In 1985 Alison Bechdel theorised that movie conversations between women were very different to movie conversations between men, and she was right. Is this still the case? In this experiment we teamed up with our colleagues at Doberman to build on the work of Bechdel and use Deep Learning to take the analysis one step further. Doberman has previously built an app to determine the average speaking time between the genders in meeting conversations, so we relied on their expertise to build an interactive app around the model we set up on our platform.
Men and women are created equal - or are they? We used AI to find out.
As a starting point, we wanted to see if movie conversations between men and movie conversations between women were different enough that it could be spotted within the first two or three phrases. Our approach was that we would try to train an AI model to see if it could accurately predict when a movie conversation was between two men and when it was between two women. The assumption here was that if the model could do this with a relatively high accuracy, then men and women were still being portrayed very differently.
Our findings show that movie conversations between men and women are still different enough that an AI model can pick up on it with a 74% accuracy. That is, when presented with just two or three lines of conversation, our model is likely to be able to tell you whether the dialogue is between two men or two women.
The Bechdel Test
The Bechdel Test is a three-step questionnaire for evaluating the presence of women in a movie. It was first laid out in a comic strip in 1985 by Alison Bechdel, to make the point that despite how basic the requirements are, a huge number of movies don’t actually fulfil them.
According to the test, a movie has to have:
- at least two female characters...
- ...who talk to each other...
- ...about something besides a man.
In a study done on about 1,800 movies released between 1970 and 2013, only about half of the movies actually pass the test. Although the study finds that there is an improvement over this time period, it also shows that the development plateaus toward the end of the period. With all the advancements that have been made in AI for text data, we thought it would be interesting to see if we could use NLP to see if there were patterns in the ways that men and women were depicted in movies. Could we create an AI model that can distinguish between conversations between two men and conversations between two women with a high accuracy?
To explore this, we decided to use the Cornell movie dialog corpus, which has dialogs from 617 movies amounting to about 200,000 conversational exchanges. In some exchanges each character is labeled with their gender. Combining the conversations and the gender label of the characters, resulted in a new dataset which included about 23,000 conversations, labeled as either FF (female to female), or MM (male to male). All conversations between female and male characters were discarded.
However, the label distribution of conversations in the new dataset was unbalanced. The new dataset included around 20,000 conversations labeled as MM and only 3,000 labeled as FF. Perhaps this is an indication of how much less screen time women get compared to men (and indeed how important the Bechdel test was for pointing this out), but since we don’t know much about how the labels were assigned it is hard to say for sure. To mitigate the effect of working with an unbalanced dataset, we collected a subsample with an equal label distribution. To increase the number of samples in that set we decided to split the conversations, such that each new conversation only included two or three utterances. Hence we changed the definition of a conversation from the original setup, where a conversation was defined to be a whole dialog between two characters, to instead be defined as two characters exchanging two or three lines. This resulted in a balanced dataset of about 10,000 conversations.
As previously mentioned, our aim was to use the dataset to train an AI classification model for investigating gender biases in movie dialogs. We wanted to make sure that the text in the dataset was not biased in the sense that the classification decision would be trivial. For example if all conversations between females in the dataset includes female third-person pronouns (she, her), whereas all male conversations male third-person pronouns (he, him), the classification could just be based on finding those words, which would harm the quality of the gender basis investigation. However from analysing the occurrence of third-person pronouns in the dataset in regard to the class belonging, we could conclude that third-person pronouns was equally distributed over both classes in the dataset (see figure), and hence we can assume that it will not skew the results.
The AI classification model we used was based on Bidirectional Encoder Representations from Transformers (BERT) embeddings and a 2 neurons dense layer with a sigmoid activation as classifications head.
The best result we achieved was a 74% accuracy. When we trained it on slightly longer phrases we reached 81% accuracy. This indicates that we are still far from a world where men and women in movies are created equal.
To better understand the model, we decided to dig a bit deeper into what the model got wrong and took out some randomly selected examples where the model was over 90% sure of its prediction to see if there were any patterns. One example from each category was excluded because of explicit language.
Understanding the model
So what is the model looking at in these examples? To better understand this, we decided to use an explainability framework called LIME, which can be applied to any type of model to better understand it. LIME changes the input of data samples to observe how this affects the predictions. In this case, what we get out is an indication of which words are important for determining the prediction that the model has made. There is a caveat here, however, in that LIME only shows us single words without telling us anything about how the context that the word is in might be impacting its importance. In the examples, therefore, you will see a few very common words highlighted - like ‘them’ or 'about' or ‘want’ - where it is likely that the model has relied on these words because of their context rather than the words themselves.
If we look at the examples of where the model has made the wrong prediction while being very sure that it is right, this can help draw back the curtain a bit to see what it is that the model uses to make its prediction. A couple of words, like ‘Mom’, ‘dinner’ and ‘sister’, as well as names of other people, seem to steer the model a lot toward making a prediction that the speaker is female. For men, words like ‘chains’, 'corpse’ and ‘Jaeger shot’ (hmm…) steer the model toward predicting that the speaker is male. Of course, there is a limit to the types of conclusions we can draw from this, and further research would need to be done to understand the workings of the model at a deeper level.
It is remarkable that with just a two or three phrase exchange in a movie, we can predict with a 74% accuracy whether a conversation is between two men or between two women. It would seem like we have a long way to go before men and women are created equal in movies.
But there is a hope, and a lot has happened in the last few years -- not least the #metoo movement, which has led to an increased focus on the experiences of women. There has also been an increase in the number of female directors in the top 100 movies list, as this article shows, from an average of 4.8% between the years 2007 to 2019 to a spike of 10.6% in 2019. So change is on the way!
But until then, try out our classifier and see if you can trick it when writing conversations between men and conversations between women. If you can - congratulations, you’re part of a much needed change! If not, perhaps it’s worth having a think about how you are portraying men and women differently in your work. We will certainly be thinking about this the next time we try our hand at writing a movie script!
02/ More reading
3D representation of a transformer (BERT)
Tracking COVID-19 - what data can tell us
Peltarion Mentorship Program
How NAS was improved. From days to hours in search time.
NAS can now be performed within only a few hours on a single GPU instead of 28 days on 800 GPUs. This leap in performance has only taken an astonishing two years and now you don’t need to be a Google employee anymore to use NAS.
A deep dive into multilingual NLP models