Scraping, screenplays and sexism

Print Friendly, PDF & Email

In the past couple of days there have been two big data posts that analyses sex and screenplays.

Polygraph’s Hannah Anderson and Matt Daniels scraped and analysed 2,000 screenplays and their dialogue to get data on the division of dialogue according to sex, age and other factors.

The Economist looked at data from USC Annenberg on nudity and ‘sexualised attire’ (aka revealing outfits and the like) in film, along with lead and speaking roles by sex.

script-analysed

Getting screenplay data

Both reports focused on presenting the data and key thoughts rather than delving too deep into interpretation. Analysing Hollywood is a complex business – like William Goldman said “nobody knows anything” when it comes to predicting success, let alone Hollywood and sexism.

The main thing of interest for me is the methods of analysing screenplays. Matt has a long and detailed method with links to script sources, along with the code on Github and a list of where he got the data from.

Potential uses

Both studies used data to explore issues around gender and films, but there is further potential with the data. For example:

  • emotion and sentiment – not a fan due to the drawbacks but possible to trace emotion in scripts, looking at such things as whether beginning, middle or ends are more or less emotional and is there a pattern
  • the split of action and dialogue in a script – do successful scripts have a divide (aka an avoidance of walls of text)
  • are women more confident or not – an extension of their sexism report, but it could be a question of whether female characters tend to ask more characters (or use emotional language)
  • writing level – what is the typical readability for the dialogue of heroes and villains, along with scripts in general and how does this vary by genre (would The Imitation Game or A Beautiful Mind be more difficult to read, let alone film, than Die Hard?)
  • is good writing important in a successful script – as with the study of readability, does having too many adverbs and other things that Hemingway hates hinder scripts
  • statistical significance – as Matt acknowledges, there are no statistical tests in their report, what tests could be done

Why we need this data

Maybe nothing comes out, but there is no harm in trying and while I never expect any rules to come out (Goldman is already laughing) but perhaps some very broad principles could emerge from the data. Even a finding of nothing can be something to report. The only pity is that due to grey areas of scraping we’d have to start from scratch rather than use the script data the teams have already used.

But it will be worth it and we can get away from what the Polygraph article calls “all rhetoric and no data, which gets us nowhere in terms of having an informed discussion.”

In the meantime if you want to search the data you can either check out the links or use the Polygraph tool here.