This section is only if you want to recreate the experiments yourself. If you want to look in depth at the analyses read:
- part 1 – setting out the problem and the original Success with Style experiment
- part 2 – recreating the original experiment
- part 3 – putting the original data through a different text analysis
- part 4 – updating the data and looking at just the first chapter
- part 5 – what this means for writers
My main thoughts are:
- I wish I’d double then triple checked the data. The source data I produced a couple of years ago when looking at it and had some errors (mainly around column sorts not capturing all columns. Thanks Excel.
- I’d should have asked the original authors for their methods. It was an interesting expertise to try and repeat it but no harm in asking
- I should have made R do more of the hard work around producing images and other things I could have automated better
- I wish I’d learnt about GutenbergR to download different books, eliminate poetry as not of interest to me and replace with another genre
Next time I would:
- get more books and genres (using GutenbergR)
- focus on fewer tests and review the tests I use
For anyone looking to do their own experiments
Use the LIWC for machine analysis
The LIWC not only gave a better machine learning performance, its own categories and tags also gave a better range of significant results than the Penn treebank. Generally I found the tone (emotion) the most interesting measure as it assigned human feelings to the tags, more than just categorising as grammatically (or is it linguistically?) what kind of word it is.
Ultimately this study has been about what this means for readers and writers and that’s why the emotion is of most relevance to me.
That and the fact that the LIWC is still being researched and updated (the last in 2015, compared with Penn’s from 1996) means that it has the potential for a longer shelf life. I also found it easier to use.
It is not free is the main downside.
Don’t skip the punctuation or action
The most surprising result was how punctuation affected results. I had tried running experiments without it but the machine learning performance decreased by around 5 percentage points while the tag differences did not seem to change.
- R code on GitHub – it’s a bit messy but gets the job done
- R analysis
- 3k trimmer
- Source data on Google Drive
- source data
- original Success with Style paper
- R library to download Gutenberg data easily
This work required some small funding from Richardson Online Ltd, my consulting company, for work on the R code.