Success with Style part 6: retrospective and links

Print Friendly, PDF & Email

This section is only if you want to recreate the experiments yourself. If you want to look in depth at the analyses read:

Retrospective thoughts

My main thoughts are:

  • I wish I’d double then triple checked the data. The source data I produced a couple of years ago when looking at it and had some errors (mainly around column sorts not capturing all columns. Thanks Excel.
  • I’d should have asked the original authors for their methods. It was an interesting expertise to try and repeat it but no harm in asking
  • I should have made R do more of the hard work around producing images and other things I could have automated better
  • I wish I’d learnt about GutenbergR to download different books, eliminate poetry as not of interest to me and replace with another genre

Next time I would:

  • get more books and genres (using GutenbergR)
  • focus on fewer tests and review the tests I use

For anyone looking to do their own experiments

Use the LIWC for machine analysis

The LIWC not only gave a better machine learning performance, its own categories and tags also gave a better range of significant results than the Penn treebank. Generally I found the tone (emotion) the most interesting measure as it assigned human feelings to the tags, more than just categorising as grammatically (or is it linguistically?) what kind of word it is.

Ultimately this study has been about what this means for readers and writers and that’s why the emotion is of most relevance to me.

That and the fact that the LIWC is still being researched and updated (the last in 2015, compared with Penn’s from 1996) means that it has the potential for a longer shelf life. I also found it easier to use.

It is not free is the main downside.

Don’t skip the punctuation or action

The most surprising result was how punctuation affected results. I had tried running experiments without it but the machine learning performance decreased by around 5 percentage points while the tag differences did not seem to change.

 

The LIWC results show that quotations marks (‘quote’) are in a higher proportion in unsuccessful books, with the exception of short stories. In poorly performing Adventure books they account for nearly 1% of all tags.

Links

Sponsor

This work required some small funding from Richardson Online Ltd, my consulting company, for work on the R code.