“Success with Style” part 5: what does machine analysis mean for writers?
Now that this machine analysis of what makes a good and bad book is complete, what does it actually mean for writers?
I started this analysis back in May. Actually it was far before then, back when the original Success with Style paper was published in 2014 but it took me that long to realise I needed help with the analysis, even after I got my R qualification.
And when the results raised more questions with each analysis it meant that something I expected to take a month end-to-end became 3 months. Even now there is more I could do but have done enough to call it a day.
Success with Style: a recap
If you’ve not read the other parts (and they can be quite stats heavy) this series was was prompted by a 2014 paper that claimed to be able to say what makes a good book, Success with Style. However my reading of it found some flaws and it was unclear how the original authors created their experiment.
From 758 books in the Project Gutenberg Library in 8 genres (Adventure, Detective/mystery, Fiction, Historical fiction, Love-story, Poetry, Sci-fi and Short-stories) with half of them deemed to be success (more than 30 downloads in the past month) and those with less unsuccessful/failure. I then put these through a variety of analyses:
- readability (how easy the books are to read, in particular Flesch-Kincaid grade level formula)
- the Stanford Tagger that uses the Penn treebank to analyse PoS (parts of speech)
- the LIWC to analyse PoS
The latter two, the Penn and LIWC PoS analyses, split all the words in the books into different categories and do so in slightly different ways.
I then repeated these analyses in slightly different ways: first using 2018 download data (with 41 books changing their success/fail category) and then analysing just the first 3,000 words on the principle that it is often only the first chapter that agents or publishers review when considering a book.
In all tests I was looking for statistical significance. A good overview of what this is is on the Harvard Business Review, but in summary it is a test to see whether the results are due to chance or whether it’s likely that there is an underlying reason for why we got the results and not down to luck of the draw.
The P-value used to determine significance in all tests was 0.05, which is a fairly standard choice (note that this may have been too high – see the end of this page). Without reporting statistical significance it’s hard to really say if your test does mean something or it was just luck of the data you drew that gave you that result.
If you want to look in depth at the analyses read:
- part 1 – setting out the problem and the original Success with Style experiment
- part 2 – recreating the original experiment
- part 3 – putting the original data through a different text analysis
- part 4 – updating the data and looking at just the first chapter
- part 6 – experiment retrospective and code links, for anyone looking to recreate the experiments
Success with Style findings summary
The main findings are that:
- analysing an entire book is more accurate than just its first 3,000 words — it’s hard to judge a book on its first chapter
- the genre affects the significance of tests and not all genres are as easy to predict as others — science-fiction has the most exceptions to the principles
- books that are emotional (lots of either positive or negative emotions) tend to be unsuccessful
- adjectives and adverbs predominate in poorly performing books and nouns and verbs in successful ones but are not a significant determiner of success or failure
- don’t talk too much — dialogue heavy books were more unsuccessful
- readability (as a computer measurement) is not generally a significant determiner of success, but don’t use too many long words in your writing. More successful books are slightly harder to read (have a higher readability) but are still able to be understood by a 15-year-old
- these rough criteria generally stood up even when the success/fail criteria changed over time, meaning there is some underlying value in them
- the LIWC is more accurate than the Penn treebank for predicting the success of a book
- including punctuation in the analysis leads to better machine learning prediction performance
What these findings mean for writers
First I’m not about to say that there are rules for writing. At best writers such as George Orwell* or Robert McKee have laid out principles, not rules (*while Orwell calls his rules but his last one is to ignore them when needed). This analysis is not to create a set of rules.
Secondly, as with many experiments, it is dangerous to extrapolate from beyond the original dataset. The 758 books in the Gutenberg dataset are all out of copyright and so are mostly from the 1850s to early 20th century. The oldest was Dante (born 1265) and most recent Samuel Vaknin, born in 1961 (Gutenberg only puts author’s birth and death dates, not publication date). Many are also well known as classics, such as Robinson Crusoe, and so may have a built in bias to being downloaded due to name recognition.
Machine analysis is not a perfect tool. Even tools such as the LIWC, which is updated regularly (Penn’s was mainly carried out between 1989 and 1996), still cannot accurately tell the difference in context of words such as ‘execute’ and whether it’s to execute a plan or Ned Stark.
Finally I didn’t clean my data, I didn’t remove common words or check for errors in the Gutenberg transcriptions. It’s not essential but may have led to some differences from what a cleaned up dataset would have produced.
The first chapter is a bad guide for overall success
Machine analysis of the success of a book failed when making a judgment solely on the first 3,000 words. At 55% its machine learning performance was only marginally better than a 50/50 guess for both the Readability and Penn treebank and LIWC analysis.
|PoS analysis||Accuracy||95% Confidence interval|
|Penn & Readability 2013 (complete book)||65.62%||57.7-72.9%|
|Penn & Readability 2018(complete book)||65.00%||57.5-72.8%|
|Penn & Readability 1st 3,000 words||55.62%||47.6-63.5%|
|LIWC 2013(complete book)||75.00%||67.6%-81.5%|
|LIWC 2018(complete book)||71.70%||64.0-78.6%|
|LIWC 3k1st 3,000 words||56.25%||48.2-64.0%|
The analysis of the first chapter did produce significant results for some of the same tests as were produced in the full book analysis. However, assuming analysing the complete book is the ‘truer’ test due to their better machine learning performance, this means that the first chapter isn’t as valuable a method of analysis as analysing the whole book.
This means that machine analysis using the Penn Treebank, Readability or LIWC categories is not suitable for agencies, publishers or other services that ask to review based on one sample chapter.
However do human readers for agencies react the same way to a machine? Looking at sites such as QueryShark, professional readers look at the cover letter/email and look for things such as the who the protagonist is and what choices they face — for example QueryShark won’t even request the first chapter until they’ve read a query email.
An experiment would be to run sample chapters of successful and unsuccessful books against professional agency readers to get their view, but that would be an experiment for another day.
Don’t be overly emotional
Overly emotional books perform poorly, whether it is overly negative or positive. The only emotional category seen commonly in successful books was Anger.
That’s not to say that emotion shouldn’t be included but that it should not overwhelm writing. This includes both the dialogue and the action.
This applied to all genres except Adventure, and even then the positive effect was small compared with the overwhelming strong net difference in unsuccessful books.
This ties in with writing tips on avoiding melodrama — show reactions and details of characters, not spell out the thing:
Remember that the drama doesn’t have to be all the way at eleven in order to affect the reader. Readers get into the little aspects of people’s lives, too.
And writing extreme emotion well by not necessarily expressing it:
Unfortunately, many writers make the mistake of assuming that to be gripping, emotion must be dramatic. Sad people should burst into tears. Joyful characters must express their glee by jumping up and down. This kind of writing results in melodrama, which leads to a sense of disbelief in the reader because, in real life, emotion isn’t always so demonstrative.
And finally there is of course the Robot Devil’s demand that you shouldn’t just have your characters announce how they feel (and so avoid naming emotions or entering too many per paragraph):
Looking at the results of emotional tags in the LIWC results supported this (Penn doesn’t offer emotion as a tag), that unsuccessful books overwhelmingly dominate the emotions:
Make it readable — but don’t worry too much
Although the Flesch-Kincaid readability was significant, and slightly lower (roughly one school year) readability was marked in unsuccessful books, I do not think the difference was so great as to make it important.
Looking at LIWC tests related to readability: the proportion of six-letter or longer words (ie long words); dictionary words:
Looking at the the overall rating then the proportion of six-letter or longer words and mean words per sentence were flagged as significant.
Overall make it readable without too many long words but don’t worry too much about the specifics.
Avoiding adjectives isn’t the best advice
This very brief University of Pennsylvania study of a few books and their contents suggested that adjectives and adverbs predominate in badly written books while good books have a higher proportion of nouns and verbs.
These 2 charts suggest at first that this is the case.
Yet while Adjectives was the PoS with the greatest relative importance in the Penn PoS test of the original data this was not repeated in the 2018 data nor in the LIWC tests.
Likewise while adjectives and adverbs dominate unsuccessful books (ie negative plots) in most genres, this isn’t always the case. And the difference is small compared to noun dominance — which again has mixed results across the genres.
Finally, I carried out a fresh T-test (a common test to find significance) to find statistical significance for adjectives, adverbs, nouns and verbs overall and adjectives per genre:
The above charts do show that successful books have a lower proportion of adjectives and adverbs. However, contrary to the University of Pennsylvania column, successful books also have a lower proportion of nouns and verbs.
One reason is suggested by The Economist’s Johnson column:
How can usage-book writers have failed to notice that good writers use plenty of adverbs? One guess is that they are overlooking many: much, quite, rather and very are common adverbs, but they do not jump out as adverbs in the way that words ending with –ly do. A better piece of advice than “Don’t use adverbs” would be to consider replacing verbs that are combined with the likes of quickly, quietly, excitedly by verbs that include those meanings (race, tiptoe, rush) instead.
And for those advocating verbs instead, he adds:
It is hard to write without verbs. So “use verbs” is not really good advice either, since writers have to use verbs, and trying to add extra ones would not turn out well
Not one of these results is statistically significant.
What this suggests is that use of adjectives may be a a symptom of bad writing but it’s not a cause — their overuse is not a reason why a book is unsuccessful. This is the same conclusion that the University of Pennsylvania post that analysed books for their adverbs, adjectives, nouns and verbs came to:
I suspect that the differences in POS distributions are a symptom, not a cause, and that attempts to improve writing by using more verbs and adverbs would generally make things worse. But still.
What this means for writers then is that avoid them but don’t worry too much.
Don’t talk too much
Too much dialogue, as indicated by quotation marks (“), was a sign of an unsuccessful book in all genres except for short stories.
For writers this most likely means that focusing on dialogue at the expense of description is not popular with readers.
Note that the quote mark proportion is a very rough approximation — it doesn’t allow for long paragraphs of dialogue, nor look at books with no quote marks and what their pattern is.
Genre shapes the rules — and science-fiction breaks them
One thing that was consistent in all analyses was that analysing by genre showed variation within the overall finding for the category. For readability, for example, it was only significant for the Adventure, Detective/mystery and Love story genres. Likewise in the LIWC tests there was no test category which produced a significant result across all genres.
This makes sense — after all, poetry was included in the analysis, and it’s hard to see a reader applying the same mental rules of what they count as good to a poem as to a science-fiction novel. Even short stories have slightly different writing ‘rules’ — the quotation proportion, for instance, shows that short stories can be more dialogue heavy than full length novels.
What is interesting is how some genres live up to their stereotypes. In the LIWC tests, for example, Clout “refers to the relative social status, confidence, or leadership that people display” and was defined through analysis of speech. In it Adventure, Detective-mystery and Fiction genres were deemed significant. In my imagining of the archetypes, detectives and adventure heroes have a certain clout or leadership.
Similarly for the readability tests it was not shown to be significant for science-fiction or poetry. Again I don’t think a poem is judged by its readability, nor, again in my own experience, is poor readability or writing style a hindrance to sci-fi.
One thought is that if the idea or story is gripping then it seems that science fiction novel has a greater chance of success. More importantly, science fiction books often have long scientific terms or made up words which would fare badly in a readability analysis tools. I know I’ve read enough sci-fi novels with long scientific or pseudo-scientific terms and long sentences, wooden characters, but I persevered because I found the concept interesting.
This coincides with recent research which suggests that readers treat science-fiction as different and ‘read stupidly’ compared with other genres:
Readers of the science fiction story “appear to have expected an overall simpler story to comprehend, an expectation that overrode the actual qualities of the story itself”, so “the science fiction setting triggered poorer overall reading”.
Whether this is the cause or the effect of science-fiction having its own rules in this study is not clear.
Summary of principles: follow good writing practice
What then has this study revealed? First, that one old saw about writing isn’t quite right. While adverbs and adjectives tend to predominate in unsuccessful books, they aren’t statistically significant.
This means that unsuccessful books having a higher proportion of adverbs and adjectives (and nouns and verbs) in this set of results doesn’t mean much and is likely to be the same in other results claiming the same.
While personally I agree that I prefer writing with fewer adjectives and adverbs, it’s not a hindrance. Children’s books in particular have more, for example Harry Potter and his chums tend to say things “severely”, “furiously”, “loftily” and so on but the series is an undoubted hit with readers.
So the next time someone tells you that you must use fewer adverbs tell them to swiftly remove unnecessary adjectives you can tell them “no, it’s not statistically significant”. And they’ll love you for it (this advice is not statistically significant).
The next outcome is know your genre. By splitting into the books into a wide range of genres, not just fiction but poetry, love-stories and science-fiction among others, we saw that the so-called rules varied.
Sci-fi in particular was the exception to many of the findings. The tests I ran cannot say why, but we can speculate. one is that the audience for sci-fi is likely to be quite different from that of poetry, love stories and even regular fiction. Hard research on this is hard to find, sci-fi audiences do seem different to other readers although this quote from a survey of sci-fi readers suggests it may be that they value world building over other considerations:
The creativity that goes into world building and bringing ‘otherworldly’ characters to life in a way that we can identify with.
It may also be that the theme or subject of the books — something that the Penn and LIWC analyses cannot work out — may be more gripping to readers such that they ignore or overlook what would otherwise be considered weak writing.
Don’t talk too much — readers want more than just dialogue, unlike films where heavy exposition is unwelcome, in books it seems readers prefer stories with a good balance of description to talking.
Read more than the first chapter to get a true sense of a book — although I can’t say how many words does give the best approximation just yet.
Finally don’t be overly emotional, either being too positive or too negative in your writing. This suggests that the old ‘show, don’t tell’ writing saw is true, rather than telling us that someone is angry (and using that word), show their reaction, using nouns and verbs (and yes, adverbs if you must).
Comparison with the original Success with Style findings
These findings contrast with some of the original findings — if you can bear the sidebar of shame the Daily Mail has summed up the original findings in a more readable way than the original paper:
Successful books tended to feature more nouns and adjectives, as well as a disproportionate use of the words ‘and’ and ‘but’ – when compared with less successful titles.
But my tests found that the proportion of adverbs, adjectives, nouns or verbs wasn’t statistically significant.
The most popular books also featured more verbs relating to ‘thought-processing’ such as ‘recognised’ and ‘remembered’.
This is statistically significant for most, but not all, genres so is something we agree on.
Verbs that serve the purpose of quotes and reports, for example the word ‘say’ and ‘said’, were heavily featured throughout the bestsellers.
My tests found the exact opposite, that it was statistically significant that those with quotes did worse in most genres. Now it may be that writers are told to only use the word ‘said’ for dialogue tags so it may be that bestsellers follow this and use ‘said’ while poorer writers use other terms, which is why successful ones have the higher ‘said’ proportion. But a quick search of that schoolboy favourite, “ejaculated” (as in to speak suddenly or sharply) found it in around half of all successful books (99 out of 206 books) so it’s another reason to doubt this finding.
Alternatively, less successful books featured ‘topical, cliché’ words and phrases such as ‘love’, as well as negative and extreme words including ‘breathless’ and ‘risk’.
I didn’t look for specific words, though there are tools to do so if you wish. However, my results did say that overly emotional books do do worse and that does tie in with love, breathless and risk.
Poor-selling books also favoured the use of ‘explicitly descriptive verbs of actions and emotions’ such as ‘wanted’, ‘took’, ‘promised’ ‘cried’ and ‘cheered’. Books that made explicit reference to body parts also scored poorly.
This was true but the difference was small and not statistically significant for verbs. For emotions though it was significant in most genres (except, of course, sci-fi, that malcontent genre).
So of the original findings only 2 of those were fully agreed with in this study and one partially.
Throughout all this, at the risk of being melodramatic, be true to yourself and write for yourself. This analysis gives pointers on signs of a bad book (‘unsuccessful’ in the more diplomatic description) but that doesn’t mean you must slavishly these principles.
Write with what you’re comfortable and for the reasons you want. Just don’t be overly dramatic about it.
… with one last thing — too many things?
There is a big caveat with this study. I asked my friend, stats professor and consultant Dr Ben Parker (seriously clever with numbers, not bad with puns, and he offers some very reasonably priced but quality stats training courses and consultancy).
He thinks too many things may be tested. He’s no doubt right, as statistical tests are aligned to the number of variables — the statistical tests used depend on the independent variables and their levels and there is a chance I used the wrong ones by analysing too much.
Ben was also concerned that the p-value of 0.05 was too high and may need to be 0.01 or lower. This is because if we tested 20 variables then there is already a 1/20 chance one of them will be significant – and 1/20 is 0.05. I did run the tests separately each time in the code there is a chance that I may have merged and analysed too much per test. This could also be the reasons why the original research was lacking in statistical significance tests.
I did run the tests by testing values separately as well as all together, but I admit that I don’t have years of stats experience under my belt (unlike Ben who knows his stuff) and may have overlooked some things. My code is on GitHub so anyone willing to check is welcome to review and amend. The conclusion is that the results are probably sound but the statistical significance may not be right.
However, even if the significance results are wrong, he suspect that it is more than likely that the resulting charts and the differences in positions are still broadly correct, which is why I have left the information as it stands.
Try for yourself, fork it if you disagree and use for your own amusement.
All links to data and code is found in this final post.