Categories
Research

“Success with Style” part 5: what does machine analysis mean for writers?

Now that this machine analysis of what makes a good and bad book is complete, what does it actually mean for writers?

I started this analysis back in May. Actually it was far before then, back when the original Success with Style paper was published in 2014 but it took me that long to realise I needed help with the analysis, even after I got my R qualification.

And when the results raised more questions with each analysis it meant that something I expected to take a month end-to-end became 3 months. Even now there is more I could do but have done enough to call it a day.

Success with Style 5: what it means for writers

Success with Style: a recap

If you’ve not read the other parts (and they can be quite stats heavy) this series was was prompted by a 2014 paper that claimed to be able to say what makes a good book, Success with Style. However my reading of it found some flaws and it was unclear how the original authors created their experiment.

From 758 books in the Project Gutenberg Library in 8 genres (Adventure, Detective/mystery, Fiction, Historical fiction, Love-story, Poetry, Sci-fi and Short-stories) with half of them deemed to be success (more than 30 downloads in the past month) and those with less unsuccessful/failure. I then put these through a variety of analyses:

  • readability (how easy the books are to read, in particular Flesch-Kincaid grade level formula)
  • the Stanford Tagger that uses the Penn treebank to analyse PoS (parts of speech)
  • the LIWC to analyse PoS

The latter two, the Penn and LIWC PoS analyses, split all the words in the books into different categories and do so in slightly different ways.

I then repeated these analyses in slightly different ways: first using 2018 download data (with 41 books changing their success/fail category) and then analysing just the first 3,000 words on the principle that it is often only the first chapter that agents or publishers review when considering a book.

Steps taken in the analysis of the books
Steps taken in the analysis of the books

In all tests I was looking for statistical significance. A good overview of what this is is on the Harvard Business Review, but in summary it is a test to see whether the results are due to chance or whether it’s likely that there is an underlying reason for why we got the results and not down to luck of the draw.

The P-value used to determine significance in all tests was 0.05, which is a fairly standard choice (note that this may have been too high – see the end of this page). Without reporting statistical significance it’s hard to really say if your test does mean something or it was just luck of the data you drew that gave you that result.

If you want to look in depth at the analyses read:

Success with Style findings summary

The main findings are that:

  • analysing an entire book is more accurate than just its first 3,000 words — it’s hard to judge a book on its first chapter
  • the genre affects the significance of tests and not all genres are as easy to predict as others — science-fiction has the most exceptions to the principles
  • books that are emotional (lots of either positive or negative emotions) tend to be unsuccessful
  • adjectives and adverbs predominate in poorly performing books and nouns and verbs in successful ones but are not a significant determiner of success or failure
  • don’t talk too much — dialogue heavy books were more unsuccessful
  • readability (as a computer measurement) is not generally a significant determiner of success, but don’t use too many long words in your writing. More successful books are slightly harder to read (have a higher readability) but are still able to be understood by a 15-year-old
  • these rough criteria generally stood up even when the success/fail criteria changed over time, meaning there is some underlying value in them
  • the LIWC is more accurate than the Penn treebank for predicting the success of a book
  • including punctuation in the analysis leads to better machine learning prediction performance

What these findings mean for writers

Caveats

First I’m not about to say that there are rules for writing. At best writers such as George Orwell* or Robert McKee have laid out principles, not rules (*while Orwell calls his rules but his last one is to ignore them when needed). This analysis is not to create a set of rules.

Secondly, as with many experiments, it is dangerous to extrapolate from beyond the original dataset. The 758 books in the Gutenberg dataset are all out of copyright and so are mostly from the 1850s to early 20th century. The oldest was Dante (born 1265) and most recent Samuel Vaknin, born in 1961 (Gutenberg only puts author’s birth and death dates, not publication date). Many are also well known as classics, such as Robinson Crusoe, and so may have a built in bias to being downloaded due to name recognition.

Machine analysis is not a perfect tool. Even tools such as the LIWC, which is updated regularly (Penn’s was mainly carried out between 1989 and 1996), still cannot accurately tell the difference in context of words such as ‘execute’ and whether it’s to execute a plan or Ned Stark.

Finally I didn’t clean my data, I didn’t remove common words or check for errors in the Gutenberg transcriptions. It’s not essential but may have led to some differences from what a cleaned up dataset would have produced.

The first chapter is a bad guide for overall success

Machine analysis of the success of a book failed when making a judgment solely on the first 3,000 words. At 55% its machine learning performance was only marginally better than a 50/50 guess for both the Readability and Penn treebank and LIWC analysis.

PoS analysis Accuracy 95% Confidence interval
Penn & Readability 2013 (complete book) 65.62% 57.7-72.9%
Penn & Readability 2018(complete book) 65.00% 57.5-72.8%
Penn & Readability 1st 3,000 words 55.62% 47.6-63.5%
LIWC 2013(complete book) 75.00% 67.6%-81.5%
LIWC 2018(complete book) 71.70% 64.0-78.6%
LIWC 3k1st 3,000 words 56.25% 48.2-64.0%

The analysis of the first chapter did produce significant results for some of the same tests as were produced in the full book analysis. However, assuming analysing the complete book is the ‘truer’ test due to their better machine learning performance, this means that the first chapter isn’t as valuable a method of analysis as analysing the whole book.

This means that machine analysis using the Penn Treebank, Readability or LIWC categories is not suitable for agencies, publishers or other services that ask to review based on one sample chapter.

However do human readers for agencies react the same way to a machine? Looking at sites such as QueryShark, professional readers look at the cover letter/email and look for things such as the who the protagonist is and what choices they face — for example QueryShark won’t even request the first chapter until they’ve read a query email.

An experiment would be to run sample chapters of successful and unsuccessful books against professional agency readers to get their view, but that would be an experiment for another day.

Don’t be overly emotional

Overly emotional books perform poorly, whether it is overly negative or positive. The only emotional category seen commonly in successful books was Anger.

That’s not to say that emotion shouldn’t be included but that it should not overwhelm writing. This includes both the dialogue and the action.

This applied to all genres except Adventure, and even then the positive effect was small compared with the overwhelming strong net difference in unsuccessful books.

This ties in with writing tips on avoiding melodrama —  show reactions and details of characters, not spell out the thing:

Remember that the drama doesn’t have to be all the way at eleven in order to affect the reader. Readers get into the little aspects of people’s lives, too.

And writing extreme emotion well by not necessarily expressing it:

Unfortunately, many writers make the mistake of assuming that to be gripping, emotion must be dramatic. Sad people should burst into tears. Joyful characters must express their glee by jumping up and down. This kind of writing results in melodrama, which leads to a sense of disbelief in the reader because, in real life, emotion isn’t always so demonstrative.

And finally there is of course the Robot Devil’s demand that you shouldn’t just have your characters announce how they feel (and so avoid naming emotions or entering too many per paragraph):

Looking at the results of emotional tags in the LIWC results supported this (Penn doesn’t offer emotion as a tag), that unsuccessful books overwhelmingly dominate the emotions:

Emotion PoS in the LIWC analysis – negative results are PoS tags more common in unsuccessful books and positive results are for successful books. ‘Affect’ includes emotions and other affective processes, ‘posemo’ is positive emotion and ‘negemo’ negative emotions.
T-test for significance using LIWC results for Tone (both positive and negative emotions). It is significant (p<0.05) for all genres except Historical fiction and Sci-fi. The figures at the top are the P-values — you can find out more on how to interpret boxplots.

Make it readable — but don’t worry too much

Although the Flesch-Kincaid readability was significant, and slightly lower (roughly one school year) readability was marked in unsuccessful books, I do not think the difference was so great as to make it important.

T-test (a standard statistical test for significance) for mean words per sentence, mean syllables per word and Flesch-Kincaid readability (FR) — both mean syllables per word and readability are statistically significant as p<0.05.
Readability by genre. Readability is significant for Adventure, Detective/mystery and Love-story. Note how Sci-fi’s plots are noticeably different to the other genres.

Looking at LIWC tests related to readability: the proportion of six-letter or longer words (ie long words); dictionary words:

Looking at the the overall rating then the proportion of six-letter or longer words and mean words per sentence were flagged as significant.

Overall make it readable without too many long words but don’t worry too much about the specifics.

Avoiding adjectives isn’t the best advice

This very brief University of Pennsylvania study of a few books and their contents suggested that adjectives and adverbs predominate in badly written books while good books have a higher proportion of nouns and verbs.

These 2 charts suggest at first that this is the case.

Adjectives (jj), adverbs (rb), nouns (nn) and verbs (vb) difference in proportion from Penn results — as before, positive results are from successful books and negative results from unsuccessful books

 

LIWC results for adjectives (adj) and adverbs

Yet while Adjectives was the PoS with the greatest relative importance in the Penn PoS test of the original data this was not repeated in the 2018 data nor in the LIWC tests.

Likewise while adjectives and adverbs dominate unsuccessful books (ie negative plots) in most genres, this isn’t always the case. And the difference is small compared to noun dominance — which again has mixed results across the genres.

Finally, I carried out a fresh T-test (a common test to find significance) to find statistical significance for adjectives, adverbs, nouns and verbs overall and adjectives per genre:

T-test with P-value for adjectives, adverbs, nouns and verbs. No P-value is lower than 0.05 so none is statistically significant.
T-test for adjectives per genre (Penn PoS). Again, none is statistically significant.

The above charts do show that successful books have a lower proportion of adjectives and adverbs. However, contrary to the University of Pennsylvania column, successful books also have a lower proportion of nouns and verbs.

One reason is suggested by The Economist’s Johnson column:

How can usage-book writers have failed to notice that good writers use plenty of adverbs? One guess is that they are overlooking many: much, quite, rather and very are common adverbs, but they do not jump out as adverbs in the way that words ending with –ly do. A better piece of advice than “Don’t use adverbs” would be to consider replacing verbs that are combined with the likes of quickly, quietly, excitedly by verbs that include those meanings (race, tiptoe, rush) instead.

And for those advocating verbs instead, he adds:

It is hard to write without verbs. So “use verbs” is not really good advice either, since writers have to use verbs, and trying to add extra ones would not turn out well

Not one of these results is statistically significant.

What this suggests is that use of adjectives may be a a symptom of bad writing but it’s not a cause — their overuse is not a reason why a book is unsuccessful. This is the same conclusion that the University of Pennsylvania post that analysed books for their adverbs, adjectives, nouns and verbs came to:

I suspect that the differences in POS distributions are a symptom, not a cause, and that attempts to improve writing by using more verbs and adverbs would generally make things worse. But still.

What this means for writers then is that avoid them but don’t worry too much.

Don’t talk too much

Too much dialogue, as indicated by quotation marks (“), was a sign of an unsuccessful book in all genres except for short stories.

The LIWC results show that quotations marks (‘quote’) are in a higher proportion in unsuccessful books, with the exception of short stories. In poorly performing Adventure books they account for nearly 1% of all tags
T-test for the statistical significance of quotation marks for genres. They are significant in Adventure, Detective/mystery, Fiction and Love stories as p<0.05. Unsuccessful books tend to have a higher proportion than successful ones. Sci-fi successful books has a very large range for the proportion, again showing how this genre has its own rules.

For writers this most likely means that focusing on dialogue at the expense of description is not popular with readers.

Note that the quote mark proportion is a very rough approximation — it doesn’t allow for long paragraphs of dialogue, nor look at books with no quote marks and what their pattern is.

Genre shapes the rules — and science-fiction breaks them

One thing that was consistent in all analyses was that analysing by genre showed variation within the overall finding for the category. For readability, for example, it was only significant for the Adventure, Detective/mystery and Love story genres. Likewise in the LIWC tests there was no test category which produced a significant result across all genres.

This makes sense — after all, poetry was included in the analysis, and it’s hard to see a reader applying the same mental rules of what they count as good to a poem as to a science-fiction novel. Even short stories have slightly different writing ‘rules’ — the quotation proportion, for instance, shows that short stories can be more dialogue heavy than full length novels.

What is interesting is how some genres live up to their stereotypes. In the LIWC tests, for example, Clout “refers to the relative social status, confidence, or leadership that people display” and was defined through analysis of speech. In it Adventure, Detective-mystery and Fiction genres were deemed significant. In my imagining of the archetypes, detectives and adventure heroes have a certain clout or leadership.

Difference in clout (scaled and normalized) between more and less successful books
Difference in Clout (scaled and normalised) between more and less successful books. The boxes show the range of the mid 50% of results and the line the median, with successful sci-fi having the largest range.

Similarly for the readability tests it was not shown to be significant for science-fiction or poetry. Again I don’t think a poem is judged by its readability, nor, again in my own experience, is poor readability or writing style a hindrance to sci-fi.

Readability by genre (scaled and normalised) — see how different science-fiction’s boxplots are and how similar the medians (lines) are for success and failure.

One thought is that if the idea or story is gripping then it seems that science fiction novel has a greater chance of success. More importantly, science fiction books often have long scientific terms or made up words which would fare badly in a readability analysis tools. I know I’ve read enough sci-fi novels with long scientific or pseudo-scientific terms and long sentences, wooden characters, but I persevered because I found the concept interesting.

This coincides with recent research which suggests that readers treat science-fiction as different and ‘read stupidly’ compared with other genres:

Readers of the science fiction story “appear to have expected an overall simpler story to comprehend, an expectation that overrode the actual qualities of the story itself”, so “the science fiction setting triggered poorer overall reading”.

Whether this is the cause or the effect of science-fiction having its own rules in this study is not clear.

Summary of principles: follow good writing practice

What then has this study revealed? First, that one old saw about writing isn’t quite right. While adverbs and adjectives tend to predominate in unsuccessful books, they aren’t statistically significant.

This means that unsuccessful books having a higher proportion of adverbs and adjectives (and nouns and verbs) in this set of results doesn’t mean much and is likely to be the same in other results claiming the same.

While personally I agree that I prefer writing with fewer adjectives and adverbs, it’s not a hindrance. Children’s books in particular have more, for example Harry Potter and his chums tend to say things “severely”, “furiously”, “loftily” and so on but the series is an undoubted hit with readers.

So the next time someone tells you that you must use fewer adverbs tell them to swiftly remove unnecessary adjectives you can tell them “no, it’s not statistically significant”. And they’ll love you for it (this advice is not statistically significant).

The next outcome is know your genre. By splitting into the books into a wide range of genres, not just fiction but poetry, love-stories and science-fiction among others, we saw that the so-called rules varied.

Sci-fi in particular was the exception to many of the findings. The tests I ran cannot say why, but we can speculate. one is that the audience for sci-fi is likely to be quite different from that of poetry, love stories and even regular fiction. Hard research on this is hard to find, sci-fi audiences do seem different to other readers although this quote from a survey of sci-fi readers suggests it may be that they value world building over other considerations:

The creativity that goes into world building and bringing ‘otherworldly’ characters to life in a way that we can identify with.

It may also be that the theme or subject of the books — something that the Penn and LIWC analyses cannot work out — may be more gripping to readers such that they ignore or overlook what would otherwise be considered weak writing.

Don’t talk too much — readers want more than just dialogue, unlike films where heavy exposition is unwelcome, in books it seems readers prefer stories with a good balance of description to talking.

Read more than the first chapter to get a true sense of a book — although I can’t say how many words does give the best approximation just yet.

Finally don’t be overly emotional, either being too positive or too negative in your writing. This suggests that the old ‘show, don’t tell’ writing saw is true, rather than telling us that someone is angry (and using that word), show their reaction, using nouns and verbs (and yes, adverbs if you must).

Comparison with the original Success with Style findings

These findings contrast with some of the original findings — if you can bear the sidebar of shame the Daily Mail has summed up the original findings in a more readable way than the original paper:

Successful books tended to feature more nouns and adjectives, as well as a disproportionate use of the words ‘and’ and ‘but’ – when compared with less successful titles.

But my tests found that the proportion of adverbs, adjectives, nouns or verbs wasn’t statistically significant.

The most popular books also featured more verbs relating to ‘thought-processing’ such as ‘recognised’ and ‘remembered’.

T-test for significance using LIWC results for ‘cogproc’ which shows cognitive processes, which includes thought processing. Again the genre varies the results but it is statistically significant for Detective/mystery, Fiction, Love stories, Poetry and Short stories

This is statistically significant for most, but not all, genres so is something we agree on.

Verbs that serve the purpose of quotes and reports, for example the word ‘say’ and ‘said’, were heavily featured throughout the bestsellers.

My tests found the exact opposite, that it was statistically significant that those with quotes did worse in most genres. Now it may be that writers are told to only use the word ‘said’ for dialogue tags so it may be that bestsellers follow this and use ‘said’ while poorer writers use other terms, which is why successful ones have the higher ‘said’ proportion. But a quick search of that schoolboy favourite, “ejaculated” (as in to speak suddenly or sharply) found it in around half of all successful books (99 out of 206 books) so it’s another reason to doubt this finding.

Alternatively, less successful books featured ‘topical, cliché’ words and phrases such as ‘love’, as well as negative and extreme words including ‘breathless’ and ‘risk’.

I didn’t look for specific words, though there are tools to do so if you wish. However, my results did say that overly emotional books do do worse and that does tie in with love, breathless and risk.

Poor-selling books also favoured the use of ‘explicitly descriptive verbs of actions and emotions’ such as ‘wanted’, ‘took’, ‘promised’ ‘cried’ and ‘cheered’. Books that made explicit reference to body parts also scored poorly.

This was true but the difference was small and not statistically significant for verbs. For emotions though it was significant in most genres (except, of course, sci-fi, that malcontent genre).

So of the original findings only 2 of those were fully agreed with in this study and one partially.

Final thought…

Throughout all this, at the risk of being melodramatic, be true to yourself and write for yourself. This analysis gives pointers on signs of a bad book (‘unsuccessful’ in the more diplomatic description) but that doesn’t mean you must slavishly these principles.

Write with what you’re comfortable and for the reasons you want. Just don’t be overly dramatic about it.

… with one last thing — too many things?

There is a big caveat with this study. I asked my friend, stats professor and consultant Dr Ben Parker (seriously clever with numbers, not bad with puns, and he offers some very reasonably priced but quality stats training courses and consultancy).

He thinks too many things may be tested. He’s no doubt right, as statistical tests are aligned to the number of variables —  the statistical tests used depend on the independent variables and their levels and there is a chance I used the wrong ones by analysing too much.

Ben was also concerned that the p-value of 0.05 was too high and may need to be 0.01 or lower. This is because if we tested 20 variables then there is already a 1/20 chance one of them will be significant – and 1/20 is 0.05. I did run the tests separately each time in the code there is a chance that I may have merged and analysed too much per test. This could also be the reasons why the original research was lacking in statistical significance tests.

I did run the tests by testing values separately as well as all together, but I admit that I don’t have years of stats experience under my belt (unlike Ben who knows his stuff) and may have overlooked some things. My code is on GitHub so anyone willing to check is welcome to review and amend. The conclusion is that the results are probably sound but the statistical significance may not be right.

However, even if the significance results are wrong, he suspect that it is more than likely that the resulting charts and the differences in positions are still broadly correct, which is why I have left the information as it stands.

Try for yourself, fork it if you disagree and use for your own amusement.

All links to data and code is found in this final post.

Categories
Research

Success with Style part 6: retrospective and links

This section is only if you want to recreate the experiments yourself. If you want to look in depth at the analyses read:

Retrospective thoughts

My main thoughts are:

  • I wish I’d double then triple checked the data. The source data I produced a couple of years ago when looking at it and had some errors (mainly around column sorts not capturing all columns. Thanks Excel.
  • I’d should have asked the original authors for their methods. It was an interesting expertise to try and repeat it but no harm in asking
  • I should have made R do more of the hard work around producing images and other things I could have automated better
  • I wish I’d learnt about GutenbergR to download different books, eliminate poetry as not of interest to me and replace with another genre

Next time I would:

  • get more books and genres (using GutenbergR)
  • focus on fewer tests and review the tests I use

For anyone looking to do their own experiments

Use the LIWC for machine analysis

The LIWC not only gave a better machine learning performance, its own categories and tags also gave a better range of significant results than the Penn treebank. Generally I found the tone (emotion) the most interesting measure as it assigned human feelings to the tags, more than just categorising as grammatically (or is it linguistically?) what kind of word it is.

Ultimately this study has been about what this means for readers and writers and that’s why the emotion is of most relevance to me.

That and the fact that the LIWC is still being researched and updated (the last in 2015, compared with Penn’s from 1996) means that it has the potential for a longer shelf life. I also found it easier to use.

It is not free is the main downside.

Don’t skip the punctuation or action

The most surprising result was how punctuation affected results. I had tried running experiments without it but the machine learning performance decreased by around 5 percentage points while the tag differences did not seem to change.

 

The LIWC results show that quotations marks (‘quote’) are in a higher proportion in unsuccessful books, with the exception of short stories. In poorly performing Adventure books they account for nearly 1% of all tags.

Links

Sponsor

This work required some small funding from Richardson Online Ltd, my consulting company, for work on the R code.

Categories
Research Scientific Research

“Success with Style” part 4 — modern data and just a chapter

When starting this analysis I spotted that the download data was for the past 30 days and that this was used for success or fail categorisation. 

Even if the data was for the lifetime of the book, it’s been nearly 5 years since the original downloads. The best way to test this then was to get the latest data (albeit still for the past 30 days).

The other thought was that the analyses looked at the entire book. But what if readers did not read the entire book but only read a certain amount before making a judgment? When submitting work to an agent or publisher for consideration, for example, often only the first chapter is requested. Based on this I analysed just the first 3,000 words of each book through the Penn and LIWC tagger and used its 2013 success/fail data to repeat the experiments.

Finally I noticed a bias towards punctuation as markers for success or failure in the output and ran the experiments without the punctuation tags to see what the result would be.

Starting hypotheses

H0: There's no difference in the tests which produce significant results between the 2014 and 2018 data
HA: There is a difference in the tests which produce significant results between the 2014 and 2018 data

H0: There's no difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words
HB: There is a difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words

The hypotheses are fairly simple – if there is no difference in the 2018 data then most of the test that proved significant with the 2013 data should also do so in 2018.

Likewise if the first 3,000 words is unimportant the test results should likewise only be significant at the same level.

3,000 words (3k words) is about 10 pages and is about one chapter’s length although of course there is no hard and fast rule about how long a chapter is.

Data used

Data summary

2018 data download date

2018-07-22

2013 data download date

2013-10-23

Unique books used

759

Difference in 2013 and 2018 success rates

Row Labels Count
FAILURE 22
Adventure 5
Detective/mystery 3
Fiction 2
Historical-fiction 1
Love-story 1
Poetry 8
Short-stories 2
SUCCESS 20
Adventure 3
Detective/mystery 4
Fiction 1
Historical-fiction 4
Love-story 3
Sci-fi 5
Grand Total 42

There were 758 unique books (the remaining 42 of the 800 listed were in multiple categories). With 42 differing that is 5.5% of the total books used and none of those with a different success status was listed in multiple categories.

The new data was parsed through both the Perl Lingua Tagger using the Penn treebank and Perl readability measure and the LIWC tagger.

Results for 2013, 2018 and 3,000 word data

Machine learning performance

The most important measure for me is which is the best for making predictions. 

Using all tags including punctuation

Accuracy

95% Confidence Interval

Sensitivity

Specificity

Readablity 2013

65.62%

57.7-72.9%

69%

63%

Readablity 2018

65.00%

57.5-72.8%

68%

63%

Readablity 3k

55.62%

47.6-63.5%

68%

44%

LIWC 2013

75.00%

67.6%-81.5%

76%

74%

LIWC 2018

71.70%

64.0-78.6%

78%

66%

LIWC 3k

56.25%

48.2-64.0%

53%

60%

According to this the LIWC is still the best tagger and that both 2013 and 2018 data are fairly similar for both readability and LIWC, with the results being in each other’s 95% confidence interval.

Both for readability and LIWC the first 3,000 words (3k) are much worse predictors of overall success and barely better than a 50/50 guess.

Difference in significance in key measures

Punctuation

Overall there was not much difference in omitting punctuation for LIWC or Penn analyses. In fact the machine analysis performances all dropped around 5% points. 

Readability 

Genre

Significant 2013

Significant 2018

Significant 3k words

Adventure

TRUE

TRUE

TRUE

Detective/mystery

TRUE

TRUE

TRUE

Fiction

FALSE

FALSE

FALSE

Historical-fiction

FALSE

FALSE

FALSE

Love-story

TRUE

TRUE

TRUE

Poetry

FALSE

FALSE

FALSE

Sci-fi

FALSE

FALSE

FALSE

Short-stories

FALSE

FALSE

FALSE

Significant tags in the same genres for all 3 different categories.

LIWC categories

Test

genre

Significant 2013

Significant 2018

Significant 3k words

Clout

Adventure

TRUE

FALSE

TRUE

 

Detective-mystery

TRUE

TRUE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Authenticity

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

TRUE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

TRUE

TRUE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Analytical

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

TRUE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

6 letter words

Adventure

TRUE

TRUE

TRUE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

FALSE

FALSE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

TRUE

TRUE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Dictionary words

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

TRUE

TRUE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

TRUE

 

Love-story

FALSE

FALSE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

TRUE

TRUE

TRUE

 

Short-stories

FALSE

FALSE

FALSE

         

Tone

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

TRUE

TRUE

TRUE

 

Fiction

TRUE

TRUE

TRUE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

TRUE

TRUE

FALSE

 

Poetry

TRUE

TRUE

TRUE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

TRUE

TRUE

TRUE

         

Mean words per sentence

Adventure

TRUE

TRUE

TRUE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

TRUE

Whereas readability was consistent across the different approaches the LIWC categories shows a lot more variety.

Tone has the most success across this. As before the 2013 and 2018 data tend to match (but not always, as with Clout or Dictionary words) and 3,000 words, well, it does its own thing.

Tone most consistent throughout and as last time had most significant categories even with 3k.

Parts of speech tags (PoS) with the largest difference

The tables list the top 3 PoS that dominate in successful and unsuccessful books.

Penn data

Successful PoS 2013 Successful PoS 2018 Successful PoS 3k
INN – Preposition / Conjunction INN – Preposition / Conjunction INN – Preposition / Conjunction
DET – Determiner DET – Determiner DET – Determiner
NNS – Noun, plural NNS – Noun, plural NNS – Noun, plural
     
Unsuccessful PoS 2013 Unsuccessful PoS 2018 Unsuccessful PoS 3k
PRP – Determiner, possessive second PRP – Determiner, possessive second RB – Adverb
RB – Adverb VB – Verb, infinitive PRP – Determiner, possessive second
VB – Verb, infinitive RB – Adverb VB – Verb, infinitive

LIWC data

Successful PoS 2013 Successful PoS 2018 Successful PoS 3k
functional – Total function words  functional – Functional words functional – Total function words 
prep –   Prepositions  prep –   Prepositions  prep –   Prepositions 
article –   Articles  space –   Space  article –   Articles 
     
Unsuccessful PoS 2013 Unsuccessful PoS 2018 Unsuccessful PoS 3k
quote –    Quotation marks  allpunc – All Punctuation* ​ adj –   Common adjectives 
allpunc – All Punctuation* ​ affect – Affective processes  adverb –   Common Adverbs 
affect – Affective processes  posemo –   Positive emotion  affect – Affective processes 

The same tags dominate all the books in the Penn treebank for successful books – prepositions (for, of, although, that), determiners (this, each, some) and plural nouns (women, books).

For unsuccessful books it also has determiners that dominate but in the possessive second person (mine yours), adverbs (often, not, very, here) and infinitive verbs (take, live).

For LIWC it is quite similar. Functional words dominate with (it, to, no, very ), prepositions also dominate successful books (to, with, above is its examples) and articles (a, an, the) and (it, to, no, very).

For unsuccessful books it’s all punctuation, quotation marks and social (mate, talk, they while including all family references) and affective processes (happy, cried), which includes all emotional terms.

Quotations suggest a high propensity to a high ratio of dialogue to action/description.

What does this tell us?

2013 v 2018 data

Overall there is more similarity than difference in the 2013 and 2018 Penn and readability results. The machine learning performance was also broadly the same, with each other’s overall performance falling within the 95% confidence interval.  

The most successful PoS were also largely the same, as were the top 3 unsuccessful ones.

Likewise the LIWC categories generally matched in significance for both 2013 and 2018 data. The Successful PoS were broadly the same, as were the unsuccessful ones.

This suggests that while the original authors didn’t mention that the data was only from the previous 30 days, their results have largely stood to be true.

The first chapter

Just judging a book by its first 3,000 words was not as accurate as analysing the whole book. The machine learning performance was barely better than a guess. 

However, the readability did match and the dominance of  successful PoS was similar to that of the full data in the 2013 and 2018 studies.

Of all the LIWC categories described in part 3, Tone both was the most significant predictor across genres but also the most consistent across the different tests.

Summary

The 2018 results generally matches the 2013 results and as such suggest the original method holds as a good predictor of success or failure of those books.

The first 3,000 words results did not match the 2013 or 2018 data and as its machine learning performance was the weakest suggests that this is not an accurate way to predict a book’s success. It may be that there is a ‘sweet spot’ where the first x amount of words correlates closely with the overall rating, but it is more than 3,000 words.

Successful books tend to use prepositions, determiner and nouns and functional words. Unsuccessful ones skew towards quotations marks, punctuation and positive emotions (which with the LIWC are similar to affective processes).

This suggests that unsuccessful books may use shorter sentences (high punctuation rate), more dialogue (high quotation mark rate), adverbs and are more emotional, particularly positive emotions. Writers are frequently told by writing experts to avoid adverbs wherever possible.

Successful books by contrast tend to focus on the action – describing scenes and situations, hence the dominance of functional words, prepositions and articles. This makes them sound rather boring, but suggests that these bread and butter words are necessary to build a good story.

The LIWC data suggests that tone is the most reliable predictor of success. But what isn’t answered whether it is because it predominates in successful or unsuccessful books and whether it is positive or negative emotions. This is something to explore though based on the emotion and affect appearing in the top 3 of unsuccessful books suggests it is there.

Having punctuation tags had some use and machine learning performance was better with it so even though the punctuation tags can be hard to interpret, it is worth including them in any machine analysis but more work is needed to interpret them.

Categories
Research Writing

“Success with Style” part 3: using LIWC data

Last time we replicated the Success with Style original output and methods despite it not being listed. We managed to get the data to broadly match. Great, but now we are going to look at a different way of analysing the same text.

In part 2 we used the Penn treebank to analyse the text and its parts of speech (PoS). This time we’re using LIWC, a tool developed at the University of Texas. It has similarities to the Penn treebank in that it categorises words and has similar categories, such as prepositions.

In part 1 we looked at the original experiment and recreated it in part 2. This time we’ll use the same input data but process it through a different NLP analysis program — the LIWC.

Hypotheses

H0: There's no difference in the proportion of LIWC categories in successful and unsuccessful books, regardless of genre
HA: There is a difference in the proportion of LIWC categories in successful and unsuccessful books, and the pattern will depend on genre

H0: There's no difference in the LIWC summary values of successful and unsuccessful books, regardless of the book's genre
HB: There is a difference in the LIWC summary values of successful and unsuccessful books, and the pattern will depend on genre

 

Success with Style LIWCMethod

The data was the same, the measure of success and the method was the same as in part 1, along with adjust the p-value (p<0.05 for significance) and machine learning algorithm. Likewise variables with many zeroes were not transformed.

Difference in success

The R code managed to create different tags to the original. You can find the LIWC definitions at the foot of this page.

Tags per genre

LIWC Difference in proportion function-article – original data

Overall biggest difference

PoS (successful books) Definition Diff (largest difference first) PoS (Unsuccessful books) Definition Diff (largest difference first)
functional Total function words 0.003835 quote Quotation marks -0.001814
prep Prepositions 0.001758 allpunc All Punctuation* ​ -0.001350
article Articles 0.001199 affect Affective processes -0.001231
ipron Impersonal pronouns 0.001198 social Social processes -0.001181
space Space 0.001155 posemo Positive emotion -0.001103
relativ Relativity 0.000860 ppron Personal pronouns -0.001047
number Numbers 0.000623 apostro Apostrophes -0.000999
focuspast Past focus 0.000463 female Female references -0.000963
power Power 0.000454 focuspresent Present focus -0.000929
cogproc Cognitive processes 0.000437 shehe 3rd pers singular -0.000905
period Periods/fullstop 0.000403 verb Common verbs -0.000642
comma Commas 0.000379 informal Informal language -0.000361
differ Differentiation 0.000369 exclam Exclamation marks -0.000323
otherp Other punctuation 0.000318 time Time -0.000319
parenth Parentheses (pairs) 0.000266 you 2nd person -0.000273
conj Conjunctions 0.000266 percept Perceptual processes -0.000236
quant Quantifiers 0.000257 affiliation Affiliation -0.000216
semic Semicolons 0.000254 focusfuture Future focus -0.000213
interrog Interrogatives 0.000233 sad Sadness -0.000202
colon Colons 0.000225 adj Common adjectives -0.000190
work Work 0.000197 family Family -0.000190
drives Drives 0.000163 nonflu Nonfluencies -0.000156
pronoun Total pronouns 0.000154 netspeak Netspeak -0.000154
cause Causation 0.000136 discrep Discrepancy -0.000140
anger Anger 0.000131 see See -0.000133
we 1st pers plural 0.000130 bio Biological processes -0.000130
certain Certainty 0.000125 i 1st pers singular -0.000121
compare 0.000125 negemo Negative emotion -0.000111
they 0.000122 body Body -0.000104
death 0.000101 reward Reward -0.000098
tentat 0.000078 friend Friends -0.000088
ingest 0.000060 risk Risk -0.000080
home 0.000055 negate Negations -0.000073
achieve 0.000038 auxverb Auxiliary verbs -0.000070
money 0.000016 motion Motion -0.000069
health 0.000011 insight Insight -0.000067
adverb 0.000011 hear Hear -0.000056
leisure 0.000003 feel Feel -0.000049
swear 0.000002 assent Assent -0.000046
male Male references -0.000045
qmark Question marks -0.000035
sexual Sexual -0.000028
anx Anxiety -0.000025
dash Dashes -0.000025
relig Religion -0.000010
filler Fillers -0.000008

A positive (negative) value means that the mean PoS proportion is higher in the more (less) successful books

Unpaired t-tests

Showing results of PoS tags that have significant adjusted P-values.

PoS Definition adjusted P-value
analytic Analytical thinking 0.017
tone Emotional tone 0
mWoSen Mean Words per Sentence 0
sixletter Six letter words 0
ppron Personal pronouns 0.005
ipron Impersonal pronouns 0
article Articles 0.005
prep Prepositions 0
adj Common adjectives 0.005
number Numbers 0
affect Affective processes 0
posemo Positive emotion 0
negemo Negative emotion 0.045
sad Sadness 0.009
social Social processes 0.044
family Family 0.041
friend Friends 0
female Female references 0.026
feel Feel 0.041
bio Biological processes 0.044
affiliation Affiliation 0.017
power Power 0.017
risk Risk 0.017
focuspresent Present focus 0.02
focusfuture Future focus 0
space Space 0.009
time Time 0
informal Informal language 0
nonflu Nonfluencies 0
colon Colons 0.028
exclam Exclamation marks 0
quote Quotation marks 0.005
apostro Apostrophes 0.017

33 out of 93 tags (including punctuation) of the transformed PoS were significantly different between successful and unsuccessful books. This mean that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.

Difference in LIWC summary variables

The LIWC has its own definitions. Some of them are proprietary so how they’re calculated is not clear, but they rely on the PoS tags. For example, ‘tone’ is overall emotion (both the positive and negative emotion tags). Like the tags, they use the proportion (ie 0.85 means 85% of the text) in a text apart from mean words per sentence.

Variables Definition
Analytical thinking (Analytic) People low in analytical thinking tend to write and think using language that is more narrative ways, focusing on the here-and-now, and personal experiences. Those high in analytical thinking perform better in college and have higher college board scores.
Clout Clout refers to the relative social status, confidence, or leadership that people display through their writing or talking. The algorithm was developed based on the results from a series of studies where people were interacting with one another.
Authenticity When people reveal themselves in an authentic or honest way, they are more personal, humble, and vulnerable.
Emotional tone (Tone) Although LIWC2015 includes both positive emotion and negative emotion dimensions, the Tone variable puts the two dimensions into a single summary variable. Numbers below 50 suggest a more negative emotional tone.
Measure Successful Unsuccessful P value Significant (p>0.05)?
Six letter words 0.1633 0.1552 0.0004 TRUE
Mean words per sentence 18.3832 17.0184 0.0007 TRUE
Dictionary words 0.8388 0.8410 0.6000 FALSE
Authentic 0.2240 0.2181 0.3900 FALSE
Analytic 0.7240 0.6939 0.0032 TRUE
Clout 0.7417 0.7499 0.3800 FALSE
Tone 0.3892 0.4486 0.0010 TRUE

Results show that the mean words per sentence were significantly different in successful books and comparable to the figures in the original test. Likewise the proportion of six letter words (or more) is significantly different in successful books. The tone however is lower in successful ones (ie uses fewer emotional words either positive or negative).

Looking further at these categories by genre:

Difference in analytical words (scaled and normalized) between more and less successful books
Difference in authenticity (scaled and normalized) between more and less successful books
Difference in clout (scaled and normalized) between more and less successful books
Difference in clout (scaled and normalized) between more and less successful books
Difference in Dictionary Words (scaled and normalized) between more and less successful books
Difference in Dictionary Words (scaled and normalized) between more and less successful books
Difference in mean words per sentence (scaled and normalized) between more and less successful books
Difference in mean words per sentence (scaled and normalized) between more and less successful books
Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books
Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books
Difference in tone (scaled and normalized) between more and less successful books
Difference in tone (scaled and normalized) between more and less successful books

Most important variables

PoS Definition Overall relative importance
ipron Impersonal pronouns 100.00
quote Quotation marks 86.40
otherp Other punctuation 69.99
posemo Positive emotion 68.88
time Time 67.30
space Space 64.90
parenth Parentheses (pairs) 58.40
you 2nd person 56.80
adj Common adjectives 46.73
risk Risk 41.25
sixletter Six letter words 40.70
semic Semicolons 38.60
power Power 35.29
netspeak Netspeak 31.52
number Numbers 30.08
swear Swear words 28.03
period Periods/fullstop 27.75
filler Fillers 25.91
certain Certainty 25.69
death Death 25.56
mWoSen Mean words per sentence 25.03
ppron Personal pronouns 22.95
colon Colons 20.12
focuspast Past focus 19.99
body Body 18.78
tone Emotional tone 18.57
leisure Leisure 17.86
focusfuture Future focus 16.08
home Home 14.88
exclam Exclamation marks 13.08
achieve Achievement 11.90
dicWo Dictionary words 11.72
apostro Apostrophes 9.99
work Work 9.22
ingest Ingestion 7.70
health Health 6.83
relig Religion 5.91
qmark Question marks 3.93
interrog Interrogatives 2.72
hear Hear 1.48

Machine learning performance

Accuracy 95% CI Sensitivity Specificity
75.00% 67.6%-81.5% 76% 74%

Conclusion

  • The mean proportion of 33 PoS tags were significantly different between more successful and less successful books (reject null hypothesis 1)
  • Six letter word proportion, mean words per sentence, analytical words and tone were significantly different between more and less successful books (reject null hypothesis 2). Between these categories all genres except historical fiction had a significant difference, with tone (ie both positive and negative emotion use) being significant for 5 out of the 8 genres. No category in the Penn treebank analysis had this many significant genres.
  • Six letter words, Mean words per sentence, Dictionary words, Authentic, Analytic, Clout, and Tone can be used to predict the status of the book with an accuracy reaching 75%. This is superior to the readability, mean words per sentence and mean syllables per word score of 65%. 

Overall LIWC analysis has performed better than using readability and Penn treebank analysis.

LIWC definitions

These are taken from the LIWC manual.

Abbreviation Category Examples
WC Word count ­
Summary Language Variables
Analytic Analytical thinking ­
Clout Clout ­
Authentic Authentic ­
Tone Emotional tone ­
WPS Words/sentence ­
Sixltr Words > 6 letters ­
Dic Dictionary words ­
Linguistic Dimensions
funct Total function words it, to, no, very
pronoun Total pronouns I, them, itself
ppron Personal pronouns I, them, her
i 1st pers singular I, me, mine
we 1st pers plural we, us, our
you 2nd person you, your, thou
shehe 3rd pers singular she, her, him
they 3rd pers plural they, their, they’d
ipron Impersonal pronouns it, it’s, those
article Articles a, an, the
prep Prepositions to, with, above
auxverb Auxiliary verbs am, will, have
adverb Common Adverbs very, really
conj Conjunctions and, but, whereas
negate Negations no, not, never
Other Grammar
verb Common verbs eat, come, carry
adj Common adjectives free, happy, long
compare Comparisons greater, best, after
interrog Interrogatives how, when, what
number Numbers second, thousand
quant Quantifiers few, many, much
Psychological Processes
affect Affective processes happy, cried
posemo Positive emotion love, nice, sweet
negemo Negative emotion hurt, ugly, nasty
anx Anxiety worried, fearful
anger Anger hate, kill, annoyed
sad Sadness crying, grief, sad
social Social processes mate, talk, they
family Family daughter, dad, aunt
friend Friends buddy, neighbor
female Female references girl, her, mom
male Male references boy, his, dad
cogproc Cognitive processes cause, know, ought
insight Insight think, know
cause Causation because, effect
discrep Discrepancy should, would
tentat Tentative maybe, perhaps
certain Certainty always, never
differ Differentiation hasn’t, but, else
percept Perceptual processes look, heard, feeling
see See view, saw, seen
hear Hear listen, hearing
feel Feel feels, touch
bio Biological processes eat, blood, pain
body Body cheek, hands, spit
health Health clinic, flu, pill
sexual Sexual horny, love, incest
ingest Ingestion dish, eat, pizza
drives Drives
affiliation Affiliation ally, friend, social
achieve Achievement win, success, better
power Power superior, bully
reward Reward take, prize, benefit
risk Risk danger, doubt
TimeOrient Time orientations
focuspast Past focus ago, did, talked
focuspresent Present focus today, is, now
focusfuture Future focus may, will, soon
relativ Relativity area, bend, exit
motion Motion arrive, car, go
space Space down, in, thin
time Time end, until, season
Personal concerns
work Work job, majors, xerox
leisure Leisure cook, chat, movie
home Home kitchen, landlord
money Money audit, cash, owe
relig Religion altar, church
death Death bury, coffin, kill
informal Informal language
swear Swear words fuck, damn, shit
netspeak Netspeak btw, lol, thx
assent Assent agree, OK, yes
nonflu Nonfluencies er, hm, umm
filler Fillers Imean, youknow
allpunc All Punctuation* ​
period Periods/fullstop .
comma Commas ,
colon Colons :
semic Semicolons ;
qmark Question marks ?
exclam Exclamation marks !
dash Dashes
quote Quotation marks apostro Apostrophes parenth Parentheses (pairs) ()otherp Other punctuation
Categories
Research

“Success with Style” part 2: recreating the original experiment

How did the team behind Success with Style develop their tests, which they claimed were statistically significant?

In part 1 we looked at the original paper and noted the lack of a hypothesis so I proposed one:

H0: There's no difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book's genre.

I also added another:

H0: There's no difference in the Flesch-Kincaid readability of successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book's genre.

Note since publishing I have updated some tables after noticing errors in the original data, I was caught by Excel not always reordering all columns when sorting. 

Hypotheses and data used

The original team used Fog and Flesch-Kincaid reading grade level but to save duplication of work I only used Flesch-Kincaid. However my source data has the Fog rating if you wish — my experience has been the Flesch-Kincaid gives more accurate results. The Flesch-Kincaid readability used here is US school grade level, where the lower the value the easier it’s judged to read.

Although the Fog and the Flesch readability indices are in my original data if you want to run it yourself – I’ll publish all data and code in the final part of this review. I also capped unreliable data for words per sentence – average words per sentence was capped at 50 (only 4 had this apply to them).

The original team gathered a range of books and classed by genre and success/failure based on number of downloads over the 60 days prior to them collecting it. We’ll use the same.

They had an equal number of books per genre and total failures and successes (758 books with 42 across multiple genres to give a total 800 books, 400 of which are failures, 400 success).

Statistical tests

For these tests I’m greatly indebted to the users of Stack Overflow and Ahmed Kamel. While I had the original ideas it was he who got them into a working R script and analysis and the analysis relies heavily on his work. I’d highly recommend Ahmed if you want help with your own statistical tests.

Statistical analysis was performed using R studio v 1.1.149. I’ve put a more detailed methodology at the end of this page. Significance uses p ≤ 0.05.

Difference in success

The R code managed to reproduce the original figures and I’ve displayed their tables and graphs as appropriate.

Tag difference per genre

Difference in proportion (all tags) cc-ls
Difference in proportion: cc-ls
Difference in proportion (all tags) md-rbs
Difference in proportion: md-rbs
Difference in proportion: sym-wdt
Difference in proportion: wp-lrb

Overall biggest difference

The data is side-by-side here, with the first two columns being the successful books and the last two the unsuccessful ones.

PoS (Successful books) Difference PoS (Unsuccessful books) Difference
INN – Preposition / Conjunction 0.005560 PRP – Determiner, possessive second -0.004326
DET – Determiner 0.003114 RB – Adverb -0.003033
NNS – Noun, plural 0.002730 VB – Verb, infinitive -0.002690
NN – Noun 0.001540 VBD – Verb, past tense -0.002665
CC – Conjunction, coordinating 0.001399 VBP – Verb, base present form -0.001630
CD – Adjective, cardinal number 0.001309 MD – Verb, modal -0.001306
WDT – Determiner, question 0.001050 FW – Foreign words -0.001169
WP – Pronoun, question 0.000558 POS – Possessive -0.000890
VBN – Verb, past/passive participle 0.000525 VBZ – Verb, present 3SG -s form -0.000392
PRPS – Determiner, possessive 0.000444 WRB – Adverb, question -0.000385
VBG – Verb, gerund 0.000259 UH – Interjection -0.000205
SYM – Symbol 0.000197 NNP – Noun, proper -0.000181
JJS – Adjective, superlative 0.000170 TO – Preposition -0.000107
JJ – Adjective 0.000083 EX – Pronoun, existential there -0.000063
WPS – Determiner, possessive & question 0.000045
JJR – Adjective, comparative 0.000041
RBR – Adverb, comparative 0.000013
RBS – Adverb, superlative 0.000003
LS – Symbol, list item 0.000002

 

A positive value means that the mean PoS proportion is higher in the more successful books, while a large negative value means its proportion is higher is less successful books.

Unpaired t-tests

For those not aware of significance, the P-value is used to determine wether a result is significant and didn’t just happen by chance. Statisticians may point out that probability is chance, but for a basic overview you can find out more here.

PoS P-value adjusted P-value
CD – Adjective, cardinal number 0 0
DET – Determiner 0 0
INN – Preposition / Conjunction 0 0
JJS – Adjective, superlative 0.012 0.039
MD – Verb, modal 0.004 0.015
POS – Possessive 0 0
PRPS – Determiner, possessive 0.022 0.057
VB – Verb, infinitive 0.018 0.052
WDT – Determiner, question 0 0
WP – Pronoun, question 0.033 0.078
WRB – Adverb, question 0.001 0.004 | 

12 out of 41 of the transformed PoS were significantly different between successful and unsuccessful books. This means that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.

Difference in Flesch-Kincaid readability, mean words per sentence, and mean syllabus per sentence between successful and unsuccessful books

Measure Successful Unsuccessful P value
Mean words per sentence 17.8 17 0.25
Mean syllables per word 1.45 1.43 0.005
Flesch-Kincaid readability 8.46 7.98 0.028

Results show that the mean readability was significantly higher in unsuccessful books compared to successful books. The same is true for the mean words per sentence which was significantly higher in unsuccessful books compared to successful books.

The mean syllables per word was not significantly different between more and less successful books.

Looking further at readability by genre

genre FAILURE mean FAILURE SD SUCCESS mean SUCCESS SD P value Significant?
Adventure 7.54 1.83 9.76 3.86 0.0002 TRUE
Detective/mystery 6.82 1.40 7.56 2.03 0.0116 TRUE
Fiction 7.92 2.27 8.07 1.87 0.3852 FALSE
Historical-fiction 8.55 1.83 9.40 3.00 0.1247 FALSE
Love-story 7.61 1.57 8.83 3.32 0.0360 TRUE
Poetry 11.27 10.24 9.71 2.66 0.8450 FALSE
Sci-fi 6.33 1.52 6.43 1.38 0.5896 FALSE
Short-stories 8.99 2.74 7.90 2.02 0.0614 FALSE

Results show that there is a statistically significant difference in the mean readability between successful and unsuccessful books for the following genres: adventure; detective/mystery and love stories. The mean readability was significantly higher (ie, harder to read) for more successful books in those genres.

Most important variables

Definition Overall relative importance
JJ – Adjective 100.000
UH – Interjection 86.810
PRPS – Determiner, possessive 69.049
TO – Preposition 67.866
INN – Preposition / Conjunction 67.570
WP – Pronoun, question 64.431
MD – Verb, modal 60.935
RBS – Adverb, superlative 59.996
WDT – Determiner, question 59.635
PRP – Determiner, possessive second 55.813
CD – Adjective, cardinal number 54.306
NN – Noun 48.380
EX – Pronoun, existential there 42.474
SYM – Symbol 40.823
Mean syllables per word 36.230
JJS – Adjective, superlative 35.699
NNP – Noun, proper 35.674
CC – Conjunction, coordinating 33.137
VBP – Verb, base present form 32.791
VBG – Verb, gerund 29.862
VBN – Verb, past/passive participle 29.826
POS – Possessive 28.903
WRB – Adverb, question 18.980
Flesch-Kincaid readability 18.371
VB – Verb, infinitive 14.735
NNS – Noun, plural 13.609
FW – Foreign words 13.562
DET – Determiner 3.757
LS – Symbol, list item 1.202

 

This shows that the most important tag in determining success or failure is adjectives. However it does not say whether this is for success or failure, but does say that adjectives are an important tag.

Machine learning performance

Accuracy 95% CI Sensitivity Specificity
65.62% 57.7-72.9% 69% 63%

Overall accuracy is 67.5%. The sensitivity is the true positive rate and specificity is the true negative rate for specificity (ie after allowing for false positives or negatives). Note that for all other tests I ignored punctuation tags but included them for machine learning as it improved performance. I left it out for other parts as knowing whether right-hand bracket was important did not seem to tell me anything. 

Conclusion

The mean of 12 PoS tags was significantly different between more successful and less successful books. We also saw the PoS pattern was largely dependent on the genre of the book.

This means we can reject the null hypothesis and say that there is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book’s genre.

Not only that but the Flesch-Kincaid readability and mean syllables per word were significantly different between more and less successful books. This was more evident in fiction, science fiction and short stories where the mean readability was significantly lower (ie easier to read) in more successful books.

This means we can say that there is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book’s genre.

Overall, the Flesch-Kincaid readability, mean words per sentence and PoS can be used to predict the status of the book with an accuracy reaching 65.6%. This is comparable to the original experiment which gave a comparable overall accuracy of 64.5%.

But what happens when we try it with a different PoS tool that analyses text in a different way? Next time I’ll use LIWC data.

Method

I used R to perform the analysis. When running:

  • statistical analysis was performed using R studio v 1.1.1.453.
  • the data set was split into a training (80%) and a test data set (20%). Analysis was performed on the training data set except when comparing readability across genres where the whole data was used due to the small sample size in each genre.

The average difference in various parts of speech (PoS, the linguistic tags assigned to words) was calculated between successful and unsuccessful books. I used what I think were the original methods used by the team to calculate these differences.

Detailed methodology

I laid out the broad outlines and normally this is put first in a research paper but it’s not the most engaging part. For those of you who are interested, this is the stats nitty gritty and is used in the other experiments.

Univariate statistical analysis

Variables were inspected for normality. Appropriate transformations such as log, Box-Cox, Yeo-Johnson transformations were performed so that variables can assume an approximate normal distribution. This was followed by a series of unpaired t-tests to assess whether the mean proportion of each PoS was significantly different between successful and unsuccessful books.

P-values were adjusted for false discovery rate to avoid the inflation of type I error (a ‘false positive’ error). Analysis was performed only using the training data set. Variables were scaled before performing the tests.

Machine learning algorithm

Support vector machine was used to predict the status of the book based on variables deemed important using initial univariate analysis. LibLinear SVM with L2 tuned over training data was used.

The model was tuned using 5-fold cross validation. The final predictive power of the model was assessed using the 20% test data. Performance was assessed using accuracy, sensitivity, specificity.

Variables with lots of zeroes

Ten variables had a lot of zeros and were heavily skewed. Thus, they were not transformed since none of the transformation algorithms fixed such a distribution. The remaining PoS did not contain such a large number of zeros and were transformed prior to performing the unpaired t-test. The package bestNormalize was used to find the most appropriate transformation.

Three PoS were removed from the analysis (nnps, pdt and rp) since none of the novels included any of these PoS.

You can see the remaining variables and their transformation if you are keen.

Categories
Research

“Success with Style” part 1: analysing great (and not so great) books

Computers can predict how successful a book will be — or so ran the headlines a few years ago following university researchers’ publication of Success with style.

A bold claim and if true and it could benefit many people; the writer looking for feedback, the literary agent or publishing house reader swamped among the piles of manuscripts, politicians looking for the next great speech. But can artificial intelligence or machine learning really predict the success of books?

Machine learning and writing

The researchers who came up with the Researchers at Stony Brook University in New York — Vikas Ashok, Song Feng and Yejin Choi — reduced stories and poems into their linguistic components and published the results. They claimed that success or failure is related to how types of words made up a text.

Their paper,
Success with style: using writing style to predict the success of novels, got a lot of attention at the time and has been cited multiple times since. Yet despite that happening in 2014, no publisher or agency has announced they’re replacing their readers with machines.

This may be in part because the authors didn’t detail their methods, and because the success rate was not 100%. Yet I had other issues with the paper. And now I have completed my statistics studies I can return to address them.

Investigating success

Things I wanted to investigate from the original paper:

  • the detailed methodology and how they got their results and why they chose to manipulate data in the way they did
  • the definition of success – not mentioned in the paper is that the downloads used isn’t the total downloads of all time, it was downloads over the previous 30 days so could be skewed. Is this the right measure?
  • the difference in proportions of success and failure was tiny, with no proportion being more than 1% – is this statistically significant?
  • the readability score – although not set out in their aims, their readability score is not divided by genre (like other scores). Why?

Success with style: the original process

The original researchers looked at 5 measures in a text (largely looking at how language was distributed and the sentiment), and we’re only go to focus on two, Part of speech (PoS) tag distribution and readability as these were fleshed out the most in the paper.

A PoS tagger tags words based on its position and context, categorises then reports how these are found in the text and there are several tools to do this along with different tags to use. The Penn tagger is used in this paper although it wasn’t their highest scoring predictor of success (they claim it only had a 2/3 success rate) this is still better than what we’d expect by chance (50% as it was a simple success/failure test).

Yet the question that wasn’t asked let alone answered “are these results statistically significant?”. That is, are these results we should sit up and pay attention to, or are these just down to chance that there happened to be a difference in the selected books. Throughout I’ll use p>0.05 as my measure of significance.

Success with style methodology

The original paper doesn’t detail the methods used, nor the hypothesis. As we’re going to perform statistical tests we’re going to we’re going to create a hypothesis.

If you haven’t come across a hypothesis before, the idea is that you start an experiment with a null hypothesis (H0), usually that the current situation is correct. You then offer an alternative hypothesis (HA) that represents your research question.

You test as if H0 is true. If the test results don’t provide convincing evidence for the alternative hypothesis, stick with H0. If they do then reject H0 in favour of HA (note this isn’t the same as saying that the test proves that the alternative hypothesis is true).

I interpreted the original PoS hypothesis as:

H0: There's no difference in the distribution of the proportion of parts of speech (PoS) tags in successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book's genre.

For readability (I’ll be using the Flesch-Kincaid grade measure, where the lower the score the more readable the work. The original researchers only used FOG and Flesch, other measures, but this is the measure I’ve used elsewhere and so:

H0: There's no difference in the readability measure of successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the readability measure of successful and unsuccessful books, and the pattern will depend on a book's genre.

Recreating the Success with style method

The original team used the Stanford tagger. I used the Perl Tagger as I already have it setup but, like the Stanford tagger, it uses the Penn Treebank to assign PoS tags to English text. It’ll also be of interest to see if a different PoS program creates any difference.

I’m going to:

  1. Get the source books and the metadata used in the original study.
  2. Run the books through the PoS tagger and readability analyser.
  3. Recreate their output data and compare it with the original results.
  4. Carry out statistical tests for the significance of the results.

I also use LIWC 2015, an alternative language tagger that I’ve used for past projects and will repeat.

Finally, as the original results didn’t account for the 30 days success/failure measure I’m going to reuse the same books to see if they have different downloads and if the accuracy is repeated.

Recreating the original results

Recreating the original results wasn’t easy. The image below is from the original paper, but how they got this is missing.

Chart showing the distribution of parts of speech tags
From the original paper: Differences in POS tag distribution between more successful and less successful books across different genres. Negative (positive) value indicates higher percentage in less (more) successful class. Ashok et al 2014

The output data offers a range of ways of interpreting it, but this is how it was created (all data links are at the end if you want the raw and manipulated data):

  1. Split the data into success/failure and sum all the tag data (eg all the CC tags, all the CD tags and so on).
  2. Work out the proportion of all tags the individual tags represent (eg for CC you get CC/(CC+CD+NN+…) in success/failure, so for the Adventure genre CC is 140,393/3,386,774 = 4.1%).
  3. For each tag subtract the value of the failure from the success proportion to give a net value (eg 4.14% – 4.223% = 0.08% difference) as shown in my table.

Here’s my intiial output (next time I’ll detail how I got it):

Not precisely the same but very close, and I’ve used similar colours and the same (rather odd) arrangement of tags for easy comparison.

I labelled the y-axis label in the charts to draw attention to the tiny scale of difference. The scale is a maximum of 1% difference, with most differences within ±0.5%.

This is minuscule and begs the question of whether this is statistically significant.

Next time

Part 1 is about the original experiment and going about recreating it. Part 2 will be on testing it and stating the difference in findings and statistical significance.

Data

This research was sponsored by Richardson Online Ltd to highlight how computers, analysis and content can come together.