Categories
Research Scientific Research

“Success with Style” part 4 — modern data and just a chapter

When starting this analysis I spotted that the download data was for the past 30 days and that this was used for success or fail categorisation. 

Even if the data was for the lifetime of the book, it’s been nearly 5 years since the original downloads. The best way to test this then was to get the latest data (albeit still for the past 30 days).

The other thought was that the analyses looked at the entire book. But what if readers did not read the entire book but only read a certain amount before making a judgment? When submitting work to an agent or publisher for consideration, for example, often only the first chapter is requested. Based on this I analysed just the first 3,000 words of each book through the Penn and LIWC tagger and used its 2013 success/fail data to repeat the experiments.

Finally I noticed a bias towards punctuation as markers for success or failure in the output and ran the experiments without the punctuation tags to see what the result would be.

Starting hypotheses

H0: There's no difference in the tests which produce significant results between the 2014 and 2018 data
HA: There is a difference in the tests which produce significant results between the 2014 and 2018 data

H0: There's no difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words
HB: There is a difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words

The hypotheses are fairly simple – if there is no difference in the 2018 data then most of the test that proved significant with the 2013 data should also do so in 2018.

Likewise if the first 3,000 words is unimportant the test results should likewise only be significant at the same level.

3,000 words (3k words) is about 10 pages and is about one chapter’s length although of course there is no hard and fast rule about how long a chapter is.

Data used

Data summary

2018 data download date

2018-07-22

2013 data download date

2013-10-23

Unique books used

759

Difference in 2013 and 2018 success rates

Row Labels Count
FAILURE 22
Adventure 5
Detective/mystery 3
Fiction 2
Historical-fiction 1
Love-story 1
Poetry 8
Short-stories 2
SUCCESS 20
Adventure 3
Detective/mystery 4
Fiction 1
Historical-fiction 4
Love-story 3
Sci-fi 5
Grand Total 42

There were 758 unique books (the remaining 42 of the 800 listed were in multiple categories). With 42 differing that is 5.5% of the total books used and none of those with a different success status was listed in multiple categories.

The new data was parsed through both the Perl Lingua Tagger using the Penn treebank and Perl readability measure and the LIWC tagger.

Results for 2013, 2018 and 3,000 word data

Machine learning performance

The most important measure for me is which is the best for making predictions. 

Using all tags including punctuation

Accuracy

95% Confidence Interval

Sensitivity

Specificity

Readablity 2013

65.62%

57.7-72.9%

69%

63%

Readablity 2018

65.00%

57.5-72.8%

68%

63%

Readablity 3k

55.62%

47.6-63.5%

68%

44%

LIWC 2013

75.00%

67.6%-81.5%

76%

74%

LIWC 2018

71.70%

64.0-78.6%

78%

66%

LIWC 3k

56.25%

48.2-64.0%

53%

60%

According to this the LIWC is still the best tagger and that both 2013 and 2018 data are fairly similar for both readability and LIWC, with the results being in each other’s 95% confidence interval.

Both for readability and LIWC the first 3,000 words (3k) are much worse predictors of overall success and barely better than a 50/50 guess.

Difference in significance in key measures

Punctuation

Overall there was not much difference in omitting punctuation for LIWC or Penn analyses. In fact the machine analysis performances all dropped around 5% points. 

Readability 

Genre

Significant 2013

Significant 2018

Significant 3k words

Adventure

TRUE

TRUE

TRUE

Detective/mystery

TRUE

TRUE

TRUE

Fiction

FALSE

FALSE

FALSE

Historical-fiction

FALSE

FALSE

FALSE

Love-story

TRUE

TRUE

TRUE

Poetry

FALSE

FALSE

FALSE

Sci-fi

FALSE

FALSE

FALSE

Short-stories

FALSE

FALSE

FALSE

Significant tags in the same genres for all 3 different categories.

LIWC categories

Test

genre

Significant 2013

Significant 2018

Significant 3k words

Clout

Adventure

TRUE

FALSE

TRUE

 

Detective-mystery

TRUE

TRUE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Authenticity

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

TRUE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

TRUE

TRUE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Analytical

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

TRUE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

6 letter words

Adventure

TRUE

TRUE

TRUE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

FALSE

FALSE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

TRUE

TRUE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Dictionary words

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

TRUE

TRUE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

TRUE

 

Love-story

FALSE

FALSE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

TRUE

TRUE

TRUE

 

Short-stories

FALSE

FALSE

FALSE

         

Tone

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

TRUE

TRUE

TRUE

 

Fiction

TRUE

TRUE

TRUE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

TRUE

TRUE

FALSE

 

Poetry

TRUE

TRUE

TRUE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

TRUE

TRUE

TRUE

         

Mean words per sentence

Adventure

TRUE

TRUE

TRUE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

TRUE

Whereas readability was consistent across the different approaches the LIWC categories shows a lot more variety.

Tone has the most success across this. As before the 2013 and 2018 data tend to match (but not always, as with Clout or Dictionary words) and 3,000 words, well, it does its own thing.

Tone most consistent throughout and as last time had most significant categories even with 3k.

Parts of speech tags (PoS) with the largest difference

The tables list the top 3 PoS that dominate in successful and unsuccessful books.

Penn data

Successful PoS 2013 Successful PoS 2018 Successful PoS 3k
INN – Preposition / Conjunction INN – Preposition / Conjunction INN – Preposition / Conjunction
DET – Determiner DET – Determiner DET – Determiner
NNS – Noun, plural NNS – Noun, plural NNS – Noun, plural
     
Unsuccessful PoS 2013 Unsuccessful PoS 2018 Unsuccessful PoS 3k
PRP – Determiner, possessive second PRP – Determiner, possessive second RB – Adverb
RB – Adverb VB – Verb, infinitive PRP – Determiner, possessive second
VB – Verb, infinitive RB – Adverb VB – Verb, infinitive

LIWC data

Successful PoS 2013 Successful PoS 2018 Successful PoS 3k
functional – Total function words  functional – Functional words functional – Total function words 
prep –   Prepositions  prep –   Prepositions  prep –   Prepositions 
article –   Articles  space –   Space  article –   Articles 
     
Unsuccessful PoS 2013 Unsuccessful PoS 2018 Unsuccessful PoS 3k
quote –    Quotation marks  allpunc – All Punctuation* ​ adj –   Common adjectives 
allpunc – All Punctuation* ​ affect – Affective processes  adverb –   Common Adverbs 
affect – Affective processes  posemo –   Positive emotion  affect – Affective processes 

The same tags dominate all the books in the Penn treebank for successful books – prepositions (for, of, although, that), determiners (this, each, some) and plural nouns (women, books).

For unsuccessful books it also has determiners that dominate but in the possessive second person (mine yours), adverbs (often, not, very, here) and infinitive verbs (take, live).

For LIWC it is quite similar. Functional words dominate with (it, to, no, very ), prepositions also dominate successful books (to, with, above is its examples) and articles (a, an, the) and (it, to, no, very).

For unsuccessful books it’s all punctuation, quotation marks and social (mate, talk, they while including all family references) and affective processes (happy, cried), which includes all emotional terms.

Quotations suggest a high propensity to a high ratio of dialogue to action/description.

What does this tell us?

2013 v 2018 data

Overall there is more similarity than difference in the 2013 and 2018 Penn and readability results. The machine learning performance was also broadly the same, with each other’s overall performance falling within the 95% confidence interval.  

The most successful PoS were also largely the same, as were the top 3 unsuccessful ones.

Likewise the LIWC categories generally matched in significance for both 2013 and 2018 data. The Successful PoS were broadly the same, as were the unsuccessful ones.

This suggests that while the original authors didn’t mention that the data was only from the previous 30 days, their results have largely stood to be true.

The first chapter

Just judging a book by its first 3,000 words was not as accurate as analysing the whole book. The machine learning performance was barely better than a guess. 

However, the readability did match and the dominance of  successful PoS was similar to that of the full data in the 2013 and 2018 studies.

Of all the LIWC categories described in part 3, Tone both was the most significant predictor across genres but also the most consistent across the different tests.

Summary

The 2018 results generally matches the 2013 results and as such suggest the original method holds as a good predictor of success or failure of those books.

The first 3,000 words results did not match the 2013 or 2018 data and as its machine learning performance was the weakest suggests that this is not an accurate way to predict a book’s success. It may be that there is a ‘sweet spot’ where the first x amount of words correlates closely with the overall rating, but it is more than 3,000 words.

Successful books tend to use prepositions, determiner and nouns and functional words. Unsuccessful ones skew towards quotations marks, punctuation and positive emotions (which with the LIWC are similar to affective processes).

This suggests that unsuccessful books may use shorter sentences (high punctuation rate), more dialogue (high quotation mark rate), adverbs and are more emotional, particularly positive emotions. Writers are frequently told by writing experts to avoid adverbs wherever possible.

Successful books by contrast tend to focus on the action – describing scenes and situations, hence the dominance of functional words, prepositions and articles. This makes them sound rather boring, but suggests that these bread and butter words are necessary to build a good story.

The LIWC data suggests that tone is the most reliable predictor of success. But what isn’t answered whether it is because it predominates in successful or unsuccessful books and whether it is positive or negative emotions. This is something to explore though based on the emotion and affect appearing in the top 3 of unsuccessful books suggests it is there.

Having punctuation tags had some use and machine learning performance was better with it so even though the punctuation tags can be hard to interpret, it is worth including them in any machine analysis but more work is needed to interpret them.

By Jonathan Richardson

Jonathan Richardson is a writer and the editor of Considered Words.

He's worked as a journalist, writer and analyst for organisations including the BBC and Which? He's also written for the stage in Cambridge, radio and sketches at the Edinburgh festival.

He's now a freelance writer and data analyst.

Leave a Reply

Your email address will not be published. Required fields are marked *