When starting this analysis I spotted that the download data was for the past 30 days and that this was used for success or fail categorisation.
Even if the data was for the lifetime of the book, it’s been nearly 5 years since the original downloads. The best way to test this then was to get the latest data (albeit still for the past 30 days).
The other thought was that the analyses looked at the entire book. But what if readers did not read the entire book but only read a certain amount before making a judgment? When submitting work to an agent or publisher for consideration, for example, often only the first chapter is requested. Based on this I analysed just the first 3,000 words of each book through the Penn and LIWC tagger and used its 2013 success/fail data to repeat the experiments.
Finally I noticed a bias towards punctuation as markers for success or failure in the output and ran the experiments without the punctuation tags to see what the result would be.
Starting hypotheses
H0: There's no difference in the tests which produce significant results between the 2014 and 2018 data
HA: There is a difference in the tests which produce significant results between the 2014 and 2018 data
H0: There's no difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words
HB: There is a difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words
The hypotheses are fairly simple – if there is no difference in the 2018 data then most of the test that proved significant with the 2013 data should also do so in 2018.
Likewise if the first 3,000 words is unimportant the test results should likewise only be significant at the same level.
3,000 words (3k words) is about 10 pages and is about one chapter’s length although of course there is no hard and fast rule about how long a chapter is.
Data used
Data summary
2018 data download date |
2018-07-22 |
2013 data download date |
2013-10-23 |
Unique books used |
759 |
Difference in 2013 and 2018 success rates
Row Labels | Count |
FAILURE | 22 |
Adventure | 5 |
Detective/mystery | 3 |
Fiction | 2 |
Historical-fiction | 1 |
Love-story | 1 |
Poetry | 8 |
Short-stories | 2 |
SUCCESS | 20 |
Adventure | 3 |
Detective/mystery | 4 |
Fiction | 1 |
Historical-fiction | 4 |
Love-story | 3 |
Sci-fi | 5 |
Grand Total | 42 |
There were 758 unique books (the remaining 42 of the 800 listed were in multiple categories). With 42 differing that is 5.5% of the total books used and none of those with a different success status was listed in multiple categories.
The new data was parsed through both the Perl Lingua Tagger using the Penn treebank and Perl readability measure and the LIWC tagger.
Results for 2013, 2018 and 3,000 word data
Machine learning performance
The most important measure for me is which is the best for making predictions.
Using all tags including punctuation |
Accuracy |
95% Confidence Interval |
Sensitivity |
Specificity |
Readablity 2013 |
65.62% |
57.7-72.9% |
69% |
63% |
Readablity 2018 |
65.00% |
57.5-72.8% |
68% |
63% |
Readablity 3k |
55.62% |
47.6-63.5% |
68% |
44% |
LIWC 2013 |
75.00% |
67.6%-81.5% |
76% |
74% |
LIWC 2018 |
71.70% |
64.0-78.6% |
78% |
66% |
LIWC 3k |
56.25% |
48.2-64.0% |
53% |
60% |
According to this the LIWC is still the best tagger and that both 2013 and 2018 data are fairly similar for both readability and LIWC, with the results being in each other’s 95% confidence interval.
Both for readability and LIWC the first 3,000 words (3k) are much worse predictors of overall success and barely better than a 50/50 guess.
Difference in significance in key measures
Punctuation
Overall there was not much difference in omitting punctuation for LIWC or Penn analyses. In fact the machine analysis performances all dropped around 5% points.
Readability
Genre |
Significant 2013 |
Significant 2018 |
Significant 3k words |
Adventure |
TRUE |
TRUE |
TRUE |
Detective/mystery |
TRUE |
TRUE |
TRUE |
Fiction |
FALSE |
FALSE |
FALSE |
Historical-fiction |
FALSE |
FALSE |
FALSE |
Love-story |
TRUE |
TRUE |
TRUE |
Poetry |
FALSE |
FALSE |
FALSE |
Sci-fi |
FALSE |
FALSE |
FALSE |
Short-stories |
FALSE |
FALSE |
FALSE |
Significant tags in the same genres for all 3 different categories.
LIWC categories
Test |
genre |
Significant 2013 |
Significant 2018 |
Significant 3k words |
Clout |
Adventure |
TRUE |
FALSE |
TRUE |
Detective-mystery |
TRUE |
TRUE |
FALSE |
|
Fiction |
TRUE |
TRUE |
FALSE |
|
Historical-fiction |
FALSE |
FALSE |
FALSE |
|
Love-story |
FALSE |
FALSE |
FALSE |
|
Poetry |
FALSE |
FALSE |
FALSE |
|
Sci-fi |
FALSE |
FALSE |
FALSE |
|
Short-stories |
FALSE |
FALSE |
FALSE |
|
Authenticity |
Adventure |
FALSE |
FALSE |
FALSE |
Detective-mystery |
FALSE |
FALSE |
FALSE |
|
Fiction |
TRUE |
TRUE |
FALSE |
|
Historical-fiction |
FALSE |
FALSE |
TRUE |
|
Love-story |
FALSE |
FALSE |
FALSE |
|
Poetry |
TRUE |
TRUE |
FALSE |
|
Sci-fi |
FALSE |
FALSE |
FALSE |
|
Short-stories |
FALSE |
FALSE |
FALSE |
|
Analytical |
Adventure |
FALSE |
FALSE |
FALSE |
Detective-mystery |
FALSE |
FALSE |
FALSE |
|
Fiction |
TRUE |
TRUE |
TRUE |
|
Historical-fiction |
FALSE |
FALSE |
FALSE |
|
Love-story |
FALSE |
FALSE |
TRUE |
|
Poetry |
FALSE |
FALSE |
FALSE |
|
Sci-fi |
FALSE |
FALSE |
FALSE |
|
Short-stories |
FALSE |
FALSE |
FALSE |
|
6 letter words |
Adventure |
TRUE |
TRUE |
TRUE |
Detective-mystery |
FALSE |
FALSE |
FALSE |
|
Fiction |
FALSE |
FALSE |
FALSE |
|
Historical-fiction |
FALSE |
FALSE |
FALSE |
|
Love-story |
TRUE |
TRUE |
TRUE |
|
Poetry |
FALSE |
FALSE |
FALSE |
|
Sci-fi |
FALSE |
FALSE |
FALSE |
|
Short-stories |
FALSE |
FALSE |
FALSE |
|
Dictionary words |
Adventure |
FALSE |
FALSE |
FALSE |
Detective-mystery |
FALSE |
TRUE |
TRUE |
|
Fiction |
TRUE |
TRUE |
FALSE |
|
Historical-fiction |
FALSE |
FALSE |
TRUE |
|
Love-story |
FALSE |
FALSE |
TRUE |
|
Poetry |
FALSE |
FALSE |
FALSE |
|
Sci-fi |
TRUE |
TRUE |
TRUE |
|
Short-stories |
FALSE |
FALSE |
FALSE |
|
Tone |
Adventure |
FALSE |
FALSE |
FALSE |
Detective-mystery |
TRUE |
TRUE |
TRUE |
|
Fiction |
TRUE |
TRUE |
TRUE |
|
Historical-fiction |
FALSE |
FALSE |
FALSE |
|
Love-story |
TRUE |
TRUE |
FALSE |
|
Poetry |
TRUE |
TRUE |
TRUE |
|
Sci-fi |
FALSE |
FALSE |
FALSE |
|
Short-stories |
TRUE |
TRUE |
TRUE |
|
Mean words per sentence |
Adventure |
TRUE |
TRUE |
TRUE |
Detective-mystery |
FALSE |
FALSE |
FALSE |
|
Fiction |
TRUE |
TRUE |
FALSE |
|
Historical-fiction |
FALSE |
FALSE |
FALSE |
|
Love-story |
FALSE |
FALSE |
FALSE |
|
Poetry |
FALSE |
FALSE |
FALSE |
|
Sci-fi |
FALSE |
FALSE |
FALSE |
|
Short-stories |
FALSE |
FALSE |
TRUE |
Whereas readability was consistent across the different approaches the LIWC categories shows a lot more variety.
Tone has the most success across this. As before the 2013 and 2018 data tend to match (but not always, as with Clout or Dictionary words) and 3,000 words, well, it does its own thing.
Tone most consistent throughout and as last time had most significant categories even with 3k.
Parts of speech tags (PoS) with the largest difference
The tables list the top 3 PoS that dominate in successful and unsuccessful books.
Penn data
Successful PoS 2013 | Successful PoS 2018 | Successful PoS 3k |
INN – Preposition / Conjunction | INN – Preposition / Conjunction | INN – Preposition / Conjunction |
DET – Determiner | DET – Determiner | DET – Determiner |
NNS – Noun, plural | NNS – Noun, plural | NNS – Noun, plural |
Unsuccessful PoS 2013 | Unsuccessful PoS 2018 | Unsuccessful PoS 3k |
PRP – Determiner, possessive second | PRP – Determiner, possessive second | RB – Adverb |
RB – Adverb | VB – Verb, infinitive | PRP – Determiner, possessive second |
VB – Verb, infinitive | RB – Adverb | VB – Verb, infinitive |
LIWC data
Successful PoS 2013 | Successful PoS 2018 | Successful PoS 3k |
functional – Total function words | functional – Functional words | functional – Total function words |
prep – Prepositions | prep – Prepositions | prep – Prepositions |
article – Articles | space – Space | article – Articles |
Unsuccessful PoS 2013 | Unsuccessful PoS 2018 | Unsuccessful PoS 3k |
quote – Quotation marks | allpunc – All Punctuation* | adj – Common adjectives |
allpunc – All Punctuation* | affect – Affective processes | adverb – Common Adverbs |
affect – Affective processes | posemo – Positive emotion | affect – Affective processes |
The same tags dominate all the books in the Penn treebank for successful books – prepositions (for, of, although, that), determiners (this, each, some) and plural nouns (women, books).
For unsuccessful books it also has determiners that dominate but in the possessive second person (mine yours), adverbs (often, not, very, here) and infinitive verbs (take, live).
For LIWC it is quite similar. Functional words dominate with (it, to, no, very ), prepositions also dominate successful books (to, with, above is its examples) and articles (a, an, the) and (it, to, no, very).
For unsuccessful books it’s all punctuation, quotation marks and social (mate, talk, they while including all family references) and affective processes (happy, cried), which includes all emotional terms.
Quotations suggest a high propensity to a high ratio of dialogue to action/description.
What does this tell us?
2013 v 2018 data
Overall there is more similarity than difference in the 2013 and 2018 Penn and readability results. The machine learning performance was also broadly the same, with each other’s overall performance falling within the 95% confidence interval.
The most successful PoS were also largely the same, as were the top 3 unsuccessful ones.
Likewise the LIWC categories generally matched in significance for both 2013 and 2018 data. The Successful PoS were broadly the same, as were the unsuccessful ones.
This suggests that while the original authors didn’t mention that the data was only from the previous 30 days, their results have largely stood to be true.
The first chapter
Just judging a book by its first 3,000 words was not as accurate as analysing the whole book. The machine learning performance was barely better than a guess.
However, the readability did match and the dominance of successful PoS was similar to that of the full data in the 2013 and 2018 studies.
Of all the LIWC categories described in part 3, Tone both was the most significant predictor across genres but also the most consistent across the different tests.
Summary
The 2018 results generally matches the 2013 results and as such suggest the original method holds as a good predictor of success or failure of those books.
The first 3,000 words results did not match the 2013 or 2018 data and as its machine learning performance was the weakest suggests that this is not an accurate way to predict a book’s success. It may be that there is a ‘sweet spot’ where the first x amount of words correlates closely with the overall rating, but it is more than 3,000 words.
Successful books tend to use prepositions, determiner and nouns and functional words. Unsuccessful ones skew towards quotations marks, punctuation and positive emotions (which with the LIWC are similar to affective processes).
This suggests that unsuccessful books may use shorter sentences (high punctuation rate), more dialogue (high quotation mark rate), adverbs and are more emotional, particularly positive emotions. Writers are frequently told by writing experts to avoid adverbs wherever possible.
Successful books by contrast tend to focus on the action – describing scenes and situations, hence the dominance of functional words, prepositions and articles. This makes them sound rather boring, but suggests that these bread and butter words are necessary to build a good story.
The LIWC data suggests that tone is the most reliable predictor of success. But what isn’t answered whether it is because it predominates in successful or unsuccessful books and whether it is positive or negative emotions. This is something to explore though based on the emotion and affect appearing in the top 3 of unsuccessful books suggests it is there.
Having punctuation tags had some use and machine learning performance was better with it so even though the punctuation tags can be hard to interpret, it is worth including them in any machine analysis but more work is needed to interpret them.