How did the team behind Success with Style develop their tests, which they claimed were statistically significant?
In part 1 we looked at the original paper and noted the lack of a hypothesis so I proposed one:
H0: There's no difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book's genre.
I also added another:
H0: There's no difference in the Flesch-Kincaid readability of successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book's genre.
Note since publishing I have updated some tables after noticing errors in the original data, I was caught by Excel not always reordering all columns when sorting.
Hypotheses and data used
The original team used Fog and Flesch-Kincaid reading grade level but to save duplication of work I only used Flesch-Kincaid. However my source data has the Fog rating if you wish — my experience has been the Flesch-Kincaid gives more accurate results. The Flesch-Kincaid readability used here is US school grade level, where the lower the value the easier it’s judged to read.
Although the Fog and the Flesch readability indices are in my original data if you want to run it yourself – I’ll publish all data and code in the final part of this review. I also capped unreliable data for words per sentence – average words per sentence was capped at 50 (only 4 had this apply to them).
The original team gathered a range of books and classed by genre and success/failure based on number of downloads over the 60 days prior to them collecting it. We’ll use the same.
They had an equal number of books per genre and total failures and successes (758 books with 42 across multiple genres to give a total 800 books, 400 of which are failures, 400 success).
Statistical tests
For these tests I’m greatly indebted to the users of Stack Overflow and Ahmed Kamel. While I had the original ideas it was he who got them into a working R script and analysis and the analysis relies heavily on his work. I’d highly recommend Ahmed if you want help with your own statistical tests.
Statistical analysis was performed using R studio v 1.1.149. I’ve put a more detailed methodology at the end of this page. Significance uses p ≤ 0.05.
Difference in success
The R code managed to reproduce the original figures and I’ve displayed their tables and graphs as appropriate.
Tag difference per genre




Overall biggest difference
The data is side-by-side here, with the first two columns being the successful books and the last two the unsuccessful ones.
PoS (Successful books) | Difference | PoS (Unsuccessful books) | Difference |
---|---|---|---|
INN – Preposition / Conjunction | 0.005560 | PRP – Determiner, possessive second | -0.004326 |
DET – Determiner | 0.003114 | RB – Adverb | -0.003033 |
NNS – Noun, plural | 0.002730 | VB – Verb, infinitive | -0.002690 |
NN – Noun | 0.001540 | VBD – Verb, past tense | -0.002665 |
CC – Conjunction, coordinating | 0.001399 | VBP – Verb, base present form | -0.001630 |
CD – Adjective, cardinal number | 0.001309 | MD – Verb, modal | -0.001306 |
WDT – Determiner, question | 0.001050 | FW – Foreign words | -0.001169 |
WP – Pronoun, question | 0.000558 | POS – Possessive | -0.000890 |
VBN – Verb, past/passive participle | 0.000525 | VBZ – Verb, present 3SG -s form | -0.000392 |
PRPS – Determiner, possessive | 0.000444 | WRB – Adverb, question | -0.000385 |
VBG – Verb, gerund | 0.000259 | UH – Interjection | -0.000205 |
SYM – Symbol | 0.000197 | NNP – Noun, proper | -0.000181 |
JJS – Adjective, superlative | 0.000170 | TO – Preposition | -0.000107 |
JJ – Adjective | 0.000083 | EX – Pronoun, existential there | -0.000063 |
WPS – Determiner, possessive & question | 0.000045 | ||
JJR – Adjective, comparative | 0.000041 | ||
RBR – Adverb, comparative | 0.000013 | ||
RBS – Adverb, superlative | 0.000003 | ||
LS – Symbol, list item | 0.000002 |
A positive value means that the mean PoS proportion is higher in the more successful books, while a large negative value means its proportion is higher is less successful books.
Unpaired t-tests
For those not aware of significance, the P-value is used to determine wether a result is significant and didn’t just happen by chance. Statisticians may point out that probability is chance, but for a basic overview you can find out more here.
PoS | P-value | adjusted P-value |
---|---|---|
CD – Adjective, cardinal number | 0 | 0 |
DET – Determiner | 0 | 0 |
INN – Preposition / Conjunction | 0 | 0 |
JJS – Adjective, superlative | 0.012 | 0.039 |
MD – Verb, modal | 0.004 | 0.015 |
POS – Possessive | 0 | 0 |
PRPS – Determiner, possessive | 0.022 | 0.057 |
VB – Verb, infinitive | 0.018 | 0.052 |
WDT – Determiner, question | 0 | 0 |
WP – Pronoun, question | 0.033 | 0.078 |
WRB – Adverb, question | 0.001 | 0.004 | |
12 out of 41 of the transformed PoS were significantly different between successful and unsuccessful books. This means that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.
Difference in Flesch-Kincaid readability, mean words per sentence, and mean syllabus per sentence between successful and unsuccessful books
Measure | Successful | Unsuccessful | P value |
---|---|---|---|
Mean words per sentence | 17.8 | 17 | 0.25 |
Mean syllables per word | 1.45 | 1.43 | 0.005 |
Flesch-Kincaid readability | 8.46 | 7.98 | 0.028 |
Results show that the mean readability was significantly higher in unsuccessful books compared to successful books. The same is true for the mean words per sentence which was significantly higher in unsuccessful books compared to successful books.
The mean syllables per word was not significantly different between more and less successful books.
Looking further at readability by genre
genre | FAILURE mean | FAILURE SD | SUCCESS mean | SUCCESS SD | P value | Significant? |
---|---|---|---|---|---|---|
Adventure | 7.54 | 1.83 | 9.76 | 3.86 | 0.0002 | TRUE |
Detective/mystery | 6.82 | 1.40 | 7.56 | 2.03 | 0.0116 | TRUE |
Fiction | 7.92 | 2.27 | 8.07 | 1.87 | 0.3852 | FALSE |
Historical-fiction | 8.55 | 1.83 | 9.40 | 3.00 | 0.1247 | FALSE |
Love-story | 7.61 | 1.57 | 8.83 | 3.32 | 0.0360 | TRUE |
Poetry | 11.27 | 10.24 | 9.71 | 2.66 | 0.8450 | FALSE |
Sci-fi | 6.33 | 1.52 | 6.43 | 1.38 | 0.5896 | FALSE |
Short-stories | 8.99 | 2.74 | 7.90 | 2.02 | 0.0614 | FALSE |
Results show that there is a statistically significant difference in the mean readability between successful and unsuccessful books for the following genres: adventure; detective/mystery and love stories. The mean readability was significantly higher (ie, harder to read) for more successful books in those genres.
Most important variables
Definition | Overall relative importance |
---|---|
JJ – Adjective | 100.000 |
UH – Interjection | 86.810 |
PRPS – Determiner, possessive | 69.049 |
TO – Preposition | 67.866 |
INN – Preposition / Conjunction | 67.570 |
WP – Pronoun, question | 64.431 |
MD – Verb, modal | 60.935 |
RBS – Adverb, superlative | 59.996 |
WDT – Determiner, question | 59.635 |
PRP – Determiner, possessive second | 55.813 |
CD – Adjective, cardinal number | 54.306 |
NN – Noun | 48.380 |
EX – Pronoun, existential there | 42.474 |
SYM – Symbol | 40.823 |
Mean syllables per word | 36.230 |
JJS – Adjective, superlative | 35.699 |
NNP – Noun, proper | 35.674 |
CC – Conjunction, coordinating | 33.137 |
VBP – Verb, base present form | 32.791 |
VBG – Verb, gerund | 29.862 |
VBN – Verb, past/passive participle | 29.826 |
POS – Possessive | 28.903 |
WRB – Adverb, question | 18.980 |
Flesch-Kincaid readability | 18.371 |
VB – Verb, infinitive | 14.735 |
NNS – Noun, plural | 13.609 |
FW – Foreign words | 13.562 |
DET – Determiner | 3.757 |
LS – Symbol, list item | 1.202 |
This shows that the most important tag in determining success or failure is adjectives. However it does not say whether this is for success or failure, but does say that adjectives are an important tag.
Machine learning performance
Accuracy | 95% CI | Sensitivity | Specificity |
---|---|---|---|
65.62% | 57.7-72.9% | 69% | 63% |
Overall accuracy is 67.5%. The sensitivity is the true positive rate and specificity is the true negative rate for specificity (ie after allowing for false positives or negatives). Note that for all other tests I ignored punctuation tags but included them for machine learning as it improved performance. I left it out for other parts as knowing whether right-hand bracket was important did not seem to tell me anything.
Conclusion
The mean of 12 PoS tags was significantly different between more successful and less successful books. We also saw the PoS pattern was largely dependent on the genre of the book.
This means we can reject the null hypothesis and say that there is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book’s genre.
Not only that but the Flesch-Kincaid readability and mean syllables per word were significantly different between more and less successful books. This was more evident in fiction, science fiction and short stories where the mean readability was significantly lower (ie easier to read) in more successful books.
This means we can say that there is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book’s genre.
Overall, the Flesch-Kincaid readability, mean words per sentence and PoS can be used to predict the status of the book with an accuracy reaching 65.6%. This is comparable to the original experiment which gave a comparable overall accuracy of 64.5%.
But what happens when we try it with a different PoS tool that analyses text in a different way? Next time I’ll use LIWC data.
Method
I used R to perform the analysis. When running:
- statistical analysis was performed using R studio v 1.1.1.453.
- the data set was split into a training (80%) and a test data set (20%). Analysis was performed on the training data set except when comparing readability across genres where the whole data was used due to the small sample size in each genre.
The average difference in various parts of speech (PoS, the linguistic tags assigned to words) was calculated between successful and unsuccessful books. I used what I think were the original methods used by the team to calculate these differences.
Detailed methodology
I laid out the broad outlines and normally this is put first in a research paper but it’s not the most engaging part. For those of you who are interested, this is the stats nitty gritty and is used in the other experiments.
Univariate statistical analysis
Variables were inspected for normality. Appropriate transformations such as log, Box-Cox, Yeo-Johnson transformations were performed so that variables can assume an approximate normal distribution. This was followed by a series of unpaired t-tests to assess whether the mean proportion of each PoS was significantly different between successful and unsuccessful books.
P-values were adjusted for false discovery rate to avoid the inflation of type I error (a ‘false positive’ error). Analysis was performed only using the training data set. Variables were scaled before performing the tests.
Machine learning algorithm
Support vector machine was used to predict the status of the book based on variables deemed important using initial univariate analysis. LibLinear SVM with L2 tuned over training data was used.
The model was tuned using 5-fold cross validation. The final predictive power of the model was assessed using the 20% test data. Performance was assessed using accuracy, sensitivity, specificity.
Variables with lots of zeroes
Ten variables had a lot of zeros and were heavily skewed. Thus, they were not transformed since none of the transformation algorithms fixed such a distribution. The remaining PoS did not contain such a large number of zeros and were transformed prior to performing the unpaired t-test. The package bestNormalize was used to find the most appropriate transformation.
Three PoS were removed from the analysis (nnps, pdt and rp) since none of the novels included any of these PoS.
You can see the remaining variables and their transformation if you are keen.