“Success with Style” part 2: recreating the original experiment

Print Friendly, PDF & Email

How did the team behind Success with Style develop their tests, which they claimed were statistically significant?

In part 1 we looked at the original paper and noted the lack of a hypothesis so I proposed one:

H0: There's no difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book's genre.

I also added another:

H0: There's no difference in the Flesch-Kincaid readability of successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book's genre.

Note since publishing I have updated some tables after noticing errors in the original data, I was caught by Excel not always reordering all columns when sorting. 

Hypotheses and data used

The original team used Fog and Flesch-Kincaid reading grade level but to save duplication of work I only used Flesch-Kincaid. However my source data has the Fog rating if you wish — my experience has been the Flesch-Kincaid gives more accurate results. The Flesch-Kincaid readability used here is US school grade level, where the lower the value the easier it’s judged to read.

Although the Fog and the Flesch readability indices are in my original data if you want to run it yourself – I’ll publish all data and code in the final part of this review. I also capped unreliable data for words per sentence – average words per sentence was capped at 50 (only 4 had this apply to them).

The original team gathered a range of books and classed by genre and success/failure based on number of downloads over the 60 days prior to them collecting it. We’ll use the same.

They had an equal number of books per genre and total failures and successes (758 books with 42 across multiple genres to give a total 800 books, 400 of which are failures, 400 success).

Statistical tests

For these tests I’m greatly indebted to the users of Stack Overflow and Ahmed Kamel. While I had the original ideas it was he who got them into a working R script and analysis and the analysis relies heavily on his work. I’d highly recommend Ahmed if you want help with your own statistical tests.

Statistical analysis was performed using R studio v 1.1.149. I’ve put a more detailed methodology at the end of this page. Significance uses p ≤ 0.05.

Difference in success

The R code managed to reproduce the original figures and I’ve displayed their tables and graphs as appropriate.

Tag difference per genre

Difference in proportion (all tags) cc-ls

Difference in proportion: cc-ls

Difference in proportion (all tags) md-rbs

Difference in proportion: md-rbs

Difference in proportion: sym-wdt

Difference in proportion: wp-lrb

Overall biggest difference

The data is side-by-side here, with the first two columns being the successful books and the last two the unsuccessful ones.

PoS (Successful books) Difference PoS (Unsuccessful books) Difference
INN – Preposition / Conjunction 0.005560 PRP – Determiner, possessive second -0.004326
DET – Determiner 0.003114 RB – Adverb -0.003033
NNS – Noun, plural 0.002730 VB – Verb, infinitive -0.002690
NN – Noun 0.001540 VBD – Verb, past tense -0.002665
CC – Conjunction, coordinating 0.001399 VBP – Verb, base present form -0.001630
CD – Adjective, cardinal number 0.001309 MD – Verb, modal -0.001306
WDT – Determiner, question 0.001050 FW – Foreign words -0.001169
WP – Pronoun, question 0.000558 POS – Possessive -0.000890
VBN – Verb, past/passive participle 0.000525 VBZ – Verb, present 3SG -s form -0.000392
PRPS – Determiner, possessive 0.000444 WRB – Adverb, question -0.000385
VBG – Verb, gerund 0.000259 UH – Interjection -0.000205
SYM – Symbol 0.000197 NNP – Noun, proper -0.000181
JJS – Adjective, superlative 0.000170 TO – Preposition -0.000107
JJ – Adjective 0.000083 EX – Pronoun, existential there -0.000063
WPS – Determiner, possessive & question 0.000045
JJR – Adjective, comparative 0.000041
RBR – Adverb, comparative 0.000013
RBS – Adverb, superlative 0.000003
LS – Symbol, list item 0.000002

 

A positive value means that the mean PoS proportion is higher in the more successful books, while a large negative value means its proportion is higher is less successful books.

Unpaired t-tests

For those not aware of significance, the P-value is used to determine wether a result is significant and didn’t just happen by chance. Statisticians may point out that probability is chance, but for a basic overview you can find out more here.

PoS P-value adjusted P-value
CD – Adjective, cardinal number 0 0
DET – Determiner 0 0
INN – Preposition / Conjunction 0 0
JJS – Adjective, superlative 0.012 0.039
MD – Verb, modal 0.004 0.015
POS – Possessive 0 0
PRPS – Determiner, possessive 0.022 0.057
VB – Verb, infinitive 0.018 0.052
WDT – Determiner, question 0 0
WP – Pronoun, question 0.033 0.078
WRB – Adverb, question 0.001 0.004 | 

12 out of 41 of the transformed PoS were significantly different between successful and unsuccessful books. This means that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.

Difference in Flesch-Kincaid readability, mean words per sentence, and mean syllabus per sentence between successful and unsuccessful books

Measure Successful Unsuccessful P value
Mean words per sentence 17.8 17 0.25
Mean syllables per word 1.45 1.43 0.005
Flesch-Kincaid readability 8.46 7.98 0.028

Results show that the mean readability was significantly higher in unsuccessful books compared to successful books. The same is true for the mean words per sentence which was significantly higher in unsuccessful books compared to successful books.

The mean syllables per word was not significantly different between more and less successful books.

Looking further at readability by genre

genre FAILURE mean FAILURE SD SUCCESS mean SUCCESS SD P value Significant?
Adventure 7.54 1.83 9.76 3.86 0.0002 TRUE
Detective/mystery 6.82 1.40 7.56 2.03 0.0116 TRUE
Fiction 7.92 2.27 8.07 1.87 0.3852 FALSE
Historical-fiction 8.55 1.83 9.40 3.00 0.1247 FALSE
Love-story 7.61 1.57 8.83 3.32 0.0360 TRUE
Poetry 11.27 10.24 9.71 2.66 0.8450 FALSE
Sci-fi 6.33 1.52 6.43 1.38 0.5896 FALSE
Short-stories 8.99 2.74 7.90 2.02 0.0614 FALSE

Results show that there is a statistically significant difference in the mean readability between successful and unsuccessful books for the following genres: adventure; detective/mystery and love stories. The mean readability was significantly higher (ie, harder to read) for more successful books in those genres.

Most important variables

Definition Overall relative importance
JJ – Adjective 100.000
UH – Interjection 86.810
PRPS – Determiner, possessive 69.049
TO – Preposition 67.866
INN – Preposition / Conjunction 67.570
WP – Pronoun, question 64.431
MD – Verb, modal 60.935
RBS – Adverb, superlative 59.996
WDT – Determiner, question 59.635
PRP – Determiner, possessive second 55.813
CD – Adjective, cardinal number 54.306
NN – Noun 48.380
EX – Pronoun, existential there 42.474
SYM – Symbol 40.823
Mean syllables per word 36.230
JJS – Adjective, superlative 35.699
NNP – Noun, proper 35.674
CC – Conjunction, coordinating 33.137
VBP – Verb, base present form 32.791
VBG – Verb, gerund 29.862
VBN – Verb, past/passive participle 29.826
POS – Possessive 28.903
WRB – Adverb, question 18.980
Flesch-Kincaid readability 18.371
VB – Verb, infinitive 14.735
NNS – Noun, plural 13.609
FW – Foreign words 13.562
DET – Determiner 3.757
LS – Symbol, list item 1.202

 

This shows that the most important tag in determining success or failure is adjectives. However it does not say whether this is for success or failure, but does say that adjectives are an important tag.

Machine learning performance

Accuracy 95% CI Sensitivity Specificity
65.62% 57.7-72.9% 69% 63%

Overall accuracy is 67.5%. The sensitivity is the true positive rate and specificity is the true negative rate for specificity (ie after allowing for false positives or negatives). Note that for all other tests I ignored punctuation tags but included them for machine learning as it improved performance. I left it out for other parts as knowing whether right-hand bracket was important did not seem to tell me anything. 

Conclusion

The mean of 12 PoS tags was significantly different between more successful and less successful books. We also saw the PoS pattern was largely dependent on the genre of the book.

This means we can reject the null hypothesis and say that there is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book’s genre.

Not only that but the Flesch-Kincaid readability and mean syllables per word were significantly different between more and less successful books. This was more evident in fiction, science fiction and short stories where the mean readability was significantly lower (ie easier to read) in more successful books.

This means we can say that there is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book’s genre.

Overall, the Flesch-Kincaid readability, mean words per sentence and PoS can be used to predict the status of the book with an accuracy reaching 65.6%. This is comparable to the original experiment which gave a comparable overall accuracy of 64.5%.

But what happens when we try it with a different PoS tool that analyses text in a different way? Next time I’ll use LIWC data.

Method

I used R to perform the analysis. When running:

  • statistical analysis was performed using R studio v 1.1.1.453.
  • the data set was split into a training (80%) and a test data set (20%). Analysis was performed on the training data set except when comparing readability across genres where the whole data was used due to the small sample size in each genre.

The average difference in various parts of speech (PoS, the linguistic tags assigned to words) was calculated between successful and unsuccessful books. I used what I think were the original methods used by the team to calculate these differences.

Detailed methodology

I laid out the broad outlines and normally this is put first in a research paper but it’s not the most engaging part. For those of you who are interested, this is the stats nitty gritty and is used in the other experiments.

Univariate statistical analysis

Variables were inspected for normality. Appropriate transformations such as log, Box-Cox, Yeo-Johnson transformations were performed so that variables can assume an approximate normal distribution. This was followed by a series of unpaired t-tests to assess whether the mean proportion of each PoS was significantly different between successful and unsuccessful books.

P-values were adjusted for false discovery rate to avoid the inflation of type I error (a ‘false positive’ error). Analysis was performed only using the training data set. Variables were scaled before performing the tests.

Machine learning algorithm

Support vector machine was used to predict the status of the book based on variables deemed important using initial univariate analysis. LibLinear SVM with L2 tuned over training data was used.

The model was tuned using 5-fold cross validation. The final predictive power of the model was assessed using the 20% test data. Performance was assessed using accuracy, sensitivity, specificity.

Variables with lots of zeroes

Ten variables had a lot of zeros and were heavily skewed. Thus, they were not transformed since none of the transformation algorithms fixed such a distribution. The remaining PoS did not contain such a large number of zeros and were transformed prior to performing the unpaired t-test. The package bestNormalize was used to find the most appropriate transformation.

Three PoS were removed from the analysis (nnps, pdt and rp) since none of the novels included any of these PoS.

You can see the remaining variables and their transformation if you are keen.