“Success with Style” part 3: using LIWC data

Print Friendly, PDF & Email

Last time we replicated the Success with Style original output and methods despite it not being listed. We managed to get the data to broadly match. Great, but now we are going to look at a different way of analysing the same text.

In part 2 we used the Penn treebank to analyse the text and its parts of speech (PoS). This time we’re using LIWC, a tool developed at the University of Texas. It has similarities to the Penn treebank in that it categorises words and has similar categories, such as prepositions.

In part 1 we looked at the original experiment and recreated it in part 2. This time we’ll use the same input data but process it through a different NLP analysis program — the LIWC.

Hypotheses

H0: There's no difference in the proportion of LIWC categories in successful and unsuccessful books, regardless of genre
HA: There is a difference in the proportion of LIWC categories in successful and unsuccessful books, and the pattern will depend on genre

H0: There's no difference in the LIWC summary values of successful and unsuccessful books, regardless of the book's genre
HB: There is a difference in the LIWC summary values of successful and unsuccessful books, and the pattern will depend on genre

 

Success with Style LIWCMethod

The data was the same, the measure of success and the method was the same as in part 1, along with adjust the p-value (p<0.05 for significance) and machine learning algorithm. Likewise variables with many zeroes were not transformed.

Difference in success

The R code managed to create different tags to the original. You can find the LIWC definitions at the foot of this page.

Tags per genre

LIWC Difference in proportion function-article – original data

Overall biggest difference

PoS (successful books) Definition Diff (largest difference first) PoS (Unsuccessful books) Definition Diff (largest difference first)
functional Total function words 0.003835 quote Quotation marks -0.001814
prep Prepositions 0.001758 allpunc All Punctuation* ​ -0.001350
article Articles 0.001199 affect Affective processes -0.001231
ipron Impersonal pronouns 0.001198 social Social processes -0.001181
space Space 0.001155 posemo Positive emotion -0.001103
relativ Relativity 0.000860 ppron Personal pronouns -0.001047
number Numbers 0.000623 apostro Apostrophes -0.000999
focuspast Past focus 0.000463 female Female references -0.000963
power Power 0.000454 focuspresent Present focus -0.000929
cogproc Cognitive processes 0.000437 shehe 3rd pers singular -0.000905
period Periods/fullstop 0.000403 verb Common verbs -0.000642
comma Commas 0.000379 informal Informal language -0.000361
differ Differentiation 0.000369 exclam Exclamation marks -0.000323
otherp Other punctuation 0.000318 time Time -0.000319
parenth Parentheses (pairs) 0.000266 you 2nd person -0.000273
conj Conjunctions 0.000266 percept Perceptual processes -0.000236
quant Quantifiers 0.000257 affiliation Affiliation -0.000216
semic Semicolons 0.000254 focusfuture Future focus -0.000213
interrog Interrogatives 0.000233 sad Sadness -0.000202
colon Colons 0.000225 adj Common adjectives -0.000190
work Work 0.000197 family Family -0.000190
drives Drives 0.000163 nonflu Nonfluencies -0.000156
pronoun Total pronouns 0.000154 netspeak Netspeak -0.000154
cause Causation 0.000136 discrep Discrepancy -0.000140
anger Anger 0.000131 see See -0.000133
we 1st pers plural 0.000130 bio Biological processes -0.000130
certain Certainty 0.000125 i 1st pers singular -0.000121
compare 0.000125 negemo Negative emotion -0.000111
they 0.000122 body Body -0.000104
death 0.000101 reward Reward -0.000098
tentat 0.000078 friend Friends -0.000088
ingest 0.000060 risk Risk -0.000080
home 0.000055 negate Negations -0.000073
achieve 0.000038 auxverb Auxiliary verbs -0.000070
money 0.000016 motion Motion -0.000069
health 0.000011 insight Insight -0.000067
adverb 0.000011 hear Hear -0.000056
leisure 0.000003 feel Feel -0.000049
swear 0.000002 assent Assent -0.000046
male Male references -0.000045
qmark Question marks -0.000035
sexual Sexual -0.000028
anx Anxiety -0.000025
dash Dashes -0.000025
relig Religion -0.000010
filler Fillers -0.000008

A positive (negative) value means that the mean PoS proportion is higher in the more (less) successful books

Unpaired t-tests

Showing results of PoS tags that have significant adjusted P-values.

PoS Definition adjusted P-value
analytic Analytical thinking 0.017
tone Emotional tone 0
mWoSen Mean Words per Sentence 0
sixletter Six letter words 0
ppron Personal pronouns 0.005
ipron Impersonal pronouns 0
article Articles 0.005
prep Prepositions 0
adj Common adjectives 0.005
number Numbers 0
affect Affective processes 0
posemo Positive emotion 0
negemo Negative emotion 0.045
sad Sadness 0.009
social Social processes 0.044
family Family 0.041
friend Friends 0
female Female references 0.026
feel Feel 0.041
bio Biological processes 0.044
affiliation Affiliation 0.017
power Power 0.017
risk Risk 0.017
focuspresent Present focus 0.02
focusfuture Future focus 0
space Space 0.009
time Time 0
informal Informal language 0
nonflu Nonfluencies 0
colon Colons 0.028
exclam Exclamation marks 0
quote Quotation marks 0.005
apostro Apostrophes 0.017

33 out of 93 tags (including punctuation) of the transformed PoS were significantly different between successful and unsuccessful books. This mean that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.

Difference in LIWC summary variables

The LIWC has its own definitions. Some of them are proprietary so how they’re calculated is not clear, but they rely on the PoS tags. For example, ‘tone’ is overall emotion (both the positive and negative emotion tags). Like the tags, they use the proportion (ie 0.85 means 85% of the text) in a text apart from mean words per sentence.

Variables Definition
Analytical thinking (Analytic) People low in analytical thinking tend to write and think using language that is more narrative ways, focusing on the here-and-now, and personal experiences. Those high in analytical thinking perform better in college and have higher college board scores.
Clout Clout refers to the relative social status, confidence, or leadership that people display through their writing or talking. The algorithm was developed based on the results from a series of studies where people were interacting with one another.
Authenticity When people reveal themselves in an authentic or honest way, they are more personal, humble, and vulnerable.
Emotional tone (Tone) Although LIWC2015 includes both positive emotion and negative emotion dimensions, the Tone variable puts the two dimensions into a single summary variable. Numbers below 50 suggest a more negative emotional tone.
Measure Successful Unsuccessful P value Significant (p>0.05)?
Six letter words 0.1633 0.1552 0.0004 TRUE
Mean words per sentence 18.3832 17.0184 0.0007 TRUE
Dictionary words 0.8388 0.8410 0.6000 FALSE
Authentic 0.2240 0.2181 0.3900 FALSE
Analytic 0.7240 0.6939 0.0032 TRUE
Clout 0.7417 0.7499 0.3800 FALSE
Tone 0.3892 0.4486 0.0010 TRUE

Results show that the mean words per sentence were significantly different in successful books and comparable to the figures in the original test. Likewise the proportion of six letter words (or more) is significantly different in successful books. The tone however is lower in successful ones (ie uses fewer emotional words either positive or negative).

Looking further at these categories by genre:

Difference in analytical words (scaled and normalized) between more and less successful books

Difference in authenticity (scaled and normalized) between more and less successful books

Difference in clout (scaled and normalized) between more and less successful books

Difference in clout (scaled and normalized) between more and less successful books

Difference in Dictionary Words (scaled and normalized) between more and less successful books

Difference in Dictionary Words (scaled and normalized) between more and less successful books

Difference in mean words per sentence (scaled and normalized) between more and less successful books

Difference in mean words per sentence (scaled and normalized) between more and less successful books

Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books

Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books

Difference in tone (scaled and normalized) between more and less successful books

Difference in tone (scaled and normalized) between more and less successful books

Most important variables

PoS Definition Overall relative importance
ipron Impersonal pronouns 100.00
quote Quotation marks 86.40
otherp Other punctuation 69.99
posemo Positive emotion 68.88
time Time 67.30
space Space 64.90
parenth Parentheses (pairs) 58.40
you 2nd person 56.80
adj Common adjectives 46.73
risk Risk 41.25
sixletter Six letter words 40.70
semic Semicolons 38.60
power Power 35.29
netspeak Netspeak 31.52
number Numbers 30.08
swear Swear words 28.03
period Periods/fullstop 27.75
filler Fillers 25.91
certain Certainty 25.69
death Death 25.56
mWoSen Mean words per sentence 25.03
ppron Personal pronouns 22.95
colon Colons 20.12
focuspast Past focus 19.99
body Body 18.78
tone Emotional tone 18.57
leisure Leisure 17.86
focusfuture Future focus 16.08
home Home 14.88
exclam Exclamation marks 13.08
achieve Achievement 11.90
dicWo Dictionary words 11.72
apostro Apostrophes 9.99
work Work 9.22
ingest Ingestion 7.70
health Health 6.83
relig Religion 5.91
qmark Question marks 3.93
interrog Interrogatives 2.72
hear Hear 1.48

Machine learning performance

Accuracy 95% CI Sensitivity Specificity
75.00% 67.6%-81.5% 76% 74%

Conclusion

  • The mean proportion of 33 PoS tags were significantly different between more successful and less successful books (reject null hypothesis 1)
  • Six letter word proportion, mean words per sentence, analytical words and tone were significantly different between more and less successful books (reject null hypothesis 2). Between these categories all genres except historical fiction had a significant difference, with tone (ie both positive and negative emotion use) being significant for 5 out of the 8 genres. No category in the Penn treebank analysis had this many significant genres.
  • Six letter words, Mean words per sentence, Dictionary words, Authentic, Analytic, Clout, and Tone can be used to predict the status of the book with an accuracy reaching 75%. This is superior to the readability, mean words per sentence and mean syllables per word score of 65%. 

Overall LIWC analysis has performed better than using readability and Penn treebank analysis.

LIWC definitions

These are taken from the LIWC manual.

Abbreviation Category Examples
WC Word count ­
Summary Language Variables
Analytic Analytical thinking ­
Clout Clout ­
Authentic Authentic ­
Tone Emotional tone ­
WPS Words/sentence ­
Sixltr Words > 6 letters ­
Dic Dictionary words ­
Linguistic Dimensions
funct Total function words it, to, no, very
pronoun Total pronouns I, them, itself
ppron Personal pronouns I, them, her
i 1st pers singular I, me, mine
we 1st pers plural we, us, our
you 2nd person you, your, thou
shehe 3rd pers singular she, her, him
they 3rd pers plural they, their, they’d
ipron Impersonal pronouns it, it’s, those
article Articles a, an, the
prep Prepositions to, with, above
auxverb Auxiliary verbs am, will, have
adverb Common Adverbs very, really
conj Conjunctions and, but, whereas
negate Negations no, not, never
Other Grammar
verb Common verbs eat, come, carry
adj Common adjectives free, happy, long
compare Comparisons greater, best, after
interrog Interrogatives how, when, what
number Numbers second, thousand
quant Quantifiers few, many, much
Psychological Processes
affect Affective processes happy, cried
posemo Positive emotion love, nice, sweet
negemo Negative emotion hurt, ugly, nasty
anx Anxiety worried, fearful
anger Anger hate, kill, annoyed
sad Sadness crying, grief, sad
social Social processes mate, talk, they
family Family daughter, dad, aunt
friend Friends buddy, neighbor
female Female references girl, her, mom
male Male references boy, his, dad
cogproc Cognitive processes cause, know, ought
insight Insight think, know
cause Causation because, effect
discrep Discrepancy should, would
tentat Tentative maybe, perhaps
certain Certainty always, never
differ Differentiation hasn’t, but, else
percept Perceptual processes look, heard, feeling
see See view, saw, seen
hear Hear listen, hearing
feel Feel feels, touch
bio Biological processes eat, blood, pain
body Body cheek, hands, spit
health Health clinic, flu, pill
sexual Sexual horny, love, incest
ingest Ingestion dish, eat, pizza
drives Drives
affiliation Affiliation ally, friend, social
achieve Achievement win, success, better
power Power superior, bully
reward Reward take, prize, benefit
risk Risk danger, doubt
TimeOrient Time orientations
focuspast Past focus ago, did, talked
focuspresent Present focus today, is, now
focusfuture Future focus may, will, soon
relativ Relativity area, bend, exit
motion Motion arrive, car, go
space Space down, in, thin
time Time end, until, season
Personal concerns
work Work job, majors, xerox
leisure Leisure cook, chat, movie
home Home kitchen, landlord
money Money audit, cash, owe
relig Religion altar, church
death Death bury, coffin, kill
informal Informal language
swear Swear words fuck, damn, shit
netspeak Netspeak btw, lol, thx
assent Assent agree, OK, yes
nonflu Nonfluencies er, hm, umm
filler Fillers Imean, youknow
allpunc All Punctuation* ​
period Periods/fullstop .
comma Commas ,
colon Colons :
semic Semicolons ;
qmark Question marks ?
exclam Exclamation marks !
dash Dashes
quote Quotation marks apostro Apostrophes parenth Parentheses (pairs) ()otherp Other punctuation