Research Writing

“Success with Style” part 3: using LIWC data

Last time we replicated the Success with Style original output and methods despite it not being listed. We managed to get the data to broadly match. Great, but now we are going to look at a different way of analysing the same text.

In part 2 we used the Penn treebank to analyse the text and its parts of speech (PoS). This time we’re using LIWC, a tool developed at the University of Texas. It has similarities to the Penn treebank in that it categorises words and has similar categories, such as prepositions.

In part 1 we looked at the original experiment and recreated it in part 2. This time we’ll use the same input data but process it through a different NLP analysis program — the LIWC.


H0: There's no difference in the proportion of LIWC categories in successful and unsuccessful books, regardless of genre
HA: There is a difference in the proportion of LIWC categories in successful and unsuccessful books, and the pattern will depend on genre

H0: There's no difference in the LIWC summary values of successful and unsuccessful books, regardless of the book's genre
HB: There is a difference in the LIWC summary values of successful and unsuccessful books, and the pattern will depend on genre


Success with Style LIWCMethod

The data was the same, the measure of success and the method was the same as in part 1, along with adjust the p-value (p<0.05 for significance) and machine learning algorithm. Likewise variables with many zeroes were not transformed.

Difference in success

The R code managed to create different tags to the original. You can find the LIWC definitions at the foot of this page.

Tags per genre

LIWC Difference in proportion function-article – original data

Overall biggest difference

PoS (successful books) Definition Diff (largest difference first) PoS (Unsuccessful books) Definition Diff (largest difference first)
functional Total function words 0.003835 quote Quotation marks -0.001814
prep Prepositions 0.001758 allpunc All Punctuation* ​ -0.001350
article Articles 0.001199 affect Affective processes -0.001231
ipron Impersonal pronouns 0.001198 social Social processes -0.001181
space Space 0.001155 posemo Positive emotion -0.001103
relativ Relativity 0.000860 ppron Personal pronouns -0.001047
number Numbers 0.000623 apostro Apostrophes -0.000999
focuspast Past focus 0.000463 female Female references -0.000963
power Power 0.000454 focuspresent Present focus -0.000929
cogproc Cognitive processes 0.000437 shehe 3rd pers singular -0.000905
period Periods/fullstop 0.000403 verb Common verbs -0.000642
comma Commas 0.000379 informal Informal language -0.000361
differ Differentiation 0.000369 exclam Exclamation marks -0.000323
otherp Other punctuation 0.000318 time Time -0.000319
parenth Parentheses (pairs) 0.000266 you 2nd person -0.000273
conj Conjunctions 0.000266 percept Perceptual processes -0.000236
quant Quantifiers 0.000257 affiliation Affiliation -0.000216
semic Semicolons 0.000254 focusfuture Future focus -0.000213
interrog Interrogatives 0.000233 sad Sadness -0.000202
colon Colons 0.000225 adj Common adjectives -0.000190
work Work 0.000197 family Family -0.000190
drives Drives 0.000163 nonflu Nonfluencies -0.000156
pronoun Total pronouns 0.000154 netspeak Netspeak -0.000154
cause Causation 0.000136 discrep Discrepancy -0.000140
anger Anger 0.000131 see See -0.000133
we 1st pers plural 0.000130 bio Biological processes -0.000130
certain Certainty 0.000125 i 1st pers singular -0.000121
compare 0.000125 negemo Negative emotion -0.000111
they 0.000122 body Body -0.000104
death 0.000101 reward Reward -0.000098
tentat 0.000078 friend Friends -0.000088
ingest 0.000060 risk Risk -0.000080
home 0.000055 negate Negations -0.000073
achieve 0.000038 auxverb Auxiliary verbs -0.000070
money 0.000016 motion Motion -0.000069
health 0.000011 insight Insight -0.000067
adverb 0.000011 hear Hear -0.000056
leisure 0.000003 feel Feel -0.000049
swear 0.000002 assent Assent -0.000046
male Male references -0.000045
qmark Question marks -0.000035
sexual Sexual -0.000028
anx Anxiety -0.000025
dash Dashes -0.000025
relig Religion -0.000010
filler Fillers -0.000008

A positive (negative) value means that the mean PoS proportion is higher in the more (less) successful books

Unpaired t-tests

Showing results of PoS tags that have significant adjusted P-values.

PoS Definition adjusted P-value
analytic Analytical thinking 0.017
tone Emotional tone 0
mWoSen Mean Words per Sentence 0
sixletter Six letter words 0
ppron Personal pronouns 0.005
ipron Impersonal pronouns 0
article Articles 0.005
prep Prepositions 0
adj Common adjectives 0.005
number Numbers 0
affect Affective processes 0
posemo Positive emotion 0
negemo Negative emotion 0.045
sad Sadness 0.009
social Social processes 0.044
family Family 0.041
friend Friends 0
female Female references 0.026
feel Feel 0.041
bio Biological processes 0.044
affiliation Affiliation 0.017
power Power 0.017
risk Risk 0.017
focuspresent Present focus 0.02
focusfuture Future focus 0
space Space 0.009
time Time 0
informal Informal language 0
nonflu Nonfluencies 0
colon Colons 0.028
exclam Exclamation marks 0
quote Quotation marks 0.005
apostro Apostrophes 0.017

33 out of 93 tags (including punctuation) of the transformed PoS were significantly different between successful and unsuccessful books. This mean that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.

Difference in LIWC summary variables

The LIWC has its own definitions. Some of them are proprietary so how they’re calculated is not clear, but they rely on the PoS tags. For example, ‘tone’ is overall emotion (both the positive and negative emotion tags). Like the tags, they use the proportion (ie 0.85 means 85% of the text) in a text apart from mean words per sentence.

Variables Definition
Analytical thinking (Analytic) People low in analytical thinking tend to write and think using language that is more narrative ways, focusing on the here-and-now, and personal experiences. Those high in analytical thinking perform better in college and have higher college board scores.
Clout Clout refers to the relative social status, confidence, or leadership that people display through their writing or talking. The algorithm was developed based on the results from a series of studies where people were interacting with one another.
Authenticity When people reveal themselves in an authentic or honest way, they are more personal, humble, and vulnerable.
Emotional tone (Tone) Although LIWC2015 includes both positive emotion and negative emotion dimensions, the Tone variable puts the two dimensions into a single summary variable. Numbers below 50 suggest a more negative emotional tone.
Measure Successful Unsuccessful P value Significant (p>0.05)?
Six letter words 0.1633 0.1552 0.0004 TRUE
Mean words per sentence 18.3832 17.0184 0.0007 TRUE
Dictionary words 0.8388 0.8410 0.6000 FALSE
Authentic 0.2240 0.2181 0.3900 FALSE
Analytic 0.7240 0.6939 0.0032 TRUE
Clout 0.7417 0.7499 0.3800 FALSE
Tone 0.3892 0.4486 0.0010 TRUE

Results show that the mean words per sentence were significantly different in successful books and comparable to the figures in the original test. Likewise the proportion of six letter words (or more) is significantly different in successful books. The tone however is lower in successful ones (ie uses fewer emotional words either positive or negative).

Looking further at these categories by genre:

Difference in analytical words (scaled and normalized) between more and less successful books
Difference in authenticity (scaled and normalized) between more and less successful books
Difference in clout (scaled and normalized) between more and less successful books
Difference in clout (scaled and normalized) between more and less successful books
Difference in Dictionary Words (scaled and normalized) between more and less successful books
Difference in Dictionary Words (scaled and normalized) between more and less successful books
Difference in mean words per sentence (scaled and normalized) between more and less successful books
Difference in mean words per sentence (scaled and normalized) between more and less successful books
Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books
Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books
Difference in tone (scaled and normalized) between more and less successful books
Difference in tone (scaled and normalized) between more and less successful books

Most important variables

PoS Definition Overall relative importance
ipron Impersonal pronouns 100.00
quote Quotation marks 86.40
otherp Other punctuation 69.99
posemo Positive emotion 68.88
time Time 67.30
space Space 64.90
parenth Parentheses (pairs) 58.40
you 2nd person 56.80
adj Common adjectives 46.73
risk Risk 41.25
sixletter Six letter words 40.70
semic Semicolons 38.60
power Power 35.29
netspeak Netspeak 31.52
number Numbers 30.08
swear Swear words 28.03
period Periods/fullstop 27.75
filler Fillers 25.91
certain Certainty 25.69
death Death 25.56
mWoSen Mean words per sentence 25.03
ppron Personal pronouns 22.95
colon Colons 20.12
focuspast Past focus 19.99
body Body 18.78
tone Emotional tone 18.57
leisure Leisure 17.86
focusfuture Future focus 16.08
home Home 14.88
exclam Exclamation marks 13.08
achieve Achievement 11.90
dicWo Dictionary words 11.72
apostro Apostrophes 9.99
work Work 9.22
ingest Ingestion 7.70
health Health 6.83
relig Religion 5.91
qmark Question marks 3.93
interrog Interrogatives 2.72
hear Hear 1.48

Machine learning performance

Accuracy 95% CI Sensitivity Specificity
75.00% 67.6%-81.5% 76% 74%


  • The mean proportion of 33 PoS tags were significantly different between more successful and less successful books (reject null hypothesis 1)
  • Six letter word proportion, mean words per sentence, analytical words and tone were significantly different between more and less successful books (reject null hypothesis 2). Between these categories all genres except historical fiction had a significant difference, with tone (ie both positive and negative emotion use) being significant for 5 out of the 8 genres. No category in the Penn treebank analysis had this many significant genres.
  • Six letter words, Mean words per sentence, Dictionary words, Authentic, Analytic, Clout, and Tone can be used to predict the status of the book with an accuracy reaching 75%. This is superior to the readability, mean words per sentence and mean syllables per word score of 65%. 

Overall LIWC analysis has performed better than using readability and Penn treebank analysis.

LIWC definitions

These are taken from the LIWC manual.

Abbreviation Category Examples
WC Word count ­
Summary Language Variables
Analytic Analytical thinking ­
Clout Clout ­
Authentic Authentic ­
Tone Emotional tone ­
WPS Words/sentence ­
Sixltr Words > 6 letters ­
Dic Dictionary words ­
Linguistic Dimensions
funct Total function words it, to, no, very
pronoun Total pronouns I, them, itself
ppron Personal pronouns I, them, her
i 1st pers singular I, me, mine
we 1st pers plural we, us, our
you 2nd person you, your, thou
shehe 3rd pers singular she, her, him
they 3rd pers plural they, their, they’d
ipron Impersonal pronouns it, it’s, those
article Articles a, an, the
prep Prepositions to, with, above
auxverb Auxiliary verbs am, will, have
adverb Common Adverbs very, really
conj Conjunctions and, but, whereas
negate Negations no, not, never
Other Grammar
verb Common verbs eat, come, carry
adj Common adjectives free, happy, long
compare Comparisons greater, best, after
interrog Interrogatives how, when, what
number Numbers second, thousand
quant Quantifiers few, many, much
Psychological Processes
affect Affective processes happy, cried
posemo Positive emotion love, nice, sweet
negemo Negative emotion hurt, ugly, nasty
anx Anxiety worried, fearful
anger Anger hate, kill, annoyed
sad Sadness crying, grief, sad
social Social processes mate, talk, they
family Family daughter, dad, aunt
friend Friends buddy, neighbor
female Female references girl, her, mom
male Male references boy, his, dad
cogproc Cognitive processes cause, know, ought
insight Insight think, know
cause Causation because, effect
discrep Discrepancy should, would
tentat Tentative maybe, perhaps
certain Certainty always, never
differ Differentiation hasn’t, but, else
percept Perceptual processes look, heard, feeling
see See view, saw, seen
hear Hear listen, hearing
feel Feel feels, touch
bio Biological processes eat, blood, pain
body Body cheek, hands, spit
health Health clinic, flu, pill
sexual Sexual horny, love, incest
ingest Ingestion dish, eat, pizza
drives Drives
affiliation Affiliation ally, friend, social
achieve Achievement win, success, better
power Power superior, bully
reward Reward take, prize, benefit
risk Risk danger, doubt
TimeOrient Time orientations
focuspast Past focus ago, did, talked
focuspresent Present focus today, is, now
focusfuture Future focus may, will, soon
relativ Relativity area, bend, exit
motion Motion arrive, car, go
space Space down, in, thin
time Time end, until, season
Personal concerns
work Work job, majors, xerox
leisure Leisure cook, chat, movie
home Home kitchen, landlord
money Money audit, cash, owe
relig Religion altar, church
death Death bury, coffin, kill
informal Informal language
swear Swear words fuck, damn, shit
netspeak Netspeak btw, lol, thx
assent Assent agree, OK, yes
nonflu Nonfluencies er, hm, umm
filler Fillers Imean, youknow
allpunc All Punctuation* ​
period Periods/fullstop .
comma Commas ,
colon Colons :
semic Semicolons ;
qmark Question marks ?
exclam Exclamation marks !
dash Dashes
quote Quotation marks apostro Apostrophes parenth Parentheses (pairs) ()otherp Other punctuation
Research Writing

An Agile writers’ room: a better way of writing part 2

Last time we looked at the problem around writing and how too few individuals can write well enough consistently to reach the top. But together they may stand a better chance, and Agile methodology would be the way to do this.

That’s quite an assumption, but Agile (in all its forms, more on that later) is geared to testing and adaptation so the best thing is to plan how that would work and try it out in reality.

Agile writing room

Writing for publication is Waterfall but should it be Agile?

Agile is about working as a team to produce something together. Very idealistic, but doesn’t Waterfall and its related methodologies do the same?

The main difference is that Agile is not about working to produce one big, final, perfect result. Instead Agile is about breaking it down into small units, delivering the minimum needed in short sprints, testing, refining and adapting.

Agile v waterfall
Waterfall compared with Agile (via Agilenutshell)

This doesn’t mean Waterfall is bad, it suits big things where you can’t test, or update or move things. Things such as building projects… and writing? Certainly when I’ve written professionally or creatively it’s been comparable to this – set deadline, some editing and peer feedback then submit your best and forget about it once done.

This makes sense at first – if you’re aiming for a deadline you must produce your best and it must be complete and on time. Yet content teams are switching away from this in the non-creative sector due to the benefit of breaking things down into bits. And you can also break the team roles down into bits and split it between members.

The Agile writing team

As the roles are split you’ll need people who can do all these things working together, feeding back and being aware of what others are doing. A mantra of Agile is that the unit of delivery is the team. The best Agile teams may not have the best at their individual skills, the best developer, but it will have the best at working together to deliver what they need to.

You can be brilliant at your role but if you can’t work with others and adapt to help with them then you can’t write in an Agile team.

So writers are all you need in a writing team, right? Yes, of course you can’t have a writing team without writers, but you need more.

Here’s a table looking at the skills you’d need in an Agile writing team and how it’d map to a writers’ room. The roles aren’t all that different in many cases, it’d be how they work together that is. This is a big reduction, writing and Agile teams vary etc, I’ve taken liberties in both the writers room and Agile team for illustration.

Role Agile Writing teams
Deals with the vision and the bigger picture. Works with stakeholders. Decides on priorities and making decisions. Keeps the team informed of priorities. They work with the backlog and decide making deacons in a timely manner. Provide information in timely manner. Product owner (aka on-site customer or active stakeholder) Executive Producer Showrunner (depends on the team)
Create the right environment. They remove blockers and work with the product owner to make the vision happen. Doer of the visionary pairing. Delivery manager/scrum master Problem solver, project management, but not technical planning and scheduling as that is left to the team Works to hire the team Has a range of skills to do things properly Very practical person Co-producers Showrunner Writers assistant can help with some of the lower level tasks
Creator Content designer, developer Writers (story editors, staff writers etc)
Researches what the user needs, identifies the users User researcher Writers assistant (if asked by writer
Testing and stretch exercises Team develops this themselves Team develops this themselves
Specialists with knowledge brought on for key parts Technical or domain experts with specialist technical knowledge Consulting producer
Testers Independent test team, user researchers External editor Readers
Anyone who is a direct user, indirect user, manager or users, senior managers, staff member. “Gold owner” who funds the project. Representatives of the customer. Stakeholders (funder/commissioner) Executive producers, studio

Differences are many though. In Agile because it’s the team that’s responsible for delivery they are also collectively responsible for accepting work, allocating it and are responsible for producing it.

So while the show runner has editorial job, they are less of the tyrant of imaginings, but in return for this loss of control it should allow for a gain in innovation.

An example of how it works

Agile has already transformed other creative ways of working. I’ve mentioned government a lot but other areas have changed too, such as marketing:

“[Before Agile we didn’t have] a clear focus of our tasks and communicating them as a team […] Now, before the start of each quarter we’d meet and decide what our team priorities would be, then each team member would be assigned to the priorities and off we’d go. We’d meet two mornings a week to discuss the progress of our priorities, our KPIs, and our blockers.”

Which Agile do I mean?

Agile experts reading this probably long ago asked this question even though I said I’d look at the general principles. The main 3 forms of Agile are as the Harvard Business Review states:

  • scrum, which emphasises creative and adaptive teamwork in solving complex problems
  • lean development, which focuses on the continual elimination of waste
  • kanban, which concentrates on reducing lead times and the amount of work in process

My straw poll of Agile experts is that kanban would be a good way to start, as it’s about reduce the amount of work. But the beauty of Agile is that it can be adapted as needed.

Team writing in Agile is not for everyone for various reason.For instance, everyone needs to own a ticket. This responsibility is not for everyone. Consistency will be tricky. That is one for Agile to answer through the doing – there may not be a market, people may be afraid of ‘idea theft’ (not that that is really an issue). It may be less agile and more plodding.

Final thought: Agile writers, over complicating things?

It’s a fair question – is this overly complicated? My only defence is the William Goldman view of Hollywood – if, as he says, “no one knows anything” then who’s to say they know it won’t work?

Hollywood and TV (which this would be about writing scripts for) would be receptive to anything as long as it gets results. More and more places, including Amazon Studios, accept unsolicited scripts and only care if they tell a good story.

What they want is writers who can meet a specification on time, make changes as requested (and not be too difficult about pushing back) and do it on time.

From my time at BBC the thing that came up again and again when people asked “how does that person keep getting hired” was that while they may at worst be accused of mediocre scripts, they were never bad, they met the brief and most important of all, they were on time.

That’s not too high a bar to hit.

Next steps

Theory is one thing but it’s nothing without putting into action.

That’s what the plan is. It’ll be hard to get going – would this be voluntary or would I hire people; I have a breakdown of resources but will that work in practice?

So many questions, but the only way to answer them is not to speculate but to try.

Be prepared, be prepared to fail, but most importantly be prepared to learn to and to develop from that. Success in terms of the project is that it even works and we complete an initial script. Surely we can do that?

News Scientific Research

Scrivener: the best tool for organising user research

User research involves a lot of, well, research; a lot of notes, documents, videos, pictures, post its and more. And they all need organising.

There’s no one solution for the problem of what to do with all this, but after a bit of experimentation I find that using Scrivener has been the best for me for keeping things organised.

Scrivener is often seen as a writing tool, but it’s more than a word processor. Yes, it is a writing tool – from word processing to screenplays – but it is also an organiser. Most important it’s very simple to use, and has more advance features for those who want them.

Scrivener being used for user research
Scrivener lets you display folders and multiple documents at once

Renaming research in Scrivener

I’ve been using Scrivener for years, and coming from an anthropological and journalist background to user research I focus research that’s written up – observations, interviews, transcripts. But I also add photos, plan card sorts, organise thoughts with the card index display, and add spreadsheets, PDFs and presentations. Even if I don’t read the presentations directly in there, being able to search all relevant work in one search helps.

In Scrivener I like how easy it is to organise and rename documents, or duplicate them. Compared with doing this in Finder or Explorer, it is much less of a faff. Likewise documents open immediately rather than take a few seconds in Word or Google Drive (and often aren’t the one I want anyway).

While I still use Google Drive and Dropbox and to organise files, particularly video, due to the amount of research that is pure words, either as transcripts, proposals, documents or insights, I find that Scrivener is the best way to keep it all together.


I love tables. I like maths, I like spreadsheets. Really.

I like to organise interview questions in tables and use a Dewey-esque numbering system to help reorganise them. So question 101 is the first, but perhaps it needs to come later, so I reorganise it as 103 and sort.

Likewise when reviewing a transcript I like to have each question in its own cell with thoughts and insights in the cell next to it.

Scrivener could be friendlier with tables – don’t create one at the end of a page or you’ll never get out, and I always have to customise it. But once I created a good, blank table I could copy and paste that.

Sort code Quote Observation
101 I’m not really sure that it’s appropriate User not keen on this
102 Do I really have to give you a dummy quote? Prefers to be in control of speech
250 At this time, a friend shall lose his friend’s hammer and the young shall not know where lieth the things possessed by their fathers Likes Brian?

Good things about using Scrivener for user research

What’s great:

  • Easy to move documents around and organise into folders and rename them
  • Split view makes reviewing transcripts and images easy
  • Colour and icon coding makes it easy to find key files
  • Compiling documents means you can make it consistent output, or just select the ones you need to put into a single PDF or Word report, or output as multiple documents so you don’t have to worry about formatting until the end
  • Coding for things such as image captions means that you don’t have problems with Word getting confused about auto-numbers
  • Text file syncing – if out in the field you can create text notes and sync them automatically into the project 
  • Great search tool for searching titles or entire files
  • Corkboard views to organise thoughts, observations, insights etc
  • Good way to have a list of priorities and hierarchies
  • Importing documents automatically works pretty well, just drag and drop the Word docs to where you want them and it’ll convert them into a continuous webpage rather than multi page report

What’s not so great:

  • No dictation tool
  • Not always the best way to view documents and tables
  • No Android version, although there is one for iOS, although it’s rare that you need the entire project on ⁃ your phone
  • Adding weblink – it already fills in the https:// part but every time you copy and paste from Chrome it has that part, so you get ‘broken’ links as it’s https://https:// if you forget to remove that part
  • Can be fiddly with bullets

User research tools to support Scrivener

OneNote, which isn’t free, is good for:

  • Transcripts – jump to the audio where your notes are as it tracks your writing with recording (although only 15min recording on Android for some unknown reason). It can convert speech to text, though I find that’s a bit less reliable.
  • Optical character recognition – it’s not 100% accurate but it’s good enough for recognising text from images and these will be show in search
  • Syncs across devices

I also use Trello to track research questions, answers and insights.

Overall Scrivener with its files synced through the cloud (Dropbox, OneDrive etc) has been great for keeping track of research. Scrivener isn’t free, but I feel I got my $45 worth of use long ago, and it’s less than what Microsoft charges for Office 365 (which includes OneNote).

Scrivener hasn’t sponsored or otherwise provided incentives for me to write this (nor has Microsoft, though I’d feel weird if they did), I just want to spread the word for a useful tool.


Daily Mail v The Guardian: equally angry?

This week two British media giants, the Daily Mail and the Guardian, got into an inter-title fight about who encourages hate and negativity.

The Press Gazette best sums up the story, which started when the Guardian implied that the Mail and Sun are to blame for the recent attack on a mosque.

The Guardian published a cartoon of a white van outside Finsbury Park mosque, where one person was killed, with ‘Read the Sun and the Daily Mail’ on the vehicle. The Mail took this as implying that it incited the attacker to kill Muslims and fumed, replying with the editorial “Fake news, the fascist Left and the REAL purveyors of hatred”.

In short, both sides accuse the other of peddling noxious opinions, and in particular the Daily Mail effectively says that the Guardian can get off its high horse as its views are just as noxious. Are they?

The Mail has a point

Yes, the Daily Mail has a point. While the Guardian may not typically have immigrants, saboteurs or judges as targets of its wrath, it does similarly emotive language in descriptions of its enemies (usually tories).

What it comes down to is the Mail says that the Guardian’s views may be left politically, but they are just as negative as the Mail claims the Guardian thinks it is.

This chart shows the average proportion of ‘anger’ words in the body copy and headlines for 12,000 Mail and Guardian opinion pieces spanning the past couple of decades. They’re not so different in terms of the average about of anger and negative words they use in body copy and headlines, and use more on average than other British newspapers.

Negative newspapers?

In 2013 I analysed 60,000 opinion columns from 6 British newspapers — the Daily Express, Mail, Independent, Mirror, Guardian and Telegraph — for a range of measures. This included sentiment, and emotional proportions within text, using the LIWC 2007.

I was looking at a range of things, including the question of whether the internet had changed the way newspapers wrote — would they become more emotional to target their niches. I chose opinion columns for I took it that an opinion column — editorials, those written by regular as well as guest columnists and commentators — was the most suitable way to see what a paper really thinks as opposed to reporting a news event.

I split the headlines and body copy out as headlines are often written separately to the body, and can also give an idea of what phrasing the paper thinks will draw readers’ attention.

At the time I vowed to publish each week. I didn’t in the end, in part as I saw no market and in part I was looking around if someone was interested in publishing, and while I got some interest, it was a case of “what does this lead to”? This is what it leads to.

Whenever there are two colours, blue is the body copy, red is the headline. Y-axis is the proportion of content meeting that definition. Or just hover over the images for the legend to appear.

Average negative emotion in headlines and body for all newspapers

The following charts make it clearer, but there is a definitive difference between newspapers and their negativity, and a similarity between the Mail and Guardian.

Average anger in headlines and body for all newspapers

The Guardian has angrier content, on average, than the Mail – 0.884 v 0.839.

Most negative content

The Daily Mail is the most negative, but the Guardian isn’t far behind.

Angriest headlines and body (split out)

The Daily Mail has the angriest headlines, but not the angriest content — that’s the Guardian



Positive message

The Mirror is overall the most positive, although the Guardian is slightly more positive in its message than the Mail.

Negative emotions in headlines and body over time

Before 2006 I have less data, which may explain the variation (and is why the other charts are based on data from 2008 onwards), but while headlines change in tone, the body copy has largely been consistent. Zoom in to 1 or 2-year views and there’s no large change over the months, not even at Christmas.

Change in negativity over time for all papers

ALl newspapers have largely been consistent over the years. I had been expecting them to become more emotional as they strive to distinguish themselves on the internet.

Mail change in negativity over time

Love it or hate it, the Mail has largely stuck to its tone over the years, perhaps a little more negative of late.

Guardian change in negativity over time

As with the Mail, the Guardian has been roughly consistent in its tone.

Word count over time

This is the only chart that shows a real change over time. Many style guides for online suggest keeping the body length short (something I ought to be better at) and you can see that as the internet becomes more important for revenue around 2005 the length shortens.

Why creep up again? Honest answer, I don’t know, but it could be a suspicion that people are so quick to move onto another article that it doesn’t matter whether it was long or not — if the reader likes it, they’ll stick to the end, regardless of the length (within reason). Or it could be my data set.

The Daily Mail v Mail Online

Part of the beef the Daily Mail has is that it accuses the Guardian of confusing MailOnline with the Daily Mail and I use ‘the Mail’ in general terms partly due to reasons in this article. As such I can’t guarantee the data solely contains Daily Mail rather than MailOnline articles (they are apparently separately companies though both owned by DMGT), though if I reviewed it I probably could.

End thoughts

I should carry out significance tests, but for a quick and dirty evaluation (if 60,000 articles can be seen as that) it serves a point — that the Mail isn’t as wrong as many would like to think.

As this former journalist says, the Daily Mail isn’t all bad and this wasn’t published to bash it. In fact it was the Guardian accusing others of being so hateful that spurred me onto this data research back in the day.

What can both papers learn? I’ve not seen their sales, link shares and page views or other closed data as that would be the best way to see if there was a correlation between tone and readership. But they can both learn that while the topics of their wrath, their readership, their font, their style, all differ, there are more similarities than some would be comfortable with.

Contact me if you want the data of nearly 60,000 articles, including 5,200 from the Mail and 7,200 Guardian, or go to Google Drive, buy you must attribute if you use it.


Bureaucrats for Brexit: the forthcoming multi-million pound gravy train

Last week just over half of us voted to leave the EU. The Leave campaign promised us massive savings, £350m a week no less (well actually, less, they admitted the morning of the result), but did not speak of the costs.

Not costs of the nosediving stock market, the torpedoed national credit rating, the plummeting investment or sinking trade figures. I’m talking about the cost of government producing laws and guiding the public how to follow them.

EU papers being crossed out

Getting legislation to laymen

The government, once it legislates, does not then just say “well we’ve passed a law, you people should read it and know what to do”. The various departments (HMRC, the Home Office etc) must produce guidance on how those affected need to follow the law and carry out its requirements. And that’s where I and others like me come in.

I’m a freelancer who turns laws into guidance the public understands – but there aren’t enough of people like me or civil servants to update the content in light of Brexit. This work will have to be done in stages:

  • review all laws to see which will need to be updated
  • review all guidance, categorising it as:
    • not needing an update
    • update without a change in the law
    • update with a change in the law
  • debate and update these laws in parliament
  • update the guidance that needed a change in the law

That’s a lot of work, but how much are we looking at?

Review all EU-related laws

According to, there are 12,272 laws related to “European”. This may not capture all laws and some may be superseded, some may not be directly related to the EU, but let’s assume that this is the right figure.

These laws must be reviewed within the 2 years notice period we give the EU telling it that we’re out (formally know as Article 50), so as to be ready for exit day. We haven’t submitted Article 50 as of time of writing, and the civil service can’t start the work till this is submitted.

So that’s 12,272 laws that will need to be examined. In 2 years.

House of Commons debating
Busy day at the office – UK Parliament via Flickr

This is just for existing laws of course, and ignores any amendments, and I’ve not even considered all the new laws we’ll need to create just to leave. But in theory this review shouldn’t cost us any more as this would be included in the MPs’ salaries, barring expenses for many late nights.

Department of Brexit?

MPs don’t draft laws alone, they work with civil servants. So if MPs have 12,300 laws to review, it’s the civil service that will do the work of examining and setting out the initial proposals to ministers to set to the House. Then the civil service will need to update the guidance to inform the public.

Perhaps a ‘Department of Brexit’ will be created to do this, or else the departments will create their own Brexit teams.

Yet the civil service is already running at high capacity. Even if some work can be ditched because it’s reviewing or enacting EU-related legislation that will no longer be needed, there simply isn’t enough staff to do this new, urgent mountain of work.

Update the guidance

There are around 12,000 EU-related publications on GOV.UK, the site where pan-UK (eg passports) and England-only guidance to the law is published. Scotland, Wales and Northern Ireland each have their own sites. Each page takes time to review and write, based on experience let’s say each page requires 2.5 working days.

In some cases it’s a simple 2 minute read through and no change will be needed. Other guidance, like farm grants, can take several weeks and involve several civil servants, the same ones doing all the reviewing for parliament. So 2.5 days seem fair.

To outsiders this may seem bureaucratic but the law is complex, often badly written, and subject to interpretation that requires a lot of input. Content is written, subbed, approved by other civil servants and amended as needed. The teams I work with go as quickly as possible but there are limits.

How long will it take?

Each pages requires half a person-week of work, or 20 pages per week for a team of 10. Again, this is reasonable, again this sounds crazy to an outsider. So let’s bring in a team of 20, that will increase output to 40 pages per week.

Great, that means that team would take 300 weeks, or 6 years, just for the English law. Scotland, Wales and Northern Ireland don’t have as much as they don’t need information on passports (yet…). So instead of quadrupling let’s call it a round 1,000 weeks, or 20 years for the team to review and update all guides.

So to get this done in two years, a tenth of the time, we’d need 10 times the people, 400. Instantly ready to go, interviewed, vetted and knowing how to write in style. In addition to existing civil servants. With offices and equipment.

How much will it cost?

We’ll take the average content designer salary as £40,000 per year (excluding pensions and benefits), which is higher than the current advert but accounts for senior roles. That works out at £16m a year, more like £20m with IT equipment, office space etc, or £40m for 2 years for the team (and this assumes all stay, there’s no hiring problems etc).

In the context of the crashing economy, and of course £350m a week Leave ‘claimed’ we’d save, this is not much. But the civil service has to find £3.5bn in cuts by 2019-20 and HMRC, which will have to update its guidance on trade with the EU, is already set to lose 20% of its staff, for example.

Hiring the civil servants would also mean that this £20m a year would be ongoing and keep rising, and if their workload does decrease when we are out of the EU it will be hard to get rid of them.

So the government would likely look to contractors, where the cost would be at least double, but can be released once the job’s done. Let’s say £80m over 2 years. Just to review and revise existing laws.

What else will taxpayers have to pay for?

We have a figure of £80m just for de-Euroising guidance. But the Department of Brexit would also have the budget for at least the following over 2 years:

  • trade negotiators – we’ll need to make trade deals with up to 50 countries, we have no negotiators as the EU did this. We have 2 years to exit, and Leave’s much-vaunted EU-Canada deal took 7 years to complete
  • special commissions – for the Irish border, Gibraltar and other areas that arise
  • referendum campaigns – for potential Scottish and North Irish referendums
  • reform unloved EU laws – VAT on tampons, reviewing farm grants for the ‘butter mountains’. If we’re to leave it’s only fair MPs examine these much-mocked rules
  • document updates – passports, driving licences and the like, and I doubt that a simple switch to “European Economic Area” in the wording or whatever we go with will suffice or be cheap
  • visa system updates – new and updated visa system to cope with EU migrants, this is already creaky and can involve processing a 41 page form and its supporting documents for each person
  • EU divorce negotiations – the big questions plus things like pensions for MEPs and eurocrats

Total price for that? If we’re involving lawyers we could well add another zero to the estimate for the updated guidance, shall we say £800m a year?


This £800m is a rough figure I’ve extrapolated. What’s terrible is not that I’ve made some very broad assumptions in my back-of-the-envelope calculations, but that this is an envelope more than what the Leave campaign told us.

In short taxpayers can expect a hefty bill they weren’t expecting. These are only initial costs. I suspect that as more people do the work the Leave campaign should have done and give detailed costs of untangling ourselves from Brussels the more bills we’ll find. But as many vote Leavers are saying already, freedom isn’t free, and they’ll be happy with it.

This poll suggests why this is the case. While those who voted remain find the economy (and so the costs of Brexit) the most important thing, for leave voters it was taking control of laws and immigration, and money’s no object there. So perhaps I shouldn’t be surprised no costing was done because it would only strengthen the remain arguments among its supporters but would do nothing for its own Leave voters.

In short then, half the country wanted us out, but all of us will have to pay. Start saving, taxpayers.

[poll id=”3″]

If you’re interested in the legal implications I recommend reading this blog on constitutional law.

Scientific Research Writing

Scraping, screenplays and sexism

In the past couple of days there have been two big data posts that analyses sex and screenplays.

Polygraph’s Hannah Anderson and Matt Daniels scraped and analysed 2,000 screenplays and their dialogue to get data on the division of dialogue according to sex, age and other factors.

The Economist looked at data from USC Annenberg on nudity and ‘sexualised attire’ (aka revealing outfits and the like) in film, along with lead and speaking roles by sex.


Getting screenplay data

Both reports focused on presenting the data and key thoughts rather than delving too deep into interpretation. Analysing Hollywood is a complex business – like William Goldman said “nobody knows anything” when it comes to predicting success, let alone Hollywood and sexism.

The main thing of interest for me is the methods of analysing screenplays. Matt has a long and detailed method with links to script sources, along with the code on Github and a list of where he got the data from.

Potential uses

Both studies used data to explore issues around gender and films, but there is further potential with the data. For example:

  • emotion and sentiment – not a fan due to the drawbacks but possible to trace emotion in scripts, looking at such things as whether beginning, middle or ends are more or less emotional and is there a pattern
  • the split of action and dialogue in a script – do successful scripts have a divide (aka an avoidance of walls of text)
  • are women more confident or not – an extension of their sexism report, but it could be a question of whether female characters tend to ask more characters (or use emotional language)
  • writing level – what is the typical readability for the dialogue of heroes and villains, along with scripts in general and how does this vary by genre (would The Imitation Game or A Beautiful Mind be more difficult to read, let alone film, than Die Hard?)
  • is good writing important in a successful script – as with the study of readability, does having too many adverbs and other things that Hemingway hates hinder scripts
  • statistical significance – as Matt acknowledges, there are no statistical tests in their report, what tests could be done

Why we need this data

Maybe nothing comes out, but there is no harm in trying and while I never expect any rules to come out (Goldman is already laughing) but perhaps some very broad principles could emerge from the data. Even a finding of nothing can be something to report. The only pity is that due to grey areas of scraping we’d have to start from scratch rather than use the script data the teams have already used.

But it will be worth it and we can get away from what the Polygraph article calls “all rhetoric and no data, which gets us nowhere in terms of having an informed discussion.”

In the meantime if you want to search the data you can either check out the links or use the Polygraph tool here.

CW News

Analysing content beyond Google Analytics

Last night I gave my talk at Content, Seriously: Real strategies for real content. It was a pleasure to address a room full of content professionals, and a relief to be able to answer their questions.

If you couldn’t make it, or wanted the links, here’s a copy.

[slideshare id=59687482&doc=analysingcontentbeyondgoogleanalyticspdf-160317164624]

CW News

Analytsing content: come hear me speak

I’m giving a talk on word analysis, Google Analytics, and what I’ve being doing at Defra, the UK Department for Environment, Food and Rural Affairs and how I’ve been measuring the work myself and other content designers have done.

I’m one of the speakers at the March meetup for ‘Content, Seriously: Real strategies for real content’, which this month is focusing on Meaningful metrics and meaningful measurement: the constant content challenge. I’ll be talking about how to go beyond Google Analytics to assess content, which as you may tell from this site, is something I love carrying out.

The talk’s on Wednesday 16 March 2016 from 7:00 PM to 9:00 PM at the very top floor of the Captain Kidd pub, 108 Wapping High Street, London E1W 2NE. It’s a pirate’s spit away from Wapping Station and a short walk from Tower Gateway. Come for the talks, stay for the Sam Smith’s beer, of which I’m a fan.

The other 2 speakers are doing bigger things than me so it’s an honour to be sharing the evening with:

  • Adrian Kingwell of Mezzo Labs – “Analysis paralysis: how I conquered my fear of numbers and learned to love Google Analytics”
  • Charlie Southwell of Transmute – “Better metrics for social media”

Tickets are a £5, but Sam Smith’s beer is cheap and last time there was a round of drinks to be had.


Sitting around Arthur’s World

You’re in Arthur’s World whether you realise it or not — and that applies to both the audience and the main character.

In the loft of Shepherds Bush theatre, Arthur’s council bedsit is oblivious to the riots and death outside, and the audience inside sitting snugly along the wall.

Research Writing

Dressing your characters

Describing your character’s dress and appearance can be the sign of poor writing taste – but not if you do it in the right context, as a Harvard Business School study has just confirmed.

When writing a story, having a character know what the norms are and being able to conform or break them, and how others react to this, can help a story. While some dress differently “to communicate that they are different or worthy of attention”, the exact effects have been found in a psychological study.

And it led to interesting results relevant to writers.