How can a team of writers create a story when they donât share the same city, schedule or way of working?
Thatâs what my two writers rooms faced. And to make the problem more complex, both teams started with a mere seed of a story idea.
Fortunately the writers rooms comprise 8 extremely talented writers who are willing to engage with the problems and to share their own approaches and ideas.
This post is about how we overcame these obstacles and how we adapt to new ones as they come along, for the project is still ongoing. Itâs about how you can write together as a group remotely, in your free time, using free tools.
The end goal is not to have a finished book or screenplay, but a detailed summary breaking out what will happen and in what order.
I chose this goal because a strong story comes across regardless of formatâââthough the medium is important to shaping a story, I took the idealistic approach that a good story can be told in a variety of media.
The aim was to get the story in its clearest form. To me this is a treatment, a summary of the story.
To get there that we need a range of tools to help us explore ideas, comment, develop them, review and refine then review them further.
What we wanted
This project is much an experiment to find how best to write remotely using Agile methods as it is a goal to produce a good story (with the aim is to succeed at both).
The team knew it was an experiment and shared their technological or availability constraints.
But before we delved into looking at technical solutions, we had to solve the biggest barrierâââpersonal separation.
Technology cannot replicate what itâs like to meet and know someone. That meant the teams had to meet each other.
This involved a kick-off workshop in person and was a key requirement. A one day workshop wonât solve all problems of knowing others, but meeting in person meant writers were more than just avatars and faces on a web chat.
Only then could we look at problems of technology. Only then could we look at using technology to help us talk.
Tools for the job
First it was agreed that whichever tool weâd use, we would approach each other with respect and to be constructive regardless of the medium.
Second, it had to be fairly straightforward to use across the range of technologies that the writers had.
Finally, it had to be free, or at least very cheap, as this project has a tight budget.
Through my freelancing I was used to Trello, Google Drive and Slack (a holy trinity in UK government when it comes to project management). Theyâre free so I used them along with Zoom:
Zoom (with subscription, free plans available) for conference callsâââit works across multiple platforms and allowed writers to dial in from around the world without too much faff
Trello (free)âââto manage tasks and resources (such as definitions, examples of templates and so on). Mainly used at beginning but to help team understand the documents as they grew. Cards were used to discuss ideas
Google Docs (free)âââfor longer ideas and for scene exercises. Handy for leaving comments on ideas or for free text, though the default of view-only sharing and having to submit comments keep causing problems
Slack (free)âââeach team had a channel to discuss ideas, one to schedule catchups (with simple voting on dates) and a general channel for both ideas
Review of technology
These tools have largely proved successful. The work is progressing and we are learning, which are the chief measures.
Most of these tools were new to the writers but all picked it up quickly and took the initiative. Slack has proved the most used, from ideas discussion (though these sometimes need corralling into a new document) and for general updates and reminders.
Getting Slack to do automatic reminds and nagging with its bots has helped take some pressure off managing two teams.
If I had the budget I think that Coda.io combines many of the above but for now itâs too expensive.
Technology is just a tool for better stories
The aim of all this of course was to produce a story. Technological tools are great only if we have something to say.
Next time Iâll show you how we started from a simple idea and developed it into a complex story with a range of characters.
The project is ongoing and Iâm happy to discuss further with if youâre interested.Despite the name dropping this post has not been sponsored or endorsed by any company.
Being â story telling animalsâ itâs no big surprise. And technology is helping us get more stories.
Iâm a great believer that technology is a tool that enables people to fulfil desires. And we desire stories.
The growth over the past decade of decent internet access and things such as cheap ereaders and cheap, high quality internet subscriptions, and free user generated content on YouTube, blogs, podcasts and story sites means that we can get an unimaginably large amount of stories cheaply and quickly.
This boom is not just related to fiction. There has been growth in factual storiesâââTED talks and even better PowerPoint presentations, along with âscripted realityâ shows such as Love Island.
New technology, same methods
Yet with all this technological change, there has been little innovation in how books and screenplays are produced since the development in the USA of writers rooms in the mid-20th century.
Iâm generalising massively but stories are still either created by:
A solo writer or a pair of writers (rarely more) who write an original idea or is commissioned to do so by others, such as a production company
A writers room of a group of writers whoâll discuss ideas and then go off and write an script by themselves, almost always for a TV or radio show rather than a film
Even with the input of and editors and showrunners, the bulk of the writing and fleshing out is done by one or at most two people, often working in a waterfall method.
That is, writers go away and write a couple of drafts, get some feedback and itâs either adjusted or shelved. This feedback can be from an editor, agent, producer or a creative writing group.
For professionals and amateurs this process broadly the sameâââthe writer acts in isolation from feedback for much of the time.
The problem with waterfall
This means that for both professional or amateur, feedback on writing is usually at the final stage. Thatâs fair enough, for it takes time and effort to help someone and go through their writing and they want to see something finished.
So feedback is often given towards the end, once the bulk of the story has been completed.
When a lot of feedback is given at once you have to pick and choose the most important feedback to give. It could be about the character, the structure, the writing, the plot, the ending or a key scene. And if there are a lot of things to fix you risk being overly negative doing it all at once.
Even getting quality criticism can be tricky, which is why there are many services offering script and novel feedback for writers. But thereâs no guarantee they can fix all problems and for new writers selling a story, agents and producers want something ready to go, pret a vendre, not something that they have to work on.
This means that unknown writers with a story that has a core of a great idea but too many flaws, no matter how fixable, have little chance of a sale.
Better writing through better processes
One innovation brought in when the Government Digital Service was launched was that teams would use Agile project methodology for all projects.
Including writing content for webpages. It was a revelation.
Joining an Agile writing team helped in several ways. First the focus was on the audience, what each page needed to tell them. breaking each page down into its structure, reviewing with colleagues and subject matter experts, writing it, reviewing and writing cycle.
While a single writer could have got something live quicker than the team, the page that did go live had consistency, quality meant it would stand longer and serve more users.
So if it works for web content why not creative work?
Agile creative writing
My first step was to test as a concept. I created the Agile Storytellers Meetup in London to test the concepts while teaching writers about Agile.
My key learning from this was holding a retrospective at the end of each to iterate on what works (or doesnât).
Combining my learnings from this along with user research and conversations with those in the industry I developed some principles:
Quality sellsâââcontacts can get you in to an interview but without quality thereâs no point
Make it a page turnerâââif you donât make it interesting no one will want to read it
Work with others and their feedback, but have a clear vision of what you wantâââfeedback is useful but must be in the context of your lodestar, your vision of what youâre aiming to achieve
So I set out advertising for writers to join me while we aim to put this into practice. And thatâs how weâre working on a screenplay and a book as a writers room.
Next timeâââthe problems with team writing and ways to fix them.
How can an idea go from a light comedy about a weatherman getting accurate weather predictions by a fax machine (of all devices) become a Cold War industrial thriller set in Antartica about sex discrimination? Through the power of Agile of course.
At the latest Agile storytellers session we focused on Agile brainstorming and idea refining techniques to make ideas good enough to proceed with.
So this was a two part operationââânot just coming up with ideas, but using Agile methods to focus on getting results quickly. And we did it using loglines.
20th Century Fox
Out of many, one idea
Loglines are one-line summaries of a filmâs plot. Examples include:
âA New York cop in LA to reconcile with his wife must save her when her building is taken over by terroristsââââDie Hard.
âThe youngest son of a Mafia don is reluctantly pulled into the family business when he must avenge an attempt on his fatherâs life.ââââThe Godfather
We had a lucky dip of print outs of different loglines we found on the internet, each drawing about half a dozen and then putting forward the 1 or 2 we thought best from our selection.
We then held a simple version of forced ranking, an Agile method of making people have an opinion on things they didnât have.
The first logline we lay down was set us our middle rank and the other ideas were then laid as either better or worse in relation to it. We then reached the top 2:
Logline A: âAfter discovering a fax machine that can send and receive messages one day into the future, an impossibly inaccurate weather man struggles for career advancement while trying to maintain the space/time continuum.â
Logline B: âTwo gay men from San Francisco move to a small Wisconsin town to open a sushi dance club.â
Deciding on and refining an idea
Both loglines had an equal amount of supporters in our vote. Taking an inspiration from 6 hats thinking we looked at it beyond our initial feeling. While we thought the sushi club sounded fun, we didnât know enough about being gay men in San Francisco and/or Wisconsin, nor sushi or dancing to be able to make a story that didnât rely on stereotypes and assumptions.
We then took our chosen longline as our draft vision statement. This meant it needed to be unambiguous, clear, fit with our values, be realistic, and short.
How to do this? First we thought about the questions and ambiguities that the statement prompted. We wrote each question on a sticky note then reviewed and grouped each question around a group, deciding on and labelling the groupings as:
The character
The rules
The setting
Now we could have had these groupings already planned as these are fairly standard throughout stories, but it was good to see them come about organically.
Everyone has ideasâââeveryone
Now it was time to get ideas on how to flesh out the story from these questions. But not everyone said that they had ideas. They were wrong.
The idea ball (roll of tape in this case) was thrown around the group. Every time the ball was received the holder had to come up with a suggestion for one of the 3 groupings or else pass. The ideas were noted.
Each idea could be independent of what went before and the aim was to generate ideas, not to critiquing or question too much on previous ones (although we did slide into that some times).
By the end and despite initial protestations of being bereft of ideas we had a rough idea of the character, where they were and when it was set and the rules of the world.
Being led by ideas, not forcing them
It was near the end that the rule about the faxâââwhich had generated the most queries in the sticky note sectionâââwent from being a magical fax from the future to a regular fax, but with a message picked up by someone who shouldnât have.
In part it was because we kept asking how the fax worked, what the timeframe of its predictions was, what the protagonist could do to solve the problem. Seeing as we saw it related to climate change, fixing it in a day was unrealistic, to put it mildly.
So we asked where would climate be important, the most visual place? After debate we decided on Antartica and once we did that ideas flowed.
That the protagonist would be locked up at some point and have to escape, that something big had to happen (a glacier collapse). That it had to be man-made so that a man could stop it.
But then we realised why a man, why not a woman, particularly as most of the group at the meetup consisted of women?
So why was she in Antarctica? To prove something? And while sexism is certainly no longer vanquished, the fax as a sole means of communication coupled with a more sexist time seemed appropriate.
Short time, many ideas
By now time was catching up on us and we still lacked a story, though we had ideas and a protagonist.
With pass the card we each wrote an idea for one topic then passed it on to be added by the next participant. Read out at the end we modified it somewhat but ultimately had a rough spine of a story and its key players.
Pass the card
But a story needs its memorable moments. So we took a sheet of paper each, divided it into eight and each drew a key scene or sequenceâââcrazy eights.
Crazy eights ideas by one of the more artistic members
An MVPÂ output
Once we shared we cherry picked the ones we liked. And behold we now had a minimal viable product (MVP), or minimal viable story, as an output:
a settingâââAntarctica, during the Falklands War, due to faxes being key and the reason why they may be even more cut off
a big ideaâââwhat if someone found a message that they shouldnât have, was trapped with the bad guys and isolated from help by thousands of miles
a protagonistâââa female meteorologist who has something to prove (yes this is still fairly 2D but better than before)
an antagonistâââthe corporation that wants to carry out a mining test that could fracture an ice shelf (again, 2D but has a motive)
a ticking clockâââthe test that will cause a glacier to splinter off that will cause flooding and other damage
a series of key eventsâââfinding the fax, the entrapment, the escape, the finding of one of Scottâs old supply bases just when all seems lost, the climax (sorry, you had to be there)
Less artistic ideas by myself
Summary and lessons learnt
So in the space of 2 hours we went from a pool of wildly different ideas to one that only had the word âfaxâ in common with what we created.
We were proud of how much we got done in such a short time. It wasnât perfect but it was a lot more than the zero we had 2 hours prior.
As usual we ended with a retrospective to find out what worked and what didnât work.
Overall the team liked taking a few ideas and building from there, the collaboration and how we got different points of view yet agreed on an outcome.
The team felt they learnt about listening, sharing and expressing, and to build on ideas.
But the venue didnât score as well. We were in Queen Elizabeth Hall on the South Bank and while the staff and bar were lovely, we did get a few interruptions for spare change and had neighbours who disturbed us.
This was a pity as the last venue, WeWork, was seen as too formal. So the hunt for the Perfect Venue (R) continues.
Carrying out user research across the public sector is not the same as carrying it out with members of the public. That at least has been my experience of carrying out half a dozen different civil service-focused Discoveries.
The first thing I do for all projects is meet the team and host a research question workshop. Where this differed from other workshops is how we thought about users.
We found when reviewing civil servant personas from previous research that some are often just their job titles. So we did two things.
First, we adapted the job titles into roles to reflect that users across different teams may have the same fundamental duties and needs but different titles. This allowed us to see patterns and groups.
Second we wrote our potential users and their stories not just âAs aâŠâ But âAs aâŠâ â+ Whoâ (egâ as an assistant who is in charge of a teamâs room bookingsâ).
This helped us to really narrow down who our users were. It also helped us resolve one debate we had about who were end users of a service and who were our chief users for one Discovery, which to our surprise were not the same.
We also had a fairly clear idea about the end users but for the Discovery we determined it was more important to know who would implement and make decisions about the proposed service and their needs.
Cross-government help
What really saved time was posting what I was working on to the cross-government user research Slack channel and mailing list.
While my team had contacts, other user researchers put me in touch with their teams when relevant. In some cases they even had previous user research I could look atâââwith room bookings, for example, I had 3 different previous projects I could study and borrow form.
As you may be aware, government has a lot of meetings and forums and groups. Going along and inviting myself to relevant meetings helped in multiple ways: I got research from the meetings; I got contacts; and I got people to spread the word about what I was doing.
The tricky bits of civil service research
User research is mostly for getting information from users, but on my projects the civil servants I spoke to expected more from interviews, particularly if a team member was present.
In some interviews it did get bogged down when team members wanted to defend or tell why the problem the user mentioned was, and thatâs not the aim of interviews. The decision has to be taken on the value of having a team member be there to take part in research and how to control the research session.
Confidentiality was also a concern. Itâs hard to be truly frank as a user if the person who designed the system youâre criticising is in the same room.
One strategy was to allow for time at the end for Q&As between the team and users, and to shut it down if it went too off-topic.
This was even trickier in workshops, yet one reason we could get so many participants to attend was that our team of experts would be there. It was a question of balancing your desire to get information while rewarding the fact that professionals were giving up their time and so expected something in return.
Participants were also keen to know the next steps. Asking product managers to vow to blog at the end of the Discovery, Alpha or Beta meant I could tell users that thereâd be a digest of learnings, and I invited many to the final Show and Tell.
What IÂ learnt
We donât share enough with other user researchers. And a lot of user researchers across government have worked on similar projects with similar problems and needs.
Contacting others can be easier due to Slack, public blogs, meet ups and so on but it requires more chasing and more channels to monitor. Combine this with projects using the same users and there can be a case of research fatigues for participants.
Some blockers were technical and unique to the civil service. GDS doesnât have .gsi in its email (a âgovernment secure initiativeâ thatâs going anyway in favour of better cyber security behaviour), and lacked a landline. For some not up to speed with the latest policies this was a red flag and was told I âcouldnât be trustedâ with a response.
With so many departments and agencies (despite decentralisation) along with local authorities it can be tempting to stay in London and its area to meet users.
Yet bursting the London bubble and travelling the country was essential.
Hangouts, Appear.In and other remote tools are great but opportunities to observe other working areas were essential to get a proper view of work. They were often keen to meet someone who was willing to come see them so it was a positive session for all.
Overall itâs been an enjoyable experience. Civil servants are not just users, theyâre people too. Shocking, I know.
Researching in government is rewarding as you have experts in their fields and they love talking about their work. Even those who are unhappy usually end their interviews with âsorry for the rantâ despite having given you reams of information.
And you hope that by the end of the project your team will have had insights and findings that will help a range of talented people across the country do a better job and so help the public.
Note: this was originally written for the GDS blog but due to team changes got lost to the aether.
Now that this machine analysis of what makes a good and bad book is complete, what does it actually mean for writers?
I started this analysis back in May. Actually it was far before then, back when the original Success with Style paper was published in 2014 but it took me that long to realise I needed help with the analysis, even after I got my R qualification.
And when the results raised more questions with each analysis it meant that something I expected to take a month end-to-end became 3 months. Even now there is more I could do but have done enough to call it a day.
Success with Style: a recap
If you’ve not read the other parts (and they can be quite stats heavy) this series was was prompted by a 2014 paper that claimed to be able to say what makes a good book, Success with Style. However my reading of it found some flaws and it was unclear how the original authors created their experiment.
From 758 books in the Project Gutenberg Library in 8 genres (Adventure, Detective/mystery, Fiction, Historical fiction, Love-story, Poetry, Sci-fi and Short-stories) with half of them deemed to be success (more than 30 downloads in the past month) and those with less unsuccessful/failure. I then put these through a variety of analyses:
the Stanford Tagger that uses the Penn treebank to analyse PoS (parts of speech)
the LIWC to analyse PoS
The latter two, the Penn and LIWC PoS analyses, split all the words in the books into different categories and do so in slightly different ways.
I then repeated these analyses in slightly different ways: first using 2018 download data (with 41 books changing their success/fail category) and then analysing just the first 3,000 words on the principle that it is often only the first chapter that agents or publishers review when considering a book.
Steps taken in the analysis of the books
In all tests I was looking for statistical significance. A good overview of what this is is on the Harvard Business Review, but in summary it is a test to see whether the results are due to chance or whether it’s likely that there is an underlying reason for why we got the results and not down to luck of the draw.
The P-value used to determine significance in all tests was 0.05, which is a fairly standard choice (note that this may have been too high – see the end of this page). Without reporting statistical significance it’s hard to really say if your test does mean something or it was just luck of the data you drew that gave you that result.
If you want to look in depth at the analyses read:
analysing an entire book is more accurate than just its first 3,000 words — it’s hard to judge a book on its first chapter
the genre affects the significance of tests and not all genres are as easy to predict as others — science-fiction has the most exceptions to the principles
books that are emotional (lots of either positive or negative emotions) tend to be unsuccessful
adjectives and adverbs predominate in poorly performing books and nouns and verbs in successful ones but are not a significant determiner of success or failure
don’t talk too much — dialogue heavy books were more unsuccessful
readability (as a computer measurement) is not generally a significant determiner of success, but don’t use too many long words in your writing. More successful books are slightly harder to read (have a higher readability) but are still able to be understood by a 15-year-old
these rough criteria generally stood up even when the success/fail criteria changed over time, meaning there is some underlying value in them
the LIWC is more accurate than the Penn treebank for predicting the success of a book
including punctuation in the analysis leads to better machine learning prediction performance
What these findings mean for writers
Caveats
First I’m not about to say that there are rules for writing. At best writers such as George Orwell* or Robert McKee have laid out principles, not rules (*while Orwell calls his rules but his last one is to ignore them when needed). This analysis is not to create a set of rules.
Secondly, as with many experiments, it is dangerous to extrapolate from beyond the original dataset. The 758 books in the Gutenberg dataset are all out of copyright and so are mostly from the 1850s to early 20th century. The oldest was Dante (born 1265) and most recent Samuel Vaknin, born in 1961 (Gutenberg only puts author’s birth and death dates, not publication date). Many are also well known as classics, such as Robinson Crusoe, and so may have a built in bias to being downloaded due to name recognition.
Machine analysis is not a perfect tool. Even tools such as the LIWC, which is updated regularly (Penn’s was mainly carried out between 1989 and 1996), still cannot accurately tell the difference in context of words such as ‘execute’ and whether it’s to execute a plan or Ned Stark.
Finally I didn’t clean my data, I didn’t remove common words or check for errors in the Gutenberg transcriptions. It’s not essential but may have led to some differences from what a cleaned up dataset would have produced.
The first chapter is a bad guide for overall success
Machine analysis of the success of a book failed when making a judgment solely on the first 3,000 words. At 55% its machine learning performance was only marginally better than a 50/50 guess for both the Readability and Penn treebank and LIWC analysis.
PoS analysis
Accuracy
95% Confidence interval
Penn & Readability 2013 (complete book)
65.62%
57.7-72.9%
Penn & Readability 2018(complete book)
65.00%
57.5-72.8%
Penn & Readability 1st 3,000 words
55.62%
47.6-63.5%
LIWC 2013(complete book)
75.00%
67.6%-81.5%
LIWC 2018(complete book)
71.70%
64.0-78.6%
LIWC 3k1st 3,000 words
56.25%
48.2-64.0%
The analysis of the first chapter did produce significant results for some of the same tests as were produced in the full book analysis. However, assuming analysing the complete book is the ‘truer’ test due to their better machine learning performance, this means that the first chapter isn’t as valuable a method of analysis as analysing the whole book.
This means that machine analysis using the Penn Treebank, Readability or LIWC categories is not suitable for agencies, publishers or other services that ask to review based on one sample chapter.
However do human readers for agencies react the same way to a machine? Looking at sites such as QueryShark, professional readers look at the cover letter/email and look for things such as the who the protagonist is and what choices they face — for example QueryShark won’t even request the first chapter until they’ve read a query email.
An experiment would be to run sample chapters of successful and unsuccessful books against professional agency readers to get their view, but that would be an experiment for another day.
Don’t be overly emotional
Overly emotional books perform poorly, whether it is overly negative or positive. The only emotional category seen commonly in successful books was Anger.
That’s not to say that emotion shouldn’t be included but that it should not overwhelm writing. This includes both the dialogue and the action.
This applied to all genres except Adventure, and even then the positive effect was small compared with the overwhelming strong net difference in unsuccessful books.
This ties in with writing tips on avoiding melodrama — show reactions and details of characters, not spell out the thing:
Remember that the drama doesnât have to be all the way at eleven in order to affect the reader. Readers get into the little aspects of peopleâs lives, too.
Unfortunately, many writers make the mistake of assuming that to be gripping, emotion must be dramatic. Sad people should burst into tears. Joyful characters must express their glee by jumping up and down. This kind of writing results in melodrama, which leads to a sense of disbelief in the reader because, in real life, emotion isnât always so demonstrative.
Looking at the results of emotional tags in the LIWC results supported this (Penn doesn’t offer emotion as a tag), that unsuccessful books overwhelmingly dominate the emotions:
Emotion PoS in the LIWC analysis – negative results are PoS tags more common in unsuccessful books and positive results are for successful books. ‘Affect’ includes emotions and other affective processes, ‘posemo’ is positive emotion and ‘negemo’ negative emotions.T-test for significance using LIWC results for Tone (both positive and negative emotions). It is significant (p<0.05) for all genres except Historical fiction and Sci-fi. The figures at the top are the P-values — you can find out more on how to interpret boxplots.
Make it readable — but don’t worry too much
Although the Flesch-Kincaid readability was significant, and slightly lower (roughly one school year) readability was marked in unsuccessful books, I do not think the difference was so great as to make it important.
T-test (a standard statistical test for significance) for mean words per sentence, mean syllables per word and Flesch-Kincaid readability (FR) — both mean syllables per word and readability are statistically significant as p<0.05.Readability by genre. Readability is significant for Adventure, Detective/mystery and Love-story. Note how Sci-fi’s plots are noticeably different to the other genres.
Looking at LIWC tests related to readability: the proportion of six-letter or longer words (ie long words); dictionary words:
Looking at the the overall rating then the proportion of six-letter or longer words and mean words per sentence were flagged as significant.
Overall make it readable without too many long words but don’t worry too much about the specifics.
Avoiding adjectives isn’t the best advice
This very brief University of Pennsylvania study of a few books and their contents suggested that adjectives and adverbs predominate in badly written books while good books have a higher proportion of nouns and verbs.
These 2 charts suggest at first that this is the case.
Adjectives (jj), adverbs (rb), nouns (nn) and verbs (vb) difference in proportion from Penn results — as before, positive results are from successful books and negative results from unsuccessful books
LIWC results for adjectives (adj) and adverbs
Yet while Adjectives was the PoS with the greatest relative importance in the Penn PoS test of the original data this was not repeated in the 2018 data nor in the LIWC tests.
Likewise while adjectives and adverbs dominate unsuccessful books (ie negative plots) in most genres, this isn’t always the case. And the difference is small compared to noun dominance — which again has mixed results across the genres.
Finally, I carried out a fresh T-test (a common test to find significance) to find statistical significance for adjectives, adverbs, nouns and verbs overall and adjectives per genre:
T-test with P-value for adjectives, adverbs, nouns and verbs. No P-value is lower than 0.05 so none is statistically significant.T-test for adjectives per genre (Penn PoS). Again, none is statistically significant.
The above charts do show that successful books have a lower proportion of adjectives and adverbs. However, contrary to the University of Pennsylvania column, successful books also have a lower proportion of nouns and verbs.
How can usage-book writers have failed to notice that good writers use plenty of adverbs? One guess is that they are overlooking many: much, quite, rather and very are common adverbs, but they do not jump out as adverbs in the way that words ending with âly do. A better piece of advice than âDonât use adverbsâ would be to consider replacing verbs that are combined with the likes of quickly, quietly, excitedly by verbs that include those meanings (race, tiptoe, rush) instead.
And for those advocating verbs instead, he adds:
It is hard to write without verbs. So âuse verbsâ is not really good advice either, since writers have to use verbs, and trying to add extra ones would not turn out well
Not one of these results is statistically significant.
I suspect that the differences in POS distributions are a symptom, not a cause, and that attempts to improve writing by using more verbs and adverbs would generally make things worse. But still.
What this means for writers then is that avoid them but don’t worry too much.
Don’t talk too much
Too much dialogue, as indicated by quotation marks (“), was a sign of an unsuccessful book in all genres except for short stories.
The LIWC results show that quotations marks (‘quote’) are in a higher proportion in unsuccessful books, with the exception of short stories. In poorly performing Adventure books they account for nearly 1% of all tagsT-test for the statistical significance of quotation marks for genres. They are significant in Adventure, Detective/mystery, Fiction and Love stories as p<0.05. Unsuccessful books tend to have a higher proportion than successful ones. Sci-fi successful books has a very large range for the proportion, again showing how this genre has its own rules.
For writers this most likely means that focusing on dialogue at the expense of description is not popular with readers.
Note that the quote mark proportion is a very rough approximation — it doesn’t allow for long paragraphs of dialogue, nor look at books with no quote marks and what their pattern is.
Genre shapes the rules — and science-fiction breaks them
One thing that was consistent in all analyses was that analysing by genre showed variation within the overall finding for the category. For readability, for example, it was only significant for the Adventure, Detective/mystery and Love story genres. Likewise in the LIWC tests there was no test category which produced a significant result across all genres.
This makes sense — after all, poetry was included in the analysis, and it’s hard to see a reader applying the same mental rules of what they count as good to a poem as to a science-fiction novel. Even short stories have slightly different writing ‘rules’ — the quotation proportion, for instance, shows that short stories can be more dialogue heavy than full length novels.
What is interesting is how some genres live up to their stereotypes. In the LIWC tests, for example, Clout “refers to the relative social status, confidence, or leadership that people display” and was defined through analysis of speech. In it Adventure, Detective-mystery and Fiction genres were deemed significant. In my imagining of the archetypes, detectives and adventure heroes have a certain clout or leadership.
Difference in Clout (scaled and normalised) between more and less successful books. The boxes show the range of the mid 50% of results and the line the median, with successful sci-fi having the largest range.
Similarly for the readability tests it was not shown to be significant for science-fiction or poetry. Again I don’t think a poem is judged by its readability, nor, again in my own experience, is poor readability or writing style a hindrance to sci-fi.
Readability by genre (scaled and normalised) — see how different science-fiction’s boxplots are and how similar the medians (lines) are for success and failure.
One thought is that if the idea or story is gripping then it seems that science fiction novel has a greater chance of success. More importantly, science fiction books often have long scientific terms or made up words which would fare badly in a readability analysis tools. I know I’ve read enough sci-fi novels with long scientific or pseudo-scientific terms and long sentences, wooden characters, but I persevered because I found the concept interesting.
Readers of the science fiction story âappear to have expected an overall simpler story to comprehend, an expectation that overrode the actual qualities of the story itselfâ, so âthe science fiction setting triggered poorer overall readingâ.
Whether this is the cause or the effect of science-fiction having its own rules in this study is not clear.
Summary of principles: follow good writing practice
What then has this study revealed? First, that one old saw about writing isn’t quite right. While adverbs and adjectives tend to predominate in unsuccessful books, they aren’t statistically significant.
This means that unsuccessful books having a higher proportion of adverbs and adjectives (and nouns and verbs) in this set of results doesn’t mean much and is likely to be the same in other results claiming the same.
While personally I agree that I prefer writing with fewer adjectives and adverbs, it’s not a hindrance. Children’s books in particular have more, for example Harry Potter and his chums tend to say things “severely”, “furiously”, “loftily” and so on but the series is an undoubted hit with readers.
So the next time someone tells you that you must use fewer adverbs tell them to swiftly remove unnecessary adjectives you can tell them “no, it’s not statistically significant”. And they’ll love you for it (this advice is not statistically significant).
The next outcome is know your genre. By splitting into the books into a wide range of genres, not just fiction but poetry, love-stories and science-fiction among others, we saw that the so-called rules varied.
Sci-fi in particular was the exception to many of the findings. The tests I ran cannot say why, but we can speculate. one is that the audience for sci-fi is likely to be quite different from that of poetry, love stories and even regular fiction. Hard research on this is hard to find, sci-fi audiences do seem different to other readers although this quote from a survey of sci-fi readers suggests it may be that they value world building over other considerations:
The creativity that goes into world building and bringing ‘otherworldly’ characters to life in a way that we can identify with.
It may also be that the theme or subject of the books — something that the Penn and LIWC analyses cannot work out — may be more gripping to readers such that they ignore or overlook what would otherwise be considered weak writing.
Don’t talk too much — readers want more than just dialogue, unlike films where heavy exposition is unwelcome, in books it seems readers prefer stories with a good balance of description to talking.
Read more than the first chapter to get a true sense of a book — although I can’t say how many words does give the best approximation just yet.
Finally don’t be overly emotional, either being too positive or too negative in your writing. This suggests that the old ‘show, don’t tell’ writing saw is true, rather than telling us that someone is angry (and using that word), show their reaction, using nouns and verbs (and yes, adverbs if you must).
Comparison with the original Success with Style findings
Successful books tended to feature more nouns and adjectives, as well as a disproportionate use of the words âandâ and âbutâ – when compared with less successful titles.
But my tests found that the proportion of adverbs, adjectives, nouns or verbs wasn’t statistically significant.
The most popular books also featured more verbs relating to âthought-processingâ such as ârecognisedâ and ârememberedâ.
T-test for significance using LIWC results for ‘cogproc’ which shows cognitive processes, which includes thought processing. Again the genre varies the results but it is statistically significant for Detective/mystery, Fiction, Love stories, Poetry and Short stories
This is statistically significant for most, but not all, genres so is something we agree on.
Verbs that serve the purpose of quotes and reports, for example the word âsayâ and âsaidâ, were heavily featured throughout the bestsellers.
My tests found the exact opposite, that it was statistically significant that those with quotes did worse in most genres. Now it may be that writers are told to only use the word ‘said’ for dialogue tags so it may be that bestsellers follow this and use ‘said’ while poorer writers use other terms, which is why successful ones have the higher ‘said’ proportion. But a quick search of that schoolboy favourite, “ejaculated” (as in to speak suddenly or sharply) found it in around half of all successful books (99 out of 206 books) so it’s another reason to doubt this finding.
I didn’t look for specific words, though there are tools to do so if you wish. However, my results did say that overly emotional books do do worse and that does tie in with love, breathless and risk.
Poor-selling books also favoured the use of âexplicitly descriptive verbs of actions and emotionsâ such as âwantedâ, âtookâ, âpromisedâ âcriedâ and âcheeredâ. Books that made explicit reference to body parts also scored poorly.
This was true but the difference was small and not statistically significant for verbs. For emotions though it was significant in most genres (except, of course, sci-fi, that malcontent genre).
So of the original findings only 2 of those were fully agreed with in this study and one partially.
Final thought…
Throughout all this, at the risk of being melodramatic, be true to yourself and write for yourself. This analysis gives pointers on signs of a bad book (‘unsuccessful’ in the more diplomatic description) but that doesn’t mean you must slavishly these principles.
Write with what you’re comfortable and for the reasons you want. Just don’t be overly dramatic about it.
Ben was also concerned that the p-value of 0.05 was too high and may need to be 0.01 or lower. This is because if we tested 20 variables then there is already a 1/20 chance one of them will be significant – and 1/20 is 0.05. I did run the tests separately each time in the code there is a chance that I may have merged and analysed too much per test. This could also be the reasons why the original research was lacking in statistical significance tests.
I did run the tests by testing values separately as well as all together, but I admit that I don’t have years of stats experience under my belt (unlike Ben who knows his stuff) and may have overlooked some things. My code is on GitHub so anyone willing to check is welcome to review and amend. The conclusion is that the results are probably sound but the statistical significance may not be right.
However, even if the significance results are wrong, he suspect that it is more than likely that the resulting charts and the differences in positions are still broadly correct, which is why I have left the information as it stands.
Try for yourself, fork it if you disagree and use for your own amusement.
I wish I’d double then triple checked the data. The source data I produced a couple of years ago when looking at it and had some errors (mainly around column sorts not capturing all columns. Thanks Excel.
I’d should have asked the original authors for their methods. It was an interesting expertise to try and repeat it but no harm in asking
I should have made R do more of the hard work around producing images and other things I could have automated better
I wish I’d learnt about GutenbergR to download different books, eliminate poetry as not of interest to me and replace with another genre
Next time I would:
get more books and genres (using GutenbergR)
focus on fewer tests and review the tests I use
For anyone looking to do their own experiments
Use the LIWC for machine analysis
The LIWC not only gave a better machine learning performance, its own categories and tags also gave a better range of significant results than the Penn treebank. Generally I found the tone (emotion) the most interesting measure as it assigned human feelings to the tags, more than just categorising as grammatically (or is it linguistically?) what kind of word it is.
Ultimately this study has been about what this means for readers and writers and that’s why the emotion is of most relevance to me.
That and the fact that the LIWC is still being researched and updated (the last in 2015, compared with Penn’s from 1996) means that it has the potential for a longer shelf life. I also found it easier to use.
It is not free is the main downside.
Don’t skip the punctuation or action
The most surprising result was how punctuation affected results. I had tried running experiments without it but the machine learning performance decreased by around 5 percentage points while the tag differences did not seem to change.
The LIWC results show that quotations marks (‘quote’) are in a higher proportion in unsuccessful books, with the exception of short stories. In poorly performing Adventure books they account for nearly 1% of all tags.
User research is to design and product development as statistical significance is to data.
You canât be confident in figures if you havenât carried out significance tests. And you canât be confident in a design or product change if you havenât carried out user research.
Yet businesses that baulk at treating data as gospel without statistical significance tests will make product or design decisions without a jot of user research.
Iâve worked for organisations like this, perhaps you have too.
What is user research?
User research is many things, but in practical terms itâs the tangible outcome of making your users, audience or customers the heart of what you do.
Itâs one thing for a company to tell us that customers are their ânumber one priorityâ.
A company shows it by having user researchers who learn about their users: who they are, what they do, what they want, what they like and dislike, what influences them.
User researchers uncover these findings through interviews, observation, usability studies, surveys. Then they interpret and gather insight through multiple rounds of research.
Insight is the output: you find out who exactly your users are, and what their pain points and needs are. Insight is shared with the wider team (and the team should be joining in on research sessions too).
These findings sit within a goal. This can be a project, business or organisational goal, and how the product or service will best serve its users.
Letâs say youâve carried out A/B tests on two web pages and design B led to a 10% increase in goal completions.
Does this necessarily mean that design B is âbetterâ? You could have got lucky with a hoard of spendthrift shoppers logging in together, or unlucky when the internet failed during design Aâs slot.
Newspapers and other everyday presentation of statistics typically omit statistical significance for simplicity. This is understandable, but statistics used in research and business must include it if they want to understand their data. And if the business does not run these tests, why not?
Statistical significance then helps give you the confidenceââânot certaintyâââthat your findings are true. That your results werenât due to a lucky (or unlucky) sample or events.
How user research ⣠statistical significance
User research gives you confidence. Confidence that what youâre doing has an effect due to changes your team made and not due to chance. Confidence that audience tastes are changing or a competitor has emerged.
Confidence that when the CEO says that they donât like something that you can push back because the users say otherwise. That you have research, not just opinions.
User research gives you confidence but never certainty. Thatâs why research is an ongoing activity, much like how significance tests are carried out on each new result.
Caveats
A danger of statistical significance is that it can give the appearance of scientific certainty when none is there. For example, produce Google Analytics data with statistical significance and itâll appear the more âscientificâ result.
Yet the analytics is the outcome of peopleâs behaviour, andâââunlike interviewsâââitâs hard to follow up and probe why a user did what they did with analytics.
Finally, both statistical significance and user research need to state the practical significance. Both can say that there is an effect but both need to say what the practical outcome is. For example, whether the problem is a mere annoyance or one that prevents users completing their task.
User research > statistics?
Both statistical significance and user research give you confidence in your results. But good user research includes the user impact by default.
A key part of user research is that the whole team should join in, and so will expand their own knowledge. How often does the team join the web analyst and contribute to their research?
User research can probe and build understanding across a team in a way that statistics by itself finds hard to achieve.
And any company that wants to be serious about its development needs user research as much as it needs statistical tests on its data.
When starting this analysis I spotted that the download data was for the past 30 days and that this was used for success or fail categorisation.
Even if the data was for the lifetime of the book, it’s been nearly 5 years since the original downloads. The best way to test this then was to get the latest data (albeit still for the past 30 days).
The other thought was that the analyses looked at the entire book. But what if readers did not read the entire book but only read a certain amount before making a judgment? When submitting work to an agent or publisher for consideration, for example, often only the first chapter is requested. Based on this I analysed just the first 3,000 words of each book through the Penn and LIWC tagger and used its 2013 success/fail data to repeat the experiments.
Finally I noticed a bias towards punctuation as markers for success or failure in the output and ran the experiments without the punctuation tags to see what the result would be.
Starting hypotheses
H0: There's no difference in the tests which produce significant results between the 2014 and 2018 data
HA: There is a difference in the tests which produce significant results between the 2014 and 2018 data
H0: There's no difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words
HB: There is a difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words
The hypotheses are fairly simple – if there is no difference in the 2018 data then most of the test that proved significant with the 2013 data should also do so in 2018.
Likewise if the first 3,000 words is unimportant the test results should likewise only be significant at the same level.
3,000 words (3k words) is about 10 pages and is about one chapter’s length although of course there is no hard and fast rule about how long a chapter is.
Data used
Data summary
2018 data download date
2018-07-22
2013 data download date
2013-10-23
Unique books used
759
Difference in 2013 and 2018 success rates
Row Labels
Count
FAILURE
22
Adventure
5
Detective/mystery
3
Fiction
2
Historical-fiction
1
Love-story
1
Poetry
8
Short-stories
2
SUCCESS
20
Adventure
3
Detective/mystery
4
Fiction
1
Historical-fiction
4
Love-story
3
Sci-fi
5
Grand Total
42
There were 758 unique books (the remaining 42 of the 800 listed were in multiple categories). With 42 differing that is 5.5% of the total books used and none of those with a different success status was listed in multiple categories.
The new data was parsed through both the Perl Lingua Tagger using the Penn treebank and Perl readability measure and the LIWC tagger.
Results for 2013, 2018 and 3,000 word data
Machine learning performance
The most important measure for me is which is the best for making predictions.
Using all tags including punctuation
Accuracy
95% Confidence Interval
Sensitivity
Specificity
Readablity 2013
65.62%
57.7-72.9%
69%
63%
Readablity 2018
65.00%
57.5-72.8%
68%
63%
Readablity 3k
55.62%
47.6-63.5%
68%
44%
LIWC 2013
75.00%
67.6%-81.5%
76%
74%
LIWC 2018
71.70%
64.0-78.6%
78%
66%
LIWC 3k
56.25%
48.2-64.0%
53%
60%
According to this the LIWC is still the best tagger and that both 2013 and 2018 data are fairly similar for both readability and LIWC, with the results being in each other’s 95% confidence interval.
Both for readability and LIWC the first 3,000 words (3k) are much worse predictors of overall success and barely better than a 50/50 guess.
Difference in significance in key measures
Punctuation
Overall there was not much difference in omitting punctuation for LIWC or Penn analyses. In fact the machine analysis performances all dropped around 5% points.
Readability
Genre
Significant 2013
Significant 2018
Significant 3k words
Adventure
TRUE
TRUE
TRUE
Detective/mystery
TRUE
TRUE
TRUE
Fiction
FALSE
FALSE
FALSE
Historical-fiction
FALSE
FALSE
FALSE
Love-story
TRUE
TRUE
TRUE
Poetry
FALSE
FALSE
FALSE
Sci-fi
FALSE
FALSE
FALSE
Short-stories
FALSE
FALSE
FALSE
Significant tags in the same genres for all 3 different categories.
LIWC categories
Test
genre
Significant 2013
Significant 2018
Significant 3k words
Clout
Adventure
TRUE
FALSE
TRUE
Detective-mystery
TRUE
TRUE
FALSE
Fiction
TRUE
TRUE
FALSE
Historical-fiction
FALSE
FALSE
FALSE
Love-story
FALSE
FALSE
FALSE
Poetry
FALSE
FALSE
FALSE
Sci-fi
FALSE
FALSE
FALSE
Short-stories
FALSE
FALSE
FALSE
Authenticity
Adventure
FALSE
FALSE
FALSE
Detective-mystery
FALSE
FALSE
FALSE
Fiction
TRUE
TRUE
FALSE
Historical-fiction
FALSE
FALSE
TRUE
Love-story
FALSE
FALSE
FALSE
Poetry
TRUE
TRUE
FALSE
Sci-fi
FALSE
FALSE
FALSE
Short-stories
FALSE
FALSE
FALSE
Analytical
Adventure
FALSE
FALSE
FALSE
Detective-mystery
FALSE
FALSE
FALSE
Fiction
TRUE
TRUE
TRUE
Historical-fiction
FALSE
FALSE
FALSE
Love-story
FALSE
FALSE
TRUE
Poetry
FALSE
FALSE
FALSE
Sci-fi
FALSE
FALSE
FALSE
Short-stories
FALSE
FALSE
FALSE
6 letter words
Adventure
TRUE
TRUE
TRUE
Detective-mystery
FALSE
FALSE
FALSE
Fiction
FALSE
FALSE
FALSE
Historical-fiction
FALSE
FALSE
FALSE
Love-story
TRUE
TRUE
TRUE
Poetry
FALSE
FALSE
FALSE
Sci-fi
FALSE
FALSE
FALSE
Short-stories
FALSE
FALSE
FALSE
Dictionary words
Adventure
FALSE
FALSE
FALSE
Detective-mystery
FALSE
TRUE
TRUE
Fiction
TRUE
TRUE
FALSE
Historical-fiction
FALSE
FALSE
TRUE
Love-story
FALSE
FALSE
TRUE
Poetry
FALSE
FALSE
FALSE
Sci-fi
TRUE
TRUE
TRUE
Short-stories
FALSE
FALSE
FALSE
Tone
Adventure
FALSE
FALSE
FALSE
Detective-mystery
TRUE
TRUE
TRUE
Fiction
TRUE
TRUE
TRUE
Historical-fiction
FALSE
FALSE
FALSE
Love-story
TRUE
TRUE
FALSE
Poetry
TRUE
TRUE
TRUE
Sci-fi
FALSE
FALSE
FALSE
Short-stories
TRUE
TRUE
TRUE
Mean words per sentence
Adventure
TRUE
TRUE
TRUE
Detective-mystery
FALSE
FALSE
FALSE
Fiction
TRUE
TRUE
FALSE
Historical-fiction
FALSE
FALSE
FALSE
Love-story
FALSE
FALSE
FALSE
Poetry
FALSE
FALSE
FALSE
Sci-fi
FALSE
FALSE
FALSE
Short-stories
FALSE
FALSE
TRUE
Whereas readability was consistent across the different approaches the LIWC categories shows a lot more variety.
Tone has the most success across this. As before the 2013 and 2018 data tend to match (but not always, as with Clout or Dictionary words) and 3,000 words, well, it does its own thing.
Tone most consistent throughout and as last time had most significant categories even with 3k.
Parts of speech tags (PoS) with the largest difference
The tables list the top 3 PoS that dominate in successful and unsuccessful books.
Penn data
Successful PoS 2013
Successful PoS 2018
Successful PoS 3k
INN – Preposition / Conjunction
INN – Preposition / Conjunction
INN – Preposition / Conjunction
DET – Determiner
DET – Determiner
DET – Determiner
NNS – Noun, plural
NNS – Noun, plural
NNS – Noun, plural
Unsuccessful PoS 2013
Unsuccessful PoS 2018
Unsuccessful PoS 3k
PRP – Determiner, possessive second
PRP – Determiner, possessive second
RB – Adverb
RB – Adverb
VB – Verb, infinitive
PRP – Determiner, possessive second
VB – Verb, infinitive
RB – Adverb
VB – Verb, infinitive
LIWC data
Successful PoS 2013
Successful PoS 2018
Successful PoS 3k
functional – Total function words
functional – Functional words
functional – Total function words
prep – Prepositions
prep – Prepositions
prep – Prepositions
article – Articles
space – Space
article – Articles
Unsuccessful PoS 2013
Unsuccessful PoS 2018
Unsuccessful PoS 3k
quote – Quotation marks
allpunc – All Punctuation* â
adj – Common adjectives
allpunc – All Punctuation* â
affect – Affective processes
adverb – Common Adverbs
affect – Affective processes
posemo – Positive emotion
affect – Affective processes
The same tags dominate all the books in the Penn treebank for successful books – prepositions (for, of, although, that), determiners (this, each, some) and plural nouns (women, books).
For unsuccessful books it also has determiners that dominate but in the possessive second person (mine yours), adverbs (often, not, very, here) and infinitive verbs (take, live).
For LIWC it is quite similar. Functional words dominate with (it, to, no, very ), prepositions also dominate successful books (to, with, above is its examples) and articles (a, an, the) and (it, to, no, very).
For unsuccessful books it’s all punctuation, quotation marks and social (mate, talk, they while including all family references) and affective processes (happy, cried), which includes all emotional terms.
Quotations suggest a high propensity to a high ratio of dialogue to action/description.
What does this tell us?
2013 v 2018 data
Overall there is more similarity than difference in the 2013 and 2018 Penn and readability results. The machine learning performance was also broadly the same, with each other’s overall performance falling within the 95% confidence interval.
The most successful PoS were also largely the same, as were the top 3 unsuccessful ones.
Likewise the LIWC categories generally matched in significance for both 2013 and 2018 data. The Successful PoS were broadly the same, as were the unsuccessful ones.
This suggests that while the original authors didn’t mention that the data was only from the previous 30 days, their results have largely stood to be true.
The first chapter
Just judging a book by its first 3,000 words was not as accurate as analysing the whole book. The machine learning performance was barely better than a guess.
However, the readability did match and the dominance of successful PoS was similar to that of the full data in the 2013 and 2018 studies.
Of all the LIWC categories described in part 3, Tone both was the most significant predictor across genres but also the most consistent across the different tests.
Summary
The 2018 results generally matches the 2013 results and as such suggest the original method holds as a good predictor of success or failure of those books.
The first 3,000 words results did not match the 2013 or 2018 data and as its machine learning performance was the weakest suggests that this is not an accurate way to predict a book’s success. It may be that there is a ‘sweet spot’ where the first x amount of words correlates closely with the overall rating, but it is more than 3,000 words.
Successful books tend to use prepositions, determiner and nouns and functional words. Unsuccessful ones skew towards quotations marks, punctuation and positive emotions (which with the LIWC are similar to affective processes).
This suggests that unsuccessful books may use shorter sentences (high punctuation rate), more dialogue (high quotation mark rate), adverbs and are more emotional, particularly positive emotions. Writers are frequently told by writing experts to avoid adverbs wherever possible.
Successful books by contrast tend to focus on the action – describing scenes and situations, hence the dominance of functional words, prepositions and articles. This makes them sound rather boring, but suggests that these bread and butter words are necessary to build a good story.
The LIWC data suggests that tone is the most reliable predictor of success. But what isn’t answered whether it is because it predominates in successful or unsuccessful books and whether it is positive or negative emotions. This is something to explore though based on the emotion and affect appearing in the top 3 of unsuccessful books suggests it is there.
Having punctuation tags had some use and machine learning performance was better with it so even though the punctuation tags can be hard to interpret, it is worth including them in any machine analysis but more work is needed to interpret them.
Last time we replicated the Success with Style original output and methods despite it not being listed. We managed to get the data to broadly match. Great, but now we are going to look at a different way of analysing the same text.
In part 2 we used the Penn treebank to analyse the text and its parts of speech (PoS). This time we’re using LIWC, a tool developed at the University of Texas. It has similarities to the Penn treebank in that it categorises words and has similar categories, such as prepositions.
In part 1 we looked at the original experiment and recreated it in part 2. This time we’ll use the same input data but process it through a different NLP analysis program — the LIWC.
Hypotheses
H0: There's no difference in the proportion of LIWC categories in successful and unsuccessful books, regardless of genre
HA: There is a difference in the proportion of LIWC categories in successful and unsuccessful books, and the pattern will depend on genre
H0: There's no difference in the LIWC summary values of successful and unsuccessful books, regardless of the book's genre
HB: There is a difference in the LIWC summary values of successful and unsuccessful books, and the pattern will depend on genre
Method
The data was the same, the measure of success and the method was the same as in part 1, along with adjust the p-value (p<0.05 for significance) and machine learning algorithm. Likewise variables with many zeroes were not transformed.
Difference in success
The R code managed to create different tags to the original. You can find the LIWC definitions at the foot of this page.
Tags per genre
LIWC Difference in proportion function-article – original data
Overall biggest difference
PoS (successful books)
Definition
Diff (largest difference first)
PoS (Unsuccessful books)
Definition
Diff (largest difference first)
functional
Total function words
0.003835
quote
Quotation marks
-0.001814
prep
Prepositions
0.001758
allpunc
All Punctuation* â
-0.001350
article
Articles
0.001199
affect
Affective processes
-0.001231
ipron
Impersonal pronouns
0.001198
social
Social processes
-0.001181
space
Space
0.001155
posemo
Positive emotion
-0.001103
relativ
Relativity
0.000860
ppron
Personal pronouns
-0.001047
number
Numbers
0.000623
apostro
Apostrophes
-0.000999
focuspast
Past focus
0.000463
female
Female references
-0.000963
power
Power
0.000454
focuspresent
Present focus
-0.000929
cogproc
Cognitive processes
0.000437
shehe
3rd pers singular
-0.000905
period
Periods/fullstop
0.000403
verb
Common verbs
-0.000642
comma
Commas
0.000379
informal
Informal language
-0.000361
differ
Differentiation
0.000369
exclam
Exclamation marks
-0.000323
otherp
Other punctuation
0.000318
time
Time
-0.000319
parenth
Parentheses (pairs)
0.000266
you
2nd person
-0.000273
conj
Conjunctions
0.000266
percept
Perceptual processes
-0.000236
quant
Quantifiers
0.000257
affiliation
Affiliation
-0.000216
semic
Semicolons
0.000254
focusfuture
Future focus
-0.000213
interrog
Interrogatives
0.000233
sad
Sadness
-0.000202
colon
Colons
0.000225
adj
Common adjectives
-0.000190
work
Work
0.000197
family
Family
-0.000190
drives
Drives
0.000163
nonflu
Nonfluencies
-0.000156
pronoun
Total pronouns
0.000154
netspeak
Netspeak
-0.000154
cause
Causation
0.000136
discrep
Discrepancy
-0.000140
anger
Anger
0.000131
see
See
-0.000133
we
1st pers plural
0.000130
bio
Biological processes
-0.000130
certain
Certainty
0.000125
i
1st pers singular
-0.000121
compare
0.000125
negemo
Negative emotion
-0.000111
they
0.000122
body
Body
-0.000104
death
0.000101
reward
Reward
-0.000098
tentat
0.000078
friend
Friends
-0.000088
ingest
0.000060
risk
Risk
-0.000080
home
0.000055
negate
Negations
-0.000073
achieve
0.000038
auxverb
Auxiliary verbs
-0.000070
money
0.000016
motion
Motion
-0.000069
health
0.000011
insight
Insight
-0.000067
adverb
0.000011
hear
Hear
-0.000056
leisure
0.000003
feel
Feel
-0.000049
swear
0.000002
assent
Assent
-0.000046
male
Male references
-0.000045
qmark
Question marks
-0.000035
sexual
Sexual
-0.000028
anx
Anxiety
-0.000025
dash
Dashes
-0.000025
relig
Religion
-0.000010
filler
Fillers
-0.000008
A positive (negative) value means that the mean PoS proportion is higher in the more (less) successful books
Unpaired t-tests
Showing results of PoS tags that have significant adjusted P-values.
PoS
Definition
adjusted P-value
analytic
Analytical thinking
0.017
tone
Emotional tone
0
mWoSen
Mean Words per Sentence
0
sixletter
Six letter words
0
ppron
Personal pronouns
0.005
ipron
Impersonal pronouns
0
article
Articles
0.005
prep
Prepositions
0
adj
Common adjectives
0.005
number
Numbers
0
affect
Affective processes
0
posemo
Positive emotion
0
negemo
Negative emotion
0.045
sad
Sadness
0.009
social
Social processes
0.044
family
Family
0.041
friend
Friends
0
female
Female references
0.026
feel
Feel
0.041
bio
Biological processes
0.044
affiliation
Affiliation
0.017
power
Power
0.017
risk
Risk
0.017
focuspresent
Present focus
0.02
focusfuture
Future focus
0
space
Space
0.009
time
Time
0
informal
Informal language
0
nonflu
Nonfluencies
0
colon
Colons
0.028
exclam
Exclamation marks
0
quote
Quotation marks
0.005
apostro
Apostrophes
0.017
33 out of 93 tags (including punctuation) of the transformed PoS were significantly different between successful and unsuccessful books. This mean that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.
Difference in LIWC summary variables
The LIWC has its own definitions. Some of them are proprietary so how they’re calculated is not clear, but they rely on the PoS tags. For example, ‘tone’ is overall emotion (both the positive and negative emotion tags). Like the tags, they use the proportion (ie 0.85 means 85% of the text) in a text apart from mean words per sentence.
Variables
Definition
Analytical thinking (Analytic)
People low in analytical thinking tend to write and think using language that is more narrative ways, focusing on the here-and-now, and personal experiences. Those high in analytical thinking perform better in college and have higher college board scores.
Clout
Clout refers to the relative social status, confidence, or leadership that people display through their writing or talking. The algorithm was developed based on the results from a series of studies where people were interacting with one another.
Authenticity
When people reveal themselves in an authentic or honest way, they are more personal, humble, and vulnerable.
Emotional tone (Tone)
Although LIWC2015 includes both positive emotion and negative emotion dimensions, the Tone variable puts the two dimensions into a single summary variable. Numbers below 50 suggest a more negative emotional tone.
Measure
Successful
Unsuccessful
P value
Significant (p>0.05)?
Six letter words
0.1633
0.1552
0.0004
TRUE
Mean words per sentence
18.3832
17.0184
0.0007
TRUE
Dictionary words
0.8388
0.8410
0.6000
FALSE
Authentic
0.2240
0.2181
0.3900
FALSE
Analytic
0.7240
0.6939
0.0032
TRUE
Clout
0.7417
0.7499
0.3800
FALSE
Tone
0.3892
0.4486
0.0010
TRUE
Results show that the mean words per sentence were significantly different in successful books and comparable to the figures in the original test. Likewise the proportion of six letter words (or more) is significantly different in successful books. The tone however is lower in successful ones (ie uses fewer emotional words either positive or negative).
Looking further at these categories by genre:
Difference in analytical words (scaled and normalized) between more and less successful booksDifference in authenticity (scaled and normalized) between more and less successful booksDifference in clout (scaled and normalized) between more and less successful booksDifference in Dictionary Words (scaled and normalized) between more and less successful booksDifference in mean words per sentence (scaled and normalized) between more and less successful booksDifference in proportion of 6 letter words (scaled and normalized) between more and less successful booksDifference in tone (scaled and normalized) between more and less successful books
Most important variables
PoS
Definition
Overall relative importance
ipron
Impersonal pronouns
100.00
quote
Quotation marks
86.40
otherp
Other punctuation
69.99
posemo
Positive emotion
68.88
time
Time
67.30
space
Space
64.90
parenth
Parentheses (pairs)
58.40
you
2nd person
56.80
adj
Common adjectives
46.73
risk
Risk
41.25
sixletter
Six letter words
40.70
semic
Semicolons
38.60
power
Power
35.29
netspeak
Netspeak
31.52
number
Numbers
30.08
swear
Swear words
28.03
period
Periods/fullstop
27.75
filler
Fillers
25.91
certain
Certainty
25.69
death
Death
25.56
mWoSen
Mean words per sentence
25.03
ppron
Personal pronouns
22.95
colon
Colons
20.12
focuspast
Past focus
19.99
body
Body
18.78
tone
Emotional tone
18.57
leisure
Leisure
17.86
focusfuture
Future focus
16.08
home
Home
14.88
exclam
Exclamation marks
13.08
achieve
Achievement
11.90
dicWo
Dictionary words
11.72
apostro
Apostrophes
9.99
work
Work
9.22
ingest
Ingestion
7.70
health
Health
6.83
relig
Religion
5.91
qmark
Question marks
3.93
interrog
Interrogatives
2.72
hear
Hear
1.48
Machine learning performance
Accuracy
95% CI
Sensitivity
Specificity
75.00%
67.6%-81.5%
76%
74%
Conclusion
The mean proportion of 33 PoS tags were significantly different between more successful and less successful books (reject null hypothesis 1)
Six letter word proportion, mean words per sentence, analytical words and tone were significantly different between more and less successful books (reject null hypothesis 2). Between these categories all genres except historical fiction had a significant difference, with tone (ie both positive and negative emotion use) being significant for 5 out of the 8 genres. No category in the Penn treebank analysis had this many significant genres.
Six letter words, Mean words per sentence, Dictionary words, Authentic, Analytic, Clout, and Tone can be used to predict the status of the book with an accuracy reaching 75%. This is superior to the readability, mean words per sentence and mean syllables per word score of 65%.
Overall LIWC analysis has performed better than using readability and Penn treebank analysis.
H0: There's no difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book's genre.
I also added another:
H0: There's no difference in the Flesch-Kincaid readability of successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book's genre.
Note since publishing I have updated some tables after noticing errors in the original data, I was caught by Excel not always reordering all columns when sorting.
Hypotheses and data used
The original team used Fog and Flesch-Kincaid reading grade level but to save duplication of work I only used Flesch-Kincaid. However my source data has the Fog rating if you wish — my experience has been the Flesch-Kincaid gives more accurate results. The Flesch-Kincaid readability used here is US school grade level, where the lower the value the easier it’s judged to read.
Although the Fog and the Flesch readability indices are in my original data if you want to run it yourself – I’ll publish all data and code in the final part of this review. I also capped unreliable data for words per sentence – average words per sentence was capped at 50 (only 4 had this apply to them).
The original team gathered a range of books and classed by genre and success/failure based on number of downloads over the 60 days prior to them collecting it. We’ll use the same.
They had an equal number of books per genre and total failures and successes (758 books with 42 across multiple genres to give a total 800 books, 400 of which are failures, 400 success).
Statistical tests
For these tests I’m greatly indebted to the users of Stack Overflow and Ahmed Kamel. While I had the original ideas it was he who got them into a working R script and analysis and the analysis relies heavily on his work. I’d highly recommend Ahmed if you want help with your own statistical tests.
Statistical analysis was performed using R studio v 1.1.149. I’ve put a more detailed methodology at the end of this page. Significance uses p †0.05.
Difference in success
The R code managed to reproduce the original figures and I’ve displayed their tables and graphs as appropriate.
Tag difference per genre
Difference in proportion: cc-lsDifference in proportion: md-rbsDifference in proportion: sym-wdtDifference in proportion: wp-lrb
Overall biggest difference
The data is side-by-side here, with the first two columns being the successful books and the last two the unsuccessful ones.
PoS (Successful books)
Difference
PoS (Unsuccessful books)
Difference
INN – Preposition / Conjunction
0.005560
PRP – Determiner, possessive second
-0.004326
DET – Determiner
0.003114
RB – Adverb
-0.003033
NNS – Noun, plural
0.002730
VB – Verb, infinitive
-0.002690
NN – Noun
0.001540
VBD – Verb, past tense
-0.002665
CC – Conjunction, coordinating
0.001399
VBP – Verb, base present form
-0.001630
CD – Adjective, cardinal number
0.001309
MD – Verb, modal
-0.001306
WDT – Determiner, question
0.001050
FW – Foreign words
-0.001169
WP – Pronoun, question
0.000558
POS – Possessive
-0.000890
VBN – Verb, past/passive participle
0.000525
VBZ – Verb, present 3SG -s form
-0.000392
PRPS – Determiner, possessive
0.000444
WRB – Adverb, question
-0.000385
VBG – Verb, gerund
0.000259
UH – Interjection
-0.000205
SYM – Symbol
0.000197
NNP – Noun, proper
-0.000181
JJS – Adjective, superlative
0.000170
TO – Preposition
-0.000107
JJ – Adjective
0.000083
EX – Pronoun, existential there
-0.000063
WPS – Determiner, possessive & question
0.000045
JJR – Adjective, comparative
0.000041
RBR – Adverb, comparative
0.000013
RBS – Adverb, superlative
0.000003
LS – Symbol, list item
0.000002
A positive value means that the mean PoS proportion is higher in the more successful books, while a large negative value means its proportion is higher is less successful books.
Unpaired t-tests
For those not aware of significance, the P-value is used to determine wether a result is significant and didn’t just happen by chance. Statisticians may point out that probability is chance, but for a basic overview you can find out more here.
PoS
P-value
adjusted P-value
CD – Adjective, cardinal number
0
0
DET – Determiner
0
0
INN – Preposition / Conjunction
0
0
JJS – Adjective, superlative
0.012
0.039
MD – Verb, modal
0.004
0.015
POS – Possessive
0
0
PRPS – Determiner, possessive
0.022
0.057
VB – Verb, infinitive
0.018
0.052
WDT – Determiner, question
0
0
WP – Pronoun, question
0.033
0.078
WRB – Adverb, question
0.001
0.004 |
12 out of 41 of the transformed PoS were significantly different between successful and unsuccessful books. This means that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.
Difference in Flesch-Kincaid readability, mean words per sentence, and mean syllabus per sentence between successful and unsuccessful books
Measure
Successful
Unsuccessful
P value
Mean words per sentence
17.8
17
0.25
Mean syllables per word
1.45
1.43
0.005
Flesch-Kincaid readability
8.46
7.98
0.028
Results show that the mean readability was significantly higher in unsuccessful books compared to successful books. The same is true for the mean words per sentence which was significantly higher in unsuccessful books compared to successful books.
The mean syllables per word was not significantly different between more and less successful books.
Looking further at readability by genre
genre
FAILURE mean
FAILURE SD
SUCCESS mean
SUCCESS SD
P value
Significant?
Adventure
7.54
1.83
9.76
3.86
0.0002
TRUE
Detective/mystery
6.82
1.40
7.56
2.03
0.0116
TRUE
Fiction
7.92
2.27
8.07
1.87
0.3852
FALSE
Historical-fiction
8.55
1.83
9.40
3.00
0.1247
FALSE
Love-story
7.61
1.57
8.83
3.32
0.0360
TRUE
Poetry
11.27
10.24
9.71
2.66
0.8450
FALSE
Sci-fi
6.33
1.52
6.43
1.38
0.5896
FALSE
Short-stories
8.99
2.74
7.90
2.02
0.0614
FALSE
Results show that there is a statistically significant difference in the mean readability between successful and unsuccessful books for the following genres: adventure; detective/mystery and love stories. The mean readability was significantly higher (ie, harder to read) for more successful books in those genres.
Most important variables
Definition
Overall relative importance
JJ – Adjective
100.000
UH – Interjection
86.810
PRPS – Determiner, possessive
69.049
TO – Preposition
67.866
INN – Preposition / Conjunction
67.570
WP – Pronoun, question
64.431
MD – Verb, modal
60.935
RBS – Adverb, superlative
59.996
WDT – Determiner, question
59.635
PRP – Determiner, possessive second
55.813
CD – Adjective, cardinal number
54.306
NN – Noun
48.380
EX – Pronoun, existential there
42.474
SYM – Symbol
40.823
Mean syllables per word
36.230
JJS – Adjective, superlative
35.699
NNP – Noun, proper
35.674
CC – Conjunction, coordinating
33.137
VBP – Verb, base present form
32.791
VBG – Verb, gerund
29.862
VBN – Verb, past/passive participle
29.826
POS – Possessive
28.903
WRB – Adverb, question
18.980
Flesch-Kincaid readability
18.371
VB – Verb, infinitive
14.735
NNS – Noun, plural
13.609
FW – Foreign words
13.562
DET – Determiner
3.757
LS – Symbol, list item
1.202
This shows that the most important tag in determining success or failure is adjectives. However it does not say whether this is for success or failure, but does say that adjectives are an important tag.
Machine learning performance
Accuracy
95% CI
Sensitivity
Specificity
65.62%
57.7-72.9%
69%
63%
Overall accuracy is 67.5%. The sensitivity is the true positive rate and specificity is the true negative rate for specificity (ie after allowing for false positives or negatives). Note that for all other tests I ignored punctuation tags but included them for machine learning as it improved performance. I left it out for other parts as knowing whether right-hand bracket was important did not seem to tell me anything.
Conclusion
The mean of 12 PoS tags was significantly different between more successful and less successful books. We also saw the PoS pattern was largely dependent on the genre of the book.
This means we can reject the null hypothesis and say that there is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book’s genre.
Not only that but the Flesch-Kincaid readability and mean syllables per word were significantly different between more and less successful books. This was more evident in fiction, science fiction and short stories where the mean readability was significantly lower (ie easier to read) in more successful books.
This means we can say that there is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book’s genre.
Overall, the Flesch-Kincaid readability, mean words per sentence and PoS can be used to predict the status of the book with an accuracy reaching 65.6%. This is comparable to the original experiment which gave a comparable overall accuracy of 64.5%.
But what happens when we try it with a different PoS tool that analyses text in a different way? Next time I’ll use LIWC data.
Method
I used R to perform the analysis. When running:
statistical analysis was performed using R studio v 1.1.1.453.
the data set was split into a training (80%) and a test data set (20%). Analysis was performed on the training data set except when comparing readability across genres where the whole data was used due to the small sample size in each genre.
The average difference in various parts of speech (PoS, the linguistic tags assigned to words) was calculated between successful and unsuccessful books. I used what I think were the original methods used by the team to calculate these differences.
Detailed methodology
I laid out the broad outlines and normally this is put first in a research paper but it’s not the most engaging part. For those of you who are interested, this is the stats nitty gritty and is used in the other experiments.
Univariate statistical analysis
Variables were inspected for normality. Appropriate transformations such as log, Box-Cox, Yeo-Johnson transformations were performed so that variables can assume an approximate normal distribution. This was followed by a series of unpaired t-tests to assess whether the mean proportion of each PoS was significantly different between successful and unsuccessful books.
P-values were adjusted for false discovery rate to avoid the inflation of type I error (a ‘false positive’ error). Analysis was performed only using the training data set. Variables were scaled before performing the tests.
Machine learning algorithm
Support vector machine was used to predict the status of the book based on variables deemed important using initial univariate analysis. LibLinear SVM with L2 tuned over training data was used.
The model was tuned using 5-fold cross validation. The final predictive power of the model was assessed using the 20% test data. Performance was assessed using accuracy, sensitivity, specificity.
Variables with lots of zeroes
Ten variables had a lot of zeros and were heavily skewed. Thus, they were not transformed since none of the transformation algorithms fixed such a distribution. The remaining PoS did not contain such a large number of zeros and were transformed prior to performing the unpaired t-test. The package bestNormalize was used to find the most appropriate transformation.
Three PoS were removed from the analysis (nnps, pdt and rp) since none of the novels included any of these PoS.