Categories
General

Remote writing: technology for the online writers room

How can a team of writers create a story when they don’t share the same city, schedule or way of working?

That’s what my two writers rooms faced. And to make the problem more complex, both teams started with a mere seed of a story idea.

Fortunately the writers rooms comprise 8 extremely talented writers who are willing to engage with the problems and to share their own approaches and ideas.

This post is about how we overcame these obstacles and how we adapt to new ones as they come along, for the project is still ongoing. It’s about how you can write together as a group remotely, in your free time, using free tools.

The problems

I formed 2 writers rooms, with 2 different story ideas, but the same goal — to create a story as a team.

The end goal is not to have a finished book or screenplay, but a detailed summary breaking out what will happen and in what order.

I chose this goal because a strong story comes across regardless of format — though the medium is important to shaping a story, I took the idealistic approach that a good story can be told in a variety of media.

The aim was to get the story in its clearest form. To me this is a treatment, a summary of the story.

To get there that we need a range of tools to help us explore ideas, comment, develop them, review and refine then review them further.

What we wanted

This project is much an experiment to find how best to write remotely using Agile methods as it is a goal to produce a good story (with the aim is to succeed at both).

The team knew it was an experiment and shared their technological or availability constraints.

But before we delved into looking at technical solutions, we had to solve the biggest barrier — personal separation.

Technology cannot replicate what it’s like to meet and know someone. That meant the teams had to meet each other.

This involved a kick-off workshop in person and was a key requirement. A one day workshop won’t solve all problems of knowing others, but meeting in person meant writers were more than just avatars and faces on a web chat.

Only then could we look at problems of technology. Only then could we look at using technology to help us talk.

Tools for the job

First it was agreed that whichever tool we’d use, we would approach each other with respect and to be constructive regardless of the medium.

Second, it had to be fairly straightforward to use across the range of technologies that the writers had.

Finally, it had to be free, or at least very cheap, as this project has a tight budget.

Through my freelancing I was used to Trello, Google Drive and Slack (a holy trinity in UK government when it comes to project management). They’re free so I used them along with Zoom:

  • Zoom (with subscription, free plans available) for conference calls — it works across multiple platforms and allowed writers to dial in from around the world without too much faff
  • Trello (free) — to manage tasks and resources (such as definitions, examples of templates and so on). Mainly used at beginning but to help team understand the documents as they grew. Cards were used to discuss ideas
  • Google Docs (free) — for longer ideas and for scene exercises. Handy for leaving comments on ideas or for free text, though the default of view-only sharing and having to submit comments keep causing problems
  • Slack (free) — each team had a channel to discuss ideas, one to schedule catchups (with simple voting on dates) and a general channel for both ideas

Review of technology

These tools have largely proved successful. The work is progressing and we are learning, which are the chief measures.

Most of these tools were new to the writers but all picked it up quickly and took the initiative. Slack has proved the most used, from ideas discussion (though these sometimes need corralling into a new document) and for general updates and reminders.

Getting Slack to do automatic reminds and nagging with its bots has helped take some pressure off managing two teams.

If I had the budget I think that Coda.io combines many of the above but for now it’s too expensive.

Technology is just a tool for better stories

The aim of all this of course was to produce a story. Technological tools are great only if we have something to say.

Next time I’ll show you how we started from a simple idea and developed it into a complex story with a range of characters.

The project is ongoing and I’m happy to discuss further with if you’re interested. Despite the name dropping this post has not been sponsored or endorsed by any company.


Originally published at https://www.linkedin.com.

Categories
General

A problem with crafting stories

Why would someone hire a team of writers to form a writers room when they don’t already have a show in production?

For an experiment in storytelling.

We devour stories and we consume more every year, thanks in part to the growth of online video through the likes of Netflix or Amazon Video, audiobooks with Audible, and the increase in book and comic sales.

Being “ story telling animals” it’s no big surprise. And technology is helping us get more stories.

I’m a great believer that technology is a tool that enables people to fulfil desires. And we desire stories.

The growth over the past decade of decent internet access and things such as cheap ereaders and cheap, high quality internet subscriptions, and free user generated content on YouTube, blogs, podcasts and story sites means that we can get an unimaginably large amount of stories cheaply and quickly.

This boom is not just related to fiction. There has been growth in factual stories — TED talks and even better PowerPoint presentations, along with ‘scripted reality’ shows such as Love Island.

New technology, same methods

Yet with all this technological change, there has been little innovation in how books and screenplays are produced since the development in the USA of writers rooms in the mid-20th century.

I’m generalising massively but stories are still either created by:

  1. A solo writer or a pair of writers (rarely more) who write an original idea or is commissioned to do so by others, such as a production company
  2. A writers room of a group of writers who’ll discuss ideas and then go off and write an script by themselves, almost always for a TV or radio show rather than a film

Even with the input of and editors and showrunners, the bulk of the writing and fleshing out is done by one or at most two people, often working in a waterfall method.

That is, writers go away and write a couple of drafts, get some feedback and it’s either adjusted or shelved. This feedback can be from an editor, agent, producer or a creative writing group.

For professionals and amateurs this process broadly the same — the writer acts in isolation from feedback for much of the time.

The problem with waterfall

This means that for both professional or amateur, feedback on writing is usually at the final stage. That’s fair enough, for it takes time and effort to help someone and go through their writing and they want to see something finished.

So feedback is often given towards the end, once the bulk of the story has been completed.

When a lot of feedback is given at once you have to pick and choose the most important feedback to give. It could be about the character, the structure, the writing, the plot, the ending or a key scene. And if there are a lot of things to fix you risk being overly negative doing it all at once.

Even getting quality criticism can be tricky, which is why there are many services offering script and novel feedback for writers. But there’s no guarantee they can fix all problems and for new writers selling a story, agents and producers want something ready to go, pret a vendre, not something that they have to work on.

This means that unknown writers with a story that has a core of a great idea but too many flaws, no matter how fixable, have little chance of a sale.

Better writing through better processes

One innovation brought in when the Government Digital Service was launched was that teams would use Agile project methodology for all projects.

Including writing content for webpages. It was a revelation.

Joining an Agile writing team helped in several ways. First the focus was on the audience, what each page needed to tell them. breaking each page down into its structure, reviewing with colleagues and subject matter experts, writing it, reviewing and writing cycle.

While a single writer could have got something live quicker than the team, the page that did go live had consistency, quality meant it would stand longer and serve more users.

So if it works for web content why not creative work?

Agile creative writing

My first step was to test as a concept. I created the Agile Storytellers Meetup in London to test the concepts while teaching writers about Agile.

My key learning from this was holding a retrospective at the end of each to iterate on what works (or doesn’t).

Combining my learnings from this along with user research and conversations with those in the industry I developed some principles:

  • Quality sells — contacts can get you in to an interview but without quality there’s no point
  • Make it a page turner — if you don’t make it interesting no one will want to read it
  • Work with others and their feedback, but have a clear vision of what you want — feedback is useful but must be in the context of your lodestar, your vision of what you’re aiming to achieve

So I set out advertising for writers to join me while we aim to put this into practice. And that’s how we’re working on a screenplay and a book as a writers room.

Next time — the problems with team writing and ways to fix them.


Originally published at https://www.linkedin.com.

Categories
General

Of faxes and futures — Agile storytellers March 2019

How can an idea go from a light comedy about a weatherman getting accurate weather predictions by a fax machine (of all devices) become a Cold War industrial thriller set in Antartica about sex discrimination? Through the power of Agile of course.

At the latest Agile storytellers session we focused on Agile brainstorming and idea refining techniques to make ideas good enough to proceed with.

So this was a two part operation — not just coming up with ideas, but using Agile methods to focus on getting results quickly. And we did it using loglines.

20th Century Fox

Out of many, one idea

Loglines are one-line summaries of a film’s plot. Examples include:

“A New York cop in LA to reconcile with his wife must save her when her building is taken over by terrorists” — Die Hard.

“The youngest son of a Mafia don is reluctantly pulled into the family business when he must avenge an attempt on his father’s life.” — The Godfather

We had a lucky dip of print outs of different loglines we found on the internet, each drawing about half a dozen and then putting forward the 1 or 2 we thought best from our selection.

We then held a simple version of forced ranking, an Agile method of making people have an opinion on things they didn’t have.

The first logline we lay down was set us our middle rank and the other ideas were then laid as either better or worse in relation to it. We then reached the top 2:

Logline A: “After discovering a fax machine that can send and receive messages one day into the future, an impossibly inaccurate weather man struggles for career advancement while trying to maintain the space/time continuum.”

Logline B: “Two gay men from San Francisco move to a small Wisconsin town to open a sushi dance club.”

Deciding on and refining an idea

Both loglines had an equal amount of supporters in our vote. Taking an inspiration from 6 hats thinking we looked at it beyond our initial feeling. While we thought the sushi club sounded fun, we didn’t know enough about being gay men in San Francisco and/or Wisconsin, nor sushi or dancing to be able to make a story that didn’t rely on stereotypes and assumptions.

We then took our chosen longline as our draft vision statement. This meant it needed to be unambiguous, clear, fit with our values, be realistic, and short.

How to do this? First we thought about the questions and ambiguities that the statement prompted. We wrote each question on a sticky note then reviewed and grouped each question around a group, deciding on and labelling the groupings as:

  • The character
  • The rules
  • The setting

Now we could have had these groupings already planned as these are fairly standard throughout stories, but it was good to see them come about organically.

Everyone has ideas — everyone

Now it was time to get ideas on how to flesh out the story from these questions. But not everyone said that they had ideas. They were wrong.

The idea ball (roll of tape in this case) was thrown around the group. Every time the ball was received the holder had to come up with a suggestion for one of the 3 groupings or else pass. The ideas were noted.

Each idea could be independent of what went before and the aim was to generate ideas, not to critiquing or question too much on previous ones (although we did slide into that some times).

By the end and despite initial protestations of being bereft of ideas we had a rough idea of the character, where they were and when it was set and the rules of the world.

Being led by ideas, not forcing them

It was near the end that the rule about the fax — which had generated the most queries in the sticky note section — went from being a magical fax from the future to a regular fax, but with a message picked up by someone who shouldn’t have.

In part it was because we kept asking how the fax worked, what the timeframe of its predictions was, what the protagonist could do to solve the problem. Seeing as we saw it related to climate change, fixing it in a day was unrealistic, to put it mildly.

So we asked where would climate be important, the most visual place? After debate we decided on Antartica and once we did that ideas flowed.

That the protagonist would be locked up at some point and have to escape, that something big had to happen (a glacier collapse). That it had to be man-made so that a man could stop it.

But then we realised why a man, why not a woman, particularly as most of the group at the meetup consisted of women?

So why was she in Antarctica? To prove something? And while sexism is certainly no longer vanquished, the fax as a sole means of communication coupled with a more sexist time seemed appropriate.

Short time, many ideas

By now time was catching up on us and we still lacked a story, though we had ideas and a protagonist.

With pass the card we each wrote an idea for one topic then passed it on to be added by the next participant. Read out at the end we modified it somewhat but ultimately had a rough spine of a story and its key players.

Pass the card

But a story needs its memorable moments. So we took a sheet of paper each, divided it into eight and each drew a key scene or sequence — crazy eights.

Crazy eights ideas by one of the more artistic members

An MVP output

Once we shared we cherry picked the ones we liked. And behold we now had a minimal viable product (MVP), or minimal viable story, as an output:

  • a setting — Antarctica, during the Falklands War, due to faxes being key and the reason why they may be even more cut off
  • a big idea — what if someone found a message that they shouldn’t have, was trapped with the bad guys and isolated from help by thousands of miles
  • a protagonist — a female meteorologist who has something to prove (yes this is still fairly 2D but better than before)
  • an antagonist — the corporation that wants to carry out a mining test that could fracture an ice shelf (again, 2D but has a motive)
  • a ticking clock — the test that will cause a glacier to splinter off that will cause flooding and other damage
  • a series of key events — finding the fax, the entrapment, the escape, the finding of one of Scott’s old supply bases just when all seems lost, the climax (sorry, you had to be there)
Less artistic ideas by myself

Summary and lessons learnt

So in the space of 2 hours we went from a pool of wildly different ideas to one that only had the word ‘fax’ in common with what we created.

We were proud of how much we got done in such a short time. It wasn’t perfect but it was a lot more than the zero we had 2 hours prior.

As usual we ended with a retrospective to find out what worked and what didn’t work.

Overall the team liked taking a few ideas and building from there, the collaboration and how we got different points of view yet agreed on an outcome.

The team felt they learnt about listening, sharing and expressing, and to build on ideas.

But the venue didn’t score as well. We were in Queen Elizabeth Hall on the South Bank and while the staff and bar were lovely, we did get a few interruptions for spare change and had neighbours who disturbed us.

This was a pity as the last venue, WeWork, was seen as too formal. So the hunt for the Perfect Venue (R) continues.

For the next Agile Storytellers session visit the Meetup Group.

Categories
General

Civil servants are user research participants too

Carrying out user research across the public sector is not the same as carrying it out with members of the public. That at least has been my experience of carrying out half a dozen different civil service-focused Discoveries.

As the Government Digital Service here in the UK likes to point out, civil servants are users too. But it’s a broad sector and my research projects have included central, devolved and local government, agencies, the police, and the NHS. Here are my experiences in light of the new guidance for services for civil servants.

Planning user research for civil servants

The first thing I do for all projects is meet the team and host a research question workshop. Where this differed from other workshops is how we thought about users.

We quickly decided that ‘civil servant’ was too broad a term for users and looked at the Civil Service profiles and Departmental IT profiles.

We found when reviewing civil servant personas from previous research that some are often just their job titles. So we did two things.

First, we adapted the job titles into roles to reflect that users across different teams may have the same fundamental duties and needs but different titles. This allowed us to see patterns and groups.

Second we wrote our potential users and their stories not just ‘As a
’ But ‘As a
’ ‘+ Who’ (eg” as an assistant who is in charge of a team’s room bookings”).

This helped us to really narrow down who our users were. It also helped us resolve one debate we had about who were end users of a service and who were our chief users for one Discovery, which to our surprise were not the same.

We also had a fairly clear idea about the end users but for the Discovery we determined it was more important to know who would implement and make decisions about the proposed service and their needs.

Cross-government help

What really saved time was posting what I was working on to the cross-government user research Slack channel and mailing list.

While my team had contacts, other user researchers put me in touch with their teams when relevant. In some cases they even had previous user research I could look at — with room bookings, for example, I had 3 different previous projects I could study and borrow form.

As you may be aware, government has a lot of meetings and forums and groups. Going along and inviting myself to relevant meetings helped in multiple ways: I got research from the meetings; I got contacts; and I got people to spread the word about what I was doing.

The tricky bits of civil service research

User research is mostly for getting information from users, but on my projects the civil servants I spoke to expected more from interviews, particularly if a team member was present.

In some interviews it did get bogged down when team members wanted to defend or tell why the problem the user mentioned was, and that’s not the aim of interviews. The decision has to be taken on the value of having a team member be there to take part in research and how to control the research session.

Confidentiality was also a concern. It’s hard to be truly frank as a user if the person who designed the system you’re criticising is in the same room.

One strategy was to allow for time at the end for Q&As between the team and users, and to shut it down if it went too off-topic.

This was even trickier in workshops, yet one reason we could get so many participants to attend was that our team of experts would be there. It was a question of balancing your desire to get information while rewarding the fact that professionals were giving up their time and so expected something in return.

Participants were also keen to know the next steps. Asking product managers to vow to blog at the end of the Discovery, Alpha or Beta meant I could tell users that there’d be a digest of learnings, and I invited many to the final Show and Tell.

What I learnt

We don’t share enough with other user researchers. And a lot of user researchers across government have worked on similar projects with similar problems and needs.

Contacting others can be easier due to Slack, public blogs, meet ups and so on but it requires more chasing and more channels to monitor. Combine this with projects using the same users and there can be a case of research fatigues for participants.

Some blockers were technical and unique to the civil service. GDS doesn’t have .gsi in its email (a ‘government secure initiative’ that’s going anyway in favour of better cyber security behaviour), and lacked a landline. For some not up to speed with the latest policies this was a red flag and was told I “couldn’t be trusted” with a response.

With so many departments and agencies (despite decentralisation) along with local authorities it can be tempting to stay in London and its area to meet users.

Yet bursting the London bubble and travelling the country was essential.

Hangouts, Appear.In and other remote tools are great but opportunities to observe other working areas were essential to get a proper view of work. They were often keen to meet someone who was willing to come see them so it was a positive session for all.

Overall it’s been an enjoyable experience. Civil servants are not just users, they’re people too. Shocking, I know.

Researching in government is rewarding as you have experts in their fields and they love talking about their work. Even those who are unhappy usually end their interviews with “sorry for the rant” despite having given you reams of information.

And you hope that by the end of the project your team will have had insights and findings that will help a range of talented people across the country do a better job and so help the public.

Note: this was originally written for the GDS blog but due to team changes got lost to the aether.

Categories
Research

“Success with Style” part 5: what does machine analysis mean for writers?

Now that this machine analysis of what makes a good and bad book is complete, what does it actually mean for writers?

I started this analysis back in May. Actually it was far before then, back when the original Success with Style paper was published in 2014 but it took me that long to realise I needed help with the analysis, even after I got my R qualification.

And when the results raised more questions with each analysis it meant that something I expected to take a month end-to-end became 3 months. Even now there is more I could do but have done enough to call it a day.

Success with Style 5: what it means for writers

Success with Style: a recap

If you’ve not read the other parts (and they can be quite stats heavy) this series was was prompted by a 2014 paper that claimed to be able to say what makes a good book, Success with Style. However my reading of it found some flaws and it was unclear how the original authors created their experiment.

From 758 books in the Project Gutenberg Library in 8 genres (Adventure, Detective/mystery, Fiction, Historical fiction, Love-story, Poetry, Sci-fi and Short-stories) with half of them deemed to be success (more than 30 downloads in the past month) and those with less unsuccessful/failure. I then put these through a variety of analyses:

  • readability (how easy the books are to read, in particular Flesch-Kincaid grade level formula)
  • the Stanford Tagger that uses the Penn treebank to analyse PoS (parts of speech)
  • the LIWC to analyse PoS

The latter two, the Penn and LIWC PoS analyses, split all the words in the books into different categories and do so in slightly different ways.

I then repeated these analyses in slightly different ways: first using 2018 download data (with 41 books changing their success/fail category) and then analysing just the first 3,000 words on the principle that it is often only the first chapter that agents or publishers review when considering a book.

Steps taken in the analysis of the books
Steps taken in the analysis of the books

In all tests I was looking for statistical significance. A good overview of what this is is on the Harvard Business Review, but in summary it is a test to see whether the results are due to chance or whether it’s likely that there is an underlying reason for why we got the results and not down to luck of the draw.

The P-value used to determine significance in all tests was 0.05, which is a fairly standard choice (note that this may have been too high – see the end of this page). Without reporting statistical significance it’s hard to really say if your test does mean something or it was just luck of the data you drew that gave you that result.

If you want to look in depth at the analyses read:

Success with Style findings summary

The main findings are that:

  • analysing an entire book is more accurate than just its first 3,000 words — it’s hard to judge a book on its first chapter
  • the genre affects the significance of tests and not all genres are as easy to predict as others — science-fiction has the most exceptions to the principles
  • books that are emotional (lots of either positive or negative emotions) tend to be unsuccessful
  • adjectives and adverbs predominate in poorly performing books and nouns and verbs in successful ones but are not a significant determiner of success or failure
  • don’t talk too much — dialogue heavy books were more unsuccessful
  • readability (as a computer measurement) is not generally a significant determiner of success, but don’t use too many long words in your writing. More successful books are slightly harder to read (have a higher readability) but are still able to be understood by a 15-year-old
  • these rough criteria generally stood up even when the success/fail criteria changed over time, meaning there is some underlying value in them
  • the LIWC is more accurate than the Penn treebank for predicting the success of a book
  • including punctuation in the analysis leads to better machine learning prediction performance

What these findings mean for writers

Caveats

First I’m not about to say that there are rules for writing. At best writers such as George Orwell* or Robert McKee have laid out principles, not rules (*while Orwell calls his rules but his last one is to ignore them when needed). This analysis is not to create a set of rules.

Secondly, as with many experiments, it is dangerous to extrapolate from beyond the original dataset. The 758 books in the Gutenberg dataset are all out of copyright and so are mostly from the 1850s to early 20th century. The oldest was Dante (born 1265) and most recent Samuel Vaknin, born in 1961 (Gutenberg only puts author’s birth and death dates, not publication date). Many are also well known as classics, such as Robinson Crusoe, and so may have a built in bias to being downloaded due to name recognition.

Machine analysis is not a perfect tool. Even tools such as the LIWC, which is updated regularly (Penn’s was mainly carried out between 1989 and 1996), still cannot accurately tell the difference in context of words such as ‘execute’ and whether it’s to execute a plan or Ned Stark.

Finally I didn’t clean my data, I didn’t remove common words or check for errors in the Gutenberg transcriptions. It’s not essential but may have led to some differences from what a cleaned up dataset would have produced.

The first chapter is a bad guide for overall success

Machine analysis of the success of a book failed when making a judgment solely on the first 3,000 words. At 55% its machine learning performance was only marginally better than a 50/50 guess for both the Readability and Penn treebank and LIWC analysis.

PoS analysis Accuracy 95% Confidence interval
Penn & Readability 2013 (complete book) 65.62% 57.7-72.9%
Penn & Readability 2018(complete book) 65.00% 57.5-72.8%
Penn & Readability 1st 3,000 words 55.62% 47.6-63.5%
LIWC 2013(complete book) 75.00% 67.6%-81.5%
LIWC 2018(complete book) 71.70% 64.0-78.6%
LIWC 3k1st 3,000 words 56.25% 48.2-64.0%

The analysis of the first chapter did produce significant results for some of the same tests as were produced in the full book analysis. However, assuming analysing the complete book is the ‘truer’ test due to their better machine learning performance, this means that the first chapter isn’t as valuable a method of analysis as analysing the whole book.

This means that machine analysis using the Penn Treebank, Readability or LIWC categories is not suitable for agencies, publishers or other services that ask to review based on one sample chapter.

However do human readers for agencies react the same way to a machine? Looking at sites such as QueryShark, professional readers look at the cover letter/email and look for things such as the who the protagonist is and what choices they face — for example QueryShark won’t even request the first chapter until they’ve read a query email.

An experiment would be to run sample chapters of successful and unsuccessful books against professional agency readers to get their view, but that would be an experiment for another day.

Don’t be overly emotional

Overly emotional books perform poorly, whether it is overly negative or positive. The only emotional category seen commonly in successful books was Anger.

That’s not to say that emotion shouldn’t be included but that it should not overwhelm writing. This includes both the dialogue and the action.

This applied to all genres except Adventure, and even then the positive effect was small compared with the overwhelming strong net difference in unsuccessful books.

This ties in with writing tips on avoiding melodrama —  show reactions and details of characters, not spell out the thing:

Remember that the drama doesn’t have to be all the way at eleven in order to affect the reader. Readers get into the little aspects of people’s lives, too.

And writing extreme emotion well by not necessarily expressing it:

Unfortunately, many writers make the mistake of assuming that to be gripping, emotion must be dramatic. Sad people should burst into tears. Joyful characters must express their glee by jumping up and down. This kind of writing results in melodrama, which leads to a sense of disbelief in the reader because, in real life, emotion isn’t always so demonstrative.

And finally there is of course the Robot Devil’s demand that you shouldn’t just have your characters announce how they feel (and so avoid naming emotions or entering too many per paragraph):

Looking at the results of emotional tags in the LIWC results supported this (Penn doesn’t offer emotion as a tag), that unsuccessful books overwhelmingly dominate the emotions:

Emotion PoS in the LIWC analysis – negative results are PoS tags more common in unsuccessful books and positive results are for successful books. ‘Affect’ includes emotions and other affective processes, ‘posemo’ is positive emotion and ‘negemo’ negative emotions.
T-test for significance using LIWC results for Tone (both positive and negative emotions). It is significant (p<0.05) for all genres except Historical fiction and Sci-fi. The figures at the top are the P-values — you can find out more on how to interpret boxplots.

Make it readable — but don’t worry too much

Although the Flesch-Kincaid readability was significant, and slightly lower (roughly one school year) readability was marked in unsuccessful books, I do not think the difference was so great as to make it important.

T-test (a standard statistical test for significance) for mean words per sentence, mean syllables per word and Flesch-Kincaid readability (FR) — both mean syllables per word and readability are statistically significant as p<0.05.
Readability by genre. Readability is significant for Adventure, Detective/mystery and Love-story. Note how Sci-fi’s plots are noticeably different to the other genres.

Looking at LIWC tests related to readability: the proportion of six-letter or longer words (ie long words); dictionary words:

Looking at the the overall rating then the proportion of six-letter or longer words and mean words per sentence were flagged as significant.

Overall make it readable without too many long words but don’t worry too much about the specifics.

Avoiding adjectives isn’t the best advice

This very brief University of Pennsylvania study of a few books and their contents suggested that adjectives and adverbs predominate in badly written books while good books have a higher proportion of nouns and verbs.

These 2 charts suggest at first that this is the case.

Adjectives (jj), adverbs (rb), nouns (nn) and verbs (vb) difference in proportion from Penn results — as before, positive results are from successful books and negative results from unsuccessful books

 

LIWC results for adjectives (adj) and adverbs

Yet while Adjectives was the PoS with the greatest relative importance in the Penn PoS test of the original data this was not repeated in the 2018 data nor in the LIWC tests.

Likewise while adjectives and adverbs dominate unsuccessful books (ie negative plots) in most genres, this isn’t always the case. And the difference is small compared to noun dominance — which again has mixed results across the genres.

Finally, I carried out a fresh T-test (a common test to find significance) to find statistical significance for adjectives, adverbs, nouns and verbs overall and adjectives per genre:

T-test with P-value for adjectives, adverbs, nouns and verbs. No P-value is lower than 0.05 so none is statistically significant.
T-test for adjectives per genre (Penn PoS). Again, none is statistically significant.

The above charts do show that successful books have a lower proportion of adjectives and adverbs. However, contrary to the University of Pennsylvania column, successful books also have a lower proportion of nouns and verbs.

One reason is suggested by The Economist’s Johnson column:

How can usage-book writers have failed to notice that good writers use plenty of adverbs? One guess is that they are overlooking many: much, quite, rather and very are common adverbs, but they do not jump out as adverbs in the way that words ending with –ly do. A better piece of advice than “Don’t use adverbs” would be to consider replacing verbs that are combined with the likes of quickly, quietly, excitedly by verbs that include those meanings (race, tiptoe, rush) instead.

And for those advocating verbs instead, he adds:

It is hard to write without verbs. So “use verbs” is not really good advice either, since writers have to use verbs, and trying to add extra ones would not turn out well

Not one of these results is statistically significant.

What this suggests is that use of adjectives may be a a symptom of bad writing but it’s not a cause — their overuse is not a reason why a book is unsuccessful. This is the same conclusion that the University of Pennsylvania post that analysed books for their adverbs, adjectives, nouns and verbs came to:

I suspect that the differences in POS distributions are a symptom, not a cause, and that attempts to improve writing by using more verbs and adverbs would generally make things worse. But still.

What this means for writers then is that avoid them but don’t worry too much.

Don’t talk too much

Too much dialogue, as indicated by quotation marks (“), was a sign of an unsuccessful book in all genres except for short stories.

The LIWC results show that quotations marks (‘quote’) are in a higher proportion in unsuccessful books, with the exception of short stories. In poorly performing Adventure books they account for nearly 1% of all tags
T-test for the statistical significance of quotation marks for genres. They are significant in Adventure, Detective/mystery, Fiction and Love stories as p<0.05. Unsuccessful books tend to have a higher proportion than successful ones. Sci-fi successful books has a very large range for the proportion, again showing how this genre has its own rules.

For writers this most likely means that focusing on dialogue at the expense of description is not popular with readers.

Note that the quote mark proportion is a very rough approximation — it doesn’t allow for long paragraphs of dialogue, nor look at books with no quote marks and what their pattern is.

Genre shapes the rules — and science-fiction breaks them

One thing that was consistent in all analyses was that analysing by genre showed variation within the overall finding for the category. For readability, for example, it was only significant for the Adventure, Detective/mystery and Love story genres. Likewise in the LIWC tests there was no test category which produced a significant result across all genres.

This makes sense — after all, poetry was included in the analysis, and it’s hard to see a reader applying the same mental rules of what they count as good to a poem as to a science-fiction novel. Even short stories have slightly different writing ‘rules’ — the quotation proportion, for instance, shows that short stories can be more dialogue heavy than full length novels.

What is interesting is how some genres live up to their stereotypes. In the LIWC tests, for example, Clout “refers to the relative social status, confidence, or leadership that people display” and was defined through analysis of speech. In it Adventure, Detective-mystery and Fiction genres were deemed significant. In my imagining of the archetypes, detectives and adventure heroes have a certain clout or leadership.

Difference in clout (scaled and normalized) between more and less successful books
Difference in Clout (scaled and normalised) between more and less successful books. The boxes show the range of the mid 50% of results and the line the median, with successful sci-fi having the largest range.

Similarly for the readability tests it was not shown to be significant for science-fiction or poetry. Again I don’t think a poem is judged by its readability, nor, again in my own experience, is poor readability or writing style a hindrance to sci-fi.

Readability by genre (scaled and normalised) — see how different science-fiction’s boxplots are and how similar the medians (lines) are for success and failure.

One thought is that if the idea or story is gripping then it seems that science fiction novel has a greater chance of success. More importantly, science fiction books often have long scientific terms or made up words which would fare badly in a readability analysis tools. I know I’ve read enough sci-fi novels with long scientific or pseudo-scientific terms and long sentences, wooden characters, but I persevered because I found the concept interesting.

This coincides with recent research which suggests that readers treat science-fiction as different and ‘read stupidly’ compared with other genres:

Readers of the science fiction story “appear to have expected an overall simpler story to comprehend, an expectation that overrode the actual qualities of the story itself”, so “the science fiction setting triggered poorer overall reading”.

Whether this is the cause or the effect of science-fiction having its own rules in this study is not clear.

Summary of principles: follow good writing practice

What then has this study revealed? First, that one old saw about writing isn’t quite right. While adverbs and adjectives tend to predominate in unsuccessful books, they aren’t statistically significant.

This means that unsuccessful books having a higher proportion of adverbs and adjectives (and nouns and verbs) in this set of results doesn’t mean much and is likely to be the same in other results claiming the same.

While personally I agree that I prefer writing with fewer adjectives and adverbs, it’s not a hindrance. Children’s books in particular have more, for example Harry Potter and his chums tend to say things “severely”, “furiously”, “loftily” and so on but the series is an undoubted hit with readers.

So the next time someone tells you that you must use fewer adverbs tell them to swiftly remove unnecessary adjectives you can tell them “no, it’s not statistically significant”. And they’ll love you for it (this advice is not statistically significant).

The next outcome is know your genre. By splitting into the books into a wide range of genres, not just fiction but poetry, love-stories and science-fiction among others, we saw that the so-called rules varied.

Sci-fi in particular was the exception to many of the findings. The tests I ran cannot say why, but we can speculate. one is that the audience for sci-fi is likely to be quite different from that of poetry, love stories and even regular fiction. Hard research on this is hard to find, sci-fi audiences do seem different to other readers although this quote from a survey of sci-fi readers suggests it may be that they value world building over other considerations:

The creativity that goes into world building and bringing ‘otherworldly’ characters to life in a way that we can identify with.

It may also be that the theme or subject of the books — something that the Penn and LIWC analyses cannot work out — may be more gripping to readers such that they ignore or overlook what would otherwise be considered weak writing.

Don’t talk too much — readers want more than just dialogue, unlike films where heavy exposition is unwelcome, in books it seems readers prefer stories with a good balance of description to talking.

Read more than the first chapter to get a true sense of a book — although I can’t say how many words does give the best approximation just yet.

Finally don’t be overly emotional, either being too positive or too negative in your writing. This suggests that the old ‘show, don’t tell’ writing saw is true, rather than telling us that someone is angry (and using that word), show their reaction, using nouns and verbs (and yes, adverbs if you must).

Comparison with the original Success with Style findings

These findings contrast with some of the original findings — if you can bear the sidebar of shame the Daily Mail has summed up the original findings in a more readable way than the original paper:

Successful books tended to feature more nouns and adjectives, as well as a disproportionate use of the words ‘and’ and ‘but’ – when compared with less successful titles.

But my tests found that the proportion of adverbs, adjectives, nouns or verbs wasn’t statistically significant.

The most popular books also featured more verbs relating to ‘thought-processing’ such as ‘recognised’ and ‘remembered’.

T-test for significance using LIWC results for ‘cogproc’ which shows cognitive processes, which includes thought processing. Again the genre varies the results but it is statistically significant for Detective/mystery, Fiction, Love stories, Poetry and Short stories

This is statistically significant for most, but not all, genres so is something we agree on.

Verbs that serve the purpose of quotes and reports, for example the word ‘say’ and ‘said’, were heavily featured throughout the bestsellers.

My tests found the exact opposite, that it was statistically significant that those with quotes did worse in most genres. Now it may be that writers are told to only use the word ‘said’ for dialogue tags so it may be that bestsellers follow this and use ‘said’ while poorer writers use other terms, which is why successful ones have the higher ‘said’ proportion. But a quick search of that schoolboy favourite, “ejaculated” (as in to speak suddenly or sharply) found it in around half of all successful books (99 out of 206 books) so it’s another reason to doubt this finding.

Alternatively, less successful books featured ‘topical, cliché’ words and phrases such as ‘love’, as well as negative and extreme words including ‘breathless’ and ‘risk’.

I didn’t look for specific words, though there are tools to do so if you wish. However, my results did say that overly emotional books do do worse and that does tie in with love, breathless and risk.

Poor-selling books also favoured the use of ‘explicitly descriptive verbs of actions and emotions’ such as ‘wanted’, ‘took’, ‘promised’ ‘cried’ and ‘cheered’. Books that made explicit reference to body parts also scored poorly.

This was true but the difference was small and not statistically significant for verbs. For emotions though it was significant in most genres (except, of course, sci-fi, that malcontent genre).

So of the original findings only 2 of those were fully agreed with in this study and one partially.

Final thought…

Throughout all this, at the risk of being melodramatic, be true to yourself and write for yourself. This analysis gives pointers on signs of a bad book (‘unsuccessful’ in the more diplomatic description) but that doesn’t mean you must slavishly these principles.

Write with what you’re comfortable and for the reasons you want. Just don’t be overly dramatic about it.

… with one last thing — too many things?

There is a big caveat with this study. I asked my friend, stats professor and consultant Dr Ben Parker (seriously clever with numbers, not bad with puns, and he offers some very reasonably priced but quality stats training courses and consultancy).

He thinks too many things may be tested. He’s no doubt right, as statistical tests are aligned to the number of variables —  the statistical tests used depend on the independent variables and their levels and there is a chance I used the wrong ones by analysing too much.

Ben was also concerned that the p-value of 0.05 was too high and may need to be 0.01 or lower. This is because if we tested 20 variables then there is already a 1/20 chance one of them will be significant – and 1/20 is 0.05. I did run the tests separately each time in the code there is a chance that I may have merged and analysed too much per test. This could also be the reasons why the original research was lacking in statistical significance tests.

I did run the tests by testing values separately as well as all together, but I admit that I don’t have years of stats experience under my belt (unlike Ben who knows his stuff) and may have overlooked some things. My code is on GitHub so anyone willing to check is welcome to review and amend. The conclusion is that the results are probably sound but the statistical significance may not be right.

However, even if the significance results are wrong, he suspect that it is more than likely that the resulting charts and the differences in positions are still broadly correct, which is why I have left the information as it stands.

Try for yourself, fork it if you disagree and use for your own amusement.

All links to data and code is found in this final post.

Categories
Research

Success with Style part 6: retrospective and links

This section is only if you want to recreate the experiments yourself. If you want to look in depth at the analyses read:

Retrospective thoughts

My main thoughts are:

  • I wish I’d double then triple checked the data. The source data I produced a couple of years ago when looking at it and had some errors (mainly around column sorts not capturing all columns. Thanks Excel.
  • I’d should have asked the original authors for their methods. It was an interesting expertise to try and repeat it but no harm in asking
  • I should have made R do more of the hard work around producing images and other things I could have automated better
  • I wish I’d learnt about GutenbergR to download different books, eliminate poetry as not of interest to me and replace with another genre

Next time I would:

  • get more books and genres (using GutenbergR)
  • focus on fewer tests and review the tests I use

For anyone looking to do their own experiments

Use the LIWC for machine analysis

The LIWC not only gave a better machine learning performance, its own categories and tags also gave a better range of significant results than the Penn treebank. Generally I found the tone (emotion) the most interesting measure as it assigned human feelings to the tags, more than just categorising as grammatically (or is it linguistically?) what kind of word it is.

Ultimately this study has been about what this means for readers and writers and that’s why the emotion is of most relevance to me.

That and the fact that the LIWC is still being researched and updated (the last in 2015, compared with Penn’s from 1996) means that it has the potential for a longer shelf life. I also found it easier to use.

It is not free is the main downside.

Don’t skip the punctuation or action

The most surprising result was how punctuation affected results. I had tried running experiments without it but the machine learning performance decreased by around 5 percentage points while the tag differences did not seem to change.

 

The LIWC results show that quotations marks (‘quote’) are in a higher proportion in unsuccessful books, with the exception of short stories. In poorly performing Adventure books they account for nearly 1% of all tags.

Links

Sponsor

This work required some small funding from Richardson Online Ltd, my consulting company, for work on the R code.

Categories
General

User research is statistical significance

User research is to design and product development as statistical significance is to data.

You can’t be confident in figures if you haven’t carried out significance tests. And you can’t be confident in a design or product change if you haven’t carried out user research.

Yet businesses that baulk at treating data as gospel without statistical significance tests will make product or design decisions without a jot of user research.

I’ve worked for organisations like this, perhaps you have too.

What is user research?

User research is many things, but in practical terms it’s the tangible outcome of making your users, audience or customers the heart of what you do.

It’s one thing for a company to tell us that customers are their “number one priority”.

A company shows it by having user researchers who learn about their users: who they are, what they do, what they want, what they like and dislike, what influences them.

User researchers uncover these findings through interviews, observation, usability studies, surveys. Then they interpret and gather insight through multiple rounds of research.

Credit

Insight is the output: you find out who exactly your users are, and what their pain points and needs are. Insight is shared with the wider team (and the team should be joining in on research sessions too).

These findings sit within a goal. This can be a project, business or organisational goal, and how the product or service will best serve its users.

This is a very quick overview of it and there are many places to find out more about user research, how it improves service design and why you should do it.

What is statistical significance?

Let’s say you’ve carried out A/B tests on two web pages and design B led to a 10% increase in goal completions.

Does this necessarily mean that design B is ‘better’? You could have got lucky with a hoard of spendthrift shoppers logging in together, or unlucky when the internet failed during design A’s slot.

Credit

The point of statistical significance is to be able to say that the results are likely to be true, a “low chance of an effect that actually is a false alarm”. That if you repeated this you had a good chance of getting similar results.

Newspapers and other everyday presentation of statistics typically omit statistical significance for simplicity. This is understandable, but statistics used in research and business must include it if they want to understand their data. And if the business does not run these tests, why not?

Statistical significance then helps give you the confidence — not certainty — that your findings are true. That your results weren’t due to a lucky (or unlucky) sample or events.

How user research ≣ statistical significance

User research gives you confidence. Confidence that what you’re doing has an effect due to changes your team made and not due to chance. Confidence that audience tastes are changing or a competitor has emerged.

Confidence that when the CEO says that they don’t like something that you can push back because the users say otherwise. That you have research, not just opinions.

Credit

User research gives you confidence but never certainty. That’s why research is an ongoing activity, much like how significance tests are carried out on each new result.

Caveats

A danger of statistical significance is that it can give the appearance of scientific certainty when none is there. For example, produce Google Analytics data with statistical significance and it’ll appear the more ‘scientific’ result.

Yet the analytics is the outcome of people’s behaviour, and — unlike interviews — it’s hard to follow up and probe why a user did what they did with analytics.

Finally, both statistical significance and user research need to state the practical significance. Both can say that there is an effect but both need to say what the practical outcome is. For example, whether the problem is a mere annoyance or one that prevents users completing their task.

User research > statistics?

Both statistical significance and user research give you confidence in your results. But good user research includes the user impact by default.

A key part of user research is that the whole team should join in, and so will expand their own knowledge. How often does the team join the web analyst and contribute to their research?

User research can probe and build understanding across a team in a way that statistics by itself finds hard to achieve.

And any company that wants to be serious about its development needs user research as much as it needs statistical tests on its data.

Categories
Research Scientific Research

“Success with Style” part 4 — modern data and just a chapter

When starting this analysis I spotted that the download data was for the past 30 days and that this was used for success or fail categorisation. 

Even if the data was for the lifetime of the book, it’s been nearly 5 years since the original downloads. The best way to test this then was to get the latest data (albeit still for the past 30 days).

The other thought was that the analyses looked at the entire book. But what if readers did not read the entire book but only read a certain amount before making a judgment? When submitting work to an agent or publisher for consideration, for example, often only the first chapter is requested. Based on this I analysed just the first 3,000 words of each book through the Penn and LIWC tagger and used its 2013 success/fail data to repeat the experiments.

Finally I noticed a bias towards punctuation as markers for success or failure in the output and ran the experiments without the punctuation tags to see what the result would be.

Starting hypotheses

H0: There's no difference in the tests which produce significant results between the 2014 and 2018 data
HA: There is a difference in the tests which produce significant results between the 2014 and 2018 data

H0: There's no difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words
HB: There is a difference in the tests which produce significant results between the full machine analysis of the book and that of just the first 3,000 words

The hypotheses are fairly simple – if there is no difference in the 2018 data then most of the test that proved significant with the 2013 data should also do so in 2018.

Likewise if the first 3,000 words is unimportant the test results should likewise only be significant at the same level.

3,000 words (3k words) is about 10 pages and is about one chapter’s length although of course there is no hard and fast rule about how long a chapter is.

Data used

Data summary

2018 data download date

2018-07-22

2013 data download date

2013-10-23

Unique books used

759

Difference in 2013 and 2018 success rates

Row Labels Count
FAILURE 22
Adventure 5
Detective/mystery 3
Fiction 2
Historical-fiction 1
Love-story 1
Poetry 8
Short-stories 2
SUCCESS 20
Adventure 3
Detective/mystery 4
Fiction 1
Historical-fiction 4
Love-story 3
Sci-fi 5
Grand Total 42

There were 758 unique books (the remaining 42 of the 800 listed were in multiple categories). With 42 differing that is 5.5% of the total books used and none of those with a different success status was listed in multiple categories.

The new data was parsed through both the Perl Lingua Tagger using the Penn treebank and Perl readability measure and the LIWC tagger.

Results for 2013, 2018 and 3,000 word data

Machine learning performance

The most important measure for me is which is the best for making predictions. 

Using all tags including punctuation

Accuracy

95% Confidence Interval

Sensitivity

Specificity

Readablity 2013

65.62%

57.7-72.9%

69%

63%

Readablity 2018

65.00%

57.5-72.8%

68%

63%

Readablity 3k

55.62%

47.6-63.5%

68%

44%

LIWC 2013

75.00%

67.6%-81.5%

76%

74%

LIWC 2018

71.70%

64.0-78.6%

78%

66%

LIWC 3k

56.25%

48.2-64.0%

53%

60%

According to this the LIWC is still the best tagger and that both 2013 and 2018 data are fairly similar for both readability and LIWC, with the results being in each other’s 95% confidence interval.

Both for readability and LIWC the first 3,000 words (3k) are much worse predictors of overall success and barely better than a 50/50 guess.

Difference in significance in key measures

Punctuation

Overall there was not much difference in omitting punctuation for LIWC or Penn analyses. In fact the machine analysis performances all dropped around 5% points. 

Readability 

Genre

Significant 2013

Significant 2018

Significant 3k words

Adventure

TRUE

TRUE

TRUE

Detective/mystery

TRUE

TRUE

TRUE

Fiction

FALSE

FALSE

FALSE

Historical-fiction

FALSE

FALSE

FALSE

Love-story

TRUE

TRUE

TRUE

Poetry

FALSE

FALSE

FALSE

Sci-fi

FALSE

FALSE

FALSE

Short-stories

FALSE

FALSE

FALSE

Significant tags in the same genres for all 3 different categories.

LIWC categories

Test

genre

Significant 2013

Significant 2018

Significant 3k words

Clout

Adventure

TRUE

FALSE

TRUE

 

Detective-mystery

TRUE

TRUE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Authenticity

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

TRUE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

TRUE

TRUE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Analytical

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

TRUE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

6 letter words

Adventure

TRUE

TRUE

TRUE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

FALSE

FALSE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

TRUE

TRUE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

FALSE

         

Dictionary words

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

FALSE

TRUE

TRUE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

TRUE

 

Love-story

FALSE

FALSE

TRUE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

TRUE

TRUE

TRUE

 

Short-stories

FALSE

FALSE

FALSE

         

Tone

Adventure

FALSE

FALSE

FALSE

 

Detective-mystery

TRUE

TRUE

TRUE

 

Fiction

TRUE

TRUE

TRUE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

TRUE

TRUE

FALSE

 

Poetry

TRUE

TRUE

TRUE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

TRUE

TRUE

TRUE

         

Mean words per sentence

Adventure

TRUE

TRUE

TRUE

 

Detective-mystery

FALSE

FALSE

FALSE

 

Fiction

TRUE

TRUE

FALSE

 

Historical-fiction

FALSE

FALSE

FALSE

 

Love-story

FALSE

FALSE

FALSE

 

Poetry

FALSE

FALSE

FALSE

 

Sci-fi

FALSE

FALSE

FALSE

 

Short-stories

FALSE

FALSE

TRUE

Whereas readability was consistent across the different approaches the LIWC categories shows a lot more variety.

Tone has the most success across this. As before the 2013 and 2018 data tend to match (but not always, as with Clout or Dictionary words) and 3,000 words, well, it does its own thing.

Tone most consistent throughout and as last time had most significant categories even with 3k.

Parts of speech tags (PoS) with the largest difference

The tables list the top 3 PoS that dominate in successful and unsuccessful books.

Penn data

Successful PoS 2013 Successful PoS 2018 Successful PoS 3k
INN – Preposition / Conjunction INN – Preposition / Conjunction INN – Preposition / Conjunction
DET – Determiner DET – Determiner DET – Determiner
NNS – Noun, plural NNS – Noun, plural NNS – Noun, plural
     
Unsuccessful PoS 2013 Unsuccessful PoS 2018 Unsuccessful PoS 3k
PRP – Determiner, possessive second PRP – Determiner, possessive second RB – Adverb
RB – Adverb VB – Verb, infinitive PRP – Determiner, possessive second
VB – Verb, infinitive RB – Adverb VB – Verb, infinitive

LIWC data

Successful PoS 2013 Successful PoS 2018 Successful PoS 3k
functional – Total function words  functional – Functional words functional – Total function words 
prep –   Prepositions  prep –   Prepositions  prep –   Prepositions 
article –   Articles  space –   Space  article –   Articles 
     
Unsuccessful PoS 2013 Unsuccessful PoS 2018 Unsuccessful PoS 3k
quote –    Quotation marks  allpunc – All Punctuation* â€‹ adj –   Common adjectives 
allpunc – All Punctuation* â€‹ affect – Affective processes  adverb –   Common Adverbs 
affect – Affective processes  posemo –   Positive emotion  affect – Affective processes 

The same tags dominate all the books in the Penn treebank for successful books – prepositions (for, of, although, that), determiners (this, each, some) and plural nouns (women, books).

For unsuccessful books it also has determiners that dominate but in the possessive second person (mine yours), adverbs (often, not, very, here) and infinitive verbs (take, live).

For LIWC it is quite similar. Functional words dominate with (it, to, no, very ), prepositions also dominate successful books (to, with, above is its examples) and articles (a, an, the) and (it, to, no, very).

For unsuccessful books it’s all punctuation, quotation marks and social (mate, talk, they while including all family references) and affective processes (happy, cried), which includes all emotional terms.

Quotations suggest a high propensity to a high ratio of dialogue to action/description.

What does this tell us?

2013 v 2018 data

Overall there is more similarity than difference in the 2013 and 2018 Penn and readability results. The machine learning performance was also broadly the same, with each other’s overall performance falling within the 95% confidence interval.  

The most successful PoS were also largely the same, as were the top 3 unsuccessful ones.

Likewise the LIWC categories generally matched in significance for both 2013 and 2018 data. The Successful PoS were broadly the same, as were the unsuccessful ones.

This suggests that while the original authors didn’t mention that the data was only from the previous 30 days, their results have largely stood to be true.

The first chapter

Just judging a book by its first 3,000 words was not as accurate as analysing the whole book. The machine learning performance was barely better than a guess. 

However, the readability did match and the dominance of  successful PoS was similar to that of the full data in the 2013 and 2018 studies.

Of all the LIWC categories described in part 3, Tone both was the most significant predictor across genres but also the most consistent across the different tests.

Summary

The 2018 results generally matches the 2013 results and as such suggest the original method holds as a good predictor of success or failure of those books.

The first 3,000 words results did not match the 2013 or 2018 data and as its machine learning performance was the weakest suggests that this is not an accurate way to predict a book’s success. It may be that there is a ‘sweet spot’ where the first x amount of words correlates closely with the overall rating, but it is more than 3,000 words.

Successful books tend to use prepositions, determiner and nouns and functional words. Unsuccessful ones skew towards quotations marks, punctuation and positive emotions (which with the LIWC are similar to affective processes).

This suggests that unsuccessful books may use shorter sentences (high punctuation rate), more dialogue (high quotation mark rate), adverbs and are more emotional, particularly positive emotions. Writers are frequently told by writing experts to avoid adverbs wherever possible.

Successful books by contrast tend to focus on the action – describing scenes and situations, hence the dominance of functional words, prepositions and articles. This makes them sound rather boring, but suggests that these bread and butter words are necessary to build a good story.

The LIWC data suggests that tone is the most reliable predictor of success. But what isn’t answered whether it is because it predominates in successful or unsuccessful books and whether it is positive or negative emotions. This is something to explore though based on the emotion and affect appearing in the top 3 of unsuccessful books suggests it is there.

Having punctuation tags had some use and machine learning performance was better with it so even though the punctuation tags can be hard to interpret, it is worth including them in any machine analysis but more work is needed to interpret them.

Categories
Research Writing

“Success with Style” part 3: using LIWC data

Last time we replicated the Success with Style original output and methods despite it not being listed. We managed to get the data to broadly match. Great, but now we are going to look at a different way of analysing the same text.

In part 2 we used the Penn treebank to analyse the text and its parts of speech (PoS). This time we’re using LIWC, a tool developed at the University of Texas. It has similarities to the Penn treebank in that it categorises words and has similar categories, such as prepositions.

In part 1 we looked at the original experiment and recreated it in part 2. This time we’ll use the same input data but process it through a different NLP analysis program — the LIWC.

Hypotheses

H0: There's no difference in the proportion of LIWC categories in successful and unsuccessful books, regardless of genre
HA: There is a difference in the proportion of LIWC categories in successful and unsuccessful books, and the pattern will depend on genre

H0: There's no difference in the LIWC summary values of successful and unsuccessful books, regardless of the book's genre
HB: There is a difference in the LIWC summary values of successful and unsuccessful books, and the pattern will depend on genre

 

Success with Style LIWCMethod

The data was the same, the measure of success and the method was the same as in part 1, along with adjust the p-value (p<0.05 for significance) and machine learning algorithm. Likewise variables with many zeroes were not transformed.

Difference in success

The R code managed to create different tags to the original. You can find the LIWC definitions at the foot of this page.

Tags per genre

LIWC Difference in proportion function-article – original data

Overall biggest difference

PoS (successful books) Definition Diff (largest difference first) PoS (Unsuccessful books) Definition Diff (largest difference first)
functional Total function words 0.003835 quote Quotation marks -0.001814
prep Prepositions 0.001758 allpunc All Punctuation* ​ -0.001350
article Articles 0.001199 affect Affective processes -0.001231
ipron Impersonal pronouns 0.001198 social Social processes -0.001181
space Space 0.001155 posemo Positive emotion -0.001103
relativ Relativity 0.000860 ppron Personal pronouns -0.001047
number Numbers 0.000623 apostro Apostrophes -0.000999
focuspast Past focus 0.000463 female Female references -0.000963
power Power 0.000454 focuspresent Present focus -0.000929
cogproc Cognitive processes 0.000437 shehe 3rd pers singular -0.000905
period Periods/fullstop 0.000403 verb Common verbs -0.000642
comma Commas 0.000379 informal Informal language -0.000361
differ Differentiation 0.000369 exclam Exclamation marks -0.000323
otherp Other punctuation 0.000318 time Time -0.000319
parenth Parentheses (pairs) 0.000266 you 2nd person -0.000273
conj Conjunctions 0.000266 percept Perceptual processes -0.000236
quant Quantifiers 0.000257 affiliation Affiliation -0.000216
semic Semicolons 0.000254 focusfuture Future focus -0.000213
interrog Interrogatives 0.000233 sad Sadness -0.000202
colon Colons 0.000225 adj Common adjectives -0.000190
work Work 0.000197 family Family -0.000190
drives Drives 0.000163 nonflu Nonfluencies -0.000156
pronoun Total pronouns 0.000154 netspeak Netspeak -0.000154
cause Causation 0.000136 discrep Discrepancy -0.000140
anger Anger 0.000131 see See -0.000133
we 1st pers plural 0.000130 bio Biological processes -0.000130
certain Certainty 0.000125 i 1st pers singular -0.000121
compare 0.000125 negemo Negative emotion -0.000111
they 0.000122 body Body -0.000104
death 0.000101 reward Reward -0.000098
tentat 0.000078 friend Friends -0.000088
ingest 0.000060 risk Risk -0.000080
home 0.000055 negate Negations -0.000073
achieve 0.000038 auxverb Auxiliary verbs -0.000070
money 0.000016 motion Motion -0.000069
health 0.000011 insight Insight -0.000067
adverb 0.000011 hear Hear -0.000056
leisure 0.000003 feel Feel -0.000049
swear 0.000002 assent Assent -0.000046
male Male references -0.000045
qmark Question marks -0.000035
sexual Sexual -0.000028
anx Anxiety -0.000025
dash Dashes -0.000025
relig Religion -0.000010
filler Fillers -0.000008

A positive (negative) value means that the mean PoS proportion is higher in the more (less) successful books

Unpaired t-tests

Showing results of PoS tags that have significant adjusted P-values.

PoS Definition adjusted P-value
analytic Analytical thinking 0.017
tone Emotional tone 0
mWoSen Mean Words per Sentence 0
sixletter Six letter words 0
ppron Personal pronouns 0.005
ipron Impersonal pronouns 0
article Articles 0.005
prep Prepositions 0
adj Common adjectives 0.005
number Numbers 0
affect Affective processes 0
posemo Positive emotion 0
negemo Negative emotion 0.045
sad Sadness 0.009
social Social processes 0.044
family Family 0.041
friend Friends 0
female Female references 0.026
feel Feel 0.041
bio Biological processes 0.044
affiliation Affiliation 0.017
power Power 0.017
risk Risk 0.017
focuspresent Present focus 0.02
focusfuture Future focus 0
space Space 0.009
time Time 0
informal Informal language 0
nonflu Nonfluencies 0
colon Colons 0.028
exclam Exclamation marks 0
quote Quotation marks 0.005
apostro Apostrophes 0.017

33 out of 93 tags (including punctuation) of the transformed PoS were significantly different between successful and unsuccessful books. This mean that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.

Difference in LIWC summary variables

The LIWC has its own definitions. Some of them are proprietary so how they’re calculated is not clear, but they rely on the PoS tags. For example, ‘tone’ is overall emotion (both the positive and negative emotion tags). Like the tags, they use the proportion (ie 0.85 means 85% of the text) in a text apart from mean words per sentence.

Variables Definition
Analytical thinking (Analytic) People low in analytical thinking tend to write and think using language that is more narrative ways, focusing on the here-and-now, and personal experiences. Those high in analytical thinking perform better in college and have higher college board scores.
Clout Clout refers to the relative social status, confidence, or leadership that people display through their writing or talking. The algorithm was developed based on the results from a series of studies where people were interacting with one another.
Authenticity When people reveal themselves in an authentic or honest way, they are more personal, humble, and vulnerable.
Emotional tone (Tone) Although LIWC2015 includes both positive emotion and negative emotion dimensions, the Tone variable puts the two dimensions into a single summary variable. Numbers below 50 suggest a more negative emotional tone.
Measure Successful Unsuccessful P value Significant (p>0.05)?
Six letter words 0.1633 0.1552 0.0004 TRUE
Mean words per sentence 18.3832 17.0184 0.0007 TRUE
Dictionary words 0.8388 0.8410 0.6000 FALSE
Authentic 0.2240 0.2181 0.3900 FALSE
Analytic 0.7240 0.6939 0.0032 TRUE
Clout 0.7417 0.7499 0.3800 FALSE
Tone 0.3892 0.4486 0.0010 TRUE

Results show that the mean words per sentence were significantly different in successful books and comparable to the figures in the original test. Likewise the proportion of six letter words (or more) is significantly different in successful books. The tone however is lower in successful ones (ie uses fewer emotional words either positive or negative).

Looking further at these categories by genre:

Difference in analytical words (scaled and normalized) between more and less successful books
Difference in authenticity (scaled and normalized) between more and less successful books
Difference in clout (scaled and normalized) between more and less successful books
Difference in clout (scaled and normalized) between more and less successful books
Difference in Dictionary Words (scaled and normalized) between more and less successful books
Difference in Dictionary Words (scaled and normalized) between more and less successful books
Difference in mean words per sentence (scaled and normalized) between more and less successful books
Difference in mean words per sentence (scaled and normalized) between more and less successful books
Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books
Difference in proportion of 6 letter words (scaled and normalized) between more and less successful books
Difference in tone (scaled and normalized) between more and less successful books
Difference in tone (scaled and normalized) between more and less successful books

Most important variables

PoS Definition Overall relative importance
ipron Impersonal pronouns 100.00
quote Quotation marks 86.40
otherp Other punctuation 69.99
posemo Positive emotion 68.88
time Time 67.30
space Space 64.90
parenth Parentheses (pairs) 58.40
you 2nd person 56.80
adj Common adjectives 46.73
risk Risk 41.25
sixletter Six letter words 40.70
semic Semicolons 38.60
power Power 35.29
netspeak Netspeak 31.52
number Numbers 30.08
swear Swear words 28.03
period Periods/fullstop 27.75
filler Fillers 25.91
certain Certainty 25.69
death Death 25.56
mWoSen Mean words per sentence 25.03
ppron Personal pronouns 22.95
colon Colons 20.12
focuspast Past focus 19.99
body Body 18.78
tone Emotional tone 18.57
leisure Leisure 17.86
focusfuture Future focus 16.08
home Home 14.88
exclam Exclamation marks 13.08
achieve Achievement 11.90
dicWo Dictionary words 11.72
apostro Apostrophes 9.99
work Work 9.22
ingest Ingestion 7.70
health Health 6.83
relig Religion 5.91
qmark Question marks 3.93
interrog Interrogatives 2.72
hear Hear 1.48

Machine learning performance

Accuracy 95% CI Sensitivity Specificity
75.00% 67.6%-81.5% 76% 74%

Conclusion

  • The mean proportion of 33 PoS tags were significantly different between more successful and less successful books (reject null hypothesis 1)
  • Six letter word proportion, mean words per sentence, analytical words and tone were significantly different between more and less successful books (reject null hypothesis 2). Between these categories all genres except historical fiction had a significant difference, with tone (ie both positive and negative emotion use) being significant for 5 out of the 8 genres. No category in the Penn treebank analysis had this many significant genres.
  • Six letter words, Mean words per sentence, Dictionary words, Authentic, Analytic, Clout, and Tone can be used to predict the status of the book with an accuracy reaching 75%. This is superior to the readability, mean words per sentence and mean syllables per word score of 65%. 

Overall LIWC analysis has performed better than using readability and Penn treebank analysis.

LIWC definitions

These are taken from the LIWC manual.

Abbreviation Category Examples
WC Word count ­
Summary Language Variables
Analytic Analytical thinking ­
Clout Clout ­
Authentic Authentic ­
Tone Emotional tone ­
WPS Words/sentence ­
Sixltr Words > 6 letters ­
Dic Dictionary words ­
Linguistic Dimensions
funct Total function words it, to, no, very
pronoun Total pronouns I, them, itself
ppron Personal pronouns I, them, her
i 1st pers singular I, me, mine
we 1st pers plural we, us, our
you 2nd person you, your, thou
shehe 3rd pers singular she, her, him
they 3rd pers plural they, their, they’d
ipron Impersonal pronouns it, it’s, those
article Articles a, an, the
prep Prepositions to, with, above
auxverb Auxiliary verbs am, will, have
adverb Common Adverbs very, really
conj Conjunctions and, but, whereas
negate Negations no, not, never
Other Grammar
verb Common verbs eat, come, carry
adj Common adjectives free, happy, long
compare Comparisons greater, best, after
interrog Interrogatives how, when, what
number Numbers second, thousand
quant Quantifiers few, many, much
Psychological Processes
affect Affective processes happy, cried
posemo Positive emotion love, nice, sweet
negemo Negative emotion hurt, ugly, nasty
anx Anxiety worried, fearful
anger Anger hate, kill, annoyed
sad Sadness crying, grief, sad
social Social processes mate, talk, they
family Family daughter, dad, aunt
friend Friends buddy, neighbor
female Female references girl, her, mom
male Male references boy, his, dad
cogproc Cognitive processes cause, know, ought
insight Insight think, know
cause Causation because, effect
discrep Discrepancy should, would
tentat Tentative maybe, perhaps
certain Certainty always, never
differ Differentiation hasn’t, but, else
percept Perceptual processes look, heard, feeling
see See view, saw, seen
hear Hear listen, hearing
feel Feel feels, touch
bio Biological processes eat, blood, pain
body Body cheek, hands, spit
health Health clinic, flu, pill
sexual Sexual horny, love, incest
ingest Ingestion dish, eat, pizza
drives Drives
affiliation Affiliation ally, friend, social
achieve Achievement win, success, better
power Power superior, bully
reward Reward take, prize, benefit
risk Risk danger, doubt
TimeOrient Time orientations
focuspast Past focus ago, did, talked
focuspresent Present focus today, is, now
focusfuture Future focus may, will, soon
relativ Relativity area, bend, exit
motion Motion arrive, car, go
space Space down, in, thin
time Time end, until, season
Personal concerns
work Work job, majors, xerox
leisure Leisure cook, chat, movie
home Home kitchen, landlord
money Money audit, cash, owe
relig Religion altar, church
death Death bury, coffin, kill
informal Informal language
swear Swear words fuck, damn, shit
netspeak Netspeak btw, lol, thx
assent Assent agree, OK, yes
nonflu Nonfluencies er, hm, umm
filler Fillers Imean, youknow
allpunc All Punctuation* ​
period Periods/fullstop .
comma Commas ,
colon Colons :
semic Semicolons ;
qmark Question marks ?
exclam Exclamation marks !
dash Dashes
quote Quotation marks apostro Apostrophes parenth Parentheses (pairs) ()otherp Other punctuation
Categories
Research

“Success with Style” part 2: recreating the original experiment

How did the team behind Success with Style develop their tests, which they claimed were statistically significant?

In part 1 we looked at the original paper and noted the lack of a hypothesis so I proposed one:

H0: There's no difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book's genre.

I also added another:

H0: There's no difference in the Flesch-Kincaid readability of successful and unsuccessful books, regardless of the book's genre.
HA: There is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book's genre.

Note since publishing I have updated some tables after noticing errors in the original data, I was caught by Excel not always reordering all columns when sorting. 

Hypotheses and data used

The original team used Fog and Flesch-Kincaid reading grade level but to save duplication of work I only used Flesch-Kincaid. However my source data has the Fog rating if you wish — my experience has been the Flesch-Kincaid gives more accurate results. The Flesch-Kincaid readability used here is US school grade level, where the lower the value the easier it’s judged to read.

Although the Fog and the Flesch readability indices are in my original data if you want to run it yourself – I’ll publish all data and code in the final part of this review. I also capped unreliable data for words per sentence – average words per sentence was capped at 50 (only 4 had this apply to them).

The original team gathered a range of books and classed by genre and success/failure based on number of downloads over the 60 days prior to them collecting it. We’ll use the same.

They had an equal number of books per genre and total failures and successes (758 books with 42 across multiple genres to give a total 800 books, 400 of which are failures, 400 success).

Statistical tests

For these tests I’m greatly indebted to the users of Stack Overflow and Ahmed Kamel. While I had the original ideas it was he who got them into a working R script and analysis and the analysis relies heavily on his work. I’d highly recommend Ahmed if you want help with your own statistical tests.

Statistical analysis was performed using R studio v 1.1.149. I’ve put a more detailed methodology at the end of this page. Significance uses p ≀ 0.05.

Difference in success

The R code managed to reproduce the original figures and I’ve displayed their tables and graphs as appropriate.

Tag difference per genre

Difference in proportion (all tags) cc-ls
Difference in proportion: cc-ls
Difference in proportion (all tags) md-rbs
Difference in proportion: md-rbs
Difference in proportion: sym-wdt
Difference in proportion: wp-lrb

Overall biggest difference

The data is side-by-side here, with the first two columns being the successful books and the last two the unsuccessful ones.

PoS (Successful books) Difference PoS (Unsuccessful books) Difference
INN – Preposition / Conjunction 0.005560 PRP – Determiner, possessive second -0.004326
DET – Determiner 0.003114 RB – Adverb -0.003033
NNS – Noun, plural 0.002730 VB – Verb, infinitive -0.002690
NN – Noun 0.001540 VBD – Verb, past tense -0.002665
CC – Conjunction, coordinating 0.001399 VBP – Verb, base present form -0.001630
CD – Adjective, cardinal number 0.001309 MD – Verb, modal -0.001306
WDT – Determiner, question 0.001050 FW – Foreign words -0.001169
WP – Pronoun, question 0.000558 POS – Possessive -0.000890
VBN – Verb, past/passive participle 0.000525 VBZ – Verb, present 3SG -s form -0.000392
PRPS – Determiner, possessive 0.000444 WRB – Adverb, question -0.000385
VBG – Verb, gerund 0.000259 UH – Interjection -0.000205
SYM – Symbol 0.000197 NNP – Noun, proper -0.000181
JJS – Adjective, superlative 0.000170 TO – Preposition -0.000107
JJ – Adjective 0.000083 EX – Pronoun, existential there -0.000063
WPS – Determiner, possessive & question 0.000045
JJR – Adjective, comparative 0.000041
RBR – Adverb, comparative 0.000013
RBS – Adverb, superlative 0.000003
LS – Symbol, list item 0.000002

 

A positive value means that the mean PoS proportion is higher in the more successful books, while a large negative value means its proportion is higher is less successful books.

Unpaired t-tests

For those not aware of significance, the P-value is used to determine wether a result is significant and didn’t just happen by chance. Statisticians may point out that probability is chance, but for a basic overview you can find out more here.

PoS P-value adjusted P-value
CD – Adjective, cardinal number 0 0
DET – Determiner 0 0
INN – Preposition / Conjunction 0 0
JJS – Adjective, superlative 0.012 0.039
MD – Verb, modal 0.004 0.015
POS – Possessive 0 0
PRPS – Determiner, possessive 0.022 0.057
VB – Verb, infinitive 0.018 0.052
WDT – Determiner, question 0 0
WP – Pronoun, question 0.033 0.078
WRB – Adverb, question 0.001 0.004 | 

12 out of 41 of the transformed PoS were significantly different between successful and unsuccessful books. This means that we can reject the null hypothesis (hypothesis 1) since the proportion of more than 1 PoS was significantly different between more and less successful books.

Difference in Flesch-Kincaid readability, mean words per sentence, and mean syllabus per sentence between successful and unsuccessful books

Measure Successful Unsuccessful P value
Mean words per sentence 17.8 17 0.25
Mean syllables per word 1.45 1.43 0.005
Flesch-Kincaid readability 8.46 7.98 0.028

Results show that the mean readability was significantly higher in unsuccessful books compared to successful books. The same is true for the mean words per sentence which was significantly higher in unsuccessful books compared to successful books.

The mean syllables per word was not significantly different between more and less successful books.

Looking further at readability by genre

genre FAILURE mean FAILURE SD SUCCESS mean SUCCESS SD P value Significant?
Adventure 7.54 1.83 9.76 3.86 0.0002 TRUE
Detective/mystery 6.82 1.40 7.56 2.03 0.0116 TRUE
Fiction 7.92 2.27 8.07 1.87 0.3852 FALSE
Historical-fiction 8.55 1.83 9.40 3.00 0.1247 FALSE
Love-story 7.61 1.57 8.83 3.32 0.0360 TRUE
Poetry 11.27 10.24 9.71 2.66 0.8450 FALSE
Sci-fi 6.33 1.52 6.43 1.38 0.5896 FALSE
Short-stories 8.99 2.74 7.90 2.02 0.0614 FALSE

Results show that there is a statistically significant difference in the mean readability between successful and unsuccessful books for the following genres: adventure; detective/mystery and love stories. The mean readability was significantly higher (ie, harder to read) for more successful books in those genres.

Most important variables

Definition Overall relative importance
JJ – Adjective 100.000
UH – Interjection 86.810
PRPS – Determiner, possessive 69.049
TO – Preposition 67.866
INN – Preposition / Conjunction 67.570
WP – Pronoun, question 64.431
MD – Verb, modal 60.935
RBS – Adverb, superlative 59.996
WDT – Determiner, question 59.635
PRP – Determiner, possessive second 55.813
CD – Adjective, cardinal number 54.306
NN – Noun 48.380
EX – Pronoun, existential there 42.474
SYM – Symbol 40.823
Mean syllables per word 36.230
JJS – Adjective, superlative 35.699
NNP – Noun, proper 35.674
CC – Conjunction, coordinating 33.137
VBP – Verb, base present form 32.791
VBG – Verb, gerund 29.862
VBN – Verb, past/passive participle 29.826
POS – Possessive 28.903
WRB – Adverb, question 18.980
Flesch-Kincaid readability 18.371
VB – Verb, infinitive 14.735
NNS – Noun, plural 13.609
FW – Foreign words 13.562
DET – Determiner 3.757
LS – Symbol, list item 1.202

 

This shows that the most important tag in determining success or failure is adjectives. However it does not say whether this is for success or failure, but does say that adjectives are an important tag.

Machine learning performance

Accuracy 95% CI Sensitivity Specificity
65.62% 57.7-72.9% 69% 63%

Overall accuracy is 67.5%. The sensitivity is the true positive rate and specificity is the true negative rate for specificity (ie after allowing for false positives or negatives). Note that for all other tests I ignored punctuation tags but included them for machine learning as it improved performance. I left it out for other parts as knowing whether right-hand bracket was important did not seem to tell me anything. 

Conclusion

The mean of 12 PoS tags was significantly different between more successful and less successful books. We also saw the PoS pattern was largely dependent on the genre of the book.

This means we can reject the null hypothesis and say that there is a difference in the distribution of the proportion of PoS tags in successful and unsuccessful books, and the pattern will depend on a book’s genre.

Not only that but the Flesch-Kincaid readability and mean syllables per word were significantly different between more and less successful books. This was more evident in fiction, science fiction and short stories where the mean readability was significantly lower (ie easier to read) in more successful books.

This means we can say that there is a difference in the Flesch-Kincaid readability of successful and unsuccessful books, and the pattern will depend on a book’s genre.

Overall, the Flesch-Kincaid readability, mean words per sentence and PoS can be used to predict the status of the book with an accuracy reaching 65.6%. This is comparable to the original experiment which gave a comparable overall accuracy of 64.5%.

But what happens when we try it with a different PoS tool that analyses text in a different way? Next time I’ll use LIWC data.

Method

I used R to perform the analysis. When running:

  • statistical analysis was performed using R studio v 1.1.1.453.
  • the data set was split into a training (80%) and a test data set (20%). Analysis was performed on the training data set except when comparing readability across genres where the whole data was used due to the small sample size in each genre.

The average difference in various parts of speech (PoS, the linguistic tags assigned to words) was calculated between successful and unsuccessful books. I used what I think were the original methods used by the team to calculate these differences.

Detailed methodology

I laid out the broad outlines and normally this is put first in a research paper but it’s not the most engaging part. For those of you who are interested, this is the stats nitty gritty and is used in the other experiments.

Univariate statistical analysis

Variables were inspected for normality. Appropriate transformations such as log, Box-Cox, Yeo-Johnson transformations were performed so that variables can assume an approximate normal distribution. This was followed by a series of unpaired t-tests to assess whether the mean proportion of each PoS was significantly different between successful and unsuccessful books.

P-values were adjusted for false discovery rate to avoid the inflation of type I error (a ‘false positive’ error). Analysis was performed only using the training data set. Variables were scaled before performing the tests.

Machine learning algorithm

Support vector machine was used to predict the status of the book based on variables deemed important using initial univariate analysis. LibLinear SVM with L2 tuned over training data was used.

The model was tuned using 5-fold cross validation. The final predictive power of the model was assessed using the 20% test data. Performance was assessed using accuracy, sensitivity, specificity.

Variables with lots of zeroes

Ten variables had a lot of zeros and were heavily skewed. Thus, they were not transformed since none of the transformation algorithms fixed such a distribution. The remaining PoS did not contain such a large number of zeros and were transformed prior to performing the unpaired t-test. The package bestNormalize was used to find the most appropriate transformation.

Three PoS were removed from the analysis (nnps, pdt and rp) since none of the novels included any of these PoS.

You can see the remaining variables and their transformation if you are keen.