The clockwork internet

A little (okay, a long) while back I described how I planned to analyse newspaper columnists in order to find out if there was a variation between newspapers in terms of tone.

Well, I have finally done this, or at least got the bulk of it out the way – 12,000 UK newspaper editorials and columnist articles from the past decade of and counting. And it was a lot simpler than I thought, thanks to the way the internet works – and you can do it too.

clockwork parts

Set it running – via thirdangeluk

Web scraping for journalists

What set me off was a talk by Paul Bradshaw from the Guardian when he described something called web scraping. He said it was a simple, automated way to gather information from the web – and he was right.

A little web searching and I came across a program called Web Harvest. It is fairly straightforward, based on XML (a simple to follow code that powers the web) and has some very good and relevant examples. There are alternatives out there, but I found this the simplest.

As soon as I started working out how to use it I realised just how it was like releasing a clockwork toy – wind it up with the right terms to scrape and it will go out and do that.

Vorsprung durch Technik

A ‘great leap forward through technology’ indeed. When I started this project it was me taking pictures of the print and then looking to type it up. Then it was copying them from official archives but that too took time.

I also found that an excellent way to process the text was Excel 2010 – yes, the spreadsheet program. With a few macros and bits of visual basic (the programming used in Excel) that I found after a bit of searching I could easily and quickly process the text to be analysed for sentiment in LIWC.

Sample size for data – bigger is better

One of the reasons for using such a big sample size is not out of greed (though there is the temptation for ‘just a few more’). It means that the data will be more robust, and I can compare more journalists, particularly those who have written for multiple papers.

Be aware that scraping is a legally grey area – not all sites let you, and if you store articles you may need to check the legality of copyright law in your jurisdiction.

If you want help and advice I’m happy to do so. For me, now that I have processed it through LIWC and built a very big database of the output, is ‘so what?’ – what does it mean?

Useful links

More on the LIWC and its practical use for dating

Web-Harvest – the tool I use. Works on Windows, Mac and Linux, is easy and has some great examples

Scraper wiki -a free online web scraper

PeoplePerHour and Freelancer in case you’d rather pay someone else to do this

The Guardian data blog, which writes about this kind of thing in depth

The Internet Web Archive – a good source for scraping as it has already scraped newspapers and online content