Web scraping is a way to download large amounts of information from government websites for later analysis in a database. It is making possible stories that would be impossible for the causal hunt and click web surfer.

An interesting pair of stories appeared a few weeks back in The Toronto Star that showed how CAR can be used to add tremendous value to data already available on the Web.

For several years now, federal government departments and agencies have maintained “proactive disclosure” websites to release quarterly information on contracts, travel and hospitality expenses for senior officials and ministers, grants and contributions and reclassification of government jobs. In terms of media coverage, the disclosures have resulted in the occasional story about expensive foreign trips or ministers not disclosing all of their expenses, but as far as I can tell, the Star is the first to assemble the disclosures of many agencies to find a wider story.

The paper pulled travel data from the websites of 20 major federal departments and compiled it into a database of about 60,000 records covering the last two years of the Liberal government and the first two of the Conservative. The paper’s analysis of 12 of those departments found that 10 of them saw more lavish spending by Conservative ministers than by their Liberal predecessors. The Prime Minister’s office was one of two that was more frugal.

Tory Natural Resources Minister Gary Lunn spent $340,000 on trips to places such as London, Paris and Australia, making him the biggest spender of any minister, Liberal or Conservative, during the four year period, the Star reported. A follow up story showed the Conservative government signed consulting contracts worth more than $900 million in its first two years, compared to more than $500 million for the Liberals in their last two years in power.

The story was written by the Star’s investigative editor, Kevin Donovan, while the heavy data work was done by the paper’s computer-assisted reporting specialist, Andrew Bailey.

[node:ad]

It is not all that surprising that it took this long for an outlet to really mine the proactive disclosure data. There are a couple of reasons for that, I think. One is that the data, once it is out there for everyone, quickly becomes part of the background noise. Ho hum. Routine. So after a few initial stories, reporters move on to other more pressing matters. The second is that the data, in the way it is published, is difficult to use. In the case of the travel expenses, it can take several mouse clicks to look up a single expense. And the expenses are presented in a variety of onscreen report formats that make it hard to consolidate the data to look for trends. There is no way to look at the expense data in one, neat table. It’s a kind of opaque openness that seems calculated to make it difficult to see the big picture.

The answer to this, of course, is to automate the collection of the data by using a computer program that can do all that mouse clicking for you, then save the results to put into a database. Aron Pilhofereditor of Interactive Newsroom Technologies at the New York Times, wrote an excellent primer on web scraping as part of Computer Assisted Reporting, A Comprehensive Primer, the Canadian CAR text I wrote recently with David McKie from CBC Radio. Aron shows how to scrape using the PERL programming language, but it can be done using other languages such as Python, or with off-the-shelf software packages

Web scraping comes with its own set of practical and legal questions, such as ensuring your scrape isn’t misinterpreted as a denial-of-service attack, and what to do when website terms of use policies explicitly prohibit mass downloads. The first problem can be mitigated by slowing down the scrape, so your program hits the site every second or two, rather than like a machine gun. The second is a real issue, but ought to be less so with government websites on which the data is already owned by the public. Besides, if governments choose to make data available in a searchable format, rather than providing a mass download option, they are pretty much inviting scrapes.

Clever people will devise clever solutions, and come up with great stories such as Donovan and Bailey’s.