Backlit_keyboard-1024x576.jpg

A journalist’s guide to web scraping

As public institutions publish more data online, web scraping has become a useful tool for reporters who want to sift through large amounts of information. [[{“fid”:”4459″,”view_mode”:”default”,”fields”:{“format”:”default”,”field_file_image_alt_text[und][0][value]”:””,”field_file_image_title_text[und][0][value]”:””},”type”:”media”,”link_text”:null,”attributes”:{“height”:”576″,”width”:”1024″,”style”:”width: 400px; height: 225px; margin-left: 10px; margin-right: 10px; float: right;”,”class”:”media-element file-default”}}]]By Nael Shiab Do you remember when Twitter lost $8 billion in just a few hours earlier this year? It…

As public institutions publish more data online, web scraping has become a useful tool for reporters who want to sift through large amounts of information.

[[{“fid”:”4459″,”view_mode”:”default”,”fields”:{“format”:”default”,”field_file_image_alt_text[und][0][value]”:””,”field_file_image_title_text[und][0][value]”:””},”type”:”media”,”link_text”:null,”attributes”:{“height”:”576″,”width”:”1024″,”style”:”width: 400px; height: 225px; margin-left: 10px; margin-right: 10px; float: right;”,”class”:”media-element file-default”}}]]By Nael Shiab

Do you remember when Twitter lost $8 billion in just a few hours earlier this year? It was because of a web scraper, a tool companies use—as do many data reporters.

A web scraper is simply a computer program that reads the HTML code from webpages, and analyze it. With such a program, or “bot,” it’s possible to extract data and information from websites.

Let’s go back in time. Last April, Twitter was supposed to announce its trimestrial financial results once the stock markets closed. Because the results were a little bit disappointing, Twitter wanted to avoid a brutal confidence loss from the traders. Unfortunately, because of a mistake, the results were published online for 45 seconds, when the stock markets were still open.

These 45 seconds allowed a bot programmed to web scrape to find the results, format them and automatically publish them on Twitter itself. (Nowadays, even bots have scoops from time to time!)

Once the tweet was published, traders went crazy. It was a disaster for Twitter. The bot’s company, Selerity, specializes in real-time analysis, and became the target of many critics. The company explained the situation a few minutes later.

For a bot, 45 seconds is an eternity. According to the company, it took only three seconds for its bot to publish the financial results.

 

Web scraping and journalism

As more and more public institutions publish data on websites, web scraping has become an increasingly useful tool for reporters who know how to code.

For example: for a story for Journal Métro, I used a web scraper to compare the price of 12, 000 products from the Société des alcools du Québec with the price of 10, 000 products of the LCBO in Ontario.

Another time, when I was in Sudbury, I decided to investigate food inspections in restaurants. All the results from such investigations are published on the Sudbury Health Unit’s website. However, it’s impossible to download all the results; you can only verify the restaurants one by one.

[[{“fid”:”4460″,”view_mode”:”default”,”fields”:{“format”:”default”,”field_file_image_alt_text[und][0][value]”:””,”field_file_image_title_text[und][0][value]”:””},”type”:”media”,”link_text”:null,”attributes”:{“height”:”621″,”width”:”1024″,”style”:”width: 600px; height: 364px; margin-left: 10px; margin-right: 10px;”,”class”:”media-element file-default”}}]]

I asked for the entire database where the results are stored. After a first refusal, I filed a freedom-of-information request—after which the Health Unit asked for a $2000 fee to process my request.

Instead of paying, I decided to code my own bot, one that would extract all the results directly from the website. Here is the result:

Coded in Python, my bot takes control of Google Chrome with the Selenium library. It clicks on each result for the 1600 facilities inspected by the Health Unit, extracts the data and then sends the information into an Excel file.

To do all of that by yourself would take you weeks. For my bot, it was one night of work.

[[{“fid”:”4461″,”view_mode”:”default”,”fields”:{“format”:”default”,”field_file_image_alt_text[und][0][value]”:””,”field_file_image_title_text[und][0][value]”:””},”type”:”media”,”link_text”:null,”attributes”:{“height”:”517″,”width”:”1024″,”style”:”width: 600px; height: 303px; margin-left: 10px; margin-right: 10px;”,”class”:”media-element file-default”}}]]

But while my bot was tirelessly extracting thousands of lines of code, one thought kept bothering me: what are the ethical rules of web scraping?

Do we have the right to extract any information found on the web? Where is the line between scraping, and hacking? And how can you ensure that the process is transparent for both the institutions targeted and the public reading the story?

As reporters, we have to respect the highest ethical standards. Otherwise, how can readers trust the facts we report to them?

Unfortunately, the code of conduct of the Fédération professionnelle des journalistes du Québec, adopted in 1996 and amended in 2010, is getting old and brings no clear answers to all my questions.

The ethics guidelines of the Canadian Association of Journalists, although more recent, doesn’t shed much light on the matter, either.

As Université de Québec à Montréal journalism professor Jean-Hugues Roy says it: “These are new territories. There are new tools that push us to rethink what ethics are, and the ethics have to evolve with them.”

So, I decided to find the answers by myself, by contacting several data reporters in the country.

Stay tuned; the results from that survey will be published in a following instalment.

Note: If you’d like to try a web scrape yourself, I published a short tutorial last February. You will learn how to extract data from the Parliament of Canada website!

Nael Shiab is an MA graduate of the University of King’s College digital journalism program. He has worked as a video reporter for Radio-Canada and is currently a data reporter for Transcontinental.

Illustration photo by Colin, via Wikimedia Commons.