Online access to archived pages on the Internet Archive uncertain for researchers

By Jasmine Bala for the Local News Conference

The disappearance of archived pages from the Internet Archive poses a threat to research and the preservation of news as the first draft of history, researchers heard recently during a Ryerson University conference on the state of local news.

The Internet Archive is a non-profit digital library with collections of books, movies, music and archived web pages from across the world. Its most popular feature is the Wayback Machine, which allows researchers to save webpages and search through their database of archived pages. Some pages that may have been previously accessible, however, have disappeared.

“If you are a site owner and you ask for something not to be displayed in the Wayback Machine that has been captured via automatic global scale crawling, then it will not be accessible,” Internet Archive’s web archiving director Jefferson Bailey said during a June 3 panel presentation hosted by the Ryerson School of Journalism.

Under copyright laws, he said, the Internet Archive team is legally required to make these pages unavailable to users upon the request of the page owner.

The two-day conference brought together about 100 journalists, educators and scholars. Bailey was featured on a panel organized by the News Measures Research Project, a major research initiative led by Duke University’s Philip Napoli. The project’s goal is to identify the factors that play a role in determining the health of local journalism in different communities.

Carrie Buchanan, a journalism professor from John Carroll University in Ohio who attended the panel, said she has discovered that archived pages she was counting on as part of her research on three hyperlocal news sources in Cleveland, Ohio have vanished from the Wayback Machine.

Buchanan noted, for example, that archived pages published by the Cleveland Heights Patch prior to 2014 have now disappeared.

In the case of Canada, the ability of site owners to remove archived pages presents problems on a larger scale with companies like Postmedia, she added.

“Somebody might attempt to take all of the old versions of Postmedia out of the Internet Archive…I think it’s a significant issue,” she said. “If this stuff is a public trust, maybe there needs to be some kind of movement to keep news in the public domain even though it’s privately owned.”

“It’s really not good when you think about how much of Canada’s history is in those old newspapers,” she said in an interview during the conference.

“Maybe all of that public material that was previously in the public domain could be removed, even from the Internet Archive. And that really scares me.”

Buchanan suggested that a transfer of ownership might be the reason for the disappearance of the Cleveland Heights Patch pages: AOL sold Patch to Hale Global at the beginning of 2014.

“If I were the new owners, I might not want people to see how good it used to be,” she said, noting she thinks the quality of journalism from Patch has declined. “I might not want people to know that there was very detailed local coverage by quite a few different reporters.”

Buchanan said she relies “extensively” on the Internet Archive and even though there are some challenges in using the Wayback Machine, “this is one of the great resources that we have for research into what’s happening to journalism.”

Research projects like the News Measures Research Project use the Internet Archive to curate archived pages and generate research datasets. After creating and testing a methodology that looks at the number of news outlets in each community and the quantity and quality of stories being produced, Napoli, the principal investigator of the project, said they are now examining local news across 100 random communities in the United States.

Two members of Napoli’s team, Rutgers University’s Matthew Weber and Kathleen McCollough, joined him on the panel to present their latest research. They have identified all of the online webpages for media outlets in those municipalities and with the help of the Internet Archive, have created a week’s sample of content to analyze.

Bailey said that research like this is a “great example of research intent” pairing with “preservation intent.”

This project alone, Bailey said, involves gathering content from 663 local news sites. Altogether that amounts to more than two terabytes of data and 19 million documents. Documents in this case, he added, means URLs.

While the rest of Napoli’s team have been analyzing the news content, Weber has been working on the network data by mapping hyperlink connections between sites.

“There are some really cool things you can do with the archive, in addition to just being able to utilize it to map connections that exist between websites,” he said. “It gives us this really unique snapshot of what content is and was on news media websites and related websites over time.”

Napoli says he hopes to have the analysis of the 100 communities completed and characteristics of an unhealthy local news ecosystem identified by the end of the year.

This story was originally published by the Local News Conference and is republished here with permission of the editors.