When Will the Internet Defeat Link Rot?
Two decades after the problem was first assessed, broken links still plague the web.
Homestar Runner's 404 page remains the best on the Internet.
A few days ago, I renewed the hosting for a single-serving joke site I haven't updated in more than a year, partially because the domain reggaehorn.com is surely worth millions, and largely because I didn't want to let a a little black hole—unfathomably tiny as it may be—open up in the web.
Obviously, the loss of a completely inconsequential site isn't exactly going to ruin the internet. But as occasionally happens, it got me thinking about how such black holes, most of more importance, are already everywhere. As sites blink offline and pages get lost to the long march of site updates and lapsed hosting fees. For everyone who values the internet as a repository of information—that's all of us—link rot is a corrosive force that's left much of the web perched atop a fragmented foundation of lost sources and dead links. So what can we do about it?
The link rot problem topped the news cycle last fall, when a Harvard Law study found that the US Supreme Court has a serious problem. According to the study, "50% of the URLs found within United States Supreme Court opinions do not link to the originally cited information." A few months earlier, a Yale study found that 29 percent of websites cited in Supreme Court decisions are no longer online.
Despite the recent attention, it's hardly a new issue. The earliest mention I could find in a Google Scholar search was a 1997 Software, IEEE article rather enticingly titled "Will Linking Battle Unravel the Web?"; poetically enough, a search for a non-paywalled version led me to a site whose PDF links are all offline. Another piece, a 1998 article by Jakob Nielsen called "Fighting Linkrot," implores readers to "never let any URL die" in the fight against link rot. (Nielsen also links to a wonderful old document from the Worldwide Web Consortium titled "Cool URIs Don't Change, which begins with a four-line poem.)
Nielsen's piece shows just how much of a concern link rot was nearly two decades ago:
Linkrot definitely reduces the usability of the Web, being cited as one of the biggest problems in using the Web by 60% of the users in the October 1997 GVU survey. This percentage was up from "only" 50% in the April 1997 survey. Users get irritated when they attempt to go somewhere, only to get their reward snatched away at the last moment by a 404 or other incomprehensible error message.
Compare then to now, when we are far more reliant on the web as our main source of reference material, and link rot becomes less a navigation issue and more a threat to the veracity of the information we rely on. What happens if a website sourced in a Supreme Court opinion goes offline? Unlike a law journal, websites aren't backed up in paper anywhere. And if dead links threaten the utility and impact of Supreme Court decisions, they also affect the value of less-scrutinized sources of information.
Let's take it to the opposite extreme: For every Geocities ASCII art page that's held on from the Web 1.0 era, how many other sites have been lost to the ether? Or, with the modern web's emphasis on promoting newer pages, what happens when users looking for reference material on a topic only find whatever blog post has been created most recently, whether or not that post is a summarization of a rehash of a story that references sources that no longer work? With each lost connection, the whole network gets a little less useful.
As it turns out, the data we've come to rely on—from blog posts to photos to reference material—is more fleeting than we'd expect in a world where a ten-minute Gmail outage causes meltdowns worldwide. In the very long term, magnetic tape remains our best option for long-term storage. Even tape doesn't last too long, a fact that's inspired projects to preserve humanity's digital record forever.
Photo: Mike's Free GIFs, which is still hanging around
In the near future, the ubiquity of cheap cloud storage has made our data more resilient than ever. But what happens if we can't find it? In an essay last year, Felix Salmon pointed out that despite storage and hosting being unfathomably cheap now, links keep rotting, partially because the web has put less emphasis on permanence:
For one thing, the institution of the permalink is dying away as we move away from the open web; if you’re not even on the web (if, for instance, your content comes in the form of a show on Netflix), then the very concept makes no sense. What’s more, we’ve moved into a world of streams, where flow is more important than stock, and where the half-life of any given piece of content has never been shorter; that’s not a world which particularly values preserving that content for perpetuity. And of course it has never been easier to simply delete vast amounts of content at a stroke. (For instance: the Kanye West and Alec Baldwin twitter feeds.)
In effect, the modern web has shifted more of the onus of archival onto users. How often do you go back to read a year-old live blog or tweet? They're designed so that readers can quickly assimilate information to be referenced from memory, which diminishes their long-term necessity, if not their value. That doesn't even begin to address apps like Snapchat that make disposability an asset.
Sure, we treat communications now with far less gravitas than our forebears may have treated their letters, which is surely the product of the ease of communicating now. And it's not like future historians are going to have a problem figuring out what we were up to; the sheer glut of our data creation, along with massive archival projects, will see to that.
Yet the high rate of online entropy means that over time, older pieces of the web, factually useful or not, disappear in a morass of noise. And as we near the biggest internet land grab in history, the historical record of the internet—which, at this point, is pretty much the historical record of humans—will likely get more volatile, as new top-level domains inevitably get bought, built up, and collapse. Why? With tools like Google's caching system and the Internet Archive, you'd think the fight against link rot could be automated.
Writing in The Magazine, Chris Higgins explained that the Internet Archive, along with the 378 billion URLs cached in its Wayback Machine system, is doing just that, along with a few others. The Internet Archive has worked with Wikipedia to cache pages as soon as they're live, which can then be used to automatically fix links if they break. Along with tools like Memento, which features a Chrome extension for fixing 404s, and similar add-ons for Firefox, there are options for cleaning up link rot out there.
In an October blog post, Alexis Rossi, head of collections at the Internet Archive touted the Wayback Machine's "save page now" feature, which lets you cache a page and create a stable link instantly. The Harvard researchers cited above used their study to promote the utility of Harvard Law's Perma system, which lets users create archived, permanent citations for their work.
Such systems aren't built into the fabric of the web, which means plenty of pages slip through the cracks. “What we’d really like to do is have the browsers themselves build something into the 404 page that [checks] automatically," Rossi told Higgins. "That takes a little bit of convincing.”
The answer lies in making a link rot-fighting system as open as possible, so anyone can help with maintenance or host their own backup.
I reached out to Mozilla, whose Firefox browser features a third-party Wayback Machine add-on, to see if building an automated link fixer into Firefox was on the table. As of now, it's not. "Mozilla has no plans to develop an automated dead link checking system, but there are numerous Firefox add-ons already available that provide similar functionality to Firefox users who are interested in it," a spokesman said in an email.
For now, the onus remains on users and webmasters (when was the last time you used that term?) to be more proactive about keeping links alive. Integrating an automated caching and link-fixing system into a browser is a tall order, and many big sites (including Motherboard) have combated the problem by instead having 404 pages redirect to the homepage, which at least doesn't leave users at a dead end.
In the meantime, the Internet Archive is working with Wikipedia, Wordpress, and other major sites to make sure they stay online. (I'm glad to know that even if the reggaehorn.com URL lapses, the site will live on in Wordpress form.) While that certainly covers a vast portion of the web, there's a more existential question that must be asked: What happens if the Internet Archive disappears?
In a 2005 paper published in the Journal of Medical Internet Research looking at the efficacy of WebCite, a system designed to cache citations for online research journals, the authors address that very question. The answer, Gunther Eysenbach and Mathieu Trudel write, lies in making the system as open as possible, so anyone can help with maintenance or host their own backup, much like the Pirate Bay copies that float around. Beyond that, Eysenbach and Trudel suggest finding institutional backers who can take over the entire project if need be; eventually, they write, it will be come so intrinsic to publishing that publishers will make sure it lives on. So far, it has.
As will the Internet Archive, at least for the foreseeable future. But as Eysenbach and Trudel point out, a permanent archival system requires support across the board, not just the admirable efforts of major internet non-profits. And until a standardized, automated internet-wide archival system gets developed—assuming that's even possible—lapsed hosting fees, content migration, and site redesigns will continue to knock links offline.