It’s Time to Archive the Internet Archive

Publishers are suing the Internet Archive for its emergency library, putting the whole project in danger.
Headquarters of Internet Archive, located in Richmond District, San Francisco, California​.
Headquarters of Internet Archive, located in Richmond District, San Francisco, California. Credit: Girl2k/Wikimedia Commons

Five of the world's largest publishers sued the Internet Archive, claiming its open-access digital library is a mass infringement on their copyright. The move puts the internet’s most important archive in danger, and has at least got some data hoarders talking about archiving the Internet Archive, and what that would even look like.

Last week, Hachette Book Group, Inc., HarperCollins Publishers LLC, John Wiley & Sons, Inc., and Penguin Random House LLC filed a copyright infringement lawsuit against the Internet Archive and five ‘Doe’ defendants, claiming that the Internet Archive is a piracy site.


In March, the Internet Archive set up a new service for people displaced from library and educational access due to COVID-19, called the "National Emergency Library." Nearly 1.4 million books are available in full for anyone to download and read, without a waitlist, until the end of June or the end of the coronavirus pandemic crisis in the US, according to an announcement on their site.

While damages haven't been set, the publishers could claim up to $150,000 in statutory damages per infringement, for each of the 1.4 million copyright works in the emergency library. They're also demanding a preliminary and permanent injunction of the Internet Archive, and anyone involved with it, from reproducing and distributing more works, and that all current copyrighted copies on the site be destroyed—effectively shutting down the entire library.

The existence of these works online caught the ire of the Copyright Alliance in March, which called the project "vile," as reported by Torrent Freak. Now, these publishers are taking the non-profit Internet Archive to court over it.

"IA’s actions grossly exceed legitimate library services, do violence to the Copyright Act, and constitute willful digital piracy on an industrial scale," the publishers state in the complaint. “IA creates nothing. IA plays no role in the hard work of researching, writing, or publishing the works or, for that matter, in creating or sustaining the overall publishing ecosystem and its distinct partnerships and markets."

The move puts one of the internet’s largest repositories of knowledge in peril. Over on the DataHoarder subreddit, threads have been started about what it would take to archive the archive, which holds dozens of petabytes of data and is constantly growing (there have been attempts to simply understand the sheer amount of data the archive holds). Academics have been saying for years that the Internet Archive must be made more resilient by creating backups of the backups and storing them in other locations. When Donald Trump was elected president, the Internet Archive announced it was making a backup in Canada. Egypt’s Bibliotheca Alexandrina once had a backup of the Internet Archive’s Wayback Machine, but it has not been updated in years.

There have also been exploratory attempts to do distributed backups of the Internet Archive, most notably by Archive Team, a group of archivists who backup notable or imperiled websites and databases. This project, called INTERNETARCHIVE.BAK, was described as “an experimental project to see the feasibility and issues with making a backup of the Internet Archive.” That project has been dormant since 2016.

In response, Internet Archive's founder Brewster Kahle wrote a short post acknowledging the lawsuit. "As a library, the Internet Archive acquires books and lends them, as libraries have always done," Kahle wrote. "Publishers suing libraries for lending books, in this case protected digitized versions, and while schools and libraries are closed, is not in anyone’s interest."