Libraries and Archivists Are Scanning and Uploading Books That Are Secretly in the Public Domain

A coalition of archivists, activists, and libraries are working overtime to make it easier to identify the many books that are secretly in the public domain, digitize them, and make them freely available online to everyone. The people behind the effort are now hoping to upload these books to the Internet Archive, one of the largest digital archives on the internet.

As it currently stands, all books published in the U.S. before 1924 are in the public domain, meaning they’re publicly owned and can be freely used and copied. Books published in 1964 and after are still in copyright, and by law will be for 95 years from their publication date.

But a copyright loophole means that up to 75 percent of books published between 1923 to 1964 are secretly in the public domain, meaning they are free to read and copy. The problem is determining which books these are, due to archaic copyright registration systems and convoluted and shifting copyright law.

As such, a coalition of libraries, volunteers, and archivists have been working overtime to identify which titles are in the public domain, digitize them, then upload them to the internet. At the heart of the effort has been the New York Public Library, which recently documented why the entire process is important, but a bit of a pain.

Back in the 1970s, the Library of Congress operated a Catalog of Copyright Entries (CCE) indicating which books had renewed copyright. Digital copies of these notices can be found in the Internet Archive and at over at Stanford University.

Historically, it’s been fairly easy to tell whether a book published between 1923 and 1964 had its copyright renewed, because the renewal records were already digitized. But proving that a book hadn’t had its copyright renewed has historically been more difficult, New York Public Library Senior Product Manager Sean Redmond said.

“Part of the difficulty is that you’re proving a negative—that it’s copyright wasn’t renewed—so you’re looking for the lack of a record,” Redmond told Motherboard. “There was no way to make lists of public domain candidates.”

Videos by VICE

So as part of a massive undertaking, the NYPL recently converted many of these records to XML format, making it significantly easier to automate the process of determining which books might be candidates for being added to the public domain, the first step in ultimately making sure they’re freely available online.

“It’s like a shoe store going from estimating shoe sales from returns and exchanges only, to having the actual sales receipts,” Redmond said. “The public domain exists, it’s just been hard to see and this project is about shining a light on it.”

Leonard Richardson, a software developer and science fiction author whose Python matching scripts are helping expedite the process, tells Motherboard that the hard work is only just beginning.

“It’s now easy to make a list of books whose registration wasn’t renewed, but that list just makes a big to-do list for someone else,” Richardson said. “The next bit is going to be slow. For any given book, we need to convince someone who has a scan of the book that they’re allowed to make it public.”

Richardson notes that much of that heavy lifting is being done by volunteers at organizations like Project Gutenberg, a nonprofit effort to digitize and archive cultural works. These volunteers are tasked with locating a copy of the book in question, scanning it, proofing it, then putting out HTML and plain-text editions.

Gutenberg has been engaged in this process for years, though it tends to work on one book at a time. Other organizations, like the Hathi Trust Research Center, didn’t bother, because there was no way of uploading public domain works at any real scale until folks like Richardson and Redmond began streamlining and automating the process accurately.

“We need to convince Hathi and the Internet Archive that this is worth their time—that if we give them a list of 10,000 books, they won’t find 1,000 errors during the verification process,” Richardson said.

For the volunteers working on this project, the biggest development in recent weeks has been the announcement that Jason Scott of the Internet Archive will also be lending a hand in getting these public domain works online. Scott recently put out a call for volunteers on Twitter. Libraries around the country are scanning these books and uploading them to the archive.

Richardson says he’s written a matching script to point out which books in the Internet Archive collection seem like they weren’t renewed, but added that actually clearing them is also going to take significant, manual work. But it’s work, he says, that will have a much broader and lasting impact than just making millions of historical works available online for free.

“The public domain is incredibly important to the preservation of culture and to the creation of new culture,” he said.

If you want to help with this effort, you can email pdbooks@textfiles.com