On January 1, 2023, a swath of books, films, and songs entered the public domain. The public domain is not a place—it refers to all the creative works not protected by an intellectual property law like copyright.
Creative works may not have intellectual property protections for a number of reasons. In most cases, the rights have expired or have been forfeited. Basically, no one holds the exclusive rights to these works, meaning that living artists today can sample and build off those works legally without asking anyone’s permission to do so.
That’s why the New York Public Library (NYPL) has been reviewing the U.S. Copyright Office’s official registration and renewals records for creative works whose copyrights haven’t been renewed, and have thus been overlooked as part of the public domain.
The books in question were published between 1923 and 1964, before changes to U.S. copyright law removed the requirement for rights holders to renew their copyrights. According to Greg Cram, associate general counsel and director of information policy at NYPL, an initial overview of books published in that period shows that around 65 to 75 percent of rights holders opted not to renew their copyrights.
“That’s sort of a staggering figure,” Cram told Motherboard. “That’s 25 to 35 percent of books that were renewed, while the rest were not. That’s interesting for me as we think about copyright policy going forward.”
Cram warns that since the project is still ongoing, the data may ultimately come out to be slightly more or slightly less, and that NYPL hasn’t even begun to dive into films, music, or other types of creative works. But these early findings could help lawmakers craft copyright policies from an evidence-based standpoint that wasn’t easily accessible in the past.
“Folks need to understand that this data is really important to the record of American creativity,” he added. “It is the history of American creativity. To some extent, it is a great record of American creativity, and I think that the data should be usable not just by us, by the libraries, but by everyone. I think it belongs to the people and is the people’s data.”
Making informed decisions about whether something is under copyright isn’t as straightforward as it sounds, mostly because the inquirer needs to know what questions to ask and where that data lives.
The U.S. Copyright Office and the Internet Archive collaborate to digitize these records, and while that digitization effort has been foundational for NYPL to even be able to conduct their investigation, the digital experience isn’t much different from the physical one: To navigate the records, you have to click on a picture of an antique card catalog and then sift through volumes of digitized cards without the help of Optical Character Recognition (OCR) software, which converts books into machine-readable text.
Cram says that use of these tools today still requires some sort of specialized knowledge, like which drawer to open and which category to look for. Those searches can take a lot of time and produce a lot of false positives for researchers. Plus, what Cram is looking for within the records is exactly what’s missing: A copyright renewal registration, or a renewal, or a registration to begin with. [trying to find absence of information]
NYPL partnered with the technology firm Data Conversion Laboratory (DCL) to manage all the data for the project. Marianne Calilhanna, vice president of marketing with DCL, says the archivists started by adding OCR to all the digital copyright registration files, then using algorithms to automatically structure and sort the data.
“We started the pilot with, I think it was just around 10,000 records, and then we started to realize, okay, we can start making some rules here,” Calilhanna told Motherboard. “So we’re able to start making these conversion rules that then we can kind of put into our automation processes to start to structure this.”
DCL also had to train the algorithm to account for the three columns of a copyright record, which is something that would be easier for a human eye to process but not a computer without proper instruction.
“Ultimately, the output we’re creating is XML,” she added. “XML is a series of tags that tell the computer, this is a title of a book, this is the title of a journal article. This is the author of that. And then we would also apply extra metadata on top of that record.”
DCL has other clients in the information sector, including the academic publishing company Elsevier, which DCL has created deep learning and pattern detection to identify, process, and restructure bibliographic citations for specific repositories. Elsevier is notorious for not sharing its metadata with academic librarians, which is essentially what’s needed to make digital files discoverable and therefore accessible in the first place. But NYPL plans to make their XML open source for other libraries across the nation and the world to use.
“For us to advance the progress and knowledge, which is the goal of copyright, I think we need access to this data so that we can understand how to answer that question of how can I use this?” Cram noted. “Having the data helps get us closer to an answer for that question, which ultimately is the goal, to use works lawfully, in a way that advances knowledge.”
The U.S. Copyright Office said in a statement that it remains committed to preserving and making all of its public records available, and that it has a longer term goal of making all public records available and searchable online.
“As part of our commitment to the preservation of and access to all public records, the Office has undertaken efforts to digitize print and microfilm records to make them available to a broader audience,” The U.S. Copyright Office said in a statement to Motherboard. “These historical public records include the copyright card catalog, record books, and the Catalogs of Copyright Entries (CCE). Eventually, the Office’s aim is to make these historical public records available in the new Copyright Public Records System once each collection's digitization and metadata capture are completed.”
For many creators, the question of whether they can use something seems so simple, but can be really hard to figure out. Until recently, a lot of that data has been locked away in these public records.
“That’s not great for a library,” Cram said. “It’s not great for the public and the public is hungry for it, because getting access and knowing how to use that data and knowing where to find that data is really important to answer that question of ‘Can I use it?’”