Not all superheros wear capes and even fewer know how to preserve hundreds of terabytes of internet history. But for the revolving cast of digital librarians in Reddit’s data hoarding community, saving as much of our digital detritus from destruction as possible is just another day on the net.
People come to the data hoarding subreddit to learn about storage set ups, how to scrape data, or to float a new archival project, which can often seem like a never ending game of one upmanship in terms of the scope of the proposals. In July, a Redditor called “traal” posted a short note to r/datahoarders suggesting a hoard of all YouTube metadata, such as the title, description, thumbnail image, and subtitles. Given that YouTube hosts billions of videos and adds around 300 hours of new video every minute, this was no amateur task. Yet within a few hours, he had a response from a user called “-Archivist” who was up to tackle the project.
“-Archivist,” who told me his name is John but declined to give a last name, is a moderator on the data hoarding subreddit and one of its most active members. He also runs The Eye, a DIY archival project “dedicated towards content archival and long-term preservation.” John told me he was interested in preserving YouTube metadata because he often gets asked to save “at risk” YouTube channels whose videos are at risk of being pulled from the site.
For example, when Alex Jones and his InfoWars channel were at risk of being booted off of YouTube a few months ago, John said he downloaded the channel’s 33,000 videos and started mirroring them to the Internet Archive. Although the takedown didn’t happen at the time, Jones did end up getting booted off of YouTube earlier this month.
“This was one example in which as a community we managed to save this channel because we had pre-warning and set scripts in place to watch the channel,” John told me. “Once it went down we checked our data and realized our scripts had scraped the channel moments before it was closed so we had everything.”
It’s not just conspiracy theorists like Jones who are getting their channels pulled, however. John said YouTube’s increasingly strict content policies have also resulted in channels about drugs and guns being shut down as well. At the same time, creators making less-edgy content that still doesn’t fit with YouTube’s advertising model are also leaving for other platforms. Although John said he has no problems storing the videos from smaller “at risk” YouTube channels every now and then, he said the number of requests to archive YouTube videos has been increasing.
“Archiving these becomes tricky when I’m commonly seeing channels with 40,000+ videos,” John told me in an email. “The problem becomes where to store these videos that are at risk but still hosted on YouTube. So the next best thing is to at least have a record of the contents to look back on as these channels die off.”
John is no stranger to large archival projects—the last time we spoke he was knee deep in an effort to archive all of Instagram—but storing all of YouTube’s metadata presents some unique challenges.
“Working with this many files is insane and I can't recommend it to anyone."
To begin the project, John scraped channelcrawler.com, a YouTube explorer, to get the unique IDs of over 450,000 English YouTube channels, but abandoned this method after the site kept crashing. (Much to the ire of the website’s owner, who took to Reddit to complain that John didn’t contact him before scraping the site, which he said “slowed it down for everybody.”) The next step was to scrape each of these YouTube channels for unique video IDs, a process John claimed took 18 hours and resulted in over 133 million video IDs.
“This is a good start, but barely scratching the surface of YouTube as a whole,” John said.
Then it came time to harvest the metadata from each of these videos. To do a “quick test,” John ran a script on 100,000 videos which spat out nearly 600,000 metadata files amounting to 12.2 gigabytes of data. “It was at this point I realized the project was going to result in millions of files very quickly,” John said. Just based on his initial scrape of 133 million videos, John was staring down the barrel of over half-a-billion metadata files, and no clear way to organize them.
John estimates that there may be as many as 10 billion videos on YouTube. (When I contacted YouTube the company said it doesn’t release official video counts and wouldn’t provide a ballpark figure.) Each of these videos has at least 5 metadata files that consist of subtitles, thumbnail images, a description, any annotations, and a JSON file containing things like the video run time, the uploader and so on. Some videos have dozens of metadata files depending on things like how many subtitle languages are available for the videos, however. This means John will have to download and host at least 50 billion files, but probably far more.
The problem, according to John, is not the amount of space these metadata files take up (a few hundred terabytes of storage isn’t out of the ordinary in data hoarding circles), but figuring out how to organize them in any manageable way.
“Working with this many files is insane and I can't recommend it to anyone,” John told me. “You could have 10GB made up of one file and stick it on a thumbdrive no issue, great stuff. When you have 10GB made up of 10 million files, that’s you run into issues.”
According to John, most operating systems and tools don’t let users open directories that contain more than 50,000 files, to say nothing of tens of millions of files. This means that all sorting has to be done on a command line in a terminal, which requires extensive knowledge of the database itself, as well as being incredibly comfortable with file structures and committed to a file naming protocol.
When he started the YouTube metadata project last month, John said he was using a single 8 terabyte disk that could “only” store 4.2 billion files. Now he told me he switched to a ZFS array, a type of file management system that he said has“no issue” storing the estimated 31.5 billion individual metadata files he has already collected. According to John, there’s no easy way of counting the files and displaying them in a human readable way, so he works with estimates based on downloading speeds.
There’s also the issue of how to scrape all the channel IDs on YouTube in an efficient manner. YouTube has over a billion users, but how many of these users run a channel is uncertain. A conservative estimate would be tens of millions of channels, but it could be well over 100 million. A tool developed by John and his collaborators to scrape YouTube for unique channel IDs uses the site’s application program interface (API) to pull between 35,000 and 50,000 channel IDs a day. At that rate, it would take almost a year just to scrape 10 million channels, to say nothing of extracting the metadata from them.
For now, John is still in the process of making the data collection aspect of the YouTube archive as efficient as possible. Going forward, he said, the main issues are going to be figuring out who is going to store the data in the long term and how to make the data available in an searchable database. The lack of a clear use case has never stopped the data hoarding community from pursuing massive projects in the past. Like Edmund Hilary looking to Everest, data is hoarded simply because it’s there.