Redditors are no stranger to what may outwardly seem to be pointless collaborative projects. In fact, that's kind of their specialty. Earlier this year, the Place project saw thousands of users come together to draw on a giant digital canvas, but at around the same time the folks over at r/DataHoarder, a community of self-described "digital librarians," were planting the seeds for something far larger—in principle anyway.
The idea was to create a distributed archive of all of Instagram. This would require ripping every picture from every public (and many private) accounts and storing them on spare hard drives and rented space in the cloud. The total size of this archive when it's finished is uncertain, but tens of millions of photos are uploaded to the platform every day, accounting for what is likely petabytes worth of data. After eight months of work, the group has archived nearly 600 terabytes of Instagram posts—nothing to bat an eye at, but a mere drop in the bucket of the total collection of all Instagram posts.
So why go to all this trouble to collect and store random people's photos? According to the archive's creators, the answer is basically 'because they're there.' But the project may also one day be of great value to historians, and may find practical use in the present as a way of preventing identity theft online—assuming Instagram doesn't manage to shut it down first.
The idea to create a distributed Instagram archive was originally posted to r/DataHoarder on January 5 by one of the subreddit's moderators, -Archivist. His real name is John (he wouldn't give his last name), he's in his late twenties, and as he told me over email, when he's not archiving Instagram, he's "archiving something else." Although John has worked on more formal archival efforts both IRL and online with Archive Team, most of his time as a digital librarian these days is dedicated to passion projects he posts to r/DataHoarder.
"So now I have 300 TB of other people's pictures, but what do I do with them?"
"My initial motivation for the Instagram archive was because nobody else was doing this," John told me over email. "I didn't start with any particular reasoning in mind or ideas as to what I'd go on to do with the collected data."
As John put it, he's "often seen as the guy with controversial archival ideas" (he's also one of the people behind the project to create a massive cam girl archive), but his idea to archive all of Instagram still took off immediately on the subreddit.
For most people, the idea of using programs to rip and store as many Instagram posts as possible might seem unbelievably mundane. But data hoarders aren't most people. This is a community where street cred is measured by the data storage capacity noted in your user flair, and even the lowliest Internet detritus is considered a bit of history worth preserving. So John had no problem finding a community of people willing to help him on this huge task—the big question was how to make it happen.
When John initially posted his idea to r/DataHoarder on January 5, he had already ripped the posts from some 3,400 accounts, representing 2.2 million files—about 633 GB of information. This is nothing to bat an eye at, but it was still just a drop in Instagram's ocean of selfies. To do this, John was using an open source program called RipMe to pull images and videos from public Instagram accounts, but actually finding these accounts was proving more difficult.
"You can go to anybody's profile and list their followers, but this list is loaded around 20 accounts at a time," John said. "So manual collection of usernames required me to scroll for hours. I initially overcame this by literally stuffing a bit of cardboard into my 'page down' key and walking away from my laptop."
One of the stipulations of the project was that it couldn't rely on Instagram's API to harvest account information since that would be a blatant violation of the platform's terms of service. Eventually the community found what it believes to be a workaround involving a few dozen lines of code that would allow them to collect the photos from around 2 million accounts every 24 hours and put these names in a list that could be used by another program to scrape the actual images from the accounts.
The overwhelming majority of the Instagram posts in the archive were harvested from public accounts that could be accessed by anyone. But John and his fellow data hoarders were also able to scrape photos from some private accounts, too. First John created an Instagram bot programmed to seek out and follow private accounts. The hope was that these accounts would follow the bot back, thus exposing the contents of their private accounts for collection in the archive. According to John, this tactic has had about a 70 percent success rate. However Instagram only allows accounts to follow 7,500 people at a time and John said he "got bored of this slow progress and abandoned the idea."
For a while, the entire project was being carried out by John alone. As he put it, once he figured out how to get millions of user names, instead of a few thousand at a time, all he did was "hand the [scraping program] millions of URLs and then wait." The distributed aspect of the project only came once another member of the data hoarding community wrote some code which would allow anyone who wanted to participate in the project to check URLs against a master list to ensure the same accounts weren't being downloaded twice.
According to John, there are currently between 30 to 40 people involved with the Instagram archival project, and they've collectively scraped and stored around 580 TB of Instagram posts. John has collected and stored approximately 300 TB of these posts himself. He said getting involved in the project doesn't require any special hardware, just a lot of storage space.
"This can be done by anyone with very little knowledge," John said, adding that the biggest obstacle for the Instagram archive is finding a home for all this data and figuring out what to actually do with it. Although John said he has pushed some of the photos to the Internet Archive, the vast majority are stored locally on the hard drives of those helping in the archive process.
"We're still quite disorganized," John said. "I've heard of people with archives ranging from 50 GB to 50 TB asking me what to do with it all, to which I answer, 'Hold on to it, I'll get back to you…' So now I have 300 TB of other people's pictures, but what do I do with them?"
This question has riled up at least one member of the r/DataHoarder community, who was uncomfortable with the idea of a handful of people having access to a large chunk of the content on Instagram. The user even went so far as to report the project to Instagram, but according to John, the archivists aren't violating the company's terms of service, so he's not expecting a cease and desist letter any time soon.
Instagram, however, seems to disagree. A source familiar with the matter told Motherboard the distributed archive violates the social media platform's terms of service and that the company is taking steps to shut the project down.
Nevertheless, John and his fellow data hoarders are still considering different use cases for the archive, such as turning it into a searchable database to prevent catfishing, where people steal photos from others' social media accounts and use them to create fake online personas and lure people into relationships. He also said it's possible to imagine a future where Instagram doesn't exist, but the content that people posted there is still valuable to historians.
"I'm not entirely sure the archival project is important right now," John said. "Sure, when Instagram eventually goes away people of the future will be able to look back on collections like this and make cultural observations and do trend analysis. But for now, most people just stare at me with a baffled expression when I mention this kind of archive."