Tech by VICE

Motherboard Made a Tool That Archives Websites on Demand

mass_archive, a Python script, will push a webpage or URL to multiple archive services at once, hopefully making online journalism or research a bit more efficient.

by Joseph Cox
May 1 2018, 3:19pm

Image: A screenshot of mass_archive

Archiving services, such as the Wayback Machine, may be a staple of online journalism, but they sometimes have a problem. While, say, Archive.is might preserve one particular webpage, perhaps the Wayback Machine can’t, depending on what sort of restrictions the website developer has put in place. For example, someone stopped copies of MSNBC host Joy Reid’s blog, which hosted a stream of homophobic comments, from displaying in the Wayback Machine.

With that in mind, I made a tool that can push a single webpage or URL to multiple archiving sites at once, and fire back the newly minted digital copies in response. Hopefully it will help reporters and researchers more efficiently figure out which service will work best for that particular site.

Called mass_archive, the tool is made in Python and is used on the command line. Just download the script, open a terminal where the script is stored, and type “python mass_archive.py example.com”, with example.com being the page or site you want to target.

Got a tip? You can contact this reporter securely on Signal on +44 20 8133 5190, OTR chat on jfcox@jabber.ccc.de, or email joseph.cox@vice.com.

If the script isn’t successful at getting an archive up on the Wayback Machine, it’ll just move onto Archive.is, and then Perma.cc, another archiving service. For the last one you’ll need to create a free account, which gives you 10 archives a month. Perma.cc also records which account archived a particular page. If that sounds like a pain, just take the Perma.cc parts out of the script in a text editor. (The bit that handles requests to Archive.is uses a module from Past Pages, an open-source effort to archive the news; thanks to Past Pages for that).

I’ve already tried the tool on sites that have tried to avoid being archived, such as malware companies, and mass_archive did the job. If you think mass_archive might be useful for you, feel free to use it.

*Each of these three services have their own rules about what websites can be archived and for what purposes; you must comply with their terms of services as well as the terms of any website that you're archiving*

The Github page for the tool is here, with some installation instructions and other requirements.