The Internet Archive Can't Preserve the Web's History by Itself
The Joy Reid saga highlights the strengths and weaknesses of web archiving.
Image: J Countess/Getty Images
Michael L. Nelson works for the Web Science and Digital Libraries Research Group at Old Dominion University.
This weekend on her show AM Joy, Joy Reid stated that security experts had not been able to prove that her blog had been hacked or manipulated, and while she “genuinely does not believe [she] wrote those hateful things,” she did admit that she “can definitely understand, based on things I have tweeted and have written in the past why some people don’t believe me.” Events throughout the week included a denial of hacking the by the Internet Archive, and a growing chorus of experts publicly expressing doubt about her version of events (see articles in my research group's blog, The Atlantic, HuffPost, and The Daily Beast). This will likely be the end of the story for Joy Reid’s now defunct blog (); her detractors on the right may continue to call for her removal but her fans on the left are surely eager to put the episode behind them.
Even though the story of Joy Reid’s blog maybe closing, a similar story likely will unfold again with different characters and minor variations, so what can we learn about both the capabilities and limitations of web archiving in anticipation of “next time”?
Don’t rely on screenshots
Social media is full of screenshots of web pages being used as evidence. Screenshots allow for annotation, highlighting, and circumventing character limits on Twitter, but the ease with which they are manipulated means they are unreliable and may fail to properly document their source. For pictures of kittens or your friends’ children, such provenance is probably not necessary, but if you seek to document a public figure’s malfeasance, the evidentiary threshold is higher.
In the Joy Reid saga, screenshots presented two problems. First, in the original tweets with screenshots of text passages, it was difficult to find the dates the blogs were posted, direct links to the posts were not available at all. For example, for a post about NBA player Tim Hardaway called “Tim Hardaway is a homophobe and so are you,” which included lines like “most straight people cringe at the sight of two men kissing” and “I admit couldn’t go see [Brokeback Mountain] either, despite my sister’s ringing endorsement, because I didn’t want to watch two male characters having sex.”, the direct link to just the article was:
But the content also appeared on 2007-02-15 at the top level of the blog as well:
The original tweet about the Tim Hardaway post does not show the date, nor does it provide the direct link. In my own research, of the 50 or so screenshots shared on Twitter from her blog, I was initially able to infer post times for only about 12.
Second, despite denials of editing (other than annotations) of the screen shots, the Tim Hardaway tweet shows the image at the top of the blog post next to text that appears midway through blog post; this could only happen via editing the image and understandably creates confusion about the discrepancy.
In summary, when sharing evidence on social media, augment screenshots with links to the live web, and links to those pages in multiple web archives.
Do use multiple web archives
Reid’s lawyers sent letters letters to both Google and the Internet Archive in December requesting that they take down the archived blog and information regarding possible hacks or intrusions. The Internet Archive declined to take the blog down, and in February, someone in Reid’s orbit used the robots.txt exclusion protocol to effectively redact the copies in the Internet Archive (this is a standard, automated method for owners of live web sites to control which, if any, pages at a site should be served from the Wayback Machine.)
What Reid’s team did not appear to anticipate is that copies of her blog would appear in other web archives, one of which was the Library of Congress’s web archive, which does not honor robots.txt exclusion. In fact, three of the example blog posts her lawyers claim were fraudulent:
are contained in a 2006-01-11 archived version at the Library of Congress:
In this case, there are copies in two distinct (geographically and administratively) systems, but they are not independent observations. The important point is that while the robots.txt redacted the Internet Archive’s version of the page, it did not redact the version in the Library of Congress.
You can decide for yourself if the content contains, as Reid’s lawyers state, “jarring changes in style and substance” or “uncharacteristic HTML/graphics formatting, and font selection, such as quote offsets, paragraph separators,” but this page alone tilts the forensic evidence against Reid’s version of events. If her blog had been hacked, it would have had to been hacked in January, 2006 for the web archives to have captured this page; a hack at a later date (say, 2007) would not alter the 2006-01-11 version in the web archives. You can read my detailed analysis, but the takeaway message is either:
- Reid did not see the posts in question or recognize them as fraudulent and did not remove them (as well as not changing her password), despite regularly interacting with her blog (sometimes posting 10+ times per day), or
- After the last post at 4:51pm EST on 2006-01-11 and before the archiving time of 5:17pm EST the same day, an adversary posted the content (including backdating the posts, which is possible in most blogs), and inserted links to “brokeback-committee-room.html” in other legitimate posts. Keep in mind that the Internet Archive did not have the “save page now” function until 2013, so there was no way an adversary could know in advance when the Internet Archive would crawl that page (and in 2006, unlike today, crawls were irregular and infrequent.)
While possible, either scenario appears unlikely, especially when you consider the scenario would have to be repeated for every fraudulent or disputed post over a period of several years.
The importance of other web archives was briefly diminished because the Internet Archive had a URL canonicalization hole that allowed people to circumvent the robots.txt exclusion (in short, swapping “http” with “https”, as in “https://blog.reidreport.com/”) and allowed many people to inspect the Internet Archive’s copies of the blog. However, this hole was quickly closed by the Internet Archive, and we cannot assume similar holes will be open in the future.
How can you use multiple web archives? Services archive.is and perma.cc are on-demand public web archives that allow submission of individual pages (similar to the “save page now” feature at the Internet Archive), webrecorder.io allows for the creation of personal web archives, and the Los Alamos National Laboratory Time Travel service allows for querying of multiple web archives (for example, blog.reidreport.com is held in five different web archives other than the Internet Archive.)
Understand the limitations of web archives
Ultimately, Reid’s version of events were not supported by the archived pages themselves. When the Internet Archive’s redaction policy was enacted, her argument was further undermined by the existence of additional web archives. Even though Reid’s story has likely ended, it is only a matter of time before a similar story unfolds. For those that seek to hold public figures accountable, a more rigorous interaction with and presentation of archived pages will limit uncertainty. For those on the receiving end of such scrutiny, a more careful consideration of the scope (as well as limitations) of not just the Internet Archive but all public web archives, will better inform their response.