Brewster Kahle, the Librarian of 404 Billion Websites

Kahle is an engineer-turned-digital librarian who founded the Internet Archive in 1996.

May 13 2014, 3:25pm
Image: Flickr/Beatrice Murch

Brewster Kahle, founder of the Internet Archive, is a digital librarian who has been working towards the universal access of knowledge since founding the site in 1996, and even before that.

An engineer who once studied artificial intelligence and co-founded web ranker Alexa, Kahle, 53, is armed with an obsession to collect everything. You’ll find part of the physical embodiment of his Alexandria-like collection in a former Christian Science church in San Francisco, where near-life-sized paper mâché dolls of the Archive's friends and benefactors occupy the pews. There's also a repository in Richmond, California, which is filled with a million books, and, to serve up the Wayback Machine—a historical backup of the web's pages that launched in 2001 and recently passed the 400 billion mark—there's a datacenter stored inside a shipping container that holds three petabytes (that's one thousand terabytes) and can process 500 requests per second.

You can’t get a library card to see the materials in person: the place to visit the Archive is online, where everything is available for free, ad infinitum. At least one hopes: While copies of the Archive's data are kept elsewhere—one mirror is at the Bibliotheca Alexandrina in Egypt—a fire at the archive in November damaged some expensive scanning equipment, for which the archive is taking donations. Kahle has so much data, he's even received some affection from the NSA.

Brewster with a printing press in 1992. Image: Flickr/Carl Malamud

This isn't easy work. Recently, Kahle successfully fought the FBI in a case where the agency requested information about a user. There are engineering challenges too, for instance, in developing the archive's sophisticated collection of news television programming, which currently contains 564,000 shows since 2009. Meanwhile, Kahle hasn't avoided ongoing debates in the digital humanities. He's questioned the wisdom of Google Books, lamented the lack of a decent loaning system for digital materials, and worried about the transition of knowledge from the non-profit sector to a private one. 

I recently had the chance to speak with Kahle about the open source and non-profit web, the Internet Archive, and Open Library, which seeks to build a web page for every book ever published and loan those books out through the web.

MOTHERBOARD: For organizations that gather and store humanity's information, do you see the non-profit model as a successful one?
Brewster Kahle: I think we’re seeing a broad experimentation. Wikipedia is a public donation model; the Internet Archive is a mix of offering services and keeping the spending very low, and creative approaches to that form of sustainability. The Public Library of Science is paid when people submit their articles. I don’t know what the answer is in terms of all the funding models: There are different ideas being tried, but this non-profit structure fits well for the internet.

Deep-sea anglers, from a page in a book in Kahle's

Speaking of creative approaches, the Internet Archive’s Tumblr is rather experimental. What’s the goal of the site?
Those are all volunteers, Tumblrs in residence. What would you do with all of this material? They make something interesting, odd, fun, arty, whatever they want to do with it.

The Internet Archive now has a TV news archive, which is great because a lot of video is blocked by GEMA in Germany (GEMA is a music rights organization in Germany which blocks many music videos on Youtube). When did it actually begin?
We started collecting in 2000. We had a version that was running in the past year, and completely re-did it and re-launched it. You can quote the videos in very specific ways, and there is a whole tool for narrowing in on what you want to quote. If you want to borrow the whole program, we put it on a DVD and lend it to the user, as a library model. 

Most libraries are locked up.That doesn’t work well on the internet. 

Open Library is an open, editable online book catalogue. One could see it as an e-book library. What challenges remain for the e-book format? Do you find that writers are open to the idea of digital lending?
Most people just see it as the continuation of what a library is. It’s basically access to the long tail. That’s the big thing about the library—making access to the works of the 20th century, which are only basically available in print. 

What do you think about the digital libraries of today?
Most of them are locked up. That doesn’t work well on the internet. We have the ones that are subscription-oriented, or tied to a particular vendor like Amazon, or they’re only available if you’re in a prestigious university. We have an opportunity for everyone to learn, so let’s take advantage of that.

The Wayback Machine's earliest archived version of itself, from November 30, 2001

Meanwhile, though, news websites are still trying to set up paywalls. Do you think that model will crash?
I don’t know how—the news sites are struggling. We want to see publishing work and libraries work in the next generation. Libraries buy things and lend them, which makes sense in the digital age.

A lot of the stuff you’re not buying.
Like webpages, they weren’t for sale in the first place. But the books are donated or we buy them. We have over one million physical books in Richmond, California. They’re not very accessible; you can’t get in the stacks. It’s a good place to go in and get a tour, but it’s not good for getting particular books out. It’s meant to be the preservation facility.

The Internet Archive hit 404 billion pages, which is a bit of a joke. How did you celebrate?
People were betting what day we would hit 404 billion. Yes, we had a party!