To the internet, a webpage is just a soup of text, symbols, and whitespace. Actual content, the stuff we’re interested in as webpage consumers—such as this blog post—is a part of this soup just as much as HTML tags are. The distinction is made only when our soup is consumed by a piece of software designed to interpret and possibly render it as a webpage. Most likely, that software is a browser.
Maybe this is already obvious and intuitive, but it’s worth emphasizing that there’s not really any mystery or magic in the HTML document itself. If you were to open up the source file for this webpage, you would find these words—the aforementioned content—sharing an essentially flat landscape with a great big old mess of code.
Videos by VICE
While the processes that produce the final HTML soup become all the more elaborate and complex, the soup itself is always there and it will always conform to HTML specifications. It may look to be a mess, but it’s a standardized, useful mess.
We’re not always interested in the webpage end-result of a string of HTML. It may be the case that I’m less interested in reading this blog post than I am in analyzing it. I may want to parse it for the appearance of certain keywords, for example. For this one webpage, I could as a human user just use command-f, but for many webpages this can be accomplished much easier via automation. That is, I might write a script that might scan the HTML strings representing webpages programmatically, collecting keyword statistics as it goes. This would be neccessary, even. Considering the web as raw data requires a programmatic approach.
This is web scraping, generally. It’s a means of collecting data from the internet via the strings of HTML that determine the content and appearance of webpages.
What might someone do with this kind of data? Some examples:
Find email addresses proximate to certain keywords for spamming purposes/lead generation.
Funnel content from a bunch of different websites into one. Imagine, for example, a single site that aggregates (illicitly, probably) raw content from a dozen other websites.
Harvest stats from government websites.
Scan listings from multiple job sites for search strings indicating gender bias.
Perform sentiment analysis on blog sites from a variety of platforms (Tumblr, WordPress, etc.).
Monitor price fluctuations among many different web retailers for a specific product.
There’s really no end to it.
The prerequisites for this Hack This are the same as for every other one that’s based on the Python programming language: Assuming you’ve already downloaded and installed Python, you should do two things. One: spend 10 minutes doing this “Hello, World” Python for non-programmers tutorial. Two: spend another five minutes doing this tutorial on using Python modules.
0.0) Scrape at your own risk
First off, there’s a lot of sketchiness and perceived sketchiness around web scraping. The legality of scraping is, generally, unsettled and some major cases have arisen in recent years involving startups whose entire business plans revolve around harvesting the websites of other businesses. For example, a few years ago Craigslist went after a website called 3Taps, which had been scraping and republishing housing listings from the classifieds giant. Eventually, 3Taps and Craigslist settled, with the prior paying out $1,000,000 to the former (which is to donate the sum to the Electronic Frontier Foundation over a 10 year period).
For our purposes, as web scraping tourists, we’re probably fine, but it’s important to keep in mind that what we’re doing can be considered unauthorized use.
0.1) Use the API, if it exists
More and more sites offer public APIs that allow us to plug directly into our desired data sources without scraping at all. You’ll never need to scrape a website for weather data, for example, because they all offer their content not just as a collection of webpages but as web services that we can access in predefined ways. Likewise, you will never need to scrape Twitter or Flickr or, hell, the New York Times. All offer ready-to-go developer tools that have the handy feature of well-defined usage policies.
1.0) Beautiful Soup
Disclaimer: I’m learning Beautiful Soup with you. BS is a set of Python tools (a Python module, or package) for extracting data from HTML documents, but it’s hardly the only set. For one thing, it’s very possible to scrape a webpage without any specialized tools at all, and that’s how I learned to do it—downloading HTML and then parsing it using the pattern matching capabilities of regular expressions. Beautiful Soup just eats all of these details and hides them from view, which allows us to focus not on the guts of HTML parsing but on the data itself.
Beautiful Soup will allow us to specify at a relatively high level what it is exactly that we’re after in a given HTML document (or on a given webpage) and it will go after that content with some relatively efficient (compared to my wholesale downloading and pattern matching above) parsing methods.
As usual, we’ll start with a new/blank Python file. Use whatever text editor you like. I’m using Sublime Text, which costs money. Atom is a comparable freeware editor. You’ll need to install Beautiful Soup, of course. Using pip, it’s just:
Bs4 is the package name of the current Beautiful Soup release.
2.0) Find a target
My idea is to scrape the website of Clark County, Washington for data on current jail inmates. The county publishes a roster that includes a small amount of information: inmate name, booking date, location, and charges. I think it might be interesting to look at charges vs. total period of incarceration between the given booking date and today’s date. We might not get to that point here, but this will provide a context for our example. I also don’t feel too bad about directing a bunch of traffic-noise to the website of a jail (which is in Vancouver, Washington, near Portland, Oregon).
(The seed for this idea comes via a scraping tutorial offered at bootcamps given by the Investigative Reporters and Editors at the University of Missouri circa 2013 and 2014.)
2.1) Know your target
Even if you’re not an HTML ace, it’s worth taking a look at the raw code behind your target website, just to get a feel for things. So, just do a right-click on the page and hit “view page source.” Scanning through the resulting code will reveal that inmate records are all bunched together in a big knot about halfway through the page.
We can see that each record is formatted like so (I redacted the inmate name myself):
The tag tr denotes a row in a table, while td denotes a table cell. The different units of inmate data are jammed into these cells. The booking date and the charge don’t seem to have any special identifiers. We’ll see if that matters in a minute.
3.0) Parse the HTML document
This is really pretty simple. In our script (so far), we’re just importing the actual Beautiful Soup module and then we’re using it to open and parse a webpage, which is accomplished as below. The resulting soup keyword is now our window into the parsed webpage and the various operations that we can perform on it. “html5lib” here tells Beautiful Soup to use the specific parser called html5lib. There are a few different parsers HTML BS can use, and they all handle HTML a little bit differently. This one is built to parse HTML5 code in the same way that modern browsers do and it also happens to be the one that worked best for the jail-scraping problem. (You will probably need to install it using pip, e.g. “pip install html5lib.”)
There are a few things happening here. For one thing, note that we’re using a module called requests in addition to Beautiful Soup. This is what queries the target webpage, which then responds by barfing its HTML back to us. This HTML barf is then stashed in a variable called response. The whole response contains some stuff in addition to the actual HTML that we don’t really care about, so we access a property called content to get the actual payload. We then pass that barf on to Beautiful Soup, which parses it all and makes it available via our new soup keyword.
Finally, we’re just printing the HTML to the screen via Beautiful Soup’s own prettify() function.
4.0) Trim it down
The parsing is already done, so everything that happens from here on out and is going to be done not to the HTML itself, but to a data structure maintained by Beautiful Soup (which we access through our soup variable). Let’s start by grabbing the entire table of inmates. If we look at the prettified HTML we just printed, we can see that this table starts like this:
More
From VICE
-

W039B1 New England Patriots head coach Bill Belichick argues a penalty with the down judge in the fourth quarter against the Houston Texans at Gillette Stadium in Foxborough, Massachusetts on September 9, 2018. The Patriots defeated the Texans 27-20. Photo by Matthew Healey/UPI -

Screenshot: Epic Games -

David A. Smith/Getty Images -

Screenshot: Square Enix
With Beautiful Soup, we can access the children of a tag by using the “contents” property. This will contain as a list all of the tags that exist at the next level below the specified level. So, if we wanted to get the third cell of a table row (corresponding to the booking date), we would do so like this. Here, we print only the booking dates of our inmate roster table followed by the associated charges. from bs4 import BeautifulSoup import requests url = ‘https://www.clark.wa.gov/sheriff/jail-roster’ response = requests.get(url) html = response.content soup = BeautifulSoup(html,”html5lib”) table = soup.find(‘table’,attrs={‘id’:’jmsInfo’}) for row in table.findAll(‘tr’): for row in table.findAll(‘tr’): if (row.contents[5].string and row.contents[2].string): print “booking date: ” + row.contents[2].string + ” charge: ” + row.contents[5].string