Hack This: Scrape a Website with Beautiful Soup

To the internet, a webpage is just a soup of text, symbols, and whitespace. Actual content, the stuff we’re interested in as webpage consumers—such as this blog post—is a part of this soup just as much as HTML tags are. The distinction is made only when our soup is consumed by a piece of software designed to interpret and possibly render it as a webpage. Most likely, that software is a browser.

Maybe this is already obvious and intuitive, but it’s worth emphasizing that there’s not really any mystery or magic in the HTML document itself. If you were to open up the source file for this webpage, you would find these words—the aforementioned content—sharing an essentially flat landscape with a great big old mess of code.

Videos by VICE

While the processes that produce the final HTML soup become all the more elaborate and complex, the soup itself is always there and it will always conform to HTML specifications. It may look to be a mess, but it’s a standardized, useful mess.

We’re not always interested in the webpage end-result of a string of HTML. It may be the case that I’m less interested in reading this blog post than I am in analyzing it. I may want to parse it for the appearance of certain keywords, for example. For this one webpage, I could as a human user just use command-f, but for many webpages this can be accomplished much easier via automation. That is, I might write a script that might scan the HTML strings representing webpages programmatically, collecting keyword statistics as it goes. This would be neccessary, even. Considering the web as raw data requires a programmatic approach.

This is web scraping, generally. It’s a means of collecting data from the internet via the strings of HTML that determine the content and appearance of webpages.

What might someone do with this kind of data? Some examples:

Find email addresses proximate to certain keywords for spamming purposes/lead generation.

Funnel content from a bunch of different websites into one. Imagine, for example, a single site that aggregates (illicitly, probably) raw content from a dozen other websites.

Harvest stats from government websites.

Scan listings from multiple job sites for search strings indicating gender bias.

Perform sentiment analysis on blog sites from a variety of platforms (Tumblr, WordPress, etc.).

Monitor price fluctuations among many different web retailers for a specific product.

There’s really no end to it.

The prerequisites for this Hack This are the same as for every other one that’s based on the Python programming language: Assuming you’ve already downloaded and installed Python, you should do two things. One: spend 10 minutes doing this “Hello, World” Python for non-programmers tutorial. Two: spend another five minutes doing this tutorial on using Python modules.

0.0) Scrape at your own risk

First off, there’s a lot of sketchiness and perceived sketchiness around web scraping. The legality of scraping is, generally, unsettled and some major cases have arisen in recent years involving startups whose entire business plans revolve around harvesting the websites of other businesses. For example, a few years ago Craigslist went after a website called 3Taps, which had been scraping and republishing housing listings from the classifieds giant. Eventually, 3Taps and Craigslist settled, with the prior paying out $1,000,000 to the former (which is to donate the sum to the Electronic Frontier Foundation over a 10 year period).

For our purposes, as web scraping tourists, we’re probably fine, but it’s important to keep in mind that what we’re doing can be considered unauthorized use.

0.1) Use the API, if it exists

More and more sites offer public APIs that allow us to plug directly into our desired data sources without scraping at all. You’ll never need to scrape a website for weather data, for example, because they all offer their content not just as a collection of webpages but as web services that we can access in predefined ways. Likewise, you will never need to scrape Twitter or Flickr or, hell, the New York Times. All offer ready-to-go developer tools that have the handy feature of well-defined usage policies.

1.0) Beautiful Soup

Disclaimer: I’m learning Beautiful Soup with you. BS is a set of Python tools (a Python module, or package) for extracting data from HTML documents, but it’s hardly the only set. For one thing, it’s very possible to scrape a webpage without any specialized tools at all, and that’s how I learned to do it—downloading HTML and then parsing it using the pattern matching capabilities of regular expressions. Beautiful Soup just eats all of these details and hides them from view, which allows us to focus not on the guts of HTML parsing but on the data itself.

Beautiful Soup will allow us to specify at a relatively high level what it is exactly that we’re after in a given HTML document (or on a given webpage) and it will go after that content with some relatively efficient (compared to my wholesale downloading and pattern matching above) parsing methods.

As usual, we’ll start with a new/blank Python file. Use whatever text editor you like. I’m using Sublime Text, which costs money. Atom is a comparable freeware editor. You’ll need to install Beautiful Soup, of course. Using pip, it’s just:

Bs4 is the package name of the current Beautiful Soup release.

2.0) Find a target

My idea is to scrape the website of Clark County, Washington for data on current jail inmates. The county publishes a roster that includes a small amount of information: inmate name, booking date, location, and charges. I think it might be interesting to look at charges vs. total period of incarceration between the given booking date and today’s date. We might not get to that point here, but this will provide a context for our example. I also don’t feel too bad about directing a bunch of traffic-noise to the website of a jail (which is in Vancouver, Washington, near Portland, Oregon).

(The seed for this idea comes via a scraping tutorial offered at bootcamps given by the Investigative Reporters and Editors at the University of Missouri circa 2013 and 2014.)

2.1) Know your target

Even if you’re not an HTML ace, it’s worth taking a look at the raw code behind your target website, just to get a feel for things. So, just do a right-click on the page and hit “view page source.” Scanning through the resulting code will reveal that inmate records are all bunched together in a big knot about halfway through the page.

We can see that each record is formatted like so (I redacted the inmate name myself):

185422[INMATE NAME]

8/24/2016IC2D1/2/2017RETAIL THEFT EXTENUATING CIRCUMSTANCES III

The tag tr denotes a row in a table, while td denotes a table cell. The different units of inmate data are jammed into these cells. The booking date and the charge don’t seem to have any special identifiers. We’ll see if that matters in a minute.

3.0) Parse the HTML document

This is really pretty simple. In our script (so far), we’re just importing the actual Beautiful Soup module and then we’re using it to open and parse a webpage, which is accomplished as below. The resulting soup keyword is now our window into the parsed webpage and the various operations that we can perform on it. “html5lib” here tells Beautiful Soup to use the specific parser called html5lib. There are a few different parsers HTML BS can use, and they all handle HTML a little bit differently. This one is built to parse HTML5 code in the same way that modern browsers do and it also happens to be the one that worked best for the jail-scraping problem. (You will probably need to install it using pip, e.g. “pip install html5lib.”)

There are a few things happening here. For one thing, note that we’re using a module called requests in addition to Beautiful Soup. This is what queries the target webpage, which then responds by barfing its HTML back to us. This HTML barf is then stashed in a variable called response. The whole response contains some stuff in addition to the actual HTML that we don’t really care about, so we access a property called content to get the actual payload. We then pass that barf on to Beautiful Soup, which parses it all and makes it available via our new soup keyword.

Finally, we’re just printing the HTML to the screen via Beautiful Soup’s own prettify() function.

4.0) Trim it down

The parsing is already done, so everything that happens from here on out and is going to be done not to the HTML itself, but to a data structure maintained by Beautiful Soup (which we access through our soup variable). Let’s start by grabbing the entire table of inmates. If we look at the prettified HTML we just printed, we can see that this table starts like this:

Here’s how we’re actually going to get at that table:

from bs4 import BeautifulSoup import requests url = ‘https://www.clark.wa.gov/sheriff/jail-roster’ response = requests.get(url) html = response.content soup = BeautifulSoup(html,”html5lib”) table = soup.find(‘table’,attrs={‘id’:’jmsInfo’}) print table.prettify()

So, Beautiful Soup is looking at its parsed HTML and returning the stuff inside of the tag that we specify using the find function. Here, we tell BS to find the tag of the table element that has the attribute id with a corresponding value jmsInfo. Within this tag and its corresponding closing tag is our jail roster table.

4.1) Keep trimming

Instead of just printing out the extracted HTML, we can output it programmatically as we further refine our quarry. Because we’re now dealing with a Python data structure and not raw naked HTML, we can iterate through our table really easily using Python’s for-in syntax. Try the following, in which we tell Python to go through our table row by row and print out the contents of each cell contained within each of those rows.

As you can see, we’re just dealing with data now. But it’s still not quite the data we want. The inmate name and id information needs to be left out, as does the inmate location. We’re just after the charge and the booking date.

NULL)

As an aside, I think this particular data could be legit interesting. We expect that those who have been held in county jail the longest will be charged with the more obvious serious offenses—murder, rape, etc. The more dire the charge, the more time is required for trial preparation and the higher the bond. What if we found that there’s some anomalous lighter-weight charge that’s keeping defendants in jail for longer-than-expected periods? That would be interesting.

4.2) One more cut

Recall the HTML formatting of the rows in our inmate table. The individual cells—again, denoted by the td tag—don’t have any other identifying attributes, like a specific class or id. So, we’re going to access the cells we want by navigating among the “children” of each individual row (corresponding to each individual inmate).

HTML documents are hierarchical. Tags contained within other tags are considered children of the containing tags. Tags that are adjacent to other tags, but are not within them, are considered to be siblings of those tags. For example:

With Beautiful Soup, we can access the children of a tag by using the “contents” property. This will contain as a list all of the tags that exist at the next level below the specified level. So, if we wanted to get the third cell of a table row (corresponding to the booking date), we would do so like this. Here, we print only the booking dates of our inmate roster table followed by the associated charges. from bs4 import BeautifulSoup import requests url = ‘https://www.clark.wa.gov/sheriff/jail-roster’ response = requests.get(url) html = response.content soup = BeautifulSoup(html,”html5lib”) table = soup.find(‘table’,attrs={‘id’:’jmsInfo’}) for row in table.findAll(‘tr’): for row in table.findAll(‘tr’): if (row.contents[5].string and row.contents[2].string): print “booking date: ” + row.contents[2].string + ” charge: ” + row.contents[5].string

Note that I added an if-statement that checks to make sure that both the booking date and the charge exist in the current row. You probably don’t want the successful execution of your code to depend on whoever is entering data at the Clark County jail. Now, instead of throwing an error, if the script finds that one of these data pieces is missing, it just skips that row. It won’t do us any good anyway.

5.0) Making use of the scrapings

We’ve successfully scraped the website of the Clark County Sheriff’s Office, but we’re not quite there yet. We still need the data in a format that we can analyze, and we also don’t have exactly the data we want yet. We’re after the total period of incarceration, from the booking date through today’s date. We could export our data as is to a CSV (spreadsheet) file first and then do that bit of processing, but I’m not planning on taking this little tutorial into a whole new software domain. Once we have the spreadsheet saved, that’s it. We’re through. So, let’s do that computation here in Python. It won’t be too difficult.

Here’s the code:

from bs4 import BeautifulSoup import requests from datetime import datetime url = ‘https://www.clark.wa.gov/sheriff/jail-roster’ response = requests.get(url) html = response.content soup = BeautifulSoup(html,”html5lib”) table = soup.find(‘table’,attrs={‘id’:’jmsInfo’}) for row in table.findAll(‘tr’)[1:]: if (row.contents[5].string and row.contents[2].string): now = datetime.now() b_date = datetime.strptime(row.contents[2].string,”%m/%d/%Y”) span = now – b_date print “incarceration period: ” + str(span.days) + ” day charge: ” + row.contents[5].string

A few things are happening here. Note that I’ve imported another module, called datetime. This is what allows us to convert the bare text from the webpage into a useful Python date format and it also allows us to do the neccessary computation to get the timespan from today’s date back to the inmate’s booking date.

To do that, I use a function called strptime, which takes in a text string representing a date and parses it into a datetime object. I use the same module to get the current date in the same format. Then, since we now have two datetime objects, getting the difference is as easy as using the subtraction operator. The result of the subtraction is an object of the time timedelta, which has a property called days.

Run the script and you’ll get something like this:

NULL)

The name Beautiful Soup, by the by, comes from Alice in Wonderland. It’s a song sung by Mock Turtle to Alice: “Soo–oop of the e–e–evening, Beautiful, beautiful Soup!” It’s a play on “tag soup,” the term given to malformed, gnarly HTML code. Tag soup is the natural enemy of web scraping.

6.0) Exporting the soup

Finally, we have the data we want. Our last task is to get it into a format that’s useful for further analysis. This is the aforementioned CSV file, which is basically a very generic spreadsheet file consisting of comma-separated values. Once we’ve saved a CSV file, we can open it in Excel or Google Sheets or whatever. It’s among the more universal file formats that’s out there.

To move our scraped data into a CSV file, we’re going to amend our script to look like this:

from bs4 import BeautifulSoup import requests from datetime import datetime import csv url = ‘https://www.clark.wa.gov/sheriff/jail-roster’ response = requests.get(url) html = response.content soup = BeautifulSoup(html,”html5lib”) table = soup.find(‘table’,attrs={‘id’:’jmsInfo’}) rows = [] for row in table.findAll(‘tr’)[1:]: cells = [] if (row.contents[5].string and row.contents[2].string): now = datetime.now() b_date = datetime.strptime(row.contents[2].string,”%m/%d/%Y”) span = now – b_date cells.append(str(span.days)) cells.append(row.contents[5].string) rows.append(cells) outfile = open(“./roster.csv”, “wb”) writer = csv.writer(outfile) writer.writerows(rows)

A lot of new stuff is happening here. First, we’re importing a new Python module, called csv. This is what we’ll used to create and write to a CSV file. We also stripped out all of the print stuff and replaced it with a pair of arrays: rows and cells. As we go through row by row, we’re going to make small arrays containing our two data points, which we’ll paste onto the end of the main rows array on every pass of the for-loop. At the end, we open a new file for writing with the open command and then use writerows to, well, write the rows to our new file. You should now have a file sitting in the same folder as your script file, called roster.csv.

And that’s it! We scraped some data.

6.1) The big payoff: data

I imported the CSV file into Google Sheets and screwed around with it for a while. I’m far from a spreadsheet expert. Here’s a sampling that gives the averages and then totals (in days) for a handful of charges.

All told, there’s nothing too surprising. The big winners are murder, rape, assault, burglary, and probation violations. As far as charges with the most total days, the winner is probation violation (“community custody violation”) by light-years with over 10,000 days. (One inmate has been in for 6,000 days on a probation violation charge, though that has to be a bookkeeping mistake, right?)

This would probably be more interesting with a bigger data set from a bigger jail. Hopefully, you can at least see the possibilities. Big Data isn’t always or even usually just sitting out there to harvest via some friendly API. The real good stuff is more likely to be squirreled away on some government website. Hunt it down.

VICE
Editions

Hack This: Scrape a Website with Beautiful Soup