Scraping, Cleaning, and Importing: an Overview

Posted on Sun 19 June 2016 in posts

The first challenge in this project was retrieving the OCRed directories from their home on Hathitrust and into my main data frame. In text format, a single directory on Hathitrust typically consists of hundreds of individual text files--far too many to download individually, but an easy task for Python. Once downloaded, the text file needs to be "cleaned" of html markup and the resulting text imported into the database. I've already applied this three-stage process of scraping, cleaning, and importing to ninety-two directories. I'll go into more detail about each step in that process at a later time, but for now, here's a general overview of how I went about it.

In [1]:
import os
import pandas as pd

As I've discussed in an earlier blog entry, I used BeautifulSoup to create a 'catalog' data frame that contains information about each of the 111 New York City directories that I've found in HathiTrust. Among other things, the data frame keeps track of which directories have been scraped, cleaned, and imported. The most up-to-date version is stored locally as a pickle. I'll begin by loading that data:

In [2]:
df_cat = pd.read_pickle('../blogdata/catalog_overview.p')

In addition to a large number of nineteenth-century directories, the catalog data frame contains quite a few entries from the late 1700s and the early 1900s. Since those aren't of interest to me at the moment, I'll exclude them from my working data frame. I'll also limit the working data frame to the seven columns relevant for understanding the process of scraping, cleaning, and importing the Hathitrust directories: the directory title, the publisher, the year it was published, the total number of pages in the digital version of that directory, and the number of pages that have already been scraped, cleaned, and imported:

In [3]:
df_cat = df_cat[(df_cat.pub_year > 1799) & (df_cat.pub_year < 1900)]
select_cols = ['publisher', 'pub_year', 'total_pages', 
               'scraped', 'cleaned', 'imported']
df_cat = df_cat[select_cols]

Next, I'll transform those column names into a more user-friendly version for display here:

In [4]:
display_names = ['Publisher', 'Year', 'Total Pages', 'Scraped', 'Cleaned', 'Imported']
df_cat.columns = display_names

Scraping Hathitrust

The scraping process grabs each OCRed page from each directory in the catalog and stores it locally as a text file with html markup. Scraping a single directory takes a fair amount of time: about two seconds per page. This can add up, especially when scraping directories from the second half of the nineteenth centuries, which grew in size along with the population of New York City. Happily, I've downloaded the text files for most of the nineteenth century directories in Hathitrust and only the directories from the 1890s are left:

In [5]:
unscraped = (df_cat.Scraped == 0)
print("\n\nNumber of directories waiting to be scraped: " + str(sum(unscraped)))
df_cat[unscraped].sort_values('Year')

Number of directories waiting to be scraped: 4
Out[5]:
Publisher Year Total Pages Scraped Cleaned Imported
88 Trow 1890 408 0 0 0
69 Trow 1894 1700 0 0 0
89 Trow 1895 380 0 0 0
70 Trow 1896 1750 0 0 0

It's worth noting the wide variation in the length of the four directories above. The 1890 and 1895 publications are surprisingly short at less than 500 pages each--a length more typical of directories from the 1820s. By the 1890s, New York City was a bona fide metropolis with a population of approximately 1.5 million. That, combined with the addition of advertisements, meant that most New York City address directories of that decade had 1500 pages or more.

So what's going on here? The shorter publications above are business directories, not residential ones. That's fun data to have, so there's no harm in scraping, cleaning, and importing the business directories too, but I'll exclude them from the main analysis later.

Cleaning

As each page is scraped, the page number is recorded in the "Scraped" column. If the process is interrupted, the script picks up where it left off the next time it runs. Once all pages in a directory have been scraped, the directory is ready for cleaning.

The uncleaned directory pages exist locally as text files, full of html markup. Thankfully, Hathitrust was judicious and consistent in their use of "id" and "class" tags, so removing the markup is easy using BeautifulSoup. Currently, four recently-scraped directories, all from the 1880s, are sitting in the cleaning queue:

In [6]:
unclean = (df_cat.Scraped > 0) & (df_cat.Cleaned == 0)
print("\n\nNumber of scraped directories waiting to be cleaned: " + str(sum(unclean)))
df_cat[unclean].sort_values('Year')

Number of scraped directories waiting to be cleaned: 4
Out[6]:
Publisher Year Total Pages Scraped Cleaned Imported
79 Trow 1880 1910 1910 0 0
80 Trow 1886 2382 2382 0 0
105 Trow 1888 254 254 0 0
106 Trow 1889 310 310 0 0

The cleaning process produces a collection of text files, each containing html-free, OCRed text from one scanned page. Blank pages--pages that had no OCRed text at all--are omitted during the cleaning process. As each page is processed, the "Cleaned" column is updated with its page number.

Importing

Once cleaned, the files will be imported line-by-line into an existing Pandas data frame. Fourteen directories, mostly from the 1870s, have been scraped and cleaned and are waiting to join the seventy-four directories that have already been imported into the project data frame:

In [7]:
ready_for_import = (df_cat.Cleaned > 0) & (df_cat.Imported == 0)
print("\n\nNumber of cleaned directories waiting to be imported: " + str(sum(ready_for_import)))
df_cat[ready_for_import].sort_values('Year')

Number of cleaned directories waiting to be imported: 14
Out[7]:
Publisher Year Total Pages Scraped Cleaned Imported
109 Jones 1805 450 450 450 0
110 Mercein 1820 452 452 452 0
30 Trow 1870 1690 1690 1690 0
68 Trow 1870 1680 1680 1680 0
31 Trow 1871 1598 1598 1598 0
32 Trow 1872 1712 1712 1712 0
33 Trow 1873 1724 1724 1724 0
101 Trow 1874 132 132 132 0
34 Trow 1876 1702 1702 1702 0
35 Trow 1876 1736 1736 1736 0
78 Trow 1876 1722 1722 1722 0
102 Trow 1876 130 130 130 0
103 Trow 1878 130 130 130 0
104 Trow 1879 156 156 156 0

Currently the main data frame has over six million rows, each representing one line of text from a directory. Once all of the nineteenth-century directories have been imported into the main data frame, the data frame will have about ten million rows and I'll move on to the next phase of this project: identifying which rows represent address entries and parsing those entries into distinct columns ('last name', 'first name', 'occupation', etc.).

But before doing that, I'll need to determine which of the rows in the main data frame represent address entries and which represent other types of text. As this subset of the main data frame shows, many of the rows are in fact addresses:

In [8]:
pd.read_pickle('../blogdata/address_sample.p')
Out[8]:
entry
98500 Clark Barney, shoemaker, 86 Mulberry
98521 Clark Daniel, laborer, Eighth c. Av. C
98528 Clark Dennis, laborer, 64 Gouverneur
98529 Clark Ebenezer, strawedgings, 53 Factory
98534 Clark Edward, paper carrier, 168 Clinton
98535 Clark Edward, laborer, r. 79 W. 18th
98538 Clark Edward, (col'd) porter, 13 York
98567 Clark George, machinist, r. 37 Broome
98568 Clark George, meat.curer, 276 Mulberry
98586 Clark Hezekiah, carman, E. 25th n. Av. 1

At the same time, many directory pages contain no addresses at all. These include the introduction and table of contents, and supplemental pages, such as weather tables, informational essays, and advertisements. The rows produced from these pages will need to be excluded from the analysis:

In [12]:
pd.set_option('display.max_colwidth',100)
pd.read_pickle('../blogdata/suppl_sample.p')
Out[12]:
entry
939804 PUBLISHED BY E. ANTHONY–ENGRAVED BY T. DONEY.
939805 This great engraving, the fruit of four years' labor and got up at an expense of many thousand d...
939806 publisher and by his agents.
939807 The reputation it has attained from the highest sources in this country has been endorsed by the...
939808 and it staids, beyond a cavil, among the proudest monuments of the Fine Arts.
939809 The number of likenesses, of the most distinguished men of ºur country, contained in the engravi...
939810 3 with which the artist has succeeded in representing them is regºrie': an unsurpassed in the an...
939811 TRIBUTE T G A nºt gº tº a CAN Artºt.
939812 The Publisher of the Engraving of the United States set are chºr tº r, cºved from the Emperor of...
939813 medal of the weight of three ounces, inost beautifully executed, bearing ºn cue side the likenes...
939814 on the reverse a beautiful wreath of oak and laurel leaves, surrounding the motto “Recua lueri.”
939815 Accompanying the nuedal was the following dispatch :

Fortunately, there are a number of ways to use Python to isolate address entries from other types of text. I'll be talking about those in a future blog entry.