Scraping, Cleaning, and Importing: an Overview

Posted on Sun 19 June 2016 in posts

The first challenge in this project was retrieving the OCRed directories from their home on Hathitrust and into my main data frame. In text format, a single directory on Hathitrust typically consists of hundreds of individual text files--far too many to download individually, but an easy task for Python. Once downloaded, the text file needs to be "cleaned" of html markup and the resulting text imported into the database. I've already applied this three-stage process of scraping, cleaning, and importing to ninety-two directories. I'll go into more detail about each step in that process at a later time, but for now, here's a general overview of how I went about it.

In [1]:

import os
import pandas as pd

As I've discussed in an earlier blog entry, I used BeautifulSoup to create a 'catalog' data frame that contains information about each of the 111 New York City directories that I've found in HathiTrust. Among other things, the data frame keeps track of which directories have been scraped, cleaned, and imported. The most up-to-date version is stored locally as a pickle. I'll begin by loading that data:

In [2]:

df_cat = pd.read_pickle('../blogdata/catalog_overview.p')

In addition to a large number of nineteenth-century directories, the catalog data frame contains quite a few entries from the late 1700s and the early 1900s. Since those aren't of interest to me at the moment, I'll exclude them from my working data frame. I'll also limit the working data frame to the seven columns relevant for understanding the process of scraping, cleaning, and importing the Hathitrust directories: the directory title, the publisher, the year it was published, the total number of pages in the digital version of that directory, and the number of pages that have already been scraped, cleaned, and imported:

In [3]:

df_cat = df_cat[(df_cat.pub_year > 1799) & (df_cat.pub_year < 1900)]
select_cols = ['publisher', 'pub_year', 'total_pages', 
               'scraped', 'cleaned', 'imported']
df_cat = df_cat[select_cols]

Next, I'll transform those column names into a more user-friendly version for display here:

In [4]:

display_names = ['Publisher', 'Year', 'Total Pages', 'Scraped', 'Cleaned', 'Imported']
df_cat.columns = display_names

Scraping Hathitrust¶

The scraping process grabs each OCRed page from each directory in the catalog and stores it locally as a text file with html markup. Scraping a single directory takes a fair amount of time: about two seconds per page. This can add up, especially when scraping directories from the second half of the nineteenth centuries, which grew in size along with the population of New York City. Happily, I've downloaded the text files for most of the nineteenth century directories in Hathitrust and only the directories from the 1890s are left:

In [5]:

unscraped = (df_cat.Scraped == 0)
print("\n\nNumber of directories waiting to be scraped: " + str(sum(unscraped)))
df_cat[unscraped].sort_values('Year')


Number of directories waiting to be scraped: 4

Out[5]:

	Publisher	Year	Total Pages
88	Trow	1890	408
69	Trow	1894	1700
89	Trow	1895	380
70	Trow	1896	1750

It's worth noting the wide variation in the length of the four directories above. The 1890 and 1895 publications are surprisingly short at less than 500 pages each--a length more typical of directories from the 1820s. By the 1890s, New York City was a bona fide metropolis with a population of approximately 1.5 million. That, combined with the addition of advertisements, meant that most New York City address directories of that decade had 1500 pages or more.

So what's going on here? The shorter publications above are business directories, not residential ones. That's fun data to have, so there's no harm in scraping, cleaning, and importing the business directories too, but I'll exclude them from the main analysis later.

Cleaning¶

As each page is scraped, the page number is recorded in the "Scraped" column. If the process is interrupted, the script picks up where it left off the next time it runs. Once all pages in a directory have been scraped, the directory is ready for cleaning.

The uncleaned directory pages exist locally as text files, full of html markup. Thankfully, Hathitrust was judicious and consistent in their use of "id" and "class" tags, so removing the markup is easy using BeautifulSoup. Currently, four recently-scraped directories, all from the 1880s, are sitting in the cleaning queue:

In [6]:

unclean = (df_cat.Scraped > 0) & (df_cat.Cleaned == 0)
print("\n\nNumber of scraped directories waiting to be cleaned: " + str(sum(unclean)))
df_cat[unclean].sort_values('Year')


Number of scraped directories waiting to be cleaned: 4

Out[6]:

	Publisher	Year	Total Pages	Scraped
79	Trow	1880	1910	1910
80	Trow	1886	2382	2382
105	Trow	1888	254	254
106	Trow	1889	310	310

The cleaning process produces a collection of text files, each containing html-free, OCRed text from one scanned page. Blank pages--pages that had no OCRed text at all--are omitted during the cleaning process. As each page is processed, the "Cleaned" column is updated with its page number.

Importing¶

Once cleaned, the files will be imported line-by-line into an existing Pandas data frame. Fourteen directories, mostly from the 1870s, have been scraped and cleaned and are waiting to join the seventy-four directories that have already been imported into the project data frame:

In [7]:

ready_for_import = (df_cat.Cleaned > 0) & (df_cat.Imported == 0)
print("\n\nNumber of cleaned directories waiting to be imported: " + str(sum(ready_for_import)))
df_cat[ready_for_import].sort_values('Year')


Number of cleaned directories waiting to be imported: 14

Out[7]:

	Publisher	Year	Total Pages	Scraped	Cleaned
109	Jones	1805	450	450	450
110	Mercein	1820	452	452	452
30	Trow	1870	1690	1690	1690
68	Trow	1870	1680	1680	1680
31	Trow	1871	1598	1598	1598
32	Trow	1872	1712	1712	1712
33	Trow	1873	1724	1724	1724
101	Trow	1874	132	132	132
34	Trow	1876	1702	1702	1702
35	Trow	1876	1736	1736	1736
78	Trow	1876	1722	1722	1722
102	Trow	1876	130	130	130
103	Trow	1878	130	130	130
104	Trow	1879	156	156	156

Currently the main data frame has over six million rows, each representing one line of text from a directory. Once all of the nineteenth-century directories have been imported into the main data frame, the data frame will have about ten million rows and I'll move on to the next phase of this project: identifying which rows represent address entries and parsing those entries into distinct columns ('last name', 'first name', 'occupation', etc.).

But before doing that, I'll need to determine which of the rows in the main data frame represent address entries and which represent other types of text. As this subset of the main data frame shows, many of the rows are in fact addresses:

In [8]:

pd.read_pickle('../blogdata/address_sample.p')

Out[8]:

	entry
98500	Clark Barney, shoemaker, 86 Mulberry
98521	Clark Daniel, laborer, Eighth c. Av. C
98528	Clark Dennis, laborer, 64 Gouverneur
98529	Clark Ebenezer, strawedgings, 53 Factory
98534	Clark Edward, paper carrier, 168 Clinton
98535	Clark Edward, laborer, r. 79 W. 18th
98538	Clark Edward, (col'd) porter, 13 York
98567	Clark George, machinist, r. 37 Broome
98568	Clark George, meat.curer, 276 Mulberry
98586	Clark Hezekiah, carman, E. 25th n. Av. 1

At the same time, many directory pages contain no addresses at all. These include the introduction and table of contents, and supplemental pages, such as weather tables, informational essays, and advertisements. The rows produced from these pages will need to be excluded from the analysis:

In [12]:

pd.set_option('display.max_colwidth',100)
pd.read_pickle('../blogdata/suppl_sample.p')

Out[12]:

	entry
939804	PUBLISHED BY E. ANTHONY–ENGRAVED BY T. DONEY.
939805	This great engraving, the fruit of four years' labor and got up at an expense of many thousand d...
939806	publisher and by his agents.
939807	The reputation it has attained from the highest sources in this country has been endorsed by the...
939808	and it staids, beyond a cavil, among the proudest monuments of the Fine Arts.
939809	The number of likenesses, of the most distinguished men of ºur country, contained in the engravi...
939810	3 with which the artist has succeeded in representing them is regºrie': an unsurpassed in the an...
939811	TRIBUTE T G A nºt gº tº a CAN Artºt.
939812	The Publisher of the Engraving of the United States set are chºr tº r, cºved from the Emperor of...
939813	medal of the weight of three ounces, inost beautifully executed, bearing ºn cue side the likenes...
939814	on the reverse a beautiful wreath of oak and laurel leaves, surrounding the motto “Recua lueri.”
939815	Accompanying the nuedal was the following dispatch :

Fortunately, there are a number of ways to use Python to isolate address entries from other types of text. I'll be talking about those in a future blog entry.