Scraping, Cleaning, and Importing: an Overview

Posted on Sun 19 June 2016 in posts • Tagged with data science, python, projects, new york city, historical data, beautifulsoup, scraping, cleaning, importing

The first challenge in this project was retrieving the OCRed directories from their home on Hathitrust and into my main data frame. In text format, a single directory on Hathitrust typically consists of hundreds of individual text files--far too many to download individually, but an easy task for Python. Once downloaded, the text file needs to be "cleaned" of html markup and the resulting text imported into the database. I've already applied this three-stage process of scraping, cleaning, and importing to ninety-two directories. I'll go into more detail about each step in that process at a later time, but for now, here's a general overview of how I went about it.


Continue reading

Building a Local Catalog

Posted on Wed 15 June 2016 in posts • Tagged with python, projects, scraping, beautifulsoup, new york city, historical data, cattle

Finding New York City directories in Hathitrust is a little like herding cattle: you find the first bunch quickly, and then spend the rest of your time rounding up strays. Instead of a long list of individual directories, searching Hathitrust yields a series of record sets into which the individual directories have been organized. A record set often holds multiple directories, but sometimes contains only one, and a single directory series series may appear across multiple record sets. It's at this point that the search process begins to feel like a long, lonely cattle drive.


Continue reading