Scraping, Cleaning, and Importing: an Overview

Posted on Sun 19 June 2016 in posts • Tagged with data science, python, projects, new york city, historical data, beautifulsoup, scraping, cleaning, importing

The first challenge in this project was retrieving the OCRed directories from their home on Hathitrust and into my main data frame. In text format, a single directory on Hathitrust typically consists of hundreds of individual text files--far too many to download individually, but an easy task for Python. Once downloaded, the text file needs to be "cleaned" of html markup and the resulting text imported into the database. I've already applied this three-stage process of scraping, cleaning, and importing to ninety-two directories. I'll go into more detail about each step in that process at a later time, but for now, here's a general overview of how I went about it.


Continue reading