Fun with Bar Plots: Examining the Holdings
Posted on Sun 26 June 2016 in posts • Tagged with data science, python, projects, new york city, historical data, visualization, bar plots
Using BeautifulSoup, I wrote the following script to identify and gather the holdings information for the New York City directory collection in Hathitrust. The script collected useful information about each directory, such as its title, publisher, publication year, and the total number of pages (really, files) in the digital version of the directory. With that information stored in my catalog data frame, I'm going to produce some tables to get a better sense of what I have. Specifically, I'd like to know how many directories I have for each year from 1800 to 1899, and how that collection breaks down by publisher and repository.
Continue reading
Scraping, Cleaning, and Importing: an Overview
Posted on Sun 19 June 2016 in posts • Tagged with data science, python, projects, new york city, historical data, beautifulsoup, scraping, cleaning, importing
The first challenge in this project was retrieving the OCRed directories from their home on Hathitrust and into my main data frame. In text format, a single directory on Hathitrust typically consists of hundreds of individual text files--far too many to download individually, but an easy task for Python. Once downloaded, the text file needs to be "cleaned" of html markup and the resulting text imported into the database. I've already applied this three-stage process of scraping, cleaning, and importing to ninety-two directories. I'll go into more detail about each step in that process at a later time, but for now, here's a general overview of how I went about it.
Continue reading
Building a Local Catalog
Posted on Wed 15 June 2016 in posts • Tagged with python, projects, scraping, beautifulsoup, new york city, historical data, cattle
Finding New York City directories in Hathitrust is a little like herding cattle: you find the first bunch quickly, and then spend the rest of your time rounding up strays. Instead of a long list of individual directories, searching Hathitrust yields a series of record sets into which the individual directories have been organized. A record set often holds multiple directories, but sometimes contains only one, and a single directory series series may appear across multiple record sets. It's at this point that the search process begins to feel like a long, lonely cattle drive.
Continue reading
Introducing the NYC Directory Project
Posted on Sun 12 June 2016 in posts • Tagged with data science, python, projects, new york city, historical data, demography, genealogy
The NYC Directory Project uses Python to extract data from nineteenth-century address directories that have been scanned and made publicly available on Hathitrust.org. The final result will be a demographic and geocoded dataset that can be searched and analyzed to learn more about the residents of New York City in the 1800s.
Continue reading