Building a Local Catalog
Posted on Wed 15 June 2016 in posts
Finding New York City directories in Hathitrust is a little like herding cattle: you find the first bunch quickly, and then spend the rest of your time rounding up strays. Instead of a long list of individual directories, searching Hathitrust yields a series of record sets into which the individual directories have been organized. A record set often holds multiple directories, but sometimes contains only one, and a single directory series series may appear across multiple record sets. It's at this point that the search process begins to feel like a long, lonely cattle drive.
So before I could proceed, I first had to create my own mini-catalog: a little slice of Hathitrust, stored in a local data frame and dedicated to this project. My local catalog helps me keep track of what I've found so I don't waste time rounding up the same directories over and over again. I also use my local catalog to store bibliographic data associated with each directory--things like title, originating repository, and publication year. And as I convert the text pages stored on Hathitrust into a multi-year data frame of address entries, my local catalog helps me keep on top of where I am in the scraping, cleaning, and importing process.
Here's an example of what a record set looks like in Hathitrust.
And here's a snippet of my current local catalog (I've omitted some columns for readability):
import pandas as pd
import sys
import numpy as np
import urllib.request
import re
from bs4 import BeautifulSoup
from IPython.display import display, HTML
snippet = pd.read_pickle('../blogdata/catalog_snippet.p')
display(HTML(snippet.to_html(index=False)))
I've already gathered the id numbers for nineteen record sets at Hathitrust, but there's every reason to think I'll find more in the future, so I wanted a script that could mine Hathitrust, store the data in local data frame, and add new data whenever I came across a new record set that I hadn't seen before.
To demonstrate that process here, I've already used the script to create a small local catalog that already contains data scraped from Hathitrust: five directories from the 1840s, all associated with record id number "100734456".
df_cat = pd.read_pickle('../blogdata/catalog_scrape.p')
display(HTML(df_cat.to_html(index=False)))
Now I'd like to add more directories to my local catalog, so I'll re-run the script, this time using three record numbers that I've found while searching for New York City directories in Hathitrust:
#Create list of Hathitrust page ids for directory holdings
ht_dirs = ['100734456', '008925998','012507789']
Before scraping, I check the list against the existing data frame in case I've scraped some of these record sets already. I've also added a few lines of code to anticipate the possibility that a user will run the script with an empty record list.
# return any record numbers not already in the local catalog
new_ht_dirs = np.setdiff1d(ht_dirs, df_cat.record_id)
print(new_ht_dirs)
# return any record numbers that were already in the local catalog
dup_ht_dirs = np.setdiff1d(ht_dirs, new_ht_dirs)
print(dup_ht_dirs)
# if the user ran the script with an empty record list, return an error message and exit.
if not ht_dirs :
sys.exit('Please add record numbers to the list and rerun this script.')
# if any record ids already exist in the local catalog, let the user know which ones.
if dup_ht_dirs.size :
print('\nAlert: The following records are already in the catalog data frame:\n')
[print(' ' + rec_id) for rec_id in dup_ht_dirs]
# if there are new records to be added, let the user know which ones.
if new_ht_dirs.size :
print('\nThe following records are new and will be added to the catalog data frame:\n')
[print(' ' + rec_id) for rec_id in new_ht_dirs]
# if all the records already exist in the local catalog, return an error message and exit.
else :
sys.exit('No new records to add.')
In this case, one of the three records in the list is already in the local data frame. This information is reported to the user and the record is eliminated from the list.
Now that the script has verified the record numbers that will be processed, it can use those record numbers to access the relevant pages on Hathitrust. The stable URL for each set of records consists of the record number appended to a base URL, which I store here in a variable for use in the loop:
# base url for directory holdings pages
ht_url = 'https://catalog.hathitrust.org/Record/'
Next, I create a series of lists to store the data that I'll scrape from Hathitrust during the loop: the title, publication year, last page number, and catalog number of each directory in the record set, as well as the name of the repository that donated the directory to Hathitrust:
# create lists to store data from loop
source_id_list = []
record_id_list = []
cat_id_list = []
title_list = []
pub_year_list = []
total_pages_list = []
repository_list = []
The script now loops through the remaining items on the list, scraping information from hathitrust and storing it in a set of lists. Most of the data will be scraped from the main record page, but the last page number of each directory can only be obtained by following the link to each directory's page, so that part of the script gets a tiny bit tricky:
# Loop through holdings pages, scraping info about directory catalog id, title,
## year of publication, repository name, and last page (total_pages)
for rec_id in new_ht_dirs :
i = 0
# create full url
myURL = ht_url + rec_id
print('Scraping ' + myURL)
# retrieve that html and store as string variable
cat_html = urllib.request.urlopen(myURL).read()
# convert html to soup
soup = BeautifulSoup(cat_html, 'lxml')
# in main html, loop through 'span/IndItem' to get list of pub_years
for item in soup.find_all('span', attrs={'class' : 'IndItem'}) :
pub_year = re.findall('\d{4}', item.text)
if not pub_year :
pub_year = 0
else :
pub_year = re.findall('\d{4}', item.text)[0]
pub_year_list.append(pub_year)
i = i + 1
# retrieve title from main html
title = soup.title.text.split(': ')[-1].split(' |')[0]
if not title :
title = 'untitled'
title_list.extend([title] * i)
# create record_id_list
record_id_list.extend([rec_id] * i)
# ...loop through 'em/original_from' to get list of repositories
for item in soup.find_all('em', attrs={'class' : 'original_from'}):
repository = item.text.split('(original from ')[-1].split(')')
if not repository :
repository = 'unknown'
else :
repository = item.text.split('(original from ')[-1].split(')')[0]
repository_list.append(repository)
# ...loop through 'div/accessLinks' to get link to each directory
for item in soup.find_all('div', attrs = {'id' : 'accessLinks'}) :
#then loop through the links in that div and...
for link in item.find_all('a'):
# ...get the cat_id from each link and add to list
dir_address = 'https:' + link.get('href')
cat_id = dir_address.split('2027/')[-1]
cat_id_list.append(cat_id)
# follow each link and grab the 'last page' url
dir_html = urllib.request.urlopen(dir_address)
f = dir_html.read()
newsoup = BeautifulSoup(f, 'lxml')
total_pages = str(newsoup.find(id='action-go-last'))
total_pages_parsed = total_pages.split('seq=')[-1]
total_pages_parsed = total_pages_parsed.split(';')[0].split('"')[0]
total_pages_list.append(total_pages_parsed)
print(' Title: ' + title)
print(' Holdings: ' + str(i))
In this case, the script scrapes the pages of two record sets, finding five directories associated with the first record set and one with the second.
By this point, the script has collected all of the bibliographic data and stored it in lists. But before adding that data to the local data frame, I first need to create a unique source id for each directory. Down the road, I'll use the catalog as a dictionary to look up the publication year associated with each address entry, so I want unique source number for each directory to serve as a primary key:
# set up numbering for new additions to dataframe
last_source_id = max(df_cat.source_id)
# count the number of new additions
num_additions = len(cat_id_list)
i = 1
while i <= num_additions :
source_id_list.append(last_source_id + i)
i = i + 1
Now that all of the data are collected, I'll convert the lists into pandas Series, specifying the data type:
source_id_pds = pd.Series(source_id_list, dtype=int)
record_id_pds = pd.Series(record_id_list, dtype=str)
cat_id_pds = pd.Series(cat_id_list, dtype=str)
title_pds = pd.Series(title_list, dtype=str)
pub_year_pds = pd.Series(pub_year_list, dtype=int)
total_pages_pds = pd.Series(total_pages_list, dtype=int)
repository_pds = pd.Series(repository_list, dtype=str)
# create empty series of correct length for scraped, cleaned, and imported
scraped_pds = cleaned_pds = imported_pds = pd.Series([0] * num_additions, dtype = int)
And now I'll combine the pandas Series into a single data frame using concat.
# create a temporary dataframe from the lists
df_temp = pd.concat([source_id_pds, record_id_pds, cat_id_pds, title_pds, pub_year_pds,
total_pages_pds, repository_pds, scraped_pds, cleaned_pds, imported_pds], axis = 1)
# name the columns
col_names = ['source_id', 'record_id', 'cat_id', 'title', 'pub_year','total_pages',
'repository', 'scraped', 'cleaned', 'imported']
df_temp.columns = col_names
Finally, I'll append the new data to the existing data:
df_cat = df_cat.append(df_temp)
df_cat = df_cat[col_names]
display(HTML(df_cat.to_html(index=False)))
Now that my local catalog is up and running, I can add new record sets to the project as I locate them, and as new ones arrive in Hathitrust. In the meantime, I can begin scraping, cleaning, and importing the newly added directories.