Building a Local Catalog

Posted on Wed 15 June 2016 in posts

Finding New York City directories in Hathitrust is a little like herding cattle: you find the first bunch quickly, and then spend the rest of your time rounding up strays. Instead of a long list of individual directories, searching Hathitrust yields a series of record sets into which the individual directories have been organized. A record set often holds multiple directories, but sometimes contains only one, and a single directory series series may appear across multiple record sets. It's at this point that the search process begins to feel like a long, lonely cattle drive.

So before I could proceed, I first had to create my own mini-catalog: a little slice of Hathitrust, stored in a local data frame and dedicated to this project. My local catalog helps me keep track of what I've found so I don't waste time rounding up the same directories over and over again. I also use my local catalog to store bibliographic data associated with each directory--things like title, originating repository, and publication year. And as I convert the text pages stored on Hathitrust into a multi-year data frame of address entries, my local catalog helps me keep on top of where I am in the scraping, cleaning, and importing process.

Here's an example of what a record set looks like in Hathitrust.

And here's a snippet of my current local catalog (I've omitted some columns for readability):

In [1]:

import pandas as pd
import sys
import numpy as np
import urllib.request
import re
from bs4 import BeautifulSoup
from IPython.display import display, HTML

snippet = pd.read_pickle('../blogdata/catalog_snippet.p')
display(HTML(snippet.to_html(index=False)))

source_id	record_id	cat_id	title	pub_year	total_pages	repository	scraped	cleaned	imported
11	100734456	chi.21539051	The New York City directory	1849	524	University of Chicago	524	524	524
12	100734456	chi.20865276	The New York City directory	1848	524	University of Chicago	524	524	523
13	100734456	chi.102186659	The New York City directory	1846	540	University of Chicago	540	540	540
69	000053989	mdp.39015036700683	New York city directory	1870	1680	University of Michigan	1680	1680	1680
70	000053989	mdp.39015050646465	New York city directory	1894	1700	University of Michigan	0	0	0
71	000053989	mdp.39015050646473	New York city directory	1896	1750	University of Michigan	0	0	0

I've already gathered the id numbers for nineteen record sets at Hathitrust, but there's every reason to think I'll find more in the future, so I wanted a script that could mine Hathitrust, store the data in local data frame, and add new data whenever I came across a new record set that I hadn't seen before.

To demonstrate that process here, I've already used the script to create a small local catalog that already contains data scraped from Hathitrust: five directories from the 1840s, all associated with record id number "100734456".

In [2]:

df_cat = pd.read_pickle('../blogdata/catalog_scrape.p')
display(HTML(df_cat.to_html(index=False)))

source_id	record_id	cat_id	title	pub_year	total_pages	repository	scraped	cleaned	imported
1	100734456	chi.21539051	The New York City directory	1849	524	University of Chicago	524	524	524
2	100734456	chi.20865276	The New York City directory	1848	524	University of Chicago	524	524	523
3	100734456	chi.102186659	The New York City directory	1846	540	University of Chicago	540	540	540
4	100734456	chi.102186641	The New York City directory	1845	478	University of Chicago	478	478	477
5	100734456	chi.102186625	The New York City directory	1843	458	University of Chicago	458	458	458

Now I'd like to add more directories to my local catalog, so I'll re-run the script, this time using three record numbers that I've found while searching for New York City directories in Hathitrust:

In [3]:

#Create list of Hathitrust page ids for directory holdings
ht_dirs = ['100734456', '008925998','012507789']

Before scraping, I check the list against the existing data frame in case I've scraped some of these record sets already. I've also added a few lines of code to anticipate the possibility that a user will run the script with an empty record list.

In [4]:

# return any record numbers not already in the local catalog
new_ht_dirs = np.setdiff1d(ht_dirs, df_cat.record_id)
print(new_ht_dirs)
# return any record numbers that were already in the local catalog
dup_ht_dirs = np.setdiff1d(ht_dirs, new_ht_dirs)
print(dup_ht_dirs)
# if the user ran the script with an empty record list, return an error message and exit.
if not ht_dirs :
    sys.exit('Please add record numbers to the list and rerun this script.')
# if any record ids already exist in the local catalog, let the user know which ones.
if dup_ht_dirs.size :
    print('\nAlert: The following records are already in the catalog data frame:\n')
    [print('  ' + rec_id) for rec_id in dup_ht_dirs]
# if there are new records to be added, let the user know which ones.
if new_ht_dirs.size :
    print('\nThe following records are new and will be added to the catalog data frame:\n')
    [print('  ' + rec_id) for rec_id in new_ht_dirs]
# if all the records already exist in the local catalog, return an error message and exit.
else :
    sys.exit('No new records to add.')

['008925998' '012507789']
['100734456']

Alert: The following records are already in the catalog data frame:

  100734456

The following records are new and will be added to the catalog data frame:

  008925998
  012507789

In this case, one of the three records in the list is already in the local data frame. This information is reported to the user and the record is eliminated from the list.

Now that the script has verified the record numbers that will be processed, it can use those record numbers to access the relevant pages on Hathitrust. The stable URL for each set of records consists of the record number appended to a base URL, which I store here in a variable for use in the loop:

In [5]:

# base url for directory holdings pages
ht_url = 'https://catalog.hathitrust.org/Record/'

Next, I create a series of lists to store the data that I'll scrape from Hathitrust during the loop: the title, publication year, last page number, and catalog number of each directory in the record set, as well as the name of the repository that donated the directory to Hathitrust:

In [6]:

# create lists to store data from loop
source_id_list = []
record_id_list = []
cat_id_list = []
title_list = []
pub_year_list = []
total_pages_list = []
repository_list = []

The script now loops through the remaining items on the list, scraping information from hathitrust and storing it in a set of lists. Most of the data will be scraped from the main record page, but the last page number of each directory can only be obtained by following the link to each directory's page, so that part of the script gets a tiny bit tricky:

In [7]:

# Loop through holdings pages, scraping info about directory catalog id, title,
## year of publication, repository name, and last page (total_pages)
for rec_id in new_ht_dirs :
    
    i = 0
    
    # create full url
    myURL = ht_url + rec_id
    print('Scraping ' + myURL)
    
    # retrieve that html and store as string variable
    cat_html = urllib.request.urlopen(myURL).read()
    
    # convert html to soup
    soup = BeautifulSoup(cat_html, 'lxml')
    
    # in main html, loop through 'span/IndItem' to get list of pub_years 
    for item in soup.find_all('span', attrs={'class' : 'IndItem'}) :
        pub_year = re.findall('\d{4}', item.text)
        if not pub_year :
            pub_year = 0
        else :
            pub_year = re.findall('\d{4}', item.text)[0]
        pub_year_list.append(pub_year)
        i = i + 1
    
    # retrieve title from main html
    title = soup.title.text.split(': ')[-1].split(' |')[0]
    if not title :
        title = 'untitled'
    title_list.extend([title] * i)
    
    # create record_id_list
    record_id_list.extend([rec_id] * i)
    
    
    # ...loop through 'em/original_from' to get list of repositories
    for item in soup.find_all('em', attrs={'class' : 'original_from'}):       
        repository = item.text.split('(original from ')[-1].split(')')
        if not repository :
            repository = 'unknown'
        else :
            repository = item.text.split('(original from ')[-1].split(')')[0]
        repository_list.append(repository)
    
    # ...loop through 'div/accessLinks' to get link to each directory
    for item in soup.find_all('div', attrs = {'id' : 'accessLinks'}) :
        #then loop through the links in that div and...
        for link in item.find_all('a'):
            # ...get the cat_id from each link and add to list
            dir_address = 'https:' + link.get('href')
            cat_id = dir_address.split('2027/')[-1]
            cat_id_list.append(cat_id)
            
            # follow each link and grab the 'last page' url
            dir_html = urllib.request.urlopen(dir_address)
            f = dir_html.read()
            newsoup = BeautifulSoup(f, 'lxml')
            total_pages = str(newsoup.find(id='action-go-last'))
            total_pages_parsed = total_pages.split('seq=')[-1]
            total_pages_parsed = total_pages_parsed.split(';')[0].split('"')[0]
            total_pages_list.append(total_pages_parsed)
            
            
    print('  Title: ' + title)
    print('  Holdings:  ' + str(i))

Scraping https://catalog.hathitrust.org/Record/008925998
  Title: Longworth's American almanac, New-York...
  Holdings:  5
Scraping https://catalog.hathitrust.org/Record/012507789
  Title: The directory of the city of New York, 
  Holdings:  1

In this case, the script scrapes the pages of two record sets, finding five directories associated with the first record set and one with the second.

By this point, the script has collected all of the bibliographic data and stored it in lists. But before adding that data to the local data frame, I first need to create a unique source id for each directory. Down the road, I'll use the catalog as a dictionary to look up the publication year associated with each address entry, so I want unique source number for each directory to serve as a primary key:

In [8]:

# set up numbering for new additions to dataframe
last_source_id = max(df_cat.source_id)

# count the number of new additions
num_additions = len(cat_id_list)

i = 1

while i <= num_additions :
    source_id_list.append(last_source_id + i)
    i  = i + 1

Now that all of the data are collected, I'll convert the lists into pandas Series, specifying the data type:

In [9]:

source_id_pds = pd.Series(source_id_list, dtype=int)
record_id_pds = pd.Series(record_id_list, dtype=str)
cat_id_pds = pd.Series(cat_id_list, dtype=str)
title_pds = pd.Series(title_list, dtype=str)
pub_year_pds = pd.Series(pub_year_list, dtype=int)
total_pages_pds = pd.Series(total_pages_list, dtype=int)
repository_pds = pd.Series(repository_list, dtype=str)

# create empty series of correct length for scraped, cleaned, and imported
scraped_pds = cleaned_pds = imported_pds = pd.Series([0] * num_additions, dtype = int)

And now I'll combine the pandas Series into a single data frame using concat.

In [10]:

# create a temporary dataframe from the lists    
df_temp = pd.concat([source_id_pds, record_id_pds, cat_id_pds, title_pds, pub_year_pds, 
                     total_pages_pds, repository_pds, scraped_pds, cleaned_pds, imported_pds], axis = 1)
# name the columns
col_names = ['source_id', 'record_id', 'cat_id', 'title', 'pub_year','total_pages', 
           'repository', 'scraped', 'cleaned', 'imported']

df_temp.columns = col_names

Finally, I'll append the new data to the existing data:

In [11]:

df_cat = df_cat.append(df_temp)
df_cat = df_cat[col_names]
display(HTML(df_cat.to_html(index=False)))

source_id	record_id	cat_id	title	pub_year	total_pages	repository	scraped	cleaned	imported
1	100734456	chi.21539051	The New York City directory	1849	524	University of Chicago	524	524	524
2	100734456	chi.20865276	The New York City directory	1848	524	University of Chicago	524	524	523
3	100734456	chi.102186659	The New York City directory	1846	540	University of Chicago	540	540	540
4	100734456	chi.102186641	The New York City directory	1845	478	University of Chicago	478	478	477
5	100734456	chi.102186625	The New York City directory	1843	458	University of Chicago	458	458	458
6	008925998	njp.32101066152420	Longworth's American almanac, New-York...	1827	564	Princeton University	0	0	0
7	008925998	njp.32101066152404	Longworth's American almanac, New-York...	1834	790	Princeton University	0	0	0
8	008925998	njp.32101080456856	Longworth's American almanac, New-York...	1837	918	Princeton University	0	0	0
9	008925998	njp.32101080456872	Longworth's American almanac, New-York...	1842	766	Princeton University	0	0	0
10	008925998	njp.32101066152396	Longworth's American almanac, New-York...	1908	356	Princeton University	0	0	0
11	012507789	chi.27275303	The directory of the city of New York,	1852	766	University of Chicago	0	0	0

Now that my local catalog is up and running, I can add new record sets to the project as I locate them, and as new ones arrive in Hathitrust. In the meantime, I can begin scraping, cleaning, and importing the newly added directories.