Building a Local Catalog

Posted on Wed 15 June 2016 in posts

Finding New York City directories in Hathitrust is a little like herding cattle: you find the first bunch quickly, and then spend the rest of your time rounding up strays. Instead of a long list of individual directories, searching Hathitrust yields a series of record sets into which the individual directories have been organized. A record set often holds multiple directories, but sometimes contains only one, and a single directory series series may appear across multiple record sets. It's at this point that the search process begins to feel like a long, lonely cattle drive.

So before I could proceed, I first had to create my own mini-catalog: a little slice of Hathitrust, stored in a local data frame and dedicated to this project. My local catalog helps me keep track of what I've found so I don't waste time rounding up the same directories over and over again. I also use my local catalog to store bibliographic data associated with each directory--things like title, originating repository, and publication year. And as I convert the text pages stored on Hathitrust into a multi-year data frame of address entries, my local catalog helps me keep on top of where I am in the scraping, cleaning, and importing process.

Here's an example of what a record set looks like in Hathitrust.

And here's a snippet of my current local catalog (I've omitted some columns for readability):

In [1]:
import pandas as pd
import sys
import numpy as np
import urllib.request
import re
from bs4 import BeautifulSoup
from IPython.display import display, HTML

snippet = pd.read_pickle('../blogdata/catalog_snippet.p')
display(HTML(snippet.to_html(index=False)))
source_id record_id cat_id title pub_year total_pages repository scraped cleaned imported
11 100734456 chi.21539051 The New York City directory 1849 524 University of Chicago 524 524 524
12 100734456 chi.20865276 The New York City directory 1848 524 University of Chicago 524 524 523
13 100734456 chi.102186659 The New York City directory 1846 540 University of Chicago 540 540 540
69 000053989 mdp.39015036700683 New York city directory 1870 1680 University of Michigan 1680 1680 1680
70 000053989 mdp.39015050646465 New York city directory 1894 1700 University of Michigan 0 0 0
71 000053989 mdp.39015050646473 New York city directory 1896 1750 University of Michigan 0 0 0

I've already gathered the id numbers for nineteen record sets at Hathitrust, but there's every reason to think I'll find more in the future, so I wanted a script that could mine Hathitrust, store the data in local data frame, and add new data whenever I came across a new record set that I hadn't seen before.

To demonstrate that process here, I've already used the script to create a small local catalog that already contains data scraped from Hathitrust: five directories from the 1840s, all associated with record id number "100734456".

In [2]:
df_cat = pd.read_pickle('../blogdata/catalog_scrape.p')
display(HTML(df_cat.to_html(index=False)))
source_id record_id cat_id title pub_year total_pages repository scraped cleaned imported
1 100734456 chi.21539051 The New York City directory 1849 524 University of Chicago 524 524 524
2 100734456 chi.20865276 The New York City directory 1848 524 University of Chicago 524 524 523
3 100734456 chi.102186659 The New York City directory 1846 540 University of Chicago 540 540 540
4 100734456 chi.102186641 The New York City directory 1845 478 University of Chicago 478 478 477
5 100734456 chi.102186625 The New York City directory 1843 458 University of Chicago 458 458 458

Now I'd like to add more directories to my local catalog, so I'll re-run the script, this time using three record numbers that I've found while searching for New York City directories in Hathitrust:

In [3]:
#Create list of Hathitrust page ids for directory holdings
ht_dirs = ['100734456', '008925998','012507789']

Before scraping, I check the list against the existing data frame in case I've scraped some of these record sets already. I've also added a few lines of code to anticipate the possibility that a user will run the script with an empty record list.

In [4]:
# return any record numbers not already in the local catalog
new_ht_dirs = np.setdiff1d(ht_dirs, df_cat.record_id)
print(new_ht_dirs)
# return any record numbers that were already in the local catalog
dup_ht_dirs = np.setdiff1d(ht_dirs, new_ht_dirs)
print(dup_ht_dirs)
# if the user ran the script with an empty record list, return an error message and exit.
if not ht_dirs :
    sys.exit('Please add record numbers to the list and rerun this script.')
# if any record ids already exist in the local catalog, let the user know which ones.
if dup_ht_dirs.size :
    print('\nAlert: The following records are already in the catalog data frame:\n')
    [print('  ' + rec_id) for rec_id in dup_ht_dirs]
# if there are new records to be added, let the user know which ones.
if new_ht_dirs.size :
    print('\nThe following records are new and will be added to the catalog data frame:\n')
    [print('  ' + rec_id) for rec_id in new_ht_dirs]
# if all the records already exist in the local catalog, return an error message and exit.
else :
    sys.exit('No new records to add.')       
['008925998' '012507789']
['100734456']

Alert: The following records are already in the catalog data frame:

  100734456

The following records are new and will be added to the catalog data frame:

  008925998
  012507789

In this case, one of the three records in the list is already in the local data frame. This information is reported to the user and the record is eliminated from the list.

Now that the script has verified the record numbers that will be processed, it can use those record numbers to access the relevant pages on Hathitrust. The stable URL for each set of records consists of the record number appended to a base URL, which I store here in a variable for use in the loop:

In [5]:
# base url for directory holdings pages
ht_url = 'https://catalog.hathitrust.org/Record/'

Next, I create a series of lists to store the data that I'll scrape from Hathitrust during the loop: the title, publication year, last page number, and catalog number of each directory in the record set, as well as the name of the repository that donated the directory to Hathitrust:

In [6]:
# create lists to store data from loop
source_id_list = []
record_id_list = []
cat_id_list = []
title_list = []
pub_year_list = []
total_pages_list = []
repository_list = []

The script now loops through the remaining items on the list, scraping information from hathitrust and storing it in a set of lists. Most of the data will be scraped from the main record page, but the last page number of each directory can only be obtained by following the link to each directory's page, so that part of the script gets a tiny bit tricky:

In [7]:
# Loop through holdings pages, scraping info about directory catalog id, title,
## year of publication, repository name, and last page (total_pages)
for rec_id in new_ht_dirs :
    
    i = 0
    
    # create full url
    myURL = ht_url + rec_id
    print('Scraping ' + myURL)
    
    # retrieve that html and store as string variable
    cat_html = urllib.request.urlopen(myURL).read()
    
    # convert html to soup
    soup = BeautifulSoup(cat_html, 'lxml')
    
    # in main html, loop through 'span/IndItem' to get list of pub_years 
    for item in soup.find_all('span', attrs={'class' : 'IndItem'}) :
        pub_year = re.findall('\d{4}', item.text)
        if not pub_year :
            pub_year = 0
        else :
            pub_year = re.findall('\d{4}', item.text)[0]
        pub_year_list.append(pub_year)
        i = i + 1
    
    # retrieve title from main html
    title = soup.title.text.split(': ')[-1].split(' |')[0]
    if not title :
        title = 'untitled'
    title_list.extend([title] * i)
    
    # create record_id_list
    record_id_list.extend([rec_id] * i)
    
    
    # ...loop through 'em/original_from' to get list of repositories
    for item in soup.find_all('em', attrs={'class' : 'original_from'}):       
        repository = item.text.split('(original from ')[-1].split(')')
        if not repository :
            repository = 'unknown'
        else :
            repository = item.text.split('(original from ')[-1].split(')')[0]
        repository_list.append(repository)
    
    # ...loop through 'div/accessLinks' to get link to each directory
    for item in soup.find_all('div', attrs = {'id' : 'accessLinks'}) :
        #then loop through the links in that div and...
        for link in item.find_all('a'):
            # ...get the cat_id from each link and add to list
            dir_address = 'https:' + link.get('href')
            cat_id = dir_address.split('2027/')[-1]
            cat_id_list.append(cat_id)
            
            # follow each link and grab the 'last page' url
            dir_html = urllib.request.urlopen(dir_address)
            f = dir_html.read()
            newsoup = BeautifulSoup(f, 'lxml')
            total_pages = str(newsoup.find(id='action-go-last'))
            total_pages_parsed = total_pages.split('seq=')[-1]
            total_pages_parsed = total_pages_parsed.split(';')[0].split('"')[0]
            total_pages_list.append(total_pages_parsed)
            
            
    print('  Title: ' + title)
    print('  Holdings:  ' + str(i))
Scraping https://catalog.hathitrust.org/Record/008925998
  Title: Longworth's American almanac, New-York...
  Holdings:  5
Scraping https://catalog.hathitrust.org/Record/012507789
  Title: The directory of the city of New York, 
  Holdings:  1

In this case, the script scrapes the pages of two record sets, finding five directories associated with the first record set and one with the second.

By this point, the script has collected all of the bibliographic data and stored it in lists. But before adding that data to the local data frame, I first need to create a unique source id for each directory. Down the road, I'll use the catalog as a dictionary to look up the publication year associated with each address entry, so I want unique source number for each directory to serve as a primary key:

In [8]:
# set up numbering for new additions to dataframe
last_source_id = max(df_cat.source_id)

# count the number of new additions
num_additions = len(cat_id_list)

i = 1

while i <= num_additions :
    source_id_list.append(last_source_id + i)
    i  = i + 1

Now that all of the data are collected, I'll convert the lists into pandas Series, specifying the data type:

In [9]:
source_id_pds = pd.Series(source_id_list, dtype=int)
record_id_pds = pd.Series(record_id_list, dtype=str)
cat_id_pds = pd.Series(cat_id_list, dtype=str)
title_pds = pd.Series(title_list, dtype=str)
pub_year_pds = pd.Series(pub_year_list, dtype=int)
total_pages_pds = pd.Series(total_pages_list, dtype=int)
repository_pds = pd.Series(repository_list, dtype=str)

# create empty series of correct length for scraped, cleaned, and imported
scraped_pds = cleaned_pds = imported_pds = pd.Series([0] * num_additions, dtype = int)

And now I'll combine the pandas Series into a single data frame using concat.

In [10]:
# create a temporary dataframe from the lists    
df_temp = pd.concat([source_id_pds, record_id_pds, cat_id_pds, title_pds, pub_year_pds, 
                     total_pages_pds, repository_pds, scraped_pds, cleaned_pds, imported_pds], axis = 1)
# name the columns
col_names = ['source_id', 'record_id', 'cat_id', 'title', 'pub_year','total_pages', 
           'repository', 'scraped', 'cleaned', 'imported']

df_temp.columns = col_names

Finally, I'll append the new data to the existing data:

In [11]:
df_cat = df_cat.append(df_temp)
df_cat = df_cat[col_names]
display(HTML(df_cat.to_html(index=False)))
source_id record_id cat_id title pub_year total_pages repository scraped cleaned imported
1 100734456 chi.21539051 The New York City directory 1849 524 University of Chicago 524 524 524
2 100734456 chi.20865276 The New York City directory 1848 524 University of Chicago 524 524 523
3 100734456 chi.102186659 The New York City directory 1846 540 University of Chicago 540 540 540
4 100734456 chi.102186641 The New York City directory 1845 478 University of Chicago 478 478 477
5 100734456 chi.102186625 The New York City directory 1843 458 University of Chicago 458 458 458
6 008925998 njp.32101066152420 Longworth's American almanac, New-York... 1827 564 Princeton University 0 0 0
7 008925998 njp.32101066152404 Longworth's American almanac, New-York... 1834 790 Princeton University 0 0 0
8 008925998 njp.32101080456856 Longworth's American almanac, New-York... 1837 918 Princeton University 0 0 0
9 008925998 njp.32101080456872 Longworth's American almanac, New-York... 1842 766 Princeton University 0 0 0
10 008925998 njp.32101066152396 Longworth's American almanac, New-York... 1908 356 Princeton University 0 0 0
11 012507789 chi.27275303 The directory of the city of New York, 1852 766 University of Chicago 0 0 0

Now that my local catalog is up and running, I can add new record sets to the project as I locate them, and as new ones arrive in Hathitrust. In the meantime, I can begin scraping, cleaning, and importing the newly added directories.