Fun with Bar Plots: Examining the Holdings
Posted on Sun 26 June 2016 in posts
Using BeautifulSoup, I wrote the following script to identify and gather the holdings information for the New York City directory collection in Hathitrust. The script collected useful information about each directory, such as its title, publisher, publication year, and the total number of pages (really, files) in the digital version of the directory. With that information stored in my catalog data frame, I'm going to produce some tables to get a better sense of what I have. Specifically, I'd like to know how many directories I have for each year from 1800 to 1899, and how that collection breaks down by publisher and repository.
To get started, after importing my libraries, I'll load the data. Although I have a few directories from the 1700s and the 1900s, I'm only interested in the nineteenth century for the moment, so I'll limit the data frame to that era:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator
from random import shuffle
import numpy as np
from itertools import cycle, islice
# import the data
cols = ['pub_year', 'repository', 'publisher']
df_cat = pd.read_pickle('../blogdata/catalog.p')[cols]
# limit the data frame to the 19th century
min_year, max_year = 1800, 1900
era = (df_cat.pub_year >= min_year) & (df_cat.pub_year < max_year)
df_cat = df_cat[era]
Also, I'll set some formatting variables and lists that will help me generate nice-looking plots.
# Plot Settings #########################################################
# set variables for font size
xtick_lab_fs = 14
ytick_lab_fs = 18
ylab_fs = 20
title_fs = 24
# set minor ticks
minor_locator = AutoMinorLocator(5)
# set the colors for the bars in Figures 2 and 3
col_sch1_list = ['#285C77', '#53C4C6', '#CBC15B', '#A147C2', '#C95E64', '#917930', '#3DB86A']
col_scheme1 = list(islice(cycle(col_sch1_list), None, len(df_cat)))
col_sch2_list = ['#E15D5D', '#C28B47', '#C1A544', '#62AA5E', '#3EC5CE', '#9CA5DE', '#5C349D']
col_scheme2 = list(islice(cycle(col_sch2_list), None, len(df_cat)))
bar_width = .75
fig_size = (25, 10)
For my first plot, I'd like to get a general look at how many directories I have for each year in the nineteenth century. I know that I don't have a directory for every year, and I'd like my plot to display the gaps in my collection. To that end, I'll reindex the data to include all of the years from 1800 to 1900. This way, any years for which I don't have a directory will show up visually on the bar plot.
# expand index to include all years from 1800 to 1900
all_years = np.arange(min_year, max_year + 1)
new_index = pd.Index(all_years)
df_year = df_cat['pub_year'].value_counts().reindex(new_index).sort_index()
With all that out of the way, I can produce the plot and get a better sense of what I have:
# set type, bar width, and figure size for the first plot
fig1 = df_year.plot(kind = 'bar', width = bar_width, figsize = fig_size)
# set the plot title, size, and placement
fig1.set_title('Number of Directories in the New York City Directory Project by Publication Year', fontsize = title_fs)
# set the x-axis ticks, minor ticks, and tick-labels
x_lab = np.arange(min_year, max_year, 5)
fig1.set_xticks(np.interp(x_lab, df_year.index, np.arange(df_year.size)))
fig1.set_xticklabels(x_lab, fontsize = xtick_lab_fs)
fig1.xaxis.set_minor_locator(minor_locator)
# set the y-axis ticks and label
fig1.set_ylabel('number of directories', fontsize = ylab_fs)
max_y = max(df_cat.pub_year.value_counts())
min_y = min(df_cat.pub_year.value_counts())
y_lab = np.arange(min_y, max_y + 1)
fig1.set_yticks(y_lab)
fig1.set_yticklabels(y_lab, fontsize = ytick_lab_fs)
# et voila
plt.show()
As the plot shows, there are some significant gaps, particluarly in the first and last quarter of the nineteenth century. Some day I'll try to fill in those gaps, possibly using the excellent collection currently being digitized by the New York City Public Library. For now, I'll take solace in the fact that I have an unbroken collection from 1832 to 1861. Not bad.
The plot also shows that for certain years I have more than one directory, but it doesn't tell us whether those directories are duplicate volumes or competing directories. Duplicates occur in Hathitrust when two or more repositories each owned and digitized a copy of the same directory--Trow's 1876 New York City Directory, for example. Strictly speaking, these digitized versions are not duplicates of each other, as the originals--now well more than a century old--are frequently missing pages or vary in other significant ways. The quality of the scans themselves can vary, so having duplicates could prove to be very helpful for our analysis.
To determine how many duplicates might exist, I'll reproduce the bar plot, this time grouping the directories by publisher. As before, I'll reindex the data with the full range of years so gaps will show in the bar chart.
#set up the data to group by publisher
df_year_pub = df_cat.groupby(['pub_year', 'publisher']).pub_year.count().unstack('publisher')
df_year_pub = df_year_pub.reindex(new_index).sort_index()
# set type, bar width, and figure size for the second plot
fig2 = df_year_pub.plot(kind = 'bar', width = bar_width, stacked = True, figsize = fig_size, color = col_scheme1)
# set the plot title, size, and placement
fig2.set_title('Number of Directories per Year grouped by Publisher', fontsize=title_fs)
# set the x-axis ticks, minor ticks, and tick-labels
fig2.set_xticks(np.interp(x_lab, df_year.index, np.arange(df_year.size)))
fig2.set_xticklabels(x_lab, fontsize = xtick_lab_fs)
fig2.xaxis.set_minor_locator(minor_locator)
# set the y-axis ticks and label
fig2.set_ylabel('number of directories', fontsize=ylab_fs)
fig2.set_yticks(y_lab)
fig2.set_yticklabels(y_lab, fontsize=ytick_lab_fs)
plt.show()
As the bar plot shows, the collection contains a number of "duplicate" directories. There are three copies of the Longworth directory for each of the years from 1834 to 1836, for example, multiple copies of most of the Doggett directories, and four copies of the 1876 Trow directory. (As I've discussed elsewhere, some of these apparent duplicates may actually be a mix of business and residential directories; I'll explore that more later).
For fun, I'll produce one more chart, this time grouping the directories by the repository that contributed them to Hathitrust:
#set up the data to group by repository
df_year_rep = df_cat.groupby(['pub_year', 'repository']).pub_year.count().unstack('repository')
df_year_rep = df_year_rep.reindex(new_index).sort_index()
# set type, bar width, and figure size for the second plot
fig3 = df_year_rep.plot(kind = 'bar', width = bar_width, stacked = True, figsize = fig_size, color = col_scheme2)
# set the plot title, size, and placement
fig3.set_title('Number of Directories per Year grouped by Repository', fontsize=title_fs)
# set the x-axis ticks, minor ticks, and tick-labels
fig3.set_xticks(np.interp(x_lab, df_year.index, np.arange(df_year.size)))
fig3.set_xticklabels(x_lab, fontsize=xtick_lab_fs)
fig3.xaxis.set_minor_locator(minor_locator)
# set the y-axis ticks and label
fig3.set_ylabel('number of directories', fontsize=ylab_fs)
fig3.set_yticks(y_lab)
fig3.set_yticklabels(y_lab, fontsize=ytick_lab_fs)
plt.show()
No single repository has contributed more than one copy of a given edition (Princeton's two contributions for 1842 are separate publications by Doggett and Longworth). An unsurprising result, but good to know.
I have to make a few more tweaks to the data--identifying and excluding the business directories, for example. But after that, let the parsing begin...