Analyzing my Zotero library

I have spent a long time in school and have accumulated lots of references as a result. I keep them all fairly organized in Zotero which lets you export your library as a .csv file, so it wasn’t hard to read into R and do some basic analysis and visualization (you can grab the file here. If you’d like an up-to-date-ish .bib file, you can have that as well). I’ve included some code below to show my approach to these data.

First things first, necessary packages:

library(foreign)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(stringr)
library(tidyr)
library(tm)
library(wordcloud)
library(SnowballC)

Now we can load the .csv we exported from Zotero; it has a ton of variables that are mostly empty or not necessary, so the next line of code focuses only on the columns in the file that are of interest.

bib <- read.csv("bib.csv") %>% 
  select(Item.Type, Publication.Title, Publication.Year, Author, Title, Publisher)

Let’s take a look at when things we published, using a basic bar chart that counts the number of library entries in each year. I use ggplot and add a bit of code to customize the look of the figure. I won’t present the code for producing subsequent figures as they mostly reproduce this chunk.

ggplot(data = bib, aes(x = Publication.Year)) +
 geom_bar() +
 labs(x = "", y = "") +
 ggtitle("Number of Articles or Books published per Year") + 
 theme_tufte(base_family="Fira Sans Light") +
 theme(strip.background = element_blank(), 
   strip.text = element_blank(),
   axis.ticks = element_blank())

That distribution isn’t too surprising: I study political science, a pretty modern discipline, and so it makes sense that the bulk of my reference library would have been published within the last 35 years or so.

Next thing we can look at: what is this library made up of? Zotero records the kind of docment in the Item.Type column, so we can easily create a summary dataframe that counts the occurrences of different kinds of documents and then plot the results.

type <- count(bib, Item.Type)

How about most cited authors? I’ll plot all of the authors in the library with 12 or more citations. This figure requires a little more data work, as the original .bib file put all authors in one column. The first step is to separate all coauthors out into their own columns:

bib.a <- str_split_fixed(bib$Author, "; ", 10) %>% 
 as.data.frame(bib.a)

Because we don’t care too much about whether anyone was a first, second, or tenth author, we gather all of these columns back into long format, and then separate first from last names:

bib.a <- gather(bib.a, author.level, name, 1:10) %>%
 str_split_fixed(bib.a$name, ", ", 2) %>% 
 as.data.frame(bib.a)

Our last step is to count the incidence of each author in the dataset, and then filter to focus only on those authors who have more than 12 citations:

top.authors <- count(bib.a, V1) %>%
 filter(n > 11) %>% 
 filter(V1 != "")

Lots of political economy, lots of international relations, some methodology. It is worth noting that this approach to summing authors doesn’t differentiate between scholars who share the same last name.

We can also look at popular publications, first by journal for articles and then by press for books.

top.journals <- count(bib, Publication.Title) %>% 
 filter(n > 45) %>% 
 filter(Publication.Title != "")

top.presses <- count(bib, Publisher) %>% 
 filter(n > 9) %>% 
 filter(Publisher != "")

Let’s take a look at how journal counts vary over time for some of the most popular journals in my library:

Finally, a word cloud of the most common words in titles:

pubcorpus <- Corpus(VectorSource(bib$Title)) %>%
 tm_map(pubcorpus, content_transformer(tolower)) %>%
 tm_map(pubcorpus, removePunctuation) %>%
 tm_map(pubcorpus, PlainTextDocument) %>%
 tm_map(pubcorpus, removeWords, stopwords('english'))
wordcloud(pubcorpus, max.words = 200, random.order = FALSE)