Ontario Salaries, 2016

As part of Ontario’s Open Data Initiative, the province releases a salary disclosure dataset every year. The dataset includes information about all public sector employees who make over $100k, and is fully searchable through the web page above. They also have prominent links to a .csv download; I figure they want people to explore it a little bit themselves. So I did!

First things first, import the data:

raw <- read.csv(url("https://files.ontario.ca/pssd/en-2016-pssd-compendium-20170401-utf8.csv"))

In previous mini-projects I’ve left some chart customization code in each of my ggplot chunks, but this time around I’m going to put some of the most repetitive stuff in a theme function upfront:

theme_ <- function(...) {
  theme_bw() +
    theme(
      strip.background = element_blank(), 
      strip.text = element_blank(),
      axis.ticks = element_blank(),
      axis.text.x = element_blank()
    )
}

Once the dataset is imported, it requires a little bit of manual cleaning just to make things easier down the road. I usually put all of these basic cleaning steps into one piped chunk. Each component is pretty easy to understand but basically the code below does the following:

  • Remove all of the rows where the individual in question is ‘Seconded’ to a particular Employer
  • Recode a bunch of really long strings to simpler strings (particularly within the Sector variable)
  • Create a Full name variable by combining the first and last names
  • Isolates the english part of the Job.Title variable
  • Cleans Salary.Paid and Taxable.Benefis so that they can be read as numbers
  • Removes some unnecessary variables

In general, what this does is it takes the raw data and cleans it up into a salaries dataframe, which we use for almost all subsequent work.

library(plyr) # For the mapvalues command
library(dplyr)
library(tidyr)

salaries <- raw %>%
  filter(!grepl("Seconded", Sector)) %>%
  mutate(Sector =  mapvalues(Sector, 
                             from = c("Government of Ontario - Legislative Assembly and Offices",
                                      "Government of Ontario - Judiciary",
                                      "Government of Ontario - Ministries",
                                      "Hospitals and Boards of Public Health"),
                             to = c("Legislative Assembly",
                                    "Judiciary",
                                    "Ministries",
                                    "Hospitals / Public Health Boards"))) %>%
  mutate(Full = paste(First.Name, Last.Name, " ")) %>%
  separate(Job.Title, c("Job.Title", "Other"), sep = "/") %>%
  mutate(Salary.Paid = sub("\\$","", Salary.Paid)) %>%
  mutate(Salary.Paid = as.numeric(sub("\\,","", Salary.Paid))) %>%
  mutate(Taxable.Benefits = sub("\\$","", Taxable.Benefits)) %>%
  mutate(Taxable.Benefits = as.numeric(sub("\\,","", Taxable.Benefits))) %>%
  select(-Calendar.Year, -Other)

We now have a clean dataset where every row is an Ontario public sector employee and every column is a different characteristic of that employee. Of particular interest, we know what sector employees work in, who there employer is, and how much they make on an annual basis. Let’s take a look at the data. One of the cool things about working with tidyverse packages is that your data manipulation can be piped directly into visualizations; no need for lots of small summary tables or dataframes.

First, how many employees do we have working in each Sector?

library(ggplot2)

salaries %>%
  count(Sector) %>%
  ggplot(aes(reorder(Sector, n), n, label = n)) +
  geom_text() +
  labs(x = "", y = "Count of Employees making +100k by Sector") +
  coord_flip() + 
  theme_()

How many unique employers do we have per Sector? This one is a little trickier; you basically have to use a couple of count() commands in a row to aggregate up. This seems harder to me than it should be, I must be missing something.

salaries %>%
  count(Sector, Employer) %>%
  count(Sector, Employer) %>%
  count(Sector) %>%
  ggplot(aes(reorder(Sector, n), n, label = n)) +
  geom_text() +
  labs(x = "", y = "Count of Unique Employers within each Sector") +
  coord_flip() +
  theme_()

What is the median salary per Sector? We use the scales package to display the numbers in the plot as dollars. The other thing a little different about the code below is that sometimes the geom_text() doesn’t render well within the confines of the plot. set the limits of the y axis manually to adjust the plot size so that everything fits appropriately.

library(scales)

salaries %>%
  group_by(Sector) %>%
  summarize(median = median(Salary.Paid, na.rm = T)) %>%
  ggplot(aes(reorder(Sector, median), median, label = dollar(median))) +
  geom_text() +
  scale_y_continuous(limits = c(100000, 155000)) +
  labs(x = "", y = "Median Salary, 2016 ($ CAD)") +
  coord_flip() + 
  theme_()

I’m interested in universities in particular. What is the median salary at each university in Ontario?

salaries %>%
  filter(Sector == "Universities") %>%
  group_by(Employer) %>%
  summarize(median = median(Salary.Paid, na.rm = T)) %>%
  ggplot(aes(reorder(Employer, median), median, label = dollar(median))) +
  geom_text() +
  scale_y_continuous(limits = c(100000, 155000)) +
  labs(x = "", y = "Median Salary, 2016 ($ CAD)") +
  coord_flip() + 
  theme_()

First big surprise of the analysis: Mac up at the top! I wouldn’t have predicted that.

Let’s take one more look at the data at a Sector level. I’m guessing that the more employees an Employer has, the more in salaries they end up disbursing. Seems like a simple scatterplot might be the best way to verify this hypothesis. I already know we have a lot of distinct employers, though, so a basic scatterplot will be quickly overwhelmed: we’ll probably be able to see the trend, but the more interesting details might get washed out. In this case an interactive scatter is ideal: we still get to see the overall trend, but we can pick out individual points and learn more about them if we want to. We have to a little bit of data manipulation to build our scatterplot: first, sum up salaries for each Employer:

employers <- salaries %>%
  group_by(Sector, Employer) %>%
  summarise(sum = sum(Salary.Paid, na.rm = T))

And second, merge that with a count of employees per Employer:

totals <- salaries %>% 
  count(Employer) %>%
  left_join(., employers)

We can then use highcharter to create a nice interactive scatterplot:

library(highcharter)

hchart(totals, "scatter", hcaes(x = n, y = sum, group = Sector, text = Employer)) %>%
  hc_tooltip(useHTML = TRUE,
             headerFormat = "<table>",
             pointFormat = paste("<tr><th colspan=\"1\"><b>{point.label}</b></th></tr>",
                                 "<tr><th>Employer:</th><td>{point.text}</td></tr>",
                                 "<tr><th>Total Employees:</th><td>{point.x}</td></tr>",
                                 "<tr><th>Total Salaries:</th><td>{point.y}</td></tr>"),
             footerFormat = "</table>") %>%
  hc_chart(zoomType = "xy") %>% 
  hc_add_theme(hc_theme_google())

You can drag in the plot above to zoom. The strong linear relationship we predicted definitely holds, but it’s also interesting to see which employers deviate a little from that.

Now that we’ve gotten the ball rolling with interactive charts, another way we could look at the salaries above is broken down by Sector. Which Employers within each sector spend the most on salaries, and how do they relate to big spenders in other Sectors?

hchart(top.employers, "scatter", hcaes(x = Sector, y = sum, text = Employer)) %>%
  hc_tooltip(useHTML = TRUE,
             headerFormat = "<table>",
             pointFormat = paste("<tr><th colspan=\"1\"><b>{point.label}</b></th></tr>",
                                 "<tr><th>Employer:</th><td>{point.text}</td></tr>",
                                 "<tr><th>Total Salaries:</th><td>{point.y}</td></tr>"),
             footerFormat = "</table>") %>%
  hc_chart(zoomType = "xy", type = "bar") %>% 
  hc_add_theme(hc_theme_google())

It’s interesting to see that colleges are all pretty closely clustered, but there’s a lot more variation within universities… probably driven by bigger differences in the number of people they employ.

So far we’ve only looked at Sector and Employer level information, but the dataset is actually at the individual level. There are a lot of individuals – 123,941 not counting Seconded employees – so its hard to learn a lot at the individual level without doing a bit of prioritizing. So let’s look at top earners. The code below selects the top 100 earners within each sector:

top <- salaries %>%
  group_by(Sector) %>%
  arrange(-Salary.Paid) %>%
  slice(1:100)

Which we can plot using highcharter to give us a pretty interesting look at the best-paid public servants in Ontario:

hchart(top, "scatter", hcaes(x = Sector, y = Salary.Paid, text = Employer, w = Full, v = Job.Title)) %>%
  hc_tooltip(useHTML = TRUE,
             headerFormat = "<table>",
             pointFormat = paste("<tr><th colspan=\"1\"><b>{point.label}</b></th></tr>",
                                 "<tr><th>Name:</th><td>{point.w}</td></tr>",
                                 "<tr><th>Position:</th><td>{point.v}</td></tr>",
                                 "<tr><th>Employer:</th><td>{point.text}</td></tr>",
                                 "<tr><th>Salary:</th><td>{point.y}</td></tr>"),
             footerFormat = "</table>") %>%
  hc_chart(zoomType = "xy", type = "bar") %>% 
  hc_add_theme(hc_theme_google())