Charting the Leafs’ record (2017-2018)

A quick bit of scraping and simple manipulation to pull in records-over-time for Atlantic Division teams, given we are pretty much half way through the season.

All data courtesy of Hockey Reference.

First, identifying the right urls to find the historical data:

base_url <- "https://www.hockey-reference.com/teams/"
urls <- paste(base_url,
  c("TOR", 
    "TBL",
    "BOS",
    "MTL",
    "FLA",
    "DET",
    "BUF"), 
  "/2018_games.html", sep = "")

Second, a basic scraping loop using lapply and rvest:

df_raw <- bind_rows(lapply(urls, function(x) {
  data.frame(
    url = x,
    wins = read_html(x) %>%
      html_nodes(css = '.center+ td') %>%
      html_text(),
    losses = read_html(x) %>%
      html_nodes(css = '.right:nth-child(11)') %>%
      html_text(),
    stringsAsFactors = FALSE)
}))

Third, some piped commands to convert to numeric values, calculate a ‘games above/below .500’ variable, identify teams based on the URLS, and identify the game number for each row:

df <- df_raw %>%
  mutate(wins = as.numeric(wins),
         losses = as.numeric(losses),
         record = wins - losses) %>%
  mutate(team = ifelse(grepl("TOR", url), "Toronto", 
                       ifelse(grepl("TBL", url), "Tampa Bay", 
                              ifelse(grepl("BOS", url), "Boston",
                                     ifelse(grepl("MTL", url), "Montreal",
                                            ifelse(grepl("FLA", url), "Florida", 
                                                   ifelse(grepl("DET", url), "Detroit",
                                                          ifelse(grepl("BUF", url), "Buffalo", 
                       NA)))))))) %>%
  group_by(team) %>%
  mutate(game_no = row_number())

Last, a chart to display our results using highcharter:

df %>% 
  na.omit() %>%
  hchart("line", 
         hcaes(x = game_no,
               y = record,
               group = team)) %>%
  hc_tooltip(table = TRUE, sort = TRUE) %>% 
  hc_xAxis(title = list(text = "Game Number (1 to most recent)")) %>%
  hc_yAxis(title = list(text = "Number of games above (or below) .500")) %>%
  hc_colors(c("#CDD8D9", "#CDD8D9", "#CDD8D9", "#CDD8D9", "#CDD8D9", "#CDD8D9", "#003E7E")) %>%
  hc_add_theme(hc_theme_google())

Leaving us with this:

It actually isn’t that hard to get the same data for all teams; nor is difficult to aggregate up to the team level:

team_df <- df %>%
  na.omit() %>%
  mutate(group = ifelse(team %in% c("TOR",
                                    "TBL",
                                    "BOS",
                                    "MTL",
                                    "FLA",
                                    "DET",
                                    "BUF"), "Atlantic", "Other")) %>%
  mutate(shots_for = as.numeric(shots_for),
         pim = as.numeric(pim),
         opp_pim = as.numeric(opp_pim),
         shots_against = as.numeric(shots_against),
         net_pim = pim - opp_pim,
         net_shots = shots_for - shots_against,
         net_goals = gf - ga) %>%
  group_by(group, team) %>%
  summarise(net_shots = sum(net_shots),
            net_goals = sum(net_goals),
            net_pim = sum(net_pim),
            sum_shots_for = sum(shots_for),
            sum_shots_against = sum(shots_against),
            sum_pim = sum(pim),
            sum_opp_pim = sum(opp_pim),
            sum_gf = sum(gf),
            sum_ga = sum(ga),
            wins = max(wins),
            losses = max(losses),
            net_position = wins - losses,
            sg_ratio = round(sum_gf/sum_shots_for, 3)) %>%
  ungroup()

I’m not going to do any real analysis, I’ll just present a few scatter charts that may be obvious, but hopefully still of some interest:

The chart above shows a given team’s goals scored on the x-axis and goals allowed on the y axis. Two things that I think are pretty interesting; first, the outliers. Tampa Bay (TBL), New York Islanders (NYI), San Jose (SJS), Buffalo (BUF), and Arizona (ARI) are all quite different from each other, and quite different from all the other teams in the league. The Islanders and Lightning are both high scoring team, but very different defensively… and this chart bears out the Lightning’s stellar season so far, as they are the highest-scoring and (second) most stingy team in the league. The stingiest, San Jose, has only allowed 86 goals, but they don’t score many either. Both Arizona and Buffalo are having terrible seasons: both teams are having trouble scoring, and also having trouble stopping other teams from scoring. Bad combo.

The second interesting thing in this chart is something that looks like a negative relationship in the cluster of non-outlier teams. Based on a cursory looks, it does seem as though for most of the teams in the league, the teams that score more allow fewer goals. The plot below restricts the datapoints to non-outliers, and adds a linear model (blue) and loess curve (red) to capture the trend.

The chart below shows goals over shots taken on the x axis, and wins on the y axis. Unsurprisingly, there is a pretty strong positive relationship: teams that score more given the shots they take win more as well.

The last chart compares penalty minutes accrued. It probably shouldn’t be surprising that own and opponent penalty minutes are strongly correlated, given that many penalties are offsetting. What is interesting to me is how many penalty minutes Nashville has!