Using the NHL API – Play by Play and Shift Data

Reading Time: 5 minutes
Photo by Mika Baumeister on Unsplash

The NHL API is a fantastic, free resource for all sorts of NHL data.  You can find team data, player data, and all sorts of game data. In my last post I showed how I was able to quickly and easily grab all the game stats for every NHL player using R. For this post I’d share how I was able to use a similar approach to get all the play by play and shift data from 2011 onward. This data will be the basis of my attempts to build some more advanced models to understand Adjusted Plus Minus and WAR/GAR.

As usual, all my code and data for this exercise I’ve uploaded to my GitHub. To start let’s load some packages

library(dplyr)
library(tidyr)
library(tibble)
library(jsonlite)

To get all the play by play and shift data we will need a list of game ids to use in our API calls. To do that we can access the ‘schedule’ endpoint in the API. Initially I thought that play by play data started in 2007 (it actually starts in 2011), so I set it as the start date and then use today’s date as the end date. Then I have a quick and dirty for loop to extract all the game data from each game and put into a data frame. Finally I save it to the ‘schedule.rds’ file. In addition to the game id’s for each game, there is a host of other info including the teams involved, the scores, team records, and venue information.

Now that we have the full schedule, let’s use it get all the play by play data sometimes called the Real Time Scoring System (RTSS) data for every game since the 2011-2012 season. This data has a wealth of information including all important plays (goals, shots, hits, penalties, takeaways, giveaways etc.), which players were involved and even where the play took place in the form of x and y coordinates. All this can be found at the ‘game’ API endpoint. As before, I created a loop to loop through each game (all 11,292 of them) and download the JSON data. The data needs a bit of extra parsing, so we collect it into several dataframes and then consolidate it. The ‘result’, ‘about’, ‘coordinates’, and ‘team’ data is pretty straightforward and we can just column bind into a single data frame with a single row for each play. The player data is a bit more complicated, as it has several rows for each player, with a row for each player involved in the play and what their involvement was, like this:

What I did was consolidate all these individual tables into a single table (after adding the event code) and then using ‘tidy’ functions to spread each of them into one row so they can be joined the rest of the table.  Essentially we create a new column for each ‘role’ in a play (eg. Scorer, Assist, DrewBy etc.) and then populate with the player’s name for each.  Finally, we join it all together, add a few additional datapoints (game id, home team, away team etc.) and then join all the plays together into the master data frame and save it to the ‘plays.rds’ file. 

## Load schedule data
games <- readRDS('../Data/schedule.rds')

## Filter to 2011 and later (first season that had coordinate data)
games <- filter(games, gameDate > '2011-09-01')

## initialize dataframes
df_final <- NULL

## loop through each game and download play data
for (i in 1:length(games$gamePk)) {

  print(i) # counter
  
  ## download play by play data
  link <- paste0("https://statsapi.web.nhl.com",games$link[i])
  df <- fromJSON(link)
  
  ## extract and merge play by play data into one dataframe
  df_result <- df$liveData$plays$allPlays$result
  df_about <- df$liveData$plays$allPlays$about
  df_coord <- df$liveData$plays$allPlays$coordinates
  df_team <- df$liveData$plays$allPlays$team
  df_plays <- cbind(df_result, df_about, df_coord, df_team, row.Names = FALSE)
  
  ## extract play by play data
  df_players <- df$liveData$plays$allPlays$players
  players <- NULL
  for (j in 1:length(df_players)) {
    if (!is.null(df_players[[j]])) {
    tmp <- flatten(df_players[[j]]) %>% select(player.fullName,playerType) %>% mutate(eventCode = df_result$eventCode[j])
    tmp1 <- filter(tmp, playerType == 'Assist') %>% mutate(playerType = paste0(playerType,row_number()))
    tmp <- rbind(filter(tmp, playerType != 'Assist'),tmp1)
    if (length(players) == 0) {
      players <- tmp } else { players <- rbind(players, tmp) }
    }
  }
  
  # this is to deal with row 6515 which has duplicate 'unknown' player values
  players <- distinct(players)
  
  if(!is.null(players)) { players <- spread(players, key=playerType, value=player.fullName) 
  
  df_plays <- left_join(df_plays, players, by = c("eventCode")) %>%
              mutate(gamePk = games$gamePk[i],                      # add game id
                     link = games$link[i],                          # add API link
                     gameType = games$gameType[i],                  # add game type (regular season etc.)
                     away_team = games$teams.away.team.name[i],     # add away team
                     home_team = games$teams.home.team.name[i]) %>% # add home team
              flatten()                                             # flatten any lists
  }

  
  ## add plays to final dataframe 
  if (df_plays > 1) {
  if (length(df_final) == 0) { df_final <- df_plays } else
                             { # add any columns missing in df_final dataframe
                               columns <- names(df_plays[!names(df_plays) %in% names(df_final)])
                               if (length(columns) > 0) {
                                 for (col in 1:length(columns)) {
                                   df_final <- mutate(df_final, !!columns[col] := NA)
                                 }
                               }
                               # add any columns missing in df_plays dataframe
                               columns <- names(df_final[!names(df_final) %in% names(df_plays)])
                               if (length(columns) > 0) {
                                 for (col in 1:length(columns)) {
                                   df_plays <- mutate(df_plays, !!columns[col] := NA)
                                 }
                               }
                               df_final <- rbind(df_final, df_plays) }
  }
}

# save file
saveRDS(df_final, '../Data/plays.rds')

We can check the completeness of our data by comparing to the original game list of 11,292 games from 2011 through 2019. Overall, we are missing data on 251 games or 2.2% of the total. We could try scraping the data from the html game reports, but for my purposes of building some predictive models I’m ok with a few missing games.

missing_games <- games$gamePk[!games$gamePk %in% unique(df_final$gamePk)]

The last set of data I’ll want to get using the API is shift data, ie which players were on the ice at which points during every game. This data is nice and clean and easy to parse from the JSON data using the following code. Update Nov 19/21: Note that the NHL API endpoint for shift data has changed, I have updated in the code block below.

df_shift <- NULL

## loop through each game and download play data
for (i in 1:length(games$gamePk)) {
    
  print(i) # counter
    
  ## download shift data
  link <- paste0("https://api.nhle.com/stats/rest/en/shiftcharts?cayenneExp=gameId=",games$gamePk[i])
  df <- fromJSON(link)
  tmp <- df$data
  
  if (length(df_shift) == 0)
  { df_shift <- tmp } else
  { df_shift <- rbind(df_shift, tmp) }
   
}

# save file
saveRDS(df_shift, '../Data/shifts.rds')

So that’s it! Now we have 3 new data sets which have complete schedules from 2007 (17k rows), play by play data and shift data from 2011 (3.5m and 8.3m rows respectively). Lots of great data to explore. In my next post I’ll begin my attempts to build an adjusted plus minus model.

Using the NHL API – Player and Game Data

Reading Time: 4 minutes

If you are looking to do some analysis, visualization or modelling on NHL data, the NHL API is a fantastic resource. It has loads and loads of data on all aspects of the NHL including team, player and game data. There is incredibly rich play by play data since 2007 for many different game events (goals, shots, hits, faceoffs etc.) and includes information on the players involved and the on ice coordinates where the event occurred. There’s no official documentation, but after some searching I was able to find a couple of fantastic resources that helped me get started. One is the Drew Hynes work documenting all the available API endpoints, and the other is the fantastic NHL scraper that Evolving Hockey has made public.

For this post I wanted to share some R code that I quickly wrote to grab player game by game stats for every player from every team since league inception. Initially I used this to create some ‘top 10’ animated charts for team career point leaders over time, but this data could be useful for a number of analyses and visualizations.

My basic approach was to use the API and make four sets of calls to collect all the data required:

  1. Access the ‘team’ endpoint to get data on all NHL teams
  2. Use the team data to retrieve rosters for each season of each team
  3. Use the roster data to identify the complete NHL player list and then use the ‘people’ endpoint to retrieve basic data about each player
  4. Finally use the player list to pull game by game stats for each player that has played in the NHL since 1917

The end result is just over 2 million rows of data! Let’s take a quick look at the code for each part. Please don’t judge the ‘for’ loops – I was excited and lazy to just get something workable and easy to follow. Maybe one day I will rewrite it more elegantly 🙂

First, the required packages. Dplyr and Tidyr for data manipulation and Jsonlite for API data parsing.

library(dplyr)
library(tidyr)
library(jsonlite)

The first part of collecting the team data is pretty straightforward. There are more than 100 teams identified, but a bunch are for all star teams and other non league teams. After reading through the list, I realized I only needed the first 58 teams. I saved the output to a .csv for a bit of manual cleaning. A few teams were missing starting season dates and I also added the final seasons for defunct teams. I then saved the cleaned team list as ‘teams.rds’

teamids <- paste(c(1:58), collapse = ',')
teams <- fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",teamids))

df_team <- teams$teams
write.csv(df_team,"teams.csv")  # manually add the missing start years, and also add end years
df_team <- read.csv("teams.csv")

saveRDS(df_team, file = 'data/teams.rds')

Using the team list, I fetched roster data for each team for each season. Note that I tried to fetch data for the earliest year (1917) to 2018 for each team, even though I had start and end year dates for each team. I was worried that I may miss some data if those dates were incorrect, so I erred on the side of caution. Then I combined all the data, added the team name, team id, and season to each roster year and saved as ‘rosters.rds’.

min_year = min(df_team$firstYearOfPlay)

roster <- NULL
for (id in 1:max(df_team$id)) {
  for (season in min_year:2018) {
    tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",id,"&expand=team.roster&season=",season,season+1)), silent=TRUE)
    if (!grepl("error",tmp)) {
      tmp <- flatten(as.data.frame(tmp$teams$roster$roster)) %>%
             mutate(teamId = id,
                    name = df_team$name[id],
                    season = season)
      if (length(roster) == 0) {
      roster <- tmp } else { roster <- rbind(roster, tmp) }
    } else warning(paste0("Did not find ",df_team$name[id]," ",season))
  }
}

saveRDS(df, file = 'data/rosters.rds')

Using the roster data allowed me to identify all the players who have played in the NHL, a total of almost 8000 players. Using the API’s ‘people’ endpoint, I then fetched basic information about each player like height, weight, position and birthplace. I saved this into ‘players.rds’.

# identify unique players and get all player data for each
player_ids <- unique(rosters$person.id)

# fetch player data
players <- NULL
for (id in player_ids) {
  tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/people/",id)), silent=TRUE)
  if (!grepl("error",tmp)) {
    tmp <- flatten(tmp$people)

    if (length(players) == 0) { players <- tmp } else
                              { # add any columns missing in players dataframe
                                columns <- names(tmp[!names(tmp) %in% names(players)])
                                if (length(columns) > 0) {
                                  for (col in 1:length(columns)) {
                                    players <- mutate(players, !!columns[col] := NA)
                                  }
                                }
                                # add any columns missing in tmp dataframe
                                columns <- names(players[!names(players) %in% names(tmp)])
                                if (length(columns) > 0) {
                                  for (col in 1:length(columns)) {
                                    tmp <- mutate(tmp, !!columns[col] := NA)
                                  }
                                }
                                players <- rbind(players, tmp) }
  } else warning(paste0("Did not find ", players[id]))
  print(id)
}

saveRDS(players, file = 'data/players.rds')

Now that we had a full player list, we can load the game by game stats for each player for every game they played. This is a lot of data, more than 2 million rows, you may want to consider splitting the queries into several blocks and join together at the end. Unfortunately rbind is a pretty slow operation, especially as a data frame gets larger, if you know of a faster, more efficient way to join data frames, please comment below.

# get game by game data for each play
games <- NULL
for (i in 1:length(players$id)) {
 print(i)
 seasons <- filter(rosters, person.id == players$id[i]) %>% select(season)
 seasons <- unique(seasons)
 for (season in seasons$season) {
   url <- paste0("https://statsapi.web.nhl.com/api/v1/people/",players$id[i],"/stats?stats=gameLog&season=",season, season+1)
   tmp <- try(fromJSON(url), silent=TRUE)
   if(is.data.frame(tmp$stats[[2]][[1]])) {
     tmp <- flatten(tmp$stats[[2]][[1]]) %>%
            mutate(id = players$id[i],
                   fullName = players$fullName[i])
  
     if (length(games) == 0) { games <- tmp } else
     { # add any columns missing in players dataframe
       columns <- names(tmp[!names(tmp) %in% names(games)])
       if (length(columns) > 0) {
         for (col in 1:length(columns)) {
           games <- mutate(games, !!columns[col] := NA)
         }
       }
       # add any columns missing in tmp dataframe
       columns <- names(games[!names(games) %in% names(tmp)])
       if (length(columns) > 0) {
         for (col in 1:length(columns)) {
           tmp <- mutate(tmp, !!columns[col] := NA)
         }
       }
       games <- rbind(games, tmp) }
   }
 }
}

saveRDS(games, file = 'data/games.rds')

I’ve uploaded all the code and data to my GitHub repository, have fun delving into this amazing data resource! Please share any interesting insights or analyses you get out of this data. Thanks for reading.

Retrieving Wikipedia Data for Natural Language Processing

Reading Time: 4 minutes

Encyclopedia
Old School Wikipedia

The internet is not just for cat videos anymore, there’s too much useful freely available data to ignore.  In my opinion, being able to easily get data from the internet using API’s is a core skill for any data scientist or analyst. I’ve walked through how easy it is to use an API to get Strava exercise data and Twitter data, today we focus on Wikipedia to find some text data for our Natural Language Processing projects.

We’re going to use R for this project and let some great packages do the heavy lifting; WikipediR for working with the MediaWiki API and rvest for web scraping Wikipedia pages.  As usual there are a number of packages and ways to approach this task, but I found this the easiest and most straightforward.  The output of this code is a text file with text data from all the Wikipedia pages related to a given topic, with the goal of generating a corpus on a domain specific topic for use in training word vectors with Word2Vec.   That said, this code is easily modified to extract a variety of data from Wikipedia.

In addition to the two packages mentioned above, we will load tidyverse for any data manipulation tasks.  Tidyverse is a must have for making data manipulation easy, like this example working with customer survey data.  We’ll also initialize some parameters including our overall topic (Ice hockey) and output filename.  For your particular topic, just search Wikipedia for a category page and include all the text after ‘Category:’.  This script will then extract all the pages and sub categories associated with the topic.

library(tidyverse)            # Data Manipulation
library(WikipediR)            # Wikipedia Queries
library(rvest)                # Web Scraping

category_list <- "Ice hockey" # Set to starting category name
filename <- "data.txt"        # Output filename
total_pages <- c()
categories <- c()
text_data <- as.null()

As a first step, we want to use the MediaWiki API to retrieve a list of all the sub-categories and page titles associated with those categories.  The WikipediR package provides some nice wrapper functions to access the API.

  # retrieve pages and categories
  pages <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c("page"))
  sub_cats <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c( "subcat"))

The ‘pages_in_category’ function is helpful here by returning Wikipedia sub-category and/or page data depending on the function parameters.  Here we specify English Language Wikipedia with the first two parameters and use the previously defined category name.  The function can return a number of properties, but we only need the title.  Finally we define the element we are looking for (page, sub-category or file), and we assign the list results to pages and sub_cats respectively.  We use a couple of loops to extract each page and subcategory title and add it to our master list.

  # add pages to list  
  if (length(pages$query$categorymembers) > 0 ) {
    for (i in 1:length(pages$query$categorymembers)) {
      total_pages <- c(total_pages, pages$query$categorymembers[[i]]$title)
    }
  }

  # add sub categories to list
  if (length(sub_cats$query$categorymembers) > 0 ) {
    for (i in 1:length(sub_cats$query$categorymembers)) {
      sub_cat = gsub("Category:", "", sub_cats$query$categorymembers[[i]]$title)
      categories <- c(categories, sub_cat)
      next_category_list <- c(next_category_list, sub_cat)
    }
  }

Next, we want to repeat the process with each of the returned sub-categories, to retrieve all of the pages and sub categories associated with that category and add them to the master list.  We will continue ‘drilling down’ until there are no more sub categories or pages associated with that overall topic.  To view the entire code with the looping structure, please check out the code on my github.  Below are the final results, that’s a lot of categories and individual pages about hockey!

[1] "Number of Categories: 7322"
[1] "Number of Pages: 12912"

Next we will extract all the text data from each page using the rvest web scraping package.  We simply loop through the entire page list just created, and then extract all the paragraphs (denoted by the <p> node).  We are leaving some of each web page behind, such as lists and tables, but for this project I figured just grabbing nice clean paragraph and sentence data would be most useful.  All of the paragraphs for all the pages are simply pasted into a single ‘page_text’ variable.

# read all page paragraph data
for (i in 1:length(total_pages)) {
  page = gsub(" ", "_", total_pages[i])
  print(paste0("Loading Page: ", page))
  web_address <- paste0("https://en.wikipedia.org/wiki/",page)
  page_html <- read_html(web_address)
  page_paragraphs <- html_nodes(page_html,"p")
  page_text <- paste(html_text(page_paragraphs), sep = '', collapse = '')
  for (j in 1:length(page_text)) {
    if (is.null(text_data)) { text_data <- page_text }
    else { text_data <- paste(text_data, page_text, sep = '', collapse = '') }
  }
}

Before we write all this juicy text data to a .txt file for future NLP tasks, we should probably do a little data preprocessing.  Let’s remove existing line breaks and replace with line breaks after each sentence so our NLP treats each sentence as an entity.  We also remove all punctuation and citations (eg. [2]) and convert everything to lower case.  You may want to modify these preprocessing steps for your project.  Perhaps you want to distinguish between ‘doors’ of a house and the band ‘Doors’, in which case you wouldn’t want to convert everything to lower case.  I also kept all numerical data as-is, to keep dates and jersey numbers, but you may want to consider removing numbers all together.

# text pre-processing
text_data <- gsub("\n", " ", text_data)       # remove the existing line breaks
text_data <- gsub("\\.", "\r\n", text_data)   # add line breaks
text_data <- gsub("\\[\\d\\]"," ", text_data) # remove citations
text_data <- gsub("[[:punct:]]"," ", text_data) # remove punctuation
text_data <- text_data %>% tolower()

# save text file
write(text_data, filename)

The final step it to write this 6 million word corpus to a text file that we will use in the next post to train word vectors using Word2Vec.

All the code for this task can be found in github.  You can use this to generate text for any topic simply by changing the starting topic page.  Also, getting more familiar with the WikipediR and rvest packages will allow you to modify this script to extract any number of Wikipedia pages and data.

Thanks for reading!