If you are looking to do some analysis, visualization or modelling on NHL data, the NHL API is a fantastic resource. It has loads and loads of data on all aspects of the NHL including team, player and game data. There is incredibly rich play by play data since 2007 for many different game events (goals, shots, hits, faceoffs etc.) and includes information on the players involved and the on ice coordinates where the event occurred. There’s no official documentation, but after some searching I was able to find a couple of fantastic resources that helped me get started. One is the Drew Hynes work documenting all the available API endpoints, and the other is the fantastic NHL scraper that Evolving Hockey has made public.
For this post I wanted to share some R code that I quickly wrote to grab player game by game stats for every player from every team since league inception. Initially I used this to create some ‘top 10’ animated charts for team career point leaders over time, but this data could be useful for a number of analyses and visualizations.
My basic approach was to use the API and make four sets of calls to collect all the data required:
- Access the ‘team’ endpoint to get data on all NHL teams
- Use the team data to retrieve rosters for each season of each team
- Use the roster data to identify the complete NHL player list and then use the ‘people’ endpoint to retrieve basic data about each player
- Finally use the player list to pull game by game stats for each player that has played in the NHL since 1917
The end result is just over 2 million rows of data! Let’s take a quick look at the code for each part. Please don’t judge the ‘for’ loops – I was excited and lazy to just get something workable and easy to follow. Maybe one day I will rewrite it more elegantly 🙂
First, the required packages. Dplyr and Tidyr for data manipulation and Jsonlite for API data parsing.
library(dplyr) library(tidyr) library(jsonlite)
The first part of collecting the team data is pretty straightforward. There are more than 100 teams identified, but a bunch are for all star teams and other non league teams. After reading through the list, I realized I only needed the first 58 teams. I saved the output to a .csv for a bit of manual cleaning. A few teams were missing starting season dates and I also added the final seasons for defunct teams. I then saved the cleaned team list as ‘teams.rds’
teamids <- paste(c(1:58), collapse = ',') teams <- fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",teamids)) df_team <- teams$teams write.csv(df_team,"teams.csv") # manually add the missing start years, and also add end years df_team <- read.csv("teams.csv") saveRDS(df_team, file = 'data/teams.rds')
Using the team list, I fetched roster data for each team for each season. Note that I tried to fetch data for the earliest year (1917) to 2018 for each team, even though I had start and end year dates for each team. I was worried that I may miss some data if those dates were incorrect, so I erred on the side of caution. Then I combined all the data, added the team name, team id, and season to each roster year and saved as ‘rosters.rds’.
min_year = min(df_team$firstYearOfPlay) roster <- NULL for (id in 1:max(df_team$id)) { for (season in min_year:2018) { tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",id,"&expand=team.roster&season=",season,season+1)), silent=TRUE) if (!grepl("error",tmp)) { tmp <- flatten(as.data.frame(tmp$teams$roster$roster)) %>% mutate(teamId = id, name = df_team$name[id], season = season) if (length(roster) == 0) { roster <- tmp } else { roster <- rbind(roster, tmp) } } else warning(paste0("Did not find ",df_team$name[id]," ",season)) } } saveRDS(df, file = 'data/rosters.rds')
Using the roster data allowed me to identify all the players who have played in the NHL, a total of almost 8000 players. Using the API’s ‘people’ endpoint, I then fetched basic information about each player like height, weight, position and birthplace. I saved this into ‘players.rds’.
# identify unique players and get all player data for each player_ids <- unique(rosters$person.id) # fetch player data players <- NULL for (id in player_ids) { tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/people/",id)), silent=TRUE) if (!grepl("error",tmp)) { tmp <- flatten(tmp$people) if (length(players) == 0) { players <- tmp } else { # add any columns missing in players dataframe columns <- names(tmp[!names(tmp) %in% names(players)]) if (length(columns) > 0) { for (col in 1:length(columns)) { players <- mutate(players, !!columns[col] := NA) } } # add any columns missing in tmp dataframe columns <- names(players[!names(players) %in% names(tmp)]) if (length(columns) > 0) { for (col in 1:length(columns)) { tmp <- mutate(tmp, !!columns[col] := NA) } } players <- rbind(players, tmp) } } else warning(paste0("Did not find ", players[id])) print(id) } saveRDS(players, file = 'data/players.rds')
Now that we had a full player list, we can load the game by game stats for each player for every game they played. This is a lot of data, more than 2 million rows, you may want to consider splitting the queries into several blocks and join together at the end. Unfortunately rbind is a pretty slow operation, especially as a data frame gets larger, if you know of a faster, more efficient way to join data frames, please comment below.
# get game by game data for each play games <- NULL for (i in 1:length(players$id)) { print(i) seasons <- filter(rosters, person.id == players$id[i]) %>% select(season) seasons <- unique(seasons) for (season in seasons$season) { url <- paste0("https://statsapi.web.nhl.com/api/v1/people/",players$id[i],"/stats?stats=gameLog&season=",season, season+1) tmp <- try(fromJSON(url), silent=TRUE) if(is.data.frame(tmp$stats[[2]][[1]])) { tmp <- flatten(tmp$stats[[2]][[1]]) %>% mutate(id = players$id[i], fullName = players$fullName[i]) if (length(games) == 0) { games <- tmp } else { # add any columns missing in players dataframe columns <- names(tmp[!names(tmp) %in% names(games)]) if (length(columns) > 0) { for (col in 1:length(columns)) { games <- mutate(games, !!columns[col] := NA) } } # add any columns missing in tmp dataframe columns <- names(games[!names(games) %in% names(tmp)]) if (length(columns) > 0) { for (col in 1:length(columns)) { tmp <- mutate(tmp, !!columns[col] := NA) } } games <- rbind(games, tmp) } } } } saveRDS(games, file = 'data/games.rds')
I’ve uploaded all the code and data to my GitHub repository, have fun delving into this amazing data resource! Please share any interesting insights or analyses you get out of this data. Thanks for reading.