Where is it? Geocoding with OpenStreetMap and R

Reading Time: 2 minutes

In previous posts I’ve shared how to use APIs to retrieve webdata from a number of sources including Strava, Wikipedia, and Twitter. Recently I was looking to plot some location data, and wanted to join latitude and longitude to street addresses so they could be plotted in Tableau. The most well known geocoding service is probably Google, but I also found a number of others. I ended up settling on OSM (OpenStreetMap), which isn’t particularly fast (it limits you to one address per second), but it is free and super easy to use.

Retrieving geocoded addressess from OSM is simple and you can find the full details at the helpful Nominatim Wiki. Suppose we want to get the location of the home of Sherlock Holmes, we can simply use the following call:

https://nominatim.openstreetmap.org/search?q=221b+baker+street,+london&format=json&addressdetails=0&limit=1

What we get back will be something like this in JSON format:

[{"place_id":50843439,"licence":"Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright","osm_type":"node","osm_id":3916613190,"boundingbox":["51.5237104","51.5238104","-0.1585445","-0.1584445"],"lat":"51.5237604","lon":"-0.1584945","display_name":"Sherlock Holmes Museum, 221B, Baker Street, Marylebone, Westminster, London, Greater London, England, NW1 6XE, United Kingdom","class":"tourism","type":"museum","importance":0.401,"icon":"https://nominatim.openstreetmap.org/images/mapicons/tourist_museum.p.20.png"}]

You can see embedded in all that information are the values for ‘lat’ and ‘lon’. So we can create a short script in R to loop through a set of addresses and extract the data we need. As an example I am using a sample of 100 Toronto parking tickets that have addresses and we will add the longitude and latitude. First let’s install the packages and define a function to extract the JSON data using the OSM API

### PACKAGES ###
################

library(tidyverse)   # data manipulation
library(jsonlite)    # JSON

### FUNCTIONS ###
#################

nominatim_osm <- function(address = NULL)
{
  if(suppressWarnings(is.null(address)))
    return(data.frame())
  tryCatch(
    d <- jsonlite::fromJSON( 
      gsub('\\@addr\\@', gsub('\\s+', '\\%20', address), 
           'https://nominatim.openstreetmap.org/search/@addr@?format=json&amp;addressdetails=0&amp;limit=1')
    ), error = function(c) return(data.frame())
  )
  if(length(d) == 0) return(data.frame())
  return(data.frame(lon = as.numeric(d$lon), lat = as.numeric(d$lat)))
}

Next, let’s load the address data and prep by defining the lon and lat values.

### Load Sample Data ###
sample_addresses <- read.csv("sample_addresses.csv") %>%
                    mutate(lon = 0,
                           lat = 0)

Finally we can loop through each addresses and get the longitude and latitude using our new function. Note we add the city and province to the address for the API call, and we also add a 1 second delay to comply with OSM guidelines.

### Add Geocoding ###
for (i in 1:dim(sample_addresses)[1]) {
  print(i)
  long_lat <- nominatim_osm(paste0(sample_addresses$location2[i],", Toronto, ON"))
  Sys.sleep(1)  # ensure 1 second between API call as per OSM guidelines
  if (dim(long_lat)[1] != 0) {
    sample_addresses$lon[i] = long_lat$lon
    sample_addresses$lat[i] = long_lat$lat
  }
}

Simple as that, here’s a screenshot of our final dataframe.

As usual, I’ve put all the code and data on my Github. Thanks for reading!

Using the NHL API – Play by Play and Shift Data

Reading Time: 5 minutes
Photo by Mika Baumeister on Unsplash

The NHL API is a fantastic, free resource for all sorts of NHL data.  You can find team data, player data, and all sorts of game data. In my last post I showed how I was able to quickly and easily grab all the game stats for every NHL player using R. For this post I’d share how I was able to use a similar approach to get all the play by play and shift data from 2011 onward. This data will be the basis of my attempts to build some more advanced models to understand Adjusted Plus Minus and WAR/GAR.

As usual, all my code and data for this exercise I’ve uploaded to my GitHub. To start let’s load some packages

library(dplyr)
library(tidyr)
library(tibble)
library(jsonlite)

To get all the play by play and shift data we will need a list of game ids to use in our API calls. To do that we can access the ‘schedule’ endpoint in the API. Initially I thought that play by play data started in 2007 (it actually starts in 2011), so I set it as the start date and then use today’s date as the end date. Then I have a quick and dirty for loop to extract all the game data from each game and put into a data frame. Finally I save it to the ‘schedule.rds’ file. In addition to the game id’s for each game, there is a host of other info including the teams involved, the scores, team records, and venue information.

Now that we have the full schedule, let’s use it get all the play by play data sometimes called the Real Time Scoring System (RTSS) data for every game since the 2011-2012 season. This data has a wealth of information including all important plays (goals, shots, hits, penalties, takeaways, giveaways etc.), which players were involved and even where the play took place in the form of x and y coordinates. All this can be found at the ‘game’ API endpoint. As before, I created a loop to loop through each game (all 11,292 of them) and download the JSON data. The data needs a bit of extra parsing, so we collect it into several dataframes and then consolidate it. The ‘result’, ‘about’, ‘coordinates’, and ‘team’ data is pretty straightforward and we can just column bind into a single data frame with a single row for each play. The player data is a bit more complicated, as it has several rows for each player, with a row for each player involved in the play and what their involvement was, like this:

What I did was consolidate all these individual tables into a single table (after adding the event code) and then using ‘tidy’ functions to spread each of them into one row so they can be joined the rest of the table.  Essentially we create a new column for each ‘role’ in a play (eg. Scorer, Assist, DrewBy etc.) and then populate with the player’s name for each.  Finally, we join it all together, add a few additional datapoints (game id, home team, away team etc.) and then join all the plays together into the master data frame and save it to the ‘plays.rds’ file. 

## Load schedule data
games <- readRDS('../Data/schedule.rds')

## Filter to 2011 and later (first season that had coordinate data)
games <- filter(games, gameDate > '2011-09-01')

## initialize dataframes
df_final <- NULL

## loop through each game and download play data
for (i in 1:length(games$gamePk)) {

  print(i) # counter
  
  ## download play by play data
  link <- paste0("https://statsapi.web.nhl.com",games$link[i])
  df <- fromJSON(link)
  
  ## extract and merge play by play data into one dataframe
  df_result <- df$liveData$plays$allPlays$result
  df_about <- df$liveData$plays$allPlays$about
  df_coord <- df$liveData$plays$allPlays$coordinates
  df_team <- df$liveData$plays$allPlays$team
  df_plays <- cbind(df_result, df_about, df_coord, df_team, row.Names = FALSE)
  
  ## extract play by play data
  df_players <- df$liveData$plays$allPlays$players
  players <- NULL
  for (j in 1:length(df_players)) {
    if (!is.null(df_players[[j]])) {
    tmp <- flatten(df_players[[j]]) %>% select(player.fullName,playerType) %>% mutate(eventCode = df_result$eventCode[j])
    tmp1 <- filter(tmp, playerType == 'Assist') %>% mutate(playerType = paste0(playerType,row_number()))
    tmp <- rbind(filter(tmp, playerType != 'Assist'),tmp1)
    if (length(players) == 0) {
      players <- tmp } else { players <- rbind(players, tmp) }
    }
  }
  
  # this is to deal with row 6515 which has duplicate 'unknown' player values
  players <- distinct(players)
  
  if(!is.null(players)) { players <- spread(players, key=playerType, value=player.fullName) 
  
  df_plays <- left_join(df_plays, players, by = c("eventCode")) %>%
              mutate(gamePk = games$gamePk[i],                      # add game id
                     link = games$link[i],                          # add API link
                     gameType = games$gameType[i],                  # add game type (regular season etc.)
                     away_team = games$teams.away.team.name[i],     # add away team
                     home_team = games$teams.home.team.name[i]) %>% # add home team
              flatten()                                             # flatten any lists
  }

  
  ## add plays to final dataframe 
  if (df_plays > 1) {
  if (length(df_final) == 0) { df_final <- df_plays } else
                             { # add any columns missing in df_final dataframe
                               columns <- names(df_plays[!names(df_plays) %in% names(df_final)])
                               if (length(columns) > 0) {
                                 for (col in 1:length(columns)) {
                                   df_final <- mutate(df_final, !!columns[col] := NA)
                                 }
                               }
                               # add any columns missing in df_plays dataframe
                               columns <- names(df_final[!names(df_final) %in% names(df_plays)])
                               if (length(columns) > 0) {
                                 for (col in 1:length(columns)) {
                                   df_plays <- mutate(df_plays, !!columns[col] := NA)
                                 }
                               }
                               df_final <- rbind(df_final, df_plays) }
  }
}

# save file
saveRDS(df_final, '../Data/plays.rds')

We can check the completeness of our data by comparing to the original game list of 11,292 games from 2011 through 2019. Overall, we are missing data on 251 games or 2.2% of the total. We could try scraping the data from the html game reports, but for my purposes of building some predictive models I’m ok with a few missing games.

missing_games <- games$gamePk[!games$gamePk %in% unique(df_final$gamePk)]

The last set of data I’ll want to get using the API is shift data, ie which players were on the ice at which points during every game. This data is nice and clean and easy to parse from the JSON data using the following code.

df_shift <- NULL

## loop through each game and download play data
for (i in 1:length(games$gamePk)) {
    
  print(i) # counter
    
  ## download shift data
  link <- paste0("http://www.nhl.com/stats/rest/shiftcharts?cayenneExp=gameId=",games$gamePk[i])
  df <- fromJSON(link)
  tmp <- df$data
  
  if (length(df_shift) == 0)
  { df_shift <- tmp } else
  { df_shift <- rbind(df_shift, tmp) }
   
}

# save file
saveRDS(df_shift, '../Data/shifts.rds')

So that’s it! Now we have 3 new data sets which have complete schedules from 2007 (17k rows), play by play data and shift data from 2011 (3.5m and 8.3m rows respectively). Lots of great data to explore. In my next post I’ll begin my attempts to build an adjusted plus minus model.