Where is it? Geocoding with OpenStreetMap and R

Reading Time: 2 minutes

In previous posts I’ve shared how to use APIs to retrieve webdata from a number of sources including Strava, Wikipedia, and Twitter. Recently I was looking to plot some location data, and wanted to join latitude and longitude to street addresses so they could be plotted in Tableau. The most well known geocoding service is probably Google, but I also found a number of others. I ended up settling on OSM (OpenStreetMap), which isn’t particularly fast (it limits you to one address per second), but it is free and super easy to use.

Retrieving geocoded addressess from OSM is simple and you can find the full details at the helpful Nominatim Wiki. Suppose we want to get the location of the home of Sherlock Holmes, we can simply use the following call:

https://nominatim.openstreetmap.org/search?q=221b+baker+street,+london&format=json&addressdetails=0&limit=1

What we get back will be something like this in JSON format:

[{"place_id":50843439,"licence":"Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright","osm_type":"node","osm_id":3916613190,"boundingbox":["51.5237104","51.5238104","-0.1585445","-0.1584445"],"lat":"51.5237604","lon":"-0.1584945","display_name":"Sherlock Holmes Museum, 221B, Baker Street, Marylebone, Westminster, London, Greater London, England, NW1 6XE, United Kingdom","class":"tourism","type":"museum","importance":0.401,"icon":"https://nominatim.openstreetmap.org/images/mapicons/tourist_museum.p.20.png"}]

You can see embedded in all that information are the values for ‘lat’ and ‘lon’. So we can create a short script in R to loop through a set of addresses and extract the data we need. As an example I am using a sample of 100 Toronto parking tickets that have addresses and we will add the longitude and latitude. First let’s install the packages and define a function to extract the JSON data using the OSM API

### PACKAGES ###
################

library(tidyverse)   # data manipulation
library(jsonlite)    # JSON

### FUNCTIONS ###
#################

nominatim_osm <- function(address = NULL)
{
  if(suppressWarnings(is.null(address)))
    return(data.frame())
  tryCatch(
    d <- jsonlite::fromJSON( 
      gsub('\\@addr\\@', gsub('\\s+', '\\%20', address), 
           'https://nominatim.openstreetmap.org/search/@addr@?format=json&amp;addressdetails=0&amp;limit=1')
    ), error = function(c) return(data.frame())
  )
  if(length(d) == 0) return(data.frame())
  return(data.frame(lon = as.numeric(d$lon), lat = as.numeric(d$lat)))
}

Next, let’s load the address data and prep by defining the lon and lat values.

### Load Sample Data ###
sample_addresses <- read.csv("sample_addresses.csv") %>%
                    mutate(lon = 0,
                           lat = 0)

Finally we can loop through each addresses and get the longitude and latitude using our new function. Note we add the city and province to the address for the API call, and we also add a 1 second delay to comply with OSM guidelines.

### Add Geocoding ###
for (i in 1:dim(sample_addresses)[1]) {
  print(i)
  long_lat <- nominatim_osm(paste0(sample_addresses$location2[i],", Toronto, ON"))
  Sys.sleep(1)  # ensure 1 second between API call as per OSM guidelines
  if (dim(long_lat)[1] != 0) {
    sample_addresses$lon[i] = long_lat$lon
    sample_addresses$lat[i] = long_lat$lat
  }
}

Simple as that, here’s a screenshot of our final dataframe.

As usual, I’ve put all the code and data on my Github. Thanks for reading!

Using the NHL API – Play by Play and Shift Data

Reading Time: 5 minutes
Photo by Mika Baumeister on Unsplash

The NHL API is a fantastic, free resource for all sorts of NHL data.  You can find team data, player data, and all sorts of game data. In my last post I showed how I was able to quickly and easily grab all the game stats for every NHL player using R. For this post I’d share how I was able to use a similar approach to get all the play by play and shift data from 2011 onward. This data will be the basis of my attempts to build some more advanced models to understand Adjusted Plus Minus and WAR/GAR.

As usual, all my code and data for this exercise I’ve uploaded to my GitHub. To start let’s load some packages

library(dplyr)
library(tidyr)
library(tibble)
library(jsonlite)

To get all the play by play and shift data we will need a list of game ids to use in our API calls. To do that we can access the ‘schedule’ endpoint in the API. Initially I thought that play by play data started in 2007 (it actually starts in 2011), so I set it as the start date and then use today’s date as the end date. Then I have a quick and dirty for loop to extract all the game data from each game and put into a data frame. Finally I save it to the ‘schedule.rds’ file. In addition to the game id’s for each game, there is a host of other info including the teams involved, the scores, team records, and venue information.

Now that we have the full schedule, let’s use it get all the play by play data sometimes called the Real Time Scoring System (RTSS) data for every game since the 2011-2012 season. This data has a wealth of information including all important plays (goals, shots, hits, penalties, takeaways, giveaways etc.), which players were involved and even where the play took place in the form of x and y coordinates. All this can be found at the ‘game’ API endpoint. As before, I created a loop to loop through each game (all 11,292 of them) and download the JSON data. The data needs a bit of extra parsing, so we collect it into several dataframes and then consolidate it. The ‘result’, ‘about’, ‘coordinates’, and ‘team’ data is pretty straightforward and we can just column bind into a single data frame with a single row for each play. The player data is a bit more complicated, as it has several rows for each player, with a row for each player involved in the play and what their involvement was, like this:

What I did was consolidate all these individual tables into a single table (after adding the event code) and then using ‘tidy’ functions to spread each of them into one row so they can be joined the rest of the table.  Essentially we create a new column for each ‘role’ in a play (eg. Scorer, Assist, DrewBy etc.) and then populate with the player’s name for each.  Finally, we join it all together, add a few additional datapoints (game id, home team, away team etc.) and then join all the plays together into the master data frame and save it to the ‘plays.rds’ file. 

## Load schedule data
games <- readRDS('../Data/schedule.rds')

## Filter to 2011 and later (first season that had coordinate data)
games <- filter(games, gameDate > '2011-09-01')

## initialize dataframes
df_final <- NULL

## loop through each game and download play data
for (i in 1:length(games$gamePk)) {

  print(i) # counter
  
  ## download play by play data
  link <- paste0("https://statsapi.web.nhl.com",games$link[i])
  df <- fromJSON(link)
  
  ## extract and merge play by play data into one dataframe
  df_result <- df$liveData$plays$allPlays$result
  df_about <- df$liveData$plays$allPlays$about
  df_coord <- df$liveData$plays$allPlays$coordinates
  df_team <- df$liveData$plays$allPlays$team
  df_plays <- cbind(df_result, df_about, df_coord, df_team, row.Names = FALSE)
  
  ## extract play by play data
  df_players <- df$liveData$plays$allPlays$players
  players <- NULL
  for (j in 1:length(df_players)) {
    if (!is.null(df_players[[j]])) {
    tmp <- flatten(df_players[[j]]) %>% select(player.fullName,playerType) %>% mutate(eventCode = df_result$eventCode[j])
    tmp1 <- filter(tmp, playerType == 'Assist') %>% mutate(playerType = paste0(playerType,row_number()))
    tmp <- rbind(filter(tmp, playerType != 'Assist'),tmp1)
    if (length(players) == 0) {
      players <- tmp } else { players <- rbind(players, tmp) }
    }
  }
  
  # this is to deal with row 6515 which has duplicate 'unknown' player values
  players <- distinct(players)
  
  if(!is.null(players)) { players <- spread(players, key=playerType, value=player.fullName) 
  
  df_plays <- left_join(df_plays, players, by = c("eventCode")) %>%
              mutate(gamePk = games$gamePk[i],                      # add game id
                     link = games$link[i],                          # add API link
                     gameType = games$gameType[i],                  # add game type (regular season etc.)
                     away_team = games$teams.away.team.name[i],     # add away team
                     home_team = games$teams.home.team.name[i]) %>% # add home team
              flatten()                                             # flatten any lists
  }

  
  ## add plays to final dataframe 
  if (df_plays > 1) {
  if (length(df_final) == 0) { df_final <- df_plays } else
                             { # add any columns missing in df_final dataframe
                               columns <- names(df_plays[!names(df_plays) %in% names(df_final)])
                               if (length(columns) > 0) {
                                 for (col in 1:length(columns)) {
                                   df_final <- mutate(df_final, !!columns[col] := NA)
                                 }
                               }
                               # add any columns missing in df_plays dataframe
                               columns <- names(df_final[!names(df_final) %in% names(df_plays)])
                               if (length(columns) > 0) {
                                 for (col in 1:length(columns)) {
                                   df_plays <- mutate(df_plays, !!columns[col] := NA)
                                 }
                               }
                               df_final <- rbind(df_final, df_plays) }
  }
}

# save file
saveRDS(df_final, '../Data/plays.rds')

We can check the completeness of our data by comparing to the original game list of 11,292 games from 2011 through 2019. Overall, we are missing data on 251 games or 2.2% of the total. We could try scraping the data from the html game reports, but for my purposes of building some predictive models I’m ok with a few missing games.

missing_games <- games$gamePk[!games$gamePk %in% unique(df_final$gamePk)]

The last set of data I’ll want to get using the API is shift data, ie which players were on the ice at which points during every game. This data is nice and clean and easy to parse from the JSON data using the following code.

df_shift <- NULL

## loop through each game and download play data
for (i in 1:length(games$gamePk)) {
    
  print(i) # counter
    
  ## download shift data
  link <- paste0("http://www.nhl.com/stats/rest/shiftcharts?cayenneExp=gameId=",games$gamePk[i])
  df <- fromJSON(link)
  tmp <- df$data
  
  if (length(df_shift) == 0)
  { df_shift <- tmp } else
  { df_shift <- rbind(df_shift, tmp) }
   
}

# save file
saveRDS(df_shift, '../Data/shifts.rds')

So that’s it! Now we have 3 new data sets which have complete schedules from 2007 (17k rows), play by play data and shift data from 2011 (3.5m and 8.3m rows respectively). Lots of great data to explore. In my next post I’ll begin my attempts to build an adjusted plus minus model.

Using the NHL API – Player and Game Data

Reading Time: 4 minutes

If you are looking to do some analysis, visualization or modelling on NHL data, the NHL API is a fantastic resource. It has loads and loads of data on all aspects of the NHL including team, player and game data. There is incredibly rich play by play data since 2007 for many different game events (goals, shots, hits, faceoffs etc.) and includes information on the players involved and the on ice coordinates where the event occurred. There’s no official documentation, but after some searching I was able to find a couple of fantastic resources that helped me get started. One is the Drew Hynes work documenting all the available API endpoints, and the other is the fantastic NHL scraper that Evolving Hockey has made public.

For this post I wanted to share some R code that I quickly wrote to grab player game by game stats for every player from every team since league inception. Initially I used this to create some ‘top 10’ animated charts for team career point leaders over time, but this data could be useful for a number of analyses and visualizations.

My basic approach was to use the API and make four sets of calls to collect all the data required:

  1. Access the ‘team’ endpoint to get data on all NHL teams
  2. Use the team data to retrieve rosters for each season of each team
  3. Use the roster data to identify the complete NHL player list and then use the ‘people’ endpoint to retrieve basic data about each player
  4. Finally use the player list to pull game by game stats for each player that has played in the NHL since 1917

The end result is just over 2 million rows of data! Let’s take a quick look at the code for each part. Please don’t judge the ‘for’ loops – I was excited and lazy to just get something workable and easy to follow. Maybe one day I will rewrite it more elegantly 🙂

First, the required packages. Dplyr and Tidyr for data manipulation and Jsonlite for API data parsing.

library(dplyr)
library(tidyr)
library(jsonlite)

The first part of collecting the team data is pretty straightforward. There are more than 100 teams identified, but a bunch are for all star teams and other non league teams. After reading through the list, I realized I only needed the first 58 teams. I saved the output to a .csv for a bit of manual cleaning. A few teams were missing starting season dates and I also added the final seasons for defunct teams. I then saved the cleaned team list as ‘teams.rds’

teamids <- paste(c(1:58), collapse = ',')
teams <- fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",teamids))

df_team <- teams$teams
write.csv(df_team,"teams.csv")  # manually add the missing start years, and also add end years
df_team <- read.csv("teams.csv")

saveRDS(df_team, file = 'data/teams.rds')

Using the team list, I fetched roster data for each team for each season. Note that I tried to fetch data for the earliest year (1917) to 2018 for each team, even though I had start and end year dates for each team. I was worried that I may miss some data if those dates were incorrect, so I erred on the side of caution. Then I combined all the data, added the team name, team id, and season to each roster year and saved as ‘rosters.rds’.

min_year = min(df_team$firstYearOfPlay)

roster <- NULL
for (id in 1:max(df_team$id)) {
  for (season in min_year:2018) {
    tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",id,"&amp;expand=team.roster&amp;season=",season,season+1)), silent=TRUE)
    if (!grepl("error",tmp)) {
      tmp <- flatten(as.data.frame(tmp$teams$roster$roster)) %>%
             mutate(teamId = id,
                    name = df_team$name[id],
                    season = season)
      if (length(roster) == 0) {
      roster <- tmp } else { roster <- rbind(roster, tmp) }
    } else warning(paste0("Did not find ",df_team$name[id]," ",season))
  }
}

saveRDS(df, file = 'data/rosters.rds')

Using the roster data allowed me to identify all the players who have played in the NHL, a total of almost 8000 players. Using the API’s ‘people’ endpoint, I then fetched basic information about each player like height, weight, position and birthplace. I saved this into ‘players.rds’.

# identify unique players and get all player data for each
player_ids <- unique(rosters$person.id)

# fetch player data
players <- NULL
for (id in player_ids) {
  tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/people/",id)), silent=TRUE)
  if (!grepl("error",tmp)) {
    tmp <- flatten(tmp$people)

    if (length(players) == 0) { players <- tmp } else
                              { # add any columns missing in players dataframe
                                columns <- names(tmp[!names(tmp) %in% names(players)])
                                if (length(columns) > 0) {
                                  for (col in 1:length(columns)) {
                                    players <- mutate(players, !!columns[col] := NA)
                                  }
                                }
                                # add any columns missing in tmp dataframe
                                columns <- names(players[!names(players) %in% names(tmp)])
                                if (length(columns) > 0) {
                                  for (col in 1:length(columns)) {
                                    tmp <- mutate(tmp, !!columns[col] := NA)
                                  }
                                }
                                players <- rbind(players, tmp) }
  } else warning(paste0("Did not find ", players[id]))
  print(id)
}

saveRDS(players, file = 'data/players.rds')

Now that we had a full player list, we can load the game by game stats for each player for every game they played. This is a lot of data, more than 2 million rows, you may want to consider splitting the queries into several blocks and join together at the end. Unfortunately rbind is a pretty slow operation, especially as a data frame gets larger, if you know of a faster, more efficient way to join data frames, please comment below.

# get game by game data for each play
games <- NULL
for (i in 1:length(players$id)) {
 print(i)
 seasons <- filter(rosters, person.id == players$id[i]) %>% select(season)
 seasons <- unique(seasons)
 for (season in seasons$season) {
   url <- paste0("https://statsapi.web.nhl.com/api/v1/people/",players$id[i],"/stats?stats=gameLog&amp;season=",season, season+1)
   tmp <- try(fromJSON(url), silent=TRUE)
   if(is.data.frame(tmp$stats[[2]][[1]])) {
     tmp <- flatten(tmp$stats[[2]][[1]]) %>%
            mutate(id = players$id[i],
                   fullName = players$fullName[i])
  
     if (length(games) == 0) { games <- tmp } else
     { # add any columns missing in players dataframe
       columns <- names(tmp[!names(tmp) %in% names(games)])
       if (length(columns) > 0) {
         for (col in 1:length(columns)) {
           games <- mutate(games, !!columns[col] := NA)
         }
       }
       # add any columns missing in tmp dataframe
       columns <- names(games[!names(games) %in% names(tmp)])
       if (length(columns) > 0) {
         for (col in 1:length(columns)) {
           tmp <- mutate(tmp, !!columns[col] := NA)
         }
       }
       games <- rbind(games, tmp) }
   }
 }
}

saveRDS(games, file = 'data/games.rds')

I’ve uploaded all the code and data to my GitHub repository, have fun delving into this amazing data resource! Please share any interesting insights or analyses you get out of this data. Thanks for reading.

Animated ‘Top 10’ NHL Scoring Charts With R

Reading Time: 5 minutes

Maybe you’ve seen some of those nifty ‘top 10’ style animated bar charts which show how categories change over time. I thought it would be cool to try and create some of my own with NHL hockey scoring data. This post shares my approach and how I implemented using R.

The first thing we need is some data, specifically season by season point totals of all the players in a team’s history. We could scrape it from directly from a sports or hockey stats website using the rvest package in R. However, in many cases that would violate the website’s terms of service, so I would recommend checking first if you go that route. There’s a better option; using the freely accessible NHL API. There’s no official documentation, so I recommend using the excellent work by Drew Hynes, who has documented many of the NHL API endpoints. There is an astounding amount of data here that the NHL makes available, including play by play and full shift data for the past 10+ seasons. It also has complete team and scoring information going all the way back to 1917 when the league was founded. In a future post I will share the code I used to fetch data from the API, but for now we will just use the final output file. Note that I have posted all code and data to my NHL GitHub repository.

First let’s install the required packages. Tidyverse is the standard workhorse for manipulating, shaping, and visualizing our data and gganimate is the additional package to turn our ggplots into animated ones. I added viridis which has a nice colour palette for our charts. Finally, we need a couple of packages (gifski and png) to render our final animated image files.

library(tidyverse) # data manipulation and plotting
library(gganimate) # chart animation
library(viridis)   # colour palettes
library(gifski)    # image rendering
library(png)       # image rendering

Next, let’s load and prepare data. The ‘games.rds’ file was built by fetching the data of all players who have played in the NHL and their game stats for each game they’ve played (eg. goals, assists, time on ice etc.). It’s a large file with over 2 million rows. This post will show how to create a chart for my favorite team the Vancouver Canucks, but you could easily modify this for your favorite team just by changing the team_name variable. So first we load the data and fix the season field to show for example ‘2018-2019’ instead of ‘2018’ just for readability. Also I added a note to add a middle initial to distinguish between the 2 Greg Adams that played for the Canucks.

team_name <- 'Vancouver Canucks'

# Load previously scraped API data
games <- readRDS('../Data/games.rds') %>%
         mutate(fullName = case_when(id == 8444894 ~ 'Greg D Adams',
                                     id == 8444898 ~ 'Greg C Adams',
                                     TRUE ~ fullName),
                season = paste0(substr(season,1,4),"-",substr(season,5,8)))

Now let’s prepare the data frame that we’re going to use for plotting, which consists of the top 10 players in career points for each Canucks season. First we’ll filter the data to Canucks players only and summarize their season totals and career cumulative points. Then we’ll use a spread and gather technique to fill zeros for all the non playing years of each player. After we gather the data again into a tidy format, we will replicate a player’s career points for each year after their retirement. Finally, we will rank all the players based on career points after each season and then filter only on the top 10 for our chart.

# Data Prep
plot_df <- filter(games, team.name == team_name) %>%
         group_by (season, fullName) %>%
         summarise(Pts = sum(stat.points)) %>%
         select(Player = fullName, Pts, season) %>%
         group_by(Player) %>%
         mutate(Total_Pts = cumsum(Pts),
                First_Season = min(season),
                Last_Season = max(season),
                Career_Pts = max(Total_Pts)) %>%
         select(-Pts) %>%
         spread(key = season, value = Total_Pts, fill = 0)

plot_df <- gather(plot_df, season, Total_Pts, 5:dim(plot_df)[2]) %>%
           mutate(Total_Pts = case_when(season > Last_Season ~ Career_Pts,
                                        TRUE ~ Total_Pts)) %>%
           select(Player, season, Total_Pts)

# Filter data to include only top 10 players for each year
plot_df <- group_by(plot_df, season) %>%
           mutate(rank = rank(-Total_Pts),
                  Value_rel = Total_Pts/Total_Pts[rank==1],
                  Value_lbl = paste0(" ",Total_Pts)) %>%
           group_by(season) %>% 
           filter(rank <=10) %>%
           ungroup()

Now that we’ve got our data ready, we’re ready to plot. First we will create a static horizontal bar plot using ggplot. It looks like a lot of code, but most of it is formatting the plot’s appearance.

# Create Static Plot
staticplot = ggplot(plot_df, aes(rank, group = Player, 
                                       fill = as.factor(Player), color = as.factor(Player))) +
                    scale_fill_viridis(discrete=TRUE) +
                    scale_color_viridis(discrete=TRUE) +
                    geom_tile(aes(y = Total_Pts/2,
                                  height = Total_Pts,
                                  width = 0.9), alpha = 0.8, color = NA) +
                    geom_text(aes(y = 0, label = paste(Player, " ")), vjust = 0.2, hjust = 1, color = "black") +
                    geom_text(aes(y=Total_Pts,label = Value_lbl, hjust=0), color = "black") +
                    coord_flip(clip = "off", expand = FALSE) +
                    scale_y_continuous(labels = scales::comma) +
                    scale_x_reverse() +
                    guides(color = FALSE, fill = FALSE) +
                    theme(axis.line=element_blank(),
                          axis.text.x=element_blank(),
                          axis.text.y=element_blank(),
                          axis.ticks=element_blank(),
                          axis.title.x=element_blank(),
                          axis.title.y=element_blank(),
                          legend.position="none",
                          panel.background=element_blank(),
                          panel.border=element_blank(),
                          panel.grid.major=element_blank(),
                          panel.grid.minor=element_blank(),
                          panel.grid.major.x = element_line( size=.1, color="grey" ),
                          panel.grid.minor.x = element_line( size=.1, color="grey" ),
                          plot.title=element_text(size=25, hjust=0.5, face="bold", colour="grey", vjust=-1),
                          plot.subtitle=element_text(size=18, hjust=0.5, face="italic", color="grey"),
                          plot.caption =element_text(size=8, hjust=0.5, face="italic", color="grey"),
                          plot.background=element_blank(),
                          plot.margin = margin(2,2, 2, 4, "cm"))

Now we create our animated plot by adding the transition_states function from gganimate and use the season field to identify the different states of our animation. Also we add a title and caption.

# Animated Plot
anim = staticplot + transition_states(season, transition_length = 4, state_length = 3) +
       view_follow(fixed_x = TRUE)  +
       labs(title = paste0(team_name," All Time Point Leaders"),  
            subtitle  =  "Season : {closest_state}",
            caption  = "Total Regular Season Pts | Data Source: www.hockey-reference.com")

Finally, we use our animated plot object and feed it into the animate function to render our final animated image file. In this case we will render to an animated gif, though other options are available. I suggest playing around with the different parameters to see how it affects the final output.

# Render Plot
animate(anim, fps = 10, duration = 45, width = 600, height = 400, end_pause = 100, detail = 1, rewind = FALSE,
        renderer = gifski_renderer(paste0(gsub(' ','_',team_name),"_alltime.gif")))

And that’s it, we’ve created a cool looking animated ‘top 10’ charts with just a few lines of R code!

Best of all the code can be easily modified to create charts for different teams or even use completely different data. There is also a lot of flexibility to customize the appearance. Enjoy, and please tag and share any charts you make as I’d love to see what you are able to create! Thanks for reading.

Retrieving Wikipedia Data for Natural Language Processing

Reading Time: 4 minutes

Encyclopedia
Old School Wikipedia

The internet is not just for cat videos anymore, there’s too much useful freely available data to ignore.  In my opinion, being able to easily get data from the internet using API’s is a core skill for any data scientist or analyst. I’ve walked through how easy it is to use an API to get Strava exercise data and Twitter data, today we focus on Wikipedia to find some text data for our Natural Language Processing projects.

We’re going to use R for this project and let some great packages do the heavy lifting; WikipediR for working with the MediaWiki API and rvest for web scraping Wikipedia pages.  As usual there are a number of packages and ways to approach this task, but I found this the easiest and most straightforward.  The output of this code is a text file with text data from all the Wikipedia pages related to a given topic, with the goal of generating a corpus on a domain specific topic for use in training word vectors with Word2Vec.   That said, this code is easily modified to extract a variety of data from Wikipedia.

In addition to the two packages mentioned above, we will load tidyverse for any data manipulation tasks.  Tidyverse is a must have for making data manipulation easy, like this example working with customer survey data.  We’ll also initialize some parameters including our overall topic (Ice hockey) and output filename.  For your particular topic, just search Wikipedia for a category page and include all the text after ‘Category:’.  This script will then extract all the pages and sub categories associated with the topic.

library(tidyverse)            # Data Manipulation
library(WikipediR)            # Wikipedia Queries
library(rvest)                # Web Scraping

category_list <- "Ice hockey" # Set to starting category name
filename <- "data.txt"        # Output filename
total_pages <- c()
categories <- c()
text_data <- as.null()

As a first step, we want to use the MediaWiki API to retrieve a list of all the sub-categories and page titles associated with those categories.  The WikipediR package provides some nice wrapper functions to access the API.

  # retrieve pages and categories
  pages <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c("page"))
  sub_cats <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c( "subcat"))

The ‘pages_in_category’ function is helpful here by returning Wikipedia sub-category and/or page data depending on the function parameters.  Here we specify English Language Wikipedia with the first two parameters and use the previously defined category name.  The function can return a number of properties, but we only need the title.  Finally we define the element we are looking for (page, sub-category or file), and we assign the list results to pages and sub_cats respectively.  We use a couple of loops to extract each page and subcategory title and add it to our master list.

  # add pages to list  
  if (length(pages$query$categorymembers) > 0 ) {
    for (i in 1:length(pages$query$categorymembers)) {
      total_pages <- c(total_pages, pages$query$categorymembers[[i]]$title)
    }
  }

  # add sub categories to list
  if (length(sub_cats$query$categorymembers) > 0 ) {
    for (i in 1:length(sub_cats$query$categorymembers)) {
      sub_cat = gsub("Category:", "", sub_cats$query$categorymembers[[i]]$title)
      categories <- c(categories, sub_cat)
      next_category_list <- c(next_category_list, sub_cat)
    }
  }

Next, we want to repeat the process with each of the returned sub-categories, to retrieve all of the pages and sub categories associated with that category and add them to the master list.  We will continue ‘drilling down’ until there are no more sub categories or pages associated with that overall topic.  To view the entire code with the looping structure, please check out the code on my github.  Below are the final results, that’s a lot of categories and individual pages about hockey!

[1] "Number of Categories: 7322"
[1] "Number of Pages: 12912"

Next we will extract all the text data from each page using the rvest web scraping package.  We simply loop through the entire page list just created, and then extract all the paragraphs (denoted by the <p> node).  We are leaving some of each web page behind, such as lists and tables, but for this project I figured just grabbing nice clean paragraph and sentence data would be most useful.  All of the paragraphs for all the pages are simply pasted into a single ‘page_text’ variable.

# read all page paragraph data
for (i in 1:length(total_pages)) {
  page = gsub(" ", "_", total_pages[i])
  print(paste0("Loading Page: ", page))
  web_address <- paste0("https://en.wikipedia.org/wiki/",page)
  page_html <- read_html(web_address)
  page_paragraphs <- html_nodes(page_html,"p")
  page_text <- paste(html_text(page_paragraphs), sep = '', collapse = '')
  for (j in 1:length(page_text)) {
    if (is.null(text_data)) { text_data <- page_text }
    else { text_data <- paste(text_data, page_text, sep = '', collapse = '') }
  }
}

Before we write all this juicy text data to a .txt file for future NLP tasks, we should probably do a little data preprocessing.  Let’s remove existing line breaks and replace with line breaks after each sentence so our NLP treats each sentence as an entity.  We also remove all punctuation and citations (eg. [2]) and convert everything to lower case.  You may want to modify these preprocessing steps for your project.  Perhaps you want to distinguish between ‘doors’ of a house and the band ‘Doors’, in which case you wouldn’t want to convert everything to lower case.  I also kept all numerical data as-is, to keep dates and jersey numbers, but you may want to consider removing numbers all together.

# text pre-processing
text_data <- gsub("\n", " ", text_data)       # remove the existing line breaks
text_data <- gsub("\\.", "\r\n", text_data)   # add line breaks
text_data <- gsub("\\[\\d\\]"," ", text_data) # remove citations
text_data <- gsub("[[:punct:]]"," ", text_data) # remove punctuation
text_data <- text_data %>% tolower()

# save text file
write(text_data, filename)

The final step it to write this 6 million word corpus to a text file that we will use in the next post to train word vectors using Word2Vec.

All the code for this task can be found in github.  You can use this to generate text for any topic simply by changing the starting topic page.  Also, getting more familiar with the WikipediR and rvest packages will allow you to modify this script to extract any number of Wikipedia pages and data.

Thanks for reading!

How to Query Twitter Data with R

Reading Time: 4 minutes

Getting Twitter data and tweets is easy in R and can be a great source of text data for NLP applications to better understand your customers or gain insight into topics that are being tweeted about.  In this brief post, I’ll walk through how to easily start searching for tweets in minutes, once you’ve setup your free Twitter developer’s account which allows access to the Twitter API.

Setup Twitter Developer Account
  1. If you’re not already a twitter user, you’ll need to signup for a new account:   www.twitter.com/signup
  2. Then you’ll need to apply for a developer account: https://developer.twitter.com/en/apply/user.  Just follow through and answer the questions about what you are planning to do with Twitter data and submit the application.  In my application I described that I would like to use the account for learning purposes.  After you submit, you’ll get an email asking to confirm your email address.  Finally, you should receive an email saying you’ve been approved (this took a couple days for me) and a link to get started.
  3. From the get started page select the first option to ‘Create an app’.
  4. Fill in the 3 required fields; the name, a description, and a website URL (you can use a personal website or even your twitter page www.twitter.com/{your twitter username}.  For our R code using the rtweet package you will also need to set the optional Callback URL to http://127.0.0.1:1410 and then submit to create the app.
  5. Now you can generate the API keys and access tokens you will need to authenticate and communicate with the Twitter API.  Select the ‘Keys and tokens’ option on the app page and click to generate an access token and access token secret.  Keep this page open, you’ll need these 4 ridiculously long codes in your R script and your app name for later.
Rtweet and User Authentication

Now that you’ve got everything you need to get started, it’s super easy to start searching for tweets in R.  There are several R packages that make it easier to interact with the Twitter API, I found the rtweet package the most current and easy to use.  If you find another that you think is better, please let me know in the comments.

# Packages
library(rtweet) # set Callback as http://127.0.0.1:1410 on Twitter App

# Enter you Twitter app info here 
consumer_key <- "  "
consumer_secret <- "  "
access_token <- "  "
access_secret <- "  "
appname = "  "

twitter_token <- create_token(
  app = appname,
  consumer_key = consumer_key,
  consumer_secret = consumer_secret,
  access_token = access_token,
  access_secret = access_secret)

Replace the blanks above with your consumer key & secret, your access token & secret, and your app name you just created.  Run this code and voila , you are ready to go!

There are a number of things you can do using the API.  In addition to searching for Tweets, you can do pretty much anything else you can do directly in Twitter, including posting new Tweets, following other users, or look at current trends on Twitter.  Although for this tutorial let’s just focus on retrieving Tweets for future text analysis.  To do some of the other tasks, a good place to start is the rtweet manual or vignette, and the Twitter developer documentation.

Below are three quick examples of how to query the Twitter database using the search_tweet function.  Because the search can return nested lists, we use the flatten function before saving with the standard write.csv function.  Note that the basic Standard API developer account only queries tweets from the past week or so, you’ll have to pony up some cash if you want access to more historical data and additional functionality.

Search Tweets by User
tweets <- search_tweets( "from:realDonaldTrump", n = 10000, include_rts = FALSE )
tweets <- flatten(tweets)
write.csv(tweets, paste0(Sys.Date(), "_Trump_tweets.csv"), row.names = FALSE)

By using the ‘from:‘ format in the search term we can search tweets for a particular user, in this case a prolific presidential tweeter.  Did he really Tweet 71 times in the past week?  Apparently so.  We set the number of tweets to retrieve to 10,000 and set to not include retweets by adding include_rts = FALSE.  The search will return a large number of fields, but you can see a sample of the most relevant ones below; date&time, user, tweet, and source.

Search Tweets by Topic
tweets <- search_tweets( "#canucks", n = 10000, include_rts = FALSE )
tweets <- flatten(tweets)
write.csv(tweets, paste0(Sys.Date(), "_Canucks_tweets.csv"), row.names = FALSE)

Similar to the query above, but we replace the search term with ‘#canucks‘, which will return mostly tweets about the Vancouver Canucks hockey team.

Stream Tweets by Topic
tweets <- stream_tweets("canada", timeout = 20)
tweets <- flatten(tweets)
write.csv(tweets, paste0(Sys.Date(), "_Canada_tweets.csv"), row.names = FALSE)

This is a neat query because it let’s you stream tweets in real-time.  For example, the query above will retrieve all tweets mentioning Canada for a period of 20 seconds.

Summary

So that’s it!  In this quick post, I shared how to get set up with the Twitter API and how to query tweets in R using the rtweet package in a quick and easy way.  You can find all the code and sample Tweet data at Github.  In my next post I’ll share how we can use this text data to do some unsupervised topic modelling using word and document vectors.  Happy Tweet Querying!

Classifying Text Data Using Machine Learning

Reading Time: 10 minutesWords on a Page

The statistics are astounding.  Data is everywhere and growing – 90% of all data was generated in the last 2 years!  Much of it is text data, every minute we generate 456,000 Tweets,  510,000 Facebook comments, 16 million text messages, and 156 million emails.  In addition, your own organization is probably collecting a significant amount of text data daily from customers and employees through a number of sources like surveys, employee comments, chat logs, and transcribed phone calls.  How can we mine all that potentially information rich data for actionable insights to improve our customer experience and process efficiency?  Manually reading comments is not going to cut it.  We want to take advantage of machine learning and modern data science techniques.  The challenge is that Natural Language Processing (NLP) can be a confusing topic for newcomers, given the vast range of approaches and techniques for answering a variety of questions about text data.  Plus, this is an area of active research so things are changing all the time.  The good news is we don’t need to understand deep neural networks and word embeddings to get started.  In this post I’d like to share a great tool for classifying text documents into a number of categories using a simple NLP technique called ‘bag of words’.  We’ll use this technique to construct a classifier that you probably encounter every day; one which can pick out spam emails from a set of emails.

How do we approach the problem of teaching a computer to understand the topic of a given text?  We could start by explicitly programming rules to parse sentences into parts of speech to identify nouns and verbs etc.,  so the computer can ‘understand’ what is being written.  But anyone who’s tried to learn a new language knows, there are always exceptions to the rules, and then exceptions to the exceptions.  Plus we need to take into account common spelling and grammatical mistakes and also the fact even correctly written sentences often are ambiguous

This morning I shot an elephant in my pajamas. How he got in my pajamas I don’t know. – Groucho Marx

It’s clear this approach, if it’s even possible, is going to be prohibitively complicated.  Instead, we can approach the problem from a different angle.

Bag of Words

By looking at each of the individual words in a given text, we can use the presence of specific words and their frequency to provide clues or evidence that a text is about a particular topic.  This is called the ‘bag of words’ approach, because it doesn’t take into account word order or grammar, just the words (or sometimes phrases) themselves.  It’s relatively simple, but also surprisingly effective.

For example, suppose we have three short text documents:

  1. The customer service was the worst.
  2. I was happy with the billing.
  3. I found the customer service excellent.

Using ‘bag of words’, sentence one is represented as “The” (2 times),  “customer” (1), “service” (1), “was” (1), “worst” (1).  We can do the same for all the documents and populate in a matrix like this:

Document the customer service was worst I happy with billing found excellent
1 2 1 1 1 1 0 0 0 0 0 0
2 1 0 0 1 0 1 1 1 1 0 0
3 1 1 1 0 0 1 0 0 0 1 1
Tokenization and Term Frequency (TF)

We call this ‘tokenization’, splitting our documents into parts like words, phrases or even letters.  In this case, we’ve split into words and done a couple of other small things like converting all to lower case (so we don’t end up with ‘the’ and ‘The’ as separate words) and removing all punctuation.  We’ve effectively turned our text documents into a numerical representation, which means we could now input this into a machine learning algorithm which uses these word frequencies as features.  We can already see just by scanning the matrix, that sentence 1 & 3 are much closer to each other than say 1 and 2.  Sentence 1 & 3 are more likely to come from the same category.

So you might be saying, ‘hold on a sec’, what about all the information we threw out by ignoring the word order, phrases and context?  Don’t we need those?  Nope.  We are going to sacrifice all that complexity for ease of implementation.  That’s ok because we are going to let the algorithm use statistical methods to tease out the topic information from the words themselves without ‘understanding’ what the write is actually saying.  That doesn’t mean we are done here, because there are a number of things we can do to help improve our classifier by further quantifying how important each word is to the document and topic.  If we consider that each non-zero entry in our matrix represents the ‘strength’ of a feature, ie how much information it provides in terms of the topic then we can evolve our method to ‘weight’ each of those occurrences with more granularity to improve the quality of our features.

Stop Words

We have a number of ‘low value’ words in our documents like ‘I’, ‘the’, ‘was’, ‘with’.  These are words that really don’t tell us anything about the topic of the document, so they add little value.  These are often called ‘stop words’, and most text classification packages in R and Python have functions to remove those automatically based on a preset dictionary of these terms.  You can also manually create a list of words that you think could be safely removed without impacting the quality of the classification model.  Once we do that we’re left with a smaller matrix that has almost the same information:

Document customer service worst happy billing found excellent
1 1 1 1 0 0 0 0
2 0 0 0 1 1 0 0
3 1 1 0 0 0 1 1

Another thing we should consider, especially if we have documents of varying lengths, is to use word proportion of total (count / # of words in document) instead of absolute counts.  This helps normalize our values which will improve algorithm performance.

Stemming

Stemming involves reducing an inflected (or derived) word to it’s base form, and is sometimes useful as part of pre-processing text data for classification.  For example, ‘billing’, ‘billed’, ‘bills’, ‘billable’ could be all reduced to ‘bill’ as part of word stemming.  The advantage is to reduce the number of unique words, and capture the similar meaning between all these words.  Lemmatization is a more complex way of doing this that requires detailed dictionaries for each word, and for that reason we will ignore this method for now despite having the cooler name.

Inverse Document Frequency (IDF)

What about words that appear in every document versus words that only appear in one or two documents, which will have more information or value in determining which category a word belongs to?  If a word is common and appears in every document it is not very useful and should have a low weight.  Conversely, unique words that show up in only a few documents are more likely to be of more value and should have a higher weight.  To incorporate this into our weighting matrix, we can use something call the inverse document frequency which weights uncommon words more highly than common words.

TF-IDF

If we multiply the term frequency weighting (TF) with the inverse document frequency (IDF) we get a commonly used weighting metric called TF-IDF.  It is a commonly used method in classification and document retrieval tasks like search engines.  Again, most text classification packages in Python and R have a function to calculate these values for you.

Pulling It All Together in R

In order to build our classifier we will want to follow a few basic steps based on the principles above.  First load our labelled data, which will have all the actual document data and the manual category label applied to each document.  This will likely be the most time consuming step, as it often requires you to manually label each document to be used to train and test the classifier.  Next we will process the data as we outlined above and create our matrix (often called a document term matrix or document frequency matrix).  Then we will train the model on a portion of our data set using a machine learning algorithm.  Then we use the remaining labelled data to test our model to see how accurately it classifies new data that it wasn’t trained on.  Finally, we output various metrics to assess model performance and make further changes to improve performance if necessary.

So, let’s make it real and apply all these steps into our spam vs. ham classification example using R.  Code below and also at my github repository.  First we load required packages and set parameters

library(quanteda)     # text classification package
library(tidyverse)    # data manipulation

# set parameters
set.seed(1912)
train_prop <- 0.7 # % of data to use for training

Next, we’ll load the text messages that we want to classify.  This data has 2 columns; one with the actual text data and the other with the spam or ham label.  There are 5574 total documents.  We also want to randomize the data for splitting into train and test sets and remove any blank rows.

# read data from csv
df <- read.table("SMSSpamCollection.txt", header=FALSE, sep="\t", quote="", stringsAsFactors=FALSE) # Ham/Spam test data

# prepare data
names(df) <- c("Label", "Text")                         # add column labels
df <- df[sample(nrow(df)),]                             # randomize data
df <- df %>% filter(Text != '') %>% filter(Label != '') # filter blank data

Many text mining packages in R and Python like our data to be data structure called a corpus, which is simply a collection of documents.  From the corpus we start by creating our document frequency matrix (dfm).  This data ends up having 5574 rows (one for each document), and 9127 columns (one for each word that appears in the corpus).  These types of matrices usually have a lot of zeros in them, and are called ‘sparse’ matrices.  We will want to do our best to reduce the number of columns or features to a more manageable number.  This will be more efficient from a memory perspective and also will help our algorithm as a large number of features can impact both performance and accuracy.  Let’s start by applying word stemming, which reduces the number of columns to 7746 by combining similar words.  We can perform another function, dfm_trim, which allows us to remove very infrequent terms and also very short documents.  This results in 1763 columns or features for our algorithm.  Finally, let’s apply the dfm_tfidf function to apply the tf-idf weighting scheme, and we can set the calculation method for both the tf (count) and idf (inverse) term.  The quanteda package allows us to do all this data preparation with only a few short lines of code!

# create document corpus  
df_corpus <- corpus(df$Text)   # convert Text to corpus 
docvars(df_corpus) <- df$Label # add classification label as docvar

# build document term matrix from corpus
df_dfm <- dfm(df_corpus, tolower = TRUE)
# stem words
df_dfm <- dfm_wordstem(df_dfm)
# remove low frequency occurence words                                
df_dfm <- dfm_trim(df_dfm, min_termfreq = 5, min_docfreq = 3)
# tf-idf weighting           
df_dfm <- dfm_tfidf(df_dfm, scheme_tf = "count", scheme_df = "inverse")

Next, we split the data into a training and testing set

# split data train/test
size <- dim(df)
train_end <- round(train_prop*size[1])
test_start <- train_end + 1
test_end <- size[1]

df_train <- df[1:train_end,]
df_test <- df[test_start:test_end,]

df_dfm_train <- df_dfm[1:train_end,]
df_dfm_test <- df_dfm[test_start:test_end,]

Most of our work is now done, building and testing the model is straightforward with only 2 lines of code.  Here we use the quanteda function textmodel_nb to use Naive Bayes algorithm to predict ham/spam.  Of course you could use a number of different algorithms here with similar results, and sometimes specific algorithms work better with specific datasets.

# build model with training set
df_classifier <- textmodel_nb(df_dfm_train, df_train$Label)

# test model with testing set
df_predictions <- predict(df_classifier, newdata = df_dfm_test)

The predict function will return a vector with the predicted category for each document in the test set.  We can compare that to the actual category labels to see how well our model is working using something called a confusion matrix.

Actual Ham Actual Spam
Predicted Ham True Negative False Negative
Predicted Spam False Positive True Positive

The most common classifier assessment metric is accuracy, which is simply the ratio of correct predictions over the total.  While useful because it is easily to calculate and interpret, it can be misleading.  Consider an example where spam represents 99% of the data, with only 1% ham (which feels like my email inbox some days).  If we build a model that predicted spam for every document then we would have a 99% accurate model, but not a very useful one, since it wouldn’t identify any ham emails.  As a result, there are a few more metrics that are commonly used.  ‘Precision’ in simple terms captures what proportion of predicted Spam is actually Spam, and ‘Recall’ captures what proportion of real Spam was predicted as Spam.

Finally, to create a single measure of model ‘goodness’, we can use something call the F1 score (sometimes F-score or F-measure) which is the harmonic mean between Precision and Recall.

An F1 score of 1 indicates perfect precision and recall while 0 is the opposite.  Let’s run the code below to see how our model performs.

conf_matrix <- table(df_predictions, df_test$Label)
accuracy <- (conf_matrix[1,1] + conf_matrix[2,2]) / sum(conf_matrix)
precision <- conf_matrix[2,2] / sum(conf_matrix[2,])
recall <- conf_matrix[2,2] / sum(conf_matrix[,2])
f1_score <- 2 * ((precision * recall) / (precision + recall))

cat("Confidence Matrix:")
conf_matrix
cat("Accuracy: ", accuracy)
cat("Precision: ", precision)
cat("Recall: ", recall)
cat("F1 Score: ", f1_score)
Confidence Matrix:              
df_predictions  ham spam
          ham  1419   11
          spam   23  219
Accuracy:  0.9796651
Precision:  0.9049587
Recall:  0.9521739
F1 Score:  0.9279661

Wow, we were able to generate a model that is 98% accurate with an F1 score of 0.93 with only a few lines of code!  I leave it to the reader to experiment with the model to see if further improvement is possible.  Generally speaking, improving the feature engineering process through the dfm matrix is often the most effective (hint:  in this case, try not stemming words).  Also, more training data is often very effective at improving performance.  Assuming we’re happy with the performance, we can now use this model to predict categories for new unseen data without any additional work.  Cool!

Summary

Hopfully this post has convinced you that you can begin building text classifiers today with only a few lines of code and some labelled data.  In this example we had more than 5000 data points, which definitely helped accuracy.  In my experience, it’s more feasible to label a few hundred documents in a few hours and that should produce a starting accuracy of about 80%.  Give it a try!  Input your data into the code above and see the results.  Let me know what you discover in the comments below.  Note that you can use data that has more than two labelled categories without any changes to the code.

From this initial model there is a lot more you can do to get more sophisticated with your text classification efforts and you are only limited by your time and willingness to learn.  The simplifying assumption of using only single ‘context-free’ words as features made it possible to quickly create such our model but will likely become a limiting factor as you try to increase accuracy even further.  You may want to experiment with including n-grams in your model, which just means sets of consecutive words.  For example bi-grams will include frequencies of all 2 word phrases in a document.  The advantage will be that you can capture some distinctions between things like ‘customer service’ and ‘service provider’, in which a bag of words model treats the word service as the same in both cases.  Creating n-grams is done at the tokenization stage and most packages allow you to specify the ‘n’.  You can also include interaction terms which look for the co-occurrence of 2 words that are not right next to each other.  The trick will be determining which of these new features will be useful to the model while discarding the rest, as n-grams and interactions will increase your matrix size exponentially.  This is where dimensionality reduction techniques will come in handy.  You may also want to learn about newer methods like word2vec or doc2vec which get better at capturing word context through a concept called word embeddings.  I will review some of these methods in future posts, but I wanted to share the basic framework in a simple manner so you could get started today.  Thanks for reading and happy text mining!

You don’t have to see the whole staircase, just take the first step – Martin Luther King Jr.

WAM 25km Post Race Analysis

Reading Time: 2 minutes

Whistler Alpine Meadows 2018. Photos By: Scott Robarts

I finished!

In previous posts I looked at historical Whistler Alpine Meadows 25km race results and applied some simple predictive modelling to help set a personal race goal of 4 hours.  Between those posts and race day, a grizzly bear decided to hang around the trails (maybe getting ready for some race day snacks), which resulted in a total revamp of the courses for all distances.  Coast Mountain Trail Series did an amazing job of making all these changes in the final 2 weeks and still pulling off an amazing event.  The new course was super technical and showed off some of the beautiful scenery around Whistler.  The chief difference though was that the new course had about half the total vertical acsent/descent of the original course.  I suspected that this would result in much faster times and my original goal probably wouldn’t be challenging enough.  Looking at the swarm plots for all three years shows that’s exactly what happened, as 2018 racers were significantly faster than previous years.

WAM 25km Results by Year
Results for the 25km race

Here are the same results in a violin plot with the 25%, 50% and 75% quartiles identified.  You can see median is a lot lower than 2016 or 2017 and sits just over 200 minutes.  My time of 3 hrs and 17 minutes (197 minutes) was just faster than the median.

25km Results with quartiles

Let’s see how I did against my age group of 40-49, where I show up as the purple dot among the light blues.  Definitely not in the top half, as I finished 27th out of 38 in this group.  Room for improvement in my race management, I faded quite a bit in the last 5 km as my legs began to cramp, and a number of runners passed me.  I think I can do a better job of staying fully fueled during the race to minimize fatigue.

25km results for M 40-49

All in all it was a great event, and a ton of fun.  Look forward to next year!

Getting started analyzing customer survey data using R

Reading Time: 6 minutesHow to listen to our customers

Most organizations use surveys to quantify customer satisfaction and loyalty.  It’s helpful to spend as little time as possible preparing and summarizing the results, so you can get straight to looking for actionable insights to improve customer experience.  However, customer survey data is usually a pain to work with, it’s structured in a way that doesn’t lend itself to easy analysis.  In this post I’m going to show how easy it is use R to quickly summarize data.

Messy vs. Tidy Data

The main challenge to overcome is that most survey data is ‘messy’, or ‘short and fat’.  Each survey completion is a single row, with columns representing information about the respondent and also their survey responses.  I’ll review why this can be a problem, but first let’s get some data work with.  Survey data is commonly shared as a .csv file, which can be loaded easily into R

df <- read.csv('My_Survey_Results.csv')

Since I don’t have any data ready for this analysis, let’s just generate some random survey data

set.seed(69)

id <- 1:2000
gender <- sample(c("Male","Female"),2000,replace=TRUE)
age <- sample(c("18-24","25-34","35-54","55-64","65+"),2000,replace=TRUE,prob=c(0.15,0.20,0.3,0.15,0.2))
province <- sample(c("BC","AB","ON","QC"),2000,replace=TRUE,prob=c(0.2,0.2,0.35,0.25))
product <- sample(c("Bag of Glass","Bass-O-Matic","Happy Fun Ball","Little Chocolate Donuts"),2000,replace=TRUE,prob=c(0.1,0.5,0.2,0.2))
Q1 <- sample(c("Very satisfied","Somewhat satisfied","Neither satisfied nor dissatisfied","Somewhat dissatisfied","Very dissatisfied"),2000, replace=TRUE)
Q2 <- sample(c("Very satisfied","Somewhat satisfied","Neither satisfied nor dissatisfied","Somewhat dissatisfied","Very dissatisfied"),2000, replace=TRUE)
Q3 <- sample(c("Strongly agree","Agree","Neither agree nor disagree","Disagree","Strongly disagree"),2000, replace=TRUE)
Q4 <- sample(c("Extremely helpful","Very helpful","Somewhat helpful","Not so helpful","Not at all helpful"),2000, replace=TRUE)
Q5 <- sample(0:10,2000,replace=TRUE)

df <- data.frame(id,gender,age,province,product,Q1,Q2,Q3,Q4,Q5)

We use the set.seed function to ensure we can replicate the same random data again in the future.  Then we use the sample function to generate various fields and load it all into a dataframe.  The end result is something like this:

We’ve got 5 columns with respondent data, including survey id, gender, age group, province, and product purchased.  The remaining 5 columns are the actual survey responses, Q1-Q4 are ranked on various Likert scales, and Q5 is a satisfaction score between 0 and 10 that we’ll use to calculate an NPS score.  If we want to start tabulating response data in Excel, we need to build a separate pivot for each question which is kind of a pain.  We need a simpler way to structure the data for analysis.

Luckily there are a set of fantastic packages in R for data manipulation called dplyr, and tidyr, which are bundled into a collection of packages called the tidyverse.  These were written by Hadley Wickham, who also wrote a great paper on the benefits of tidy data.  First, we want to load the package and then let’s tidy the data

# load library
library(tidyverse)

# convert to tidy data
df_tidy <- gather(df, question, response, 6:dim(df)[2])

We use the gather function to create the magic.  The first parameter specifies the dataframe to use (df), the next 2 parameters define the names for our new tidy data columns (question and response).  Finally, we want to specify the columns to tidy (in this case the questions start in column 6, so let’s gather all columns starting at 6 to the end of the dataframe).  Using the dim function here to define how many columns are in the dataframe will help if we ever decide to add or remove questions from our survey.  No changes to the code will be necessary.  So what’s going to happen when we execute this code?  Easiest to understand by looking directly at the result:

It essentially creates a new row for each question and response pair.  Let’s take a look sorted by id

Here we can see 5 rows for each survey id, representing the answers to each of the 5 questions.  What happens to the first 5 rows that have the respondent info (id, gender etc.) that we didn’t include in our gather function?  They just get repeated five times in their existing columns.  We did that so we have that information available when we want to create a crosstab or analyze our response data by any of those factors.  Overall, our dataframe went from 2000 rows by 10 columns (short and fat) to 10000 rows by 7 columns (tall and thin).  As you can see we could add any number of new questions to our survey and our tidy data will only grow in height not width, since every response will be added in the question and response columns.

Summarize Results

So what if we want to summarize the results for each of our questions, to see how many people responded in each category.  We can do that with the following code

df_summary <- group_by(df_tidy, question, response) %>%
             summarise(n = n()) %>%
             mutate(perc = n / sum(n)) %>%
             arrange(desc(n), .by_group = TRUE)

Again, let’s walk through each dplyr command to see what’s happening.  First we are going to group the data by question and response using the group_by function (which works the same way as the SQL command of same name).  Then we can use the handy pipe operator (%>%) to pass the result to the next function will we use to summarise the grouped data.  We will create a new variable named ‘n’ and assign it the count for each group by using the n() function.  Since I’d like to also know the percentage, I’m also going to add a mutate function to do just that.  Finally, to make it easier to identify the most popular answers, I’m going to arrange my results in a descending order by group (question).

Pretty cool, with only a few short lines of code we are able to summarize our entire survey dataset!  If we add or remove questions or change the possible responses, it will adjust the output accordingly.  If we want to use the results for just one question, use the filter function

filter(df_summary, question == 'Q4')
Crosstabs

Another common thing we might want to do is create a crosstab, a chart showing the relationship between two variables.  For example, say we want to understand if there’s a difference between how people answer Q1 among different provinces.

df_summary <- filter(df_tidy, question == 'Q1') %>%
              group_by(province, response) %>%
              summarise(n = n()) %>%
              mutate(perc = n / sum(n)) %>%
              select(-n) %>%
              spread(key = province, value = perc, fill = 0)

To generate this chart we group and summarize as before, including adding the percentage column.  Then we want to pivot the data back to a ‘short and wide’ format by using the spread function.  It requires a few parameters, a key to identify what to label the new columns (in this case province), and then a value parameter to indicate which column to populate the table with (in this case the percentage).  I also added the optional fill parameter, which specifies what to do if there are no values for a particular cell in the table, in this case it will put a zero.  Using the fill parameter avoids the possibility of ending up with NA values in your table.

Here we can see that it looks like BC respondents are more likely to be very dissatisfied, however we would likely want to check significance with a hypothesis test.  To do that you will also need the sample sizes for each percentage.  This is a straightforward change to the above code, by removing the mutate function and using n for the value column in the spread function.  Practice by making the changes and observing the results.

By using the gather and spread functions together, we are able to easily manipulate data for any number of analyses and visualizations.

Net Promoter Score (NPS)

The last thing we’ll do is calculate the Net Promoter Score.  NPS is a common measure of customer loyalty.  Calculating the NPS for a particular question is very straightforward, we can reuse the code above to summarize the customer satisfaction question Q5.  Then we can use those results to determine the % of promoters and detractors.  NPS is simply the difference between the two.

# summarize Q5 results
df_summary <- filter(df_tidy, question == 'Q5') %>%
              group_by(response) %>%
              summarise(n = n()) %>%
              mutate(perc = n / sum(n)) %>%
              arrange(desc(as.numeric(response)))

# calculate promoters, detractors and NPS
promoters <- sum(df_summary$perc[df_summary$response %in% c(9,10)])
detractors <- sum(df_summary$perc[df_summary$response %in% c(0:6)])
nps <- promoters - detractors
Summary

R is a wonderfully powerful and easy to use tool for analyzing customer survey data (not to mention it’s free!).  This post has covered the following:

  • Transform survey data into tidy data for easy analysis
  • Generate summary counts and percentage splits for each question
  • Create a crosstab to compare a relationship between 2 variables
  • Calculate an NPS score

I’ve only scratched the surface of what we can do with R to derive insights from our data, hopefully this post has provided some tips on getting started.  Any questions, improvements or ideas for future posts, please comment below.  Thanks for reading!

Using Regression to model race performance in Python

Reading Time: 5 minutesIn this post I’ll cover how to do the following in Python:

  • Use the Seaborn library to plot data and trendlines
  • Generate a regression equation using the polyfit function
  • Use the regression model to predict future race times
  • Review how to improve model performance

This is the third and final post in a series on how to visualize and analyze race and personal running data with the goal of estimating future performance.  In the first part I did a bit of exploratory analysis of Whistler Alpine Meadows 25km distance race data to help set an overall goal for finishing time and required pace to achieve that goal.  In the second post I dived into how to use the Strava API to retrieve my activity data that we will use in this final post to build a simple model that can estimate race finishing time.

Using Seaborn to plot polynomial regression line

First let’s load in our data from the .csv file we saved in our last post, so we don’t need to reload the data from the API.  Reading a .csv file is easy using the pandas function

splits = pd.read_csv('18-08-25 New Activity Splits.csv')

Before we return to plotting the data, let’s take another quick look at the data.  Last time we plotted the ‘moving time’ vs. the elevation change, but there is also an ‘elapsed time’ in the data.  Let’s investigate further by creating and plotting a new variable which is the difference between these two times.

splits['time_diff'] = splits['elapsed_time'] - splits['moving_time']

plt.plot( 'elevation_difference', 'time_diff', data=splits, linestyle='', marker='o', markersize=3, alpha=0.1, color="blue")

In most cases the elapsed time and moving time are close, but there are a significant number of points where they are different.  What causes this?  Time spent stationary or with little movement is captured in elapsed time but not moving time.  This confirms what I’ve noticed when logging an activity through Strava, especially on steep or twisty trails where Strava is fooled into thinking you’ve stopped.  For this analysis, I’m going to use elapsed time, even if it means that the few cases where I actually ‘stopped’ for an extended period of time will be included in data.  Using elapsed time will provide a more conservative and realistic estimate of my pace.

Last time we plotted the data using the matplotlib plot function.  This time let’s use the awesome Seaborn library to produce some nicer plots and include some trendlines and confidence intervals, using the function regplot.

sns.regplot(x = 'elevation_difference', y = 'elapsed_time', data = splits ,order = 2)
plt.title('Running Pace vs. Elevation Change', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel('Elevation Change (m)', fontsize=18)
plt.ylabel('1km Pace (sec)', fontsize=18)

Notice we used the parameter order to specify which order polynomial to try and fit to the data.  I used 2 in this case, which produces a nice parabola which approximates the data pretty well.  As a stats refresher, the equation for a second degree polynomial (also known as a quadratic) is y = ax² + bx + c.  The light blue cone represents the 95% confidence interval, which is calculated using the bootstrap method.  One drawback of this plot is it doesn’t allow us the flexibility of setting the various visual parameters that the matplotlib plot functions does.  Specifically, I’d like to make the individual points look like those in the first plot by changing the alpha level to better show the point density.  Luckily, Python makes this easy by allowing us to combine 2 plot functions onto one plot.  I use the plot function to plot the individual points, and the regplot function to plot the trendline and confidence interval.  Use ‘scatter=None’ to suppress plotting the individual points in the regplot.

plt.plot( 'elevation_difference', 'elapsed_time', data=splits, linestyle='', marker='o', markersize=5, alpha=0.1, color="blue")
sns.regplot(x = 'elevation_difference', y = 'elapsed_time', scatter=None, data = splits ,order = 2)
plt.title('Running Pace vs. Elevation Change', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel('Elevation Change (m)', fontsize=18)
plt.ylabel('1km Pace (sec)', fontsize=18)

Using Polyfit to generate the equation for the fitted model

So here’s the main drawback of using regplot, there’s no ability to have it provide the coefficients for the fitted lines and confidence intervals.  If anyone knows how to do this, I would love to hear about it in the comments!  So let’s rely on a Numpy function, polyfit, to give the equation

coeff = np.polyfit(splits['elevation_difference'], splits['elapsed_time'], 2)

That will produce the following coefficient array (in decreasing order):

array([7.40646826e-03, 6.30941912e-01, 3.74015634e+02])

So our complete equation is:  y = 0.0074*x² + 0.6310*x + 374

Apply equation to WAM course profile to estimate total time

Finally, let’s apply our model to the WAM course profile, which I manually created as a .csv file.  Then we calculate the time using the coefficients from the polyfit function above.

# Load WAM course data
WAM = pd.read_csv('WAM_25k_course.csv')

# Calculate estimated time for each km based on elevation change
WAM['estimated_time'] = coeff[0]*WAM['elevation']**2 + coeff[1]*WAM['elevation'] + coeff[2]

This is what the overall data looks like:

    km  elevation  estimated_time
0    1          0      374.015634
1    2          0      374.015634
2    3         13      383.469572
3    4         18      387.772284
4    5         68      451.167193
5    6        203      807.309992
6    7        158      658.599529
7    8         32      401.789998
8    9         27      396.450381
9   10        141      610.226439
10  11        190      761.268101
11  12        310     1281.369227
12  13       -120      404.955747
13  14        -23      363.421991
14  15        -78      369.863117
15  16         24      393.424365
16  17        -43      360.579691
17  18        -60      362.822405
18  19        -16      365.816619
19  20        -93      379.396580
20  21       -167      475.207328
21  22       -181      502.458454
22  23       -165      471.551317
23  24       -128      414.602645
24  25        -79      370.394991

Adding up the all the times and converting to minutes

WAM['estimated_time'].sum() / 60

This gives an estimated time of 202 minutes (3 hrs and 22 minutes).  That would be an amazing time!  But I suspect that it’s a bit optimistic as it uses a number of runs done on smooth road or track which will be much faster than a trail run.  To try and get a more accurate estimate, I went and manually classified my runs over the last year as either ‘trail’, ‘road’, or ‘track’ and entered the information in the description field of the activity on Strava.  After retrieving only the classified data again using the Strava API, I use the code below to recalculate my estimated finishing time

splits_trail = splits[splits['description'] == 'Trail']
coeff_trail = np.polyfit(splits_trail['elevation_difference'], splits_trail['elapsed_time'], 2)
WAM['estimated_time_trail'] = coeff_trail[0]*WAM['elevation']**2 + coeff_trail[1]*WAM['elevation'] + coeff_trail[2]
WAM['estimated_time_trail'].sum() / 60

This time I get an estimated finish time of 242 minutes (4 hrs and 2 minutes), which is almost exactly my goal of finishing in the middle of the pack!

Final Thoughts

This has been an interesting exercise and provided quite a bit of insight through some exploratory data analysis and some simple modelling that was relatively quick and easy to do.  This is always a good approach, as it allows you to iterate quickly to understand the process and data more fully, before diving into more complicated and time consuming modelling techniques.  Our next step would likely be to build a more complex regression model and/or another popular machine learning algorithm like Random Forest which can utilize other potential factors in estimating pace.  We already identified that the type of surface is almost certainly a factor in estimating performance.  There are some other hypothesized factors that we could add to train our model to see if it improves performance:

  • Fatigue estimate (split completed at beginning, middle or end of activity)
  • Temperature (hot day vs cold day)
  • More granular terrain classifications (ie smooth trail vs. technical trail)

Perhaps I will tackle this in a future post, but for now you have a solid set of tools to do some pretty cool analysis of your own activities.  We learned how to scrape race data from the web and retrieve data using an API, some creative ways to to visualize that data, and finally how to build a simple regression model to predict future performance.  Pretty cool!