Using the NHL API – Player and Game Data

Reading Time: 4 minutes

If you are looking to do some analysis, visualization or modelling on NHL data, the NHL API is a fantastic resource. It has loads and loads of data on all aspects of the NHL including team, player and game data. There is incredibly rich play by play data since 2007 for many different game events (goals, shots, hits, faceoffs etc.) and includes information on the players involved and the on ice coordinates where the event occurred. There’s no official documentation, but after some searching I was able to find a couple of fantastic resources that helped me get started. One is the Drew Hynes work documenting all the available API endpoints, and the other is the fantastic NHL scraper that Evolving Hockey has made public.

For this post I wanted to share some R code that I quickly wrote to grab player game by game stats for every player from every team since league inception. Initially I used this to create some ‘top 10’ animated charts for team career point leaders over time, but this data could be useful for a number of analyses and visualizations.

My basic approach was to use the API and make four sets of calls to collect all the data required:

  1. Access the ‘team’ endpoint to get data on all NHL teams
  2. Use the team data to retrieve rosters for each season of each team
  3. Use the roster data to identify the complete NHL player list and then use the ‘people’ endpoint to retrieve basic data about each player
  4. Finally use the player list to pull game by game stats for each player that has played in the NHL since 1917

The end result is just over 2 million rows of data! Let’s take a quick look at the code for each part. Please don’t judge the ‘for’ loops – I was excited and lazy to just get something workable and easy to follow. Maybe one day I will rewrite it more elegantly 🙂

First, the required packages. Dplyr and Tidyr for data manipulation and Jsonlite for API data parsing.

library(dplyr)
library(tidyr)
library(jsonlite)

The first part of collecting the team data is pretty straightforward. There are more than 100 teams identified, but a bunch are for all star teams and other non league teams. After reading through the list, I realized I only needed the first 58 teams. I saved the output to a .csv for a bit of manual cleaning. A few teams were missing starting season dates and I also added the final seasons for defunct teams. I then saved the cleaned team list as ‘teams.rds’

teamids <- paste(c(1:58), collapse = ',')
teams <- fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",teamids))

df_team <- teams$teams
write.csv(df_team,"teams.csv")  # manually add the missing start years, and also add end years
df_team <- read.csv("teams.csv")

saveRDS(df_team, file = 'data/teams.rds')

Using the team list, I fetched roster data for each team for each season. Note that I tried to fetch data for the earliest year (1917) to 2018 for each team, even though I had start and end year dates for each team. I was worried that I may miss some data if those dates were incorrect, so I erred on the side of caution. Then I combined all the data, added the team name, team id, and season to each roster year and saved as ‘rosters.rds’.

min_year = min(df_team$firstYearOfPlay)

roster <- NULL
for (id in 1:max(df_team$id)) {
  for (season in min_year:2018) {
    tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/teams?teamId=",id,"&expand=team.roster&season=",season,season+1)), silent=TRUE)
    if (!grepl("error",tmp)) {
      tmp <- flatten(as.data.frame(tmp$teams$roster$roster)) %>%
             mutate(teamId = id,
                    name = df_team$name[id],
                    season = season)
      if (length(roster) == 0) {
      roster <- tmp } else { roster <- rbind(roster, tmp) }
    } else warning(paste0("Did not find ",df_team$name[id]," ",season))
  }
}

saveRDS(df, file = 'data/rosters.rds')

Using the roster data allowed me to identify all the players who have played in the NHL, a total of almost 8000 players. Using the API’s ‘people’ endpoint, I then fetched basic information about each player like height, weight, position and birthplace. I saved this into ‘players.rds’.

# identify unique players and get all player data for each
player_ids <- unique(rosters$person.id)

# fetch player data
players <- NULL
for (id in player_ids) {
  tmp <- try(fromJSON(paste0("https://statsapi.web.nhl.com/api/v1/people/",id)), silent=TRUE)
  if (!grepl("error",tmp)) {
    tmp <- flatten(tmp$people)

    if (length(players) == 0) { players <- tmp } else
                              { # add any columns missing in players dataframe
                                columns <- names(tmp[!names(tmp) %in% names(players)])
                                if (length(columns) > 0) {
                                  for (col in 1:length(columns)) {
                                    players <- mutate(players, !!columns[col] := NA)
                                  }
                                }
                                # add any columns missing in tmp dataframe
                                columns <- names(players[!names(players) %in% names(tmp)])
                                if (length(columns) > 0) {
                                  for (col in 1:length(columns)) {
                                    tmp <- mutate(tmp, !!columns[col] := NA)
                                  }
                                }
                                players <- rbind(players, tmp) }
  } else warning(paste0("Did not find ", players[id]))
  print(id)
}

saveRDS(players, file = 'data/players.rds')

Now that we had a full player list, we can load the game by game stats for each player for every game they played. This is a lot of data, more than 2 million rows, you may want to consider splitting the queries into several blocks and join together at the end. Unfortunately rbind is a pretty slow operation, especially as a data frame gets larger, if you know of a faster, more efficient way to join data frames, please comment below.

# get game by game data for each play
games <- NULL
for (i in 1:length(players$id)) {
 print(i)
 seasons <- filter(rosters, person.id == players$id[i]) %>% select(season)
 seasons <- unique(seasons)
 for (season in seasons$season) {
   url <- paste0("https://statsapi.web.nhl.com/api/v1/people/",players$id[i],"/stats?stats=gameLog&season=",season, season+1)
   tmp <- try(fromJSON(url), silent=TRUE)
   if(is.data.frame(tmp$stats[[2]][[1]])) {
     tmp <- flatten(tmp$stats[[2]][[1]]) %>%
            mutate(id = players$id[i],
                   fullName = players$fullName[i])
  
     if (length(games) == 0) { games <- tmp } else
     { # add any columns missing in players dataframe
       columns <- names(tmp[!names(tmp) %in% names(games)])
       if (length(columns) > 0) {
         for (col in 1:length(columns)) {
           games <- mutate(games, !!columns[col] := NA)
         }
       }
       # add any columns missing in tmp dataframe
       columns <- names(games[!names(games) %in% names(tmp)])
       if (length(columns) > 0) {
         for (col in 1:length(columns)) {
           tmp <- mutate(tmp, !!columns[col] := NA)
         }
       }
       games <- rbind(games, tmp) }
   }
 }
}

saveRDS(games, file = 'data/games.rds')

I’ve uploaded all the code and data to my GitHub repository, have fun delving into this amazing data resource! Please share any interesting insights or analyses you get out of this data. Thanks for reading.

Animated ‘Top 10’ NHL Scoring Charts With R

Reading Time: 5 minutes

Maybe you’ve seen some of those nifty ‘top 10’ style animated bar charts which show how categories change over time. I thought it would be cool to try and create some of my own with NHL hockey scoring data. This post shares my approach and how I implemented using R.

The first thing we need is some data, specifically season by season point totals of all the players in a team’s history. We could scrape it from directly from a sports or hockey stats website using the rvest package in R. However, in many cases that would violate the website’s terms of service, so I would recommend checking first if you go that route. There’s a better option; using the freely accessible NHL API. There’s no official documentation, so I recommend using the excellent work by Drew Hynes, who has documented many of the NHL API endpoints. There is an astounding amount of data here that the NHL makes available, including play by play and full shift data for the past 10+ seasons. It also has complete team and scoring information going all the way back to 1917 when the league was founded. In a future post I will share the code I used to fetch data from the API, but for now we will just use the final output file. Note that I have posted all code and data to my NHL GitHub repository.

First let’s install the required packages. Tidyverse is the standard workhorse for manipulating, shaping, and visualizing our data and gganimate is the additional package to turn our ggplots into animated ones. I added viridis which has a nice colour palette for our charts. Finally, we need a couple of packages (gifski and png) to render our final animated image files.

library(tidyverse) # data manipulation and plotting
library(gganimate) # chart animation
library(viridis)   # colour palettes
library(gifski)    # image rendering
library(png)       # image rendering

Next, let’s load and prepare data. The ‘games.rds’ file was built by fetching the data of all players who have played in the NHL and their game stats for each game they’ve played (eg. goals, assists, time on ice etc.). It’s a large file with over 2 million rows. This post will show how to create a chart for my favorite team the Vancouver Canucks, but you could easily modify this for your favorite team just by changing the team_name variable. So first we load the data and fix the season field to show for example ‘2018-2019’ instead of ‘2018’ just for readability. Also I added a note to add a middle initial to distinguish between the 2 Greg Adams that played for the Canucks.

team_name <- 'Vancouver Canucks'

# Load previously scraped API data
games <- readRDS('../Data/games.rds') %>%
         mutate(fullName = case_when(id == 8444894 ~ 'Greg D Adams',
                                     id == 8444898 ~ 'Greg C Adams',
                                     TRUE ~ fullName),
                season = paste0(substr(season,1,4),"-",substr(season,5,8)))

Now let’s prepare the data frame that we’re going to use for plotting, which consists of the top 10 players in career points for each Canucks season. First we’ll filter the data to Canucks players only and summarize their season totals and career cumulative points. Then we’ll use a spread and gather technique to fill zeros for all the non playing years of each player. After we gather the data again into a tidy format, we will replicate a player’s career points for each year after their retirement. Finally, we will rank all the players based on career points after each season and then filter only on the top 10 for our chart.

# Data Prep
plot_df <- filter(games, team.name == team_name) %>%
         group_by (season, fullName) %>%
         summarise(Pts = sum(stat.points)) %>%
         select(Player = fullName, Pts, season) %>%
         group_by(Player) %>%
         mutate(Total_Pts = cumsum(Pts),
                First_Season = min(season),
                Last_Season = max(season),
                Career_Pts = max(Total_Pts)) %>%
         select(-Pts) %>%
         spread(key = season, value = Total_Pts, fill = 0)

plot_df <- gather(plot_df, season, Total_Pts, 5:dim(plot_df)[2]) %>%
           mutate(Total_Pts = case_when(season > Last_Season ~ Career_Pts,
                                        TRUE ~ Total_Pts)) %>%
           select(Player, season, Total_Pts)

# Filter data to include only top 10 players for each year
plot_df <- group_by(plot_df, season) %>%
           mutate(rank = rank(-Total_Pts),
                  Value_rel = Total_Pts/Total_Pts[rank==1],
                  Value_lbl = paste0(" ",Total_Pts)) %>%
           group_by(season) %>% 
           filter(rank <=10) %>%
           ungroup()

Now that we’ve got our data ready, we’re ready to plot. First we will create a static horizontal bar plot using ggplot. It looks like a lot of code, but most of it is formatting the plot’s appearance.

# Create Static Plot
staticplot = ggplot(plot_df, aes(rank, group = Player, 
                                       fill = as.factor(Player), color = as.factor(Player))) +
                    scale_fill_viridis(discrete=TRUE) +
                    scale_color_viridis(discrete=TRUE) +
                    geom_tile(aes(y = Total_Pts/2,
                                  height = Total_Pts,
                                  width = 0.9), alpha = 0.8, color = NA) +
                    geom_text(aes(y = 0, label = paste(Player, " ")), vjust = 0.2, hjust = 1, color = "black") +
                    geom_text(aes(y=Total_Pts,label = Value_lbl, hjust=0), color = "black") +
                    coord_flip(clip = "off", expand = FALSE) +
                    scale_y_continuous(labels = scales::comma) +
                    scale_x_reverse() +
                    guides(color = FALSE, fill = FALSE) +
                    theme(axis.line=element_blank(),
                          axis.text.x=element_blank(),
                          axis.text.y=element_blank(),
                          axis.ticks=element_blank(),
                          axis.title.x=element_blank(),
                          axis.title.y=element_blank(),
                          legend.position="none",
                          panel.background=element_blank(),
                          panel.border=element_blank(),
                          panel.grid.major=element_blank(),
                          panel.grid.minor=element_blank(),
                          panel.grid.major.x = element_line( size=.1, color="grey" ),
                          panel.grid.minor.x = element_line( size=.1, color="grey" ),
                          plot.title=element_text(size=25, hjust=0.5, face="bold", colour="grey", vjust=-1),
                          plot.subtitle=element_text(size=18, hjust=0.5, face="italic", color="grey"),
                          plot.caption =element_text(size=8, hjust=0.5, face="italic", color="grey"),
                          plot.background=element_blank(),
                          plot.margin = margin(2,2, 2, 4, "cm"))

Now we create our animated plot by adding the transition_states function from gganimate and use the season field to identify the different states of our animation. Also we add a title and caption.

# Animated Plot
anim = staticplot + transition_states(season, transition_length = 4, state_length = 3) +
       view_follow(fixed_x = TRUE)  +
       labs(title = paste0(team_name," All Time Point Leaders"),  
            subtitle  =  "Season : {closest_state}",
            caption  = "Total Regular Season Pts | Data Source: www.hockey-reference.com")

Finally, we use our animated plot object and feed it into the animate function to render our final animated image file. In this case we will render to an animated gif, though other options are available. I suggest playing around with the different parameters to see how it affects the final output.

# Render Plot
animate(anim, fps = 10, duration = 45, width = 600, height = 400, end_pause = 100, detail = 1, rewind = FALSE,
        renderer = gifski_renderer(paste0(gsub(' ','_',team_name),"_alltime.gif")))

And that’s it, we’ve created a cool looking animated ‘top 10’ charts with just a few lines of R code!

Best of all the code can be easily modified to create charts for different teams or even use completely different data. There is also a lot of flexibility to customize the appearance. Enjoy, and please tag and share any charts you make as I’d love to see what you are able to create! Thanks for reading.

Retrieving Wikipedia Data for Natural Language Processing

Reading Time: 4 minutes

Encyclopedia
Old School Wikipedia

The internet is not just for cat videos anymore, there’s too much useful freely available data to ignore.  In my opinion, being able to easily get data from the internet using API’s is a core skill for any data scientist or analyst. I’ve walked through how easy it is to use an API to get Strava exercise data and Twitter data, today we focus on Wikipedia to find some text data for our Natural Language Processing projects.

We’re going to use R for this project and let some great packages do the heavy lifting; WikipediR for working with the MediaWiki API and rvest for web scraping Wikipedia pages.  As usual there are a number of packages and ways to approach this task, but I found this the easiest and most straightforward.  The output of this code is a text file with text data from all the Wikipedia pages related to a given topic, with the goal of generating a corpus on a domain specific topic for use in training word vectors with Word2Vec.   That said, this code is easily modified to extract a variety of data from Wikipedia.

In addition to the two packages mentioned above, we will load tidyverse for any data manipulation tasks.  Tidyverse is a must have for making data manipulation easy, like this example working with customer survey data.  We’ll also initialize some parameters including our overall topic (Ice hockey) and output filename.  For your particular topic, just search Wikipedia for a category page and include all the text after ‘Category:’.  This script will then extract all the pages and sub categories associated with the topic.

library(tidyverse)            # Data Manipulation
library(WikipediR)            # Wikipedia Queries
library(rvest)                # Web Scraping

category_list <- "Ice hockey" # Set to starting category name
filename <- "data.txt"        # Output filename
total_pages <- c()
categories <- c()
text_data <- as.null()

As a first step, we want to use the MediaWiki API to retrieve a list of all the sub-categories and page titles associated with those categories.  The WikipediR package provides some nice wrapper functions to access the API.

  # retrieve pages and categories
  pages <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c("page"))
  sub_cats <- pages_in_category("en", "wikipedia", categories = category,  properties = c("title"), type = c( "subcat"))

The ‘pages_in_category’ function is helpful here by returning Wikipedia sub-category and/or page data depending on the function parameters.  Here we specify English Language Wikipedia with the first two parameters and use the previously defined category name.  The function can return a number of properties, but we only need the title.  Finally we define the element we are looking for (page, sub-category or file), and we assign the list results to pages and sub_cats respectively.  We use a couple of loops to extract each page and subcategory title and add it to our master list.

  # add pages to list  
  if (length(pages$query$categorymembers) > 0 ) {
    for (i in 1:length(pages$query$categorymembers)) {
      total_pages <- c(total_pages, pages$query$categorymembers[[i]]$title)
    }
  }

  # add sub categories to list
  if (length(sub_cats$query$categorymembers) > 0 ) {
    for (i in 1:length(sub_cats$query$categorymembers)) {
      sub_cat = gsub("Category:", "", sub_cats$query$categorymembers[[i]]$title)
      categories <- c(categories, sub_cat)
      next_category_list <- c(next_category_list, sub_cat)
    }
  }

Next, we want to repeat the process with each of the returned sub-categories, to retrieve all of the pages and sub categories associated with that category and add them to the master list.  We will continue ‘drilling down’ until there are no more sub categories or pages associated with that overall topic.  To view the entire code with the looping structure, please check out the code on my github.  Below are the final results, that’s a lot of categories and individual pages about hockey!

[1] "Number of Categories: 7322"
[1] "Number of Pages: 12912"

Next we will extract all the text data from each page using the rvest web scraping package.  We simply loop through the entire page list just created, and then extract all the paragraphs (denoted by the <p> node).  We are leaving some of each web page behind, such as lists and tables, but for this project I figured just grabbing nice clean paragraph and sentence data would be most useful.  All of the paragraphs for all the pages are simply pasted into a single ‘page_text’ variable.

# read all page paragraph data
for (i in 1:length(total_pages)) {
  page = gsub(" ", "_", total_pages[i])
  print(paste0("Loading Page: ", page))
  web_address <- paste0("https://en.wikipedia.org/wiki/",page)
  page_html <- read_html(web_address)
  page_paragraphs <- html_nodes(page_html,"p")
  page_text <- paste(html_text(page_paragraphs), sep = '', collapse = '')
  for (j in 1:length(page_text)) {
    if (is.null(text_data)) { text_data <- page_text }
    else { text_data <- paste(text_data, page_text, sep = '', collapse = '') }
  }
}

Before we write all this juicy text data to a .txt file for future NLP tasks, we should probably do a little data preprocessing.  Let’s remove existing line breaks and replace with line breaks after each sentence so our NLP treats each sentence as an entity.  We also remove all punctuation and citations (eg. [2]) and convert everything to lower case.  You may want to modify these preprocessing steps for your project.  Perhaps you want to distinguish between ‘doors’ of a house and the band ‘Doors’, in which case you wouldn’t want to convert everything to lower case.  I also kept all numerical data as-is, to keep dates and jersey numbers, but you may want to consider removing numbers all together.

# text pre-processing
text_data <- gsub("\n", " ", text_data)       # remove the existing line breaks
text_data <- gsub("\\.", "\r\n", text_data)   # add line breaks
text_data <- gsub("\\[\\d\\]"," ", text_data) # remove citations
text_data <- gsub("[[:punct:]]"," ", text_data) # remove punctuation
text_data <- text_data %>% tolower()

# save text file
write(text_data, filename)

The final step it to write this 6 million word corpus to a text file that we will use in the next post to train word vectors using Word2Vec.

All the code for this task can be found in github.  You can use this to generate text for any topic simply by changing the starting topic page.  Also, getting more familiar with the WikipediR and rvest packages will allow you to modify this script to extract any number of Wikipedia pages and data.

Thanks for reading!

How to Query Twitter Data with R

Reading Time: 4 minutes

Getting Twitter data and tweets is easy in R and can be a great source of text data for NLP applications to better understand your customers or gain insight into topics that are being tweeted about.  In this brief post, I’ll walk through how to easily start searching for tweets in minutes, once you’ve setup your free Twitter developer’s account which allows access to the Twitter API.

Setup Twitter Developer Account
  1. If you’re not already a twitter user, you’ll need to signup for a new account:   www.twitter.com/signup
  2. Then you’ll need to apply for a developer account: https://developer.twitter.com/en/apply/user.  Just follow through and answer the questions about what you are planning to do with Twitter data and submit the application.  In my application I described that I would like to use the account for learning purposes.  After you submit, you’ll get an email asking to confirm your email address.  Finally, you should receive an email saying you’ve been approved (this took a couple days for me) and a link to get started.
  3. From the get started page select the first option to ‘Create an app’.
  4. Fill in the 3 required fields; the name, a description, and a website URL (you can use a personal website or even your twitter page www.twitter.com/{your twitter username}.  For our R code using the rtweet package you will also need to set the optional Callback URL to http://127.0.0.1:1410 and then submit to create the app.
  5. Now you can generate the API keys and access tokens you will need to authenticate and communicate with the Twitter API.  Select the ‘Keys and tokens’ option on the app page and click to generate an access token and access token secret.  Keep this page open, you’ll need these 4 ridiculously long codes in your R script and your app name for later.
Rtweet and User Authentication

Now that you’ve got everything you need to get started, it’s super easy to start searching for tweets in R.  There are several R packages that make it easier to interact with the Twitter API, I found the rtweet package the most current and easy to use.  If you find another that you think is better, please let me know in the comments.

# Packages
library(rtweet) # set Callback as http://127.0.0.1:1410 on Twitter App

# Enter you Twitter app info here 
consumer_key <- "  "
consumer_secret <- "  "
access_token <- "  "
access_secret <- "  "
appname = "  "

twitter_token <- create_token(
  app = appname,
  consumer_key = consumer_key,
  consumer_secret = consumer_secret,
  access_token = access_token,
  access_secret = access_secret)

Replace the blanks above with your consumer key & secret, your access token & secret, and your app name you just created.  Run this code and voila , you are ready to go!

There are a number of things you can do using the API.  In addition to searching for Tweets, you can do pretty much anything else you can do directly in Twitter, including posting new Tweets, following other users, or look at current trends on Twitter.  Although for this tutorial let’s just focus on retrieving Tweets for future text analysis.  To do some of the other tasks, a good place to start is the rtweet manual or vignette, and the Twitter developer documentation.

Below are three quick examples of how to query the Twitter database using the search_tweet function.  Because the search can return nested lists, we use the flatten function before saving with the standard write.csv function.  Note that the basic Standard API developer account only queries tweets from the past week or so, you’ll have to pony up some cash if you want access to more historical data and additional functionality.

Search Tweets by User
tweets <- search_tweets( "from:realDonaldTrump", n = 10000, include_rts = FALSE )
tweets <- flatten(tweets)
write.csv(tweets, paste0(Sys.Date(), "_Trump_tweets.csv"), row.names = FALSE)

By using the ‘from:‘ format in the search term we can search tweets for a particular user, in this case a prolific presidential tweeter.  Did he really Tweet 71 times in the past week?  Apparently so.  We set the number of tweets to retrieve to 10,000 and set to not include retweets by adding include_rts = FALSE.  The search will return a large number of fields, but you can see a sample of the most relevant ones below; date&time, user, tweet, and source.

Search Tweets by Topic
tweets <- search_tweets( "#canucks", n = 10000, include_rts = FALSE )
tweets <- flatten(tweets)
write.csv(tweets, paste0(Sys.Date(), "_Canucks_tweets.csv"), row.names = FALSE)

Similar to the query above, but we replace the search term with ‘#canucks‘, which will return mostly tweets about the Vancouver Canucks hockey team.

Stream Tweets by Topic
tweets <- stream_tweets("canada", timeout = 20)
tweets <- flatten(tweets)
write.csv(tweets, paste0(Sys.Date(), "_Canada_tweets.csv"), row.names = FALSE)

This is a neat query because it let’s you stream tweets in real-time.  For example, the query above will retrieve all tweets mentioning Canada for a period of 20 seconds.

Summary

So that’s it!  In this quick post, I shared how to get set up with the Twitter API and how to query tweets in R using the rtweet package in a quick and easy way.  You can find all the code and sample Tweet data at Github.  In my next post I’ll share how we can use this text data to do some unsupervised topic modelling using word and document vectors.  Happy Tweet Querying!

Classifying Text Data Using Machine Learning

Reading Time: 10 minutesWords on a Page

The statistics are astounding.  Data is everywhere and growing – 90% of all data was generated in the last 2 years!  Much of it is text data, every minute we generate 456,000 Tweets,  510,000 Facebook comments, 16 million text messages, and 156 million emails.  In addition, your own organization is probably collecting a significant amount of text data daily from customers and employees through a number of sources like surveys, employee comments, chat logs, and transcribed phone calls.  How can we mine all that potentially information rich data for actionable insights to improve our customer experience and process efficiency?  Manually reading comments is not going to cut it.  We want to take advantage of machine learning and modern data science techniques.  The challenge is that Natural Language Processing (NLP) can be a confusing topic for newcomers, given the vast range of approaches and techniques for answering a variety of questions about text data.  Plus, this is an area of active research so things are changing all the time.  The good news is we don’t need to understand deep neural networks and word embeddings to get started.  In this post I’d like to share a great tool for classifying text documents into a number of categories using a simple NLP technique called ‘bag of words’.  We’ll use this technique to construct a classifier that you probably encounter every day; one which can pick out spam emails from a set of emails.

How do we approach the problem of teaching a computer to understand the topic of a given text?  We could start by explicitly programming rules to parse sentences into parts of speech to identify nouns and verbs etc.,  so the computer can ‘understand’ what is being written.  But anyone who’s tried to learn a new language knows, there are always exceptions to the rules, and then exceptions to the exceptions.  Plus we need to take into account common spelling and grammatical mistakes and also the fact even correctly written sentences often are ambiguous

This morning I shot an elephant in my pajamas. How he got in my pajamas I don’t know. – Groucho Marx

It’s clear this approach, if it’s even possible, is going to be prohibitively complicated.  Instead, we can approach the problem from a different angle.

Bag of Words

By looking at each of the individual words in a given text, we can use the presence of specific words and their frequency to provide clues or evidence that a text is about a particular topic.  This is called the ‘bag of words’ approach, because it doesn’t take into account word order or grammar, just the words (or sometimes phrases) themselves.  It’s relatively simple, but also surprisingly effective.

For example, suppose we have three short text documents:

  1. The customer service was the worst.
  2. I was happy with the billing.
  3. I found the customer service excellent.

Using ‘bag of words’, sentence one is represented as “The” (2 times),  “customer” (1), “service” (1), “was” (1), “worst” (1).  We can do the same for all the documents and populate in a matrix like this:

Document the customer service was worst I happy with billing found excellent
1 2 1 1 1 1 0 0 0 0 0 0
2 1 0 0 1 0 1 1 1 1 0 0
3 1 1 1 0 0 1 0 0 0 1 1
Tokenization and Term Frequency (TF)

We call this ‘tokenization’, splitting our documents into parts like words, phrases or even letters.  In this case, we’ve split into words and done a couple of other small things like converting all to lower case (so we don’t end up with ‘the’ and ‘The’ as separate words) and removing all punctuation.  We’ve effectively turned our text documents into a numerical representation, which means we could now input this into a machine learning algorithm which uses these word frequencies as features.  We can already see just by scanning the matrix, that sentence 1 & 3 are much closer to each other than say 1 and 2.  Sentence 1 & 3 are more likely to come from the same category.

So you might be saying, ‘hold on a sec’, what about all the information we threw out by ignoring the word order, phrases and context?  Don’t we need those?  Nope.  We are going to sacrifice all that complexity for ease of implementation.  That’s ok because we are going to let the algorithm use statistical methods to tease out the topic information from the words themselves without ‘understanding’ what the write is actually saying.  That doesn’t mean we are done here, because there are a number of things we can do to help improve our classifier by further quantifying how important each word is to the document and topic.  If we consider that each non-zero entry in our matrix represents the ‘strength’ of a feature, ie how much information it provides in terms of the topic then we can evolve our method to ‘weight’ each of those occurrences with more granularity to improve the quality of our features.

Stop Words

We have a number of ‘low value’ words in our documents like ‘I’, ‘the’, ‘was’, ‘with’.  These are words that really don’t tell us anything about the topic of the document, so they add little value.  These are often called ‘stop words’, and most text classification packages in R and Python have functions to remove those automatically based on a preset dictionary of these terms.  You can also manually create a list of words that you think could be safely removed without impacting the quality of the classification model.  Once we do that we’re left with a smaller matrix that has almost the same information:

Document customer service worst happy billing found excellent
1 1 1 1 0 0 0 0
2 0 0 0 1 1 0 0
3 1 1 0 0 0 1 1

Another thing we should consider, especially if we have documents of varying lengths, is to use word proportion of total (count / # of words in document) instead of absolute counts.  This helps normalize our values which will improve algorithm performance.

Stemming

Stemming involves reducing an inflected (or derived) word to it’s base form, and is sometimes useful as part of pre-processing text data for classification.  For example, ‘billing’, ‘billed’, ‘bills’, ‘billable’ could be all reduced to ‘bill’ as part of word stemming.  The advantage is to reduce the number of unique words, and capture the similar meaning between all these words.  Lemmatization is a more complex way of doing this that requires detailed dictionaries for each word, and for that reason we will ignore this method for now despite having the cooler name.

Inverse Document Frequency (IDF)

What about words that appear in every document versus words that only appear in one or two documents, which will have more information or value in determining which category a word belongs to?  If a word is common and appears in every document it is not very useful and should have a low weight.  Conversely, unique words that show up in only a few documents are more likely to be of more value and should have a higher weight.  To incorporate this into our weighting matrix, we can use something call the inverse document frequency which weights uncommon words more highly than common words.

TF-IDF

If we multiply the term frequency weighting (TF) with the inverse document frequency (IDF) we get a commonly used weighting metric called TF-IDF.  It is a commonly used method in classification and document retrieval tasks like search engines.  Again, most text classification packages in Python and R have a function to calculate these values for you.

Pulling It All Together in R

In order to build our classifier we will want to follow a few basic steps based on the principles above.  First load our labelled data, which will have all the actual document data and the manual category label applied to each document.  This will likely be the most time consuming step, as it often requires you to manually label each document to be used to train and test the classifier.  Next we will process the data as we outlined above and create our matrix (often called a document term matrix or document frequency matrix).  Then we will train the model on a portion of our data set using a machine learning algorithm.  Then we use the remaining labelled data to test our model to see how accurately it classifies new data that it wasn’t trained on.  Finally, we output various metrics to assess model performance and make further changes to improve performance if necessary.

So, let’s make it real and apply all these steps into our spam vs. ham classification example using R.  Code below and also at my github repository.  First we load required packages and set parameters

library(quanteda)     # text classification package
library(tidyverse)    # data manipulation

# set parameters
set.seed(1912)
train_prop <- 0.7 # % of data to use for training

Next, we’ll load the text messages that we want to classify.  This data has 2 columns; one with the actual text data and the other with the spam or ham label.  There are 5574 total documents.  We also want to randomize the data for splitting into train and test sets and remove any blank rows.

# read data from csv
df <- read.table("SMSSpamCollection.txt", header=FALSE, sep="\t", quote="", stringsAsFactors=FALSE) # Ham/Spam test data

# prepare data
names(df) <- c("Label", "Text")                         # add column labels
df <- df[sample(nrow(df)),]                             # randomize data
df <- df %>% filter(Text != '') %>% filter(Label != '') # filter blank data

Many text mining packages in R and Python like our data to be data structure called a corpus, which is simply a collection of documents.  From the corpus we start by creating our document frequency matrix (dfm).  This data ends up having 5574 rows (one for each document), and 9127 columns (one for each word that appears in the corpus).  These types of matrices usually have a lot of zeros in them, and are called ‘sparse’ matrices.  We will want to do our best to reduce the number of columns or features to a more manageable number.  This will be more efficient from a memory perspective and also will help our algorithm as a large number of features can impact both performance and accuracy.  Let’s start by applying word stemming, which reduces the number of columns to 7746 by combining similar words.  We can perform another function, dfm_trim, which allows us to remove very infrequent terms and also very short documents.  This results in 1763 columns or features for our algorithm.  Finally, let’s apply the dfm_tfidf function to apply the tf-idf weighting scheme, and we can set the calculation method for both the tf (count) and idf (inverse) term.  The quanteda package allows us to do all this data preparation with only a few short lines of code!

# create document corpus  
df_corpus <- corpus(df$Text)   # convert Text to corpus 
docvars(df_corpus) <- df$Label # add classification label as docvar

# build document term matrix from corpus
df_dfm <- dfm(df_corpus, tolower = TRUE)
# stem words
df_dfm <- dfm_wordstem(df_dfm)
# remove low frequency occurence words                                
df_dfm <- dfm_trim(df_dfm, min_termfreq = 5, min_docfreq = 3)
# tf-idf weighting           
df_dfm <- dfm_tfidf(df_dfm, scheme_tf = "count", scheme_df = "inverse")

Next, we split the data into a training and testing set

# split data train/test
size <- dim(df)
train_end <- round(train_prop*size[1])
test_start <- train_end + 1
test_end <- size[1]

df_train <- df[1:train_end,]
df_test <- df[test_start:test_end,]

df_dfm_train <- df_dfm[1:train_end,]
df_dfm_test <- df_dfm[test_start:test_end,]

Most of our work is now done, building and testing the model is straightforward with only 2 lines of code.  Here we use the quanteda function textmodel_nb to use Naive Bayes algorithm to predict ham/spam.  Of course you could use a number of different algorithms here with similar results, and sometimes specific algorithms work better with specific datasets.

# build model with training set
df_classifier <- textmodel_nb(df_dfm_train, df_train$Label)

# test model with testing set
df_predictions <- predict(df_classifier, newdata = df_dfm_test)

The predict function will return a vector with the predicted category for each document in the test set.  We can compare that to the actual category labels to see how well our model is working using something called a confusion matrix.

Actual Ham Actual Spam
Predicted Ham True Negative False Negative
Predicted Spam False Positive True Positive

The most common classifier assessment metric is accuracy, which is simply the ratio of correct predictions over the total.  While useful because it is easily to calculate and interpret, it can be misleading.  Consider an example where spam represents 99% of the data, with only 1% ham (which feels like my email inbox some days).  If we build a model that predicted spam for every document then we would have a 99% accurate model, but not a very useful one, since it wouldn’t identify any ham emails.  As a result, there are a few more metrics that are commonly used.  ‘Precision’ in simple terms captures what proportion of predicted Spam is actually Spam, and ‘Recall’ captures what proportion of real Spam was predicted as Spam.

Finally, to create a single measure of model ‘goodness’, we can use something call the F1 score (sometimes F-score or F-measure) which is the harmonic mean between Precision and Recall.

An F1 score of 1 indicates perfect precision and recall while 0 is the opposite.  Let’s run the code below to see how our model performs.

conf_matrix <- table(df_predictions, df_test$Label)
accuracy <- (conf_matrix[1,1] + conf_matrix[2,2]) / sum(conf_matrix)
precision <- conf_matrix[2,2] / sum(conf_matrix[2,])
recall <- conf_matrix[2,2] / sum(conf_matrix[,2])
f1_score <- 2 * ((precision * recall) / (precision + recall))

cat("Confidence Matrix:")
conf_matrix
cat("Accuracy: ", accuracy)
cat("Precision: ", precision)
cat("Recall: ", recall)
cat("F1 Score: ", f1_score)
Confidence Matrix:              
df_predictions  ham spam
          ham  1419   11
          spam   23  219
Accuracy:  0.9796651
Precision:  0.9049587
Recall:  0.9521739
F1 Score:  0.9279661

Wow, we were able to generate a model that is 98% accurate with an F1 score of 0.93 with only a few lines of code!  I leave it to the reader to experiment with the model to see if further improvement is possible.  Generally speaking, improving the feature engineering process through the dfm matrix is often the most effective (hint:  in this case, try not stemming words).  Also, more training data is often very effective at improving performance.  Assuming we’re happy with the performance, we can now use this model to predict categories for new unseen data without any additional work.  Cool!

Summary

Hopfully this post has convinced you that you can begin building text classifiers today with only a few lines of code and some labelled data.  In this example we had more than 5000 data points, which definitely helped accuracy.  In my experience, it’s more feasible to label a few hundred documents in a few hours and that should produce a starting accuracy of about 80%.  Give it a try!  Input your data into the code above and see the results.  Let me know what you discover in the comments below.  Note that you can use data that has more than two labelled categories without any changes to the code.

From this initial model there is a lot more you can do to get more sophisticated with your text classification efforts and you are only limited by your time and willingness to learn.  The simplifying assumption of using only single ‘context-free’ words as features made it possible to quickly create such our model but will likely become a limiting factor as you try to increase accuracy even further.  You may want to experiment with including n-grams in your model, which just means sets of consecutive words.  For example bi-grams will include frequencies of all 2 word phrases in a document.  The advantage will be that you can capture some distinctions between things like ‘customer service’ and ‘service provider’, in which a bag of words model treats the word service as the same in both cases.  Creating n-grams is done at the tokenization stage and most packages allow you to specify the ‘n’.  You can also include interaction terms which look for the co-occurrence of 2 words that are not right next to each other.  The trick will be determining which of these new features will be useful to the model while discarding the rest, as n-grams and interactions will increase your matrix size exponentially.  This is where dimensionality reduction techniques will come in handy.  You may also want to learn about newer methods like word2vec or doc2vec which get better at capturing word context through a concept called word embeddings.  I will review some of these methods in future posts, but I wanted to share the basic framework in a simple manner so you could get started today.  Thanks for reading and happy text mining!

You don’t have to see the whole staircase, just take the first step – Martin Luther King Jr.

Getting started analyzing customer survey data using R

Reading Time: 6 minutesHow to listen to our customers

Most organizations use surveys to quantify customer satisfaction and loyalty.  It’s helpful to spend as little time as possible preparing and summarizing the results, so you can get straight to looking for actionable insights to improve customer experience.  However, customer survey data is usually a pain to work with, it’s structured in a way that doesn’t lend itself to easy analysis.  In this post I’m going to show how easy it is use R to quickly summarize data.

Messy vs. Tidy Data

The main challenge to overcome is that most survey data is ‘messy’, or ‘short and fat’.  Each survey completion is a single row, with columns representing information about the respondent and also their survey responses.  I’ll review why this can be a problem, but first let’s get some data work with.  Survey data is commonly shared as a .csv file, which can be loaded easily into R

df <- read.csv('My_Survey_Results.csv')

Since I don’t have any data ready for this analysis, let’s just generate some random survey data

set.seed(69)

id <- 1:2000
gender <- sample(c("Male","Female"),2000,replace=TRUE)
age <- sample(c("18-24","25-34","35-54","55-64","65+"),2000,replace=TRUE,prob=c(0.15,0.20,0.3,0.15,0.2))
province <- sample(c("BC","AB","ON","QC"),2000,replace=TRUE,prob=c(0.2,0.2,0.35,0.25))
product <- sample(c("Bag of Glass","Bass-O-Matic","Happy Fun Ball","Little Chocolate Donuts"),2000,replace=TRUE,prob=c(0.1,0.5,0.2,0.2))
Q1 <- sample(c("Very satisfied","Somewhat satisfied","Neither satisfied nor dissatisfied","Somewhat dissatisfied","Very dissatisfied"),2000, replace=TRUE)
Q2 <- sample(c("Very satisfied","Somewhat satisfied","Neither satisfied nor dissatisfied","Somewhat dissatisfied","Very dissatisfied"),2000, replace=TRUE)
Q3 <- sample(c("Strongly agree","Agree","Neither agree nor disagree","Disagree","Strongly disagree"),2000, replace=TRUE)
Q4 <- sample(c("Extremely helpful","Very helpful","Somewhat helpful","Not so helpful","Not at all helpful"),2000, replace=TRUE)
Q5 <- sample(0:10,2000,replace=TRUE)

df <- data.frame(id,gender,age,province,product,Q1,Q2,Q3,Q4,Q5)

We use the set.seed function to ensure we can replicate the same random data again in the future.  Then we use the sample function to generate various fields and load it all into a dataframe.  The end result is something like this:

We’ve got 5 columns with respondent data, including survey id, gender, age group, province, and product purchased.  The remaining 5 columns are the actual survey responses, Q1-Q4 are ranked on various Likert scales, and Q5 is a satisfaction score between 0 and 10 that we’ll use to calculate an NPS score.  If we want to start tabulating response data in Excel, we need to build a separate pivot for each question which is kind of a pain.  We need a simpler way to structure the data for analysis.

Luckily there are a set of fantastic packages in R for data manipulation called dplyr, and tidyr, which are bundled into a collection of packages called the tidyverse.  These were written by Hadley Wickham, who also wrote a great paper on the benefits of tidy data.  First, we want to load the package and then let’s tidy the data

# load library
library(tidyverse)

# convert to tidy data
df_tidy <- gather(df, question, response, 6:dim(df)[2])

We use the gather function to create the magic.  The first parameter specifies the dataframe to use (df), the next 2 parameters define the names for our new tidy data columns (question and response).  Finally, we want to specify the columns to tidy (in this case the questions start in column 6, so let’s gather all columns starting at 6 to the end of the dataframe).  Using the dim function here to define how many columns are in the dataframe will help if we ever decide to add or remove questions from our survey.  No changes to the code will be necessary.  So what’s going to happen when we execute this code?  Easiest to understand by looking directly at the result:

It essentially creates a new row for each question and response pair.  Let’s take a look sorted by id

Here we can see 5 rows for each survey id, representing the answers to each of the 5 questions.  What happens to the first 5 rows that have the respondent info (id, gender etc.) that we didn’t include in our gather function?  They just get repeated five times in their existing columns.  We did that so we have that information available when we want to create a crosstab or analyze our response data by any of those factors.  Overall, our dataframe went from 2000 rows by 10 columns (short and fat) to 10000 rows by 7 columns (tall and thin).  As you can see we could add any number of new questions to our survey and our tidy data will only grow in height not width, since every response will be added in the question and response columns.

Summarize Results

So what if we want to summarize the results for each of our questions, to see how many people responded in each category.  We can do that with the following code

df_summary <- group_by(df_tidy, question, response) %>%
             summarise(n = n()) %>%
             mutate(perc = n / sum(n)) %>%
             arrange(desc(n), .by_group = TRUE)

Again, let’s walk through each dplyr command to see what’s happening.  First we are going to group the data by question and response using the group_by function (which works the same way as the SQL command of same name).  Then we can use the handy pipe operator (%>%) to pass the result to the next function will we use to summarise the grouped data.  We will create a new variable named ‘n’ and assign it the count for each group by using the n() function.  Since I’d like to also know the percentage, I’m also going to add a mutate function to do just that.  Finally, to make it easier to identify the most popular answers, I’m going to arrange my results in a descending order by group (question).

Pretty cool, with only a few short lines of code we are able to summarize our entire survey dataset!  If we add or remove questions or change the possible responses, it will adjust the output accordingly.  If we want to use the results for just one question, use the filter function

filter(df_summary, question == 'Q4')
Crosstabs

Another common thing we might want to do is create a crosstab, a chart showing the relationship between two variables.  For example, say we want to understand if there’s a difference between how people answer Q1 among different provinces.

df_summary <- filter(df_tidy, question == 'Q1') %>%
              group_by(province, response) %>%
              summarise(n = n()) %>%
              mutate(perc = n / sum(n)) %>%
              select(-n) %>%
              spread(key = province, value = perc, fill = 0)

To generate this chart we group and summarize as before, including adding the percentage column.  Then we want to pivot the data back to a ‘short and wide’ format by using the spread function.  It requires a few parameters, a key to identify what to label the new columns (in this case province), and then a value parameter to indicate which column to populate the table with (in this case the percentage).  I also added the optional fill parameter, which specifies what to do if there are no values for a particular cell in the table, in this case it will put a zero.  Using the fill parameter avoids the possibility of ending up with NA values in your table.

Here we can see that it looks like BC respondents are more likely to be very dissatisfied, however we would likely want to check significance with a hypothesis test.  To do that you will also need the sample sizes for each percentage.  This is a straightforward change to the above code, by removing the mutate function and using n for the value column in the spread function.  Practice by making the changes and observing the results.

By using the gather and spread functions together, we are able to easily manipulate data for any number of analyses and visualizations.

Net Promoter Score (NPS)

The last thing we’ll do is calculate the Net Promoter Score.  NPS is a common measure of customer loyalty.  Calculating the NPS for a particular question is very straightforward, we can reuse the code above to summarize the customer satisfaction question Q5.  Then we can use those results to determine the % of promoters and detractors.  NPS is simply the difference between the two.

# summarize Q5 results
df_summary <- filter(df_tidy, question == 'Q5') %>%
              group_by(response) %>%
              summarise(n = n()) %>%
              mutate(perc = n / sum(n)) %>%
              arrange(desc(as.numeric(response)))

# calculate promoters, detractors and NPS
promoters <- sum(df_summary$perc[df_summary$response %in% c(9,10)])
detractors <- sum(df_summary$perc[df_summary$response %in% c(0:6)])
nps <- promoters - detractors
Summary

R is a wonderfully powerful and easy to use tool for analyzing customer survey data (not to mention it’s free!).  This post has covered the following:

  • Transform survey data into tidy data for easy analysis
  • Generate summary counts and percentage splits for each question
  • Create a crosstab to compare a relationship between 2 variables
  • Calculate an NPS score

I’ve only scratched the surface of what we can do with R to derive insights from our data, hopefully this post has provided some tips on getting started.  Any questions, improvements or ideas for future posts, please comment below.  Thanks for reading!