Getting started analyzing customer survey data using R

Reading Time: 6 minutes

How to listen to our customers

Most organizations use surveys to quantify customer satisfaction and loyalty.  It’s helpful to spend as little time as possible preparing and summarizing the results, so you can get straight to looking for actionable insights to improve customer experience.  However, customer survey data is usually a pain to work with, it’s structured in a way that doesn’t lend itself to easy analysis.  In this post I’m going to show how easy it is use R to quickly summarize data.

Messy vs. Tidy Data

The main challenge to overcome is that most survey data is ‘messy’, or ‘short and fat’.  Each survey completion is a single row, with columns representing information about the respondent and also their survey responses.  I’ll review why this can be a problem, but first let’s get some data work with.  Survey data is commonly shared as a .csv file, which can be loaded easily into R

df <- read.csv('My_Survey_Results.csv')

Since I don’t have any data ready for this analysis, let’s just generate some random survey data

set.seed(69)

id <- 1:2000
gender <- sample(c("Male","Female"),2000,replace=TRUE)
age <- sample(c("18-24","25-34","35-54","55-64","65+"),2000,replace=TRUE,prob=c(0.15,0.20,0.3,0.15,0.2))
province <- sample(c("BC","AB","ON","QC"),2000,replace=TRUE,prob=c(0.2,0.2,0.35,0.25))
product <- sample(c("Bag of Glass","Bass-O-Matic","Happy Fun Ball","Little Chocolate Donuts"),2000,replace=TRUE,prob=c(0.1,0.5,0.2,0.2))
Q1 <- sample(c("Very satisfied","Somewhat satisfied","Neither satisfied nor dissatisfied","Somewhat dissatisfied","Very dissatisfied"),2000, replace=TRUE)
Q2 <- sample(c("Very satisfied","Somewhat satisfied","Neither satisfied nor dissatisfied","Somewhat dissatisfied","Very dissatisfied"),2000, replace=TRUE)
Q3 <- sample(c("Strongly agree","Agree","Neither agree nor disagree","Disagree","Strongly disagree"),2000, replace=TRUE)
Q4 <- sample(c("Extremely helpful","Very helpful","Somewhat helpful","Not so helpful","Not at all helpful"),2000, replace=TRUE)
Q5 <- sample(0:10,2000,replace=TRUE)

df <- data.frame(id,gender,age,province,product,Q1,Q2,Q3,Q4,Q5)

We use the set.seed function to ensure we can replicate the same random data again in the future.  Then we use the sample function to generate various fields and load it all into a dataframe.  The end result is something like this:

We’ve got 5 columns with respondent data, including survey id, gender, age group, province, and product purchased.  The remaining 5 columns are the actual survey responses, Q1-Q4 are ranked on various Likert scales, and Q5 is a satisfaction score between 0 and 10 that we’ll use to calculate an NPS score.  If we want to start tabulating response data in Excel, we need to build a separate pivot for each question which is kind of a pain.  We need a simpler way to structure the data for analysis.

Luckily there are a set of fantastic packages in R for data manipulation called dplyr, and tidyr, which are bundled into a collection of packages called the tidyverse.  These were written by Hadley Wickham, who also wrote a great paper on the benefits of tidy data.  First, we want to load the package and then let’s tidy the data

# load library
library(tidyverse)

# convert to tidy data
df_tidy <- gather(df, question, response, 6:dim(df)[2])

We use the gather function to create the magic.  The first parameter specifies the dataframe to use (df), the next 2 parameters define the names for our new tidy data columns (question and response).  Finally, we want to specify the columns to tidy (in this case the questions start in column 6, so let’s gather all columns starting at 6 to the end of the dataframe).  Using the dim function here to define how many columns are in the dataframe will help if we ever decide to add or remove questions from our survey.  No changes to the code will be necessary.  So what’s going to happen when we execute this code?  Easiest to understand by looking directly at the result:

It essentially creates a new row for each question and response pair.  Let’s take a look sorted by id

Here we can see 5 rows for each survey id, representing the answers to each of the 5 questions.  What happens to the first 5 rows that have the respondent info (id, gender etc.) that we didn’t include in our gather function?  They just get repeated five times in their existing columns.  We did that so we have that information available when we want to create a crosstab or analyze our response data by any of those factors.  Overall, our dataframe went from 2000 rows by 10 columns (short and fat) to 10000 rows by 7 columns (tall and thin).  As you can see we could add any number of new questions to our survey and our tidy data will only grow in height not width, since every response will be added in the question and response columns.

Summarize Results

So what if we want to summarize the results for each of our questions, to see how many people responded in each category.  We can do that with the following code

df_summary <- group_by(df_tidy, question, response) %>%
             summarise(n = n()) %>%
             mutate(perc = n / sum(n)) %>%
             arrange(desc(n), .by_group = TRUE)

Again, let’s walk through each dplyr command to see what’s happening.  First we are going to group the data by question and response using the group_by function (which works the same way as the SQL command of same name).  Then we can use the handy pipe operator (%>%) to pass the result to the next function will we use to summarise the grouped data.  We will create a new variable named ‘n’ and assign it the count for each group by using the n() function.  Since I’d like to also know the percentage, I’m also going to add a mutate function to do just that.  Finally, to make it easier to identify the most popular answers, I’m going to arrange my results in a descending order by group (question).

Pretty cool, with only a few short lines of code we are able to summarize our entire survey dataset!  If we add or remove questions or change the possible responses, it will adjust the output accordingly.  If we want to use the results for just one question, use the filter function

filter(df_summary, question == 'Q4')
Crosstabs

Another common thing we might want to do is create a crosstab, a chart showing the relationship between two variables.  For example, say we want to understand if there’s a difference between how people answer Q1 among different provinces.

df_summary <- filter(df_tidy, question == 'Q1') %>%
              group_by(province, response) %>%
              summarise(n = n()) %>%
              mutate(perc = n / sum(n)) %>%
              select(-n) %>%
              spread(key = province, value = perc, fill = 0)

To generate this chart we group and summarize as before, including adding the percentage column.  Then we want to pivot the data back to a ‘short and wide’ format by using the spread function.  It requires a few parameters, a key to identify what to label the new columns (in this case province), and then a value parameter to indicate which column to populate the table with (in this case the percentage).  I also added the optional fill parameter, which specifies what to do if there are no values for a particular cell in the table, in this case it will put a zero.  Using the fill parameter avoids the possibility of ending up with NA values in your table.

Here we can see that it looks like BC respondents are more likely to be very dissatisfied, however we would likely want to check significance with a hypothesis test.  To do that you will also need the sample sizes for each percentage.  This is a straightforward change to the above code, by removing the mutate function and using n for the value column in the spread function.  Practice by making the changes and observing the results.

By using the gather and spread functions together, we are able to easily manipulate data for any number of analyses and visualizations.

Net Promoter Score (NPS)

The last thing we’ll do is calculate the Net Promoter Score.  NPS is a common measure of customer loyalty.  Calculating the NPS for a particular question is very straightforward, we can reuse the code above to summarize the customer satisfaction question Q5.  Then we can use those results to determine the % of promoters and detractors.  NPS is simply the difference between the two.

# summarize Q5 results
df_summary <- filter(df_tidy, question == 'Q5') %>%
              group_by(response) %>%
              summarise(n = n()) %>%
              mutate(perc = n / sum(n)) %>%
              arrange(desc(as.numeric(response)))

# calculate promoters, detractors and NPS
promoters <- sum(df_summary$perc[df_summary$response %in% c(9,10)])
detractors <- sum(df_summary$perc[df_summary$response %in% c(0:6)])
nps <- promoters - detractors
Summary

R is a wonderfully powerful and easy to use tool for analyzing customer survey data (not to mention it’s free!).  This post has covered the following:

  • Transform survey data into tidy data for easy analysis
  • Generate summary counts and percentage splits for each question
  • Create a crosstab to compare a relationship between 2 variables
  • Calculate an NPS score

I’ve only scratched the surface of what we can do with R to derive insights from our data, hopefully this post has provided some tips on getting started.  Any questions, improvements or ideas for future posts, please comment below.  Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *