Using Regression to model race performance in Python

Reading Time: 5 minutes

In this post I’ll cover how to do the following in Python:

  • Use the Seaborn library to plot data and trendlines
  • Generate a regression equation using the polyfit function
  • Use the regression model to predict future race times
  • Review how to improve model performance

This is the third and final post in a series on how to visualize and analyze race and personal running data with the goal of estimating future performance.  In the first part I did a bit of exploratory analysis of Whistler Alpine Meadows 25km distance race data to help set an overall goal for finishing time and required pace to achieve that goal.  In the second post I dived into how to use the Strava API to retrieve my activity data that we will use in this final post to build a simple model that can estimate race finishing time.

Using Seaborn to plot polynomial regression line

First let’s load in our data from the .csv file we saved in our last post, so we don’t need to reload the data from the API.  Reading a .csv file is easy using the pandas function

splits = pd.read_csv('18-08-25 New Activity Splits.csv')

Before we return to plotting the data, let’s take another quick look at the data.  Last time we plotted the ‘moving time’ vs. the elevation change, but there is also an ‘elapsed time’ in the data.  Let’s investigate further by creating and plotting a new variable which is the difference between these two times.

splits['time_diff'] = splits['elapsed_time'] - splits['moving_time']

plt.plot( 'elevation_difference', 'time_diff', data=splits, linestyle='', marker='o', markersize=3, alpha=0.1, color="blue")

In most cases the elapsed time and moving time are close, but there are a significant number of points where they are different.  What causes this?  Time spent stationary or with little movement is captured in elapsed time but not moving time.  This confirms what I’ve noticed when logging an activity through Strava, especially on steep or twisty trails where Strava is fooled into thinking you’ve stopped.  For this analysis, I’m going to use elapsed time, even if it means that the few cases where I actually ‘stopped’ for an extended period of time will be included in data.  Using elapsed time will provide a more conservative and realistic estimate of my pace.

Last time we plotted the data using the matplotlib plot function.  This time let’s use the awesome Seaborn library to produce some nicer plots and include some trendlines and confidence intervals, using the function regplot.

sns.regplot(x = 'elevation_difference', y = 'elapsed_time', data = splits ,order = 2)
plt.title('Running Pace vs. Elevation Change', fontsize=18, fontweight="bold")
plt.xlabel('Elevation Change (m)', fontsize=18)
plt.ylabel('1km Pace (sec)', fontsize=18)

Notice we used the parameter order to specify which order polynomial to try and fit to the data.  I used 2 in this case, which produces a nice parabola which approximates the data pretty well.  As a stats refresher, the equation for a second degree polynomial (also known as a quadratic) is y = ax² + bx + c.  The light blue cone represents the 95% confidence interval, which is calculated using the bootstrap method.  One drawback of this plot is it doesn’t allow us the flexibility of setting the various visual parameters that the matplotlib plot functions does.  Specifically, I’d like to make the individual points look like those in the first plot by changing the alpha level to better show the point density.  Luckily, Python makes this easy by allowing us to combine 2 plot functions onto one plot.  I use the plot function to plot the individual points, and the regplot function to plot the trendline and confidence interval.  Use ‘scatter=None’ to suppress plotting the individual points in the regplot.

plt.plot( 'elevation_difference', 'elapsed_time', data=splits, linestyle='', marker='o', markersize=5, alpha=0.1, color="blue")
sns.regplot(x = 'elevation_difference', y = 'elapsed_time', scatter=None, data = splits ,order = 2)
plt.title('Running Pace vs. Elevation Change', fontsize=18, fontweight="bold")
plt.xlabel('Elevation Change (m)', fontsize=18)
plt.ylabel('1km Pace (sec)', fontsize=18)

Using Polyfit to generate the equation for the fitted model

So here’s the main drawback of using regplot, there’s no ability to have it provide the coefficients for the fitted lines and confidence intervals.  If anyone knows how to do this, I would love to hear about it in the comments!  So let’s rely on a Numpy function, polyfit, to give the equation

coeff = np.polyfit(splits['elevation_difference'], splits['elapsed_time'], 2)

That will produce the following coefficient array (in decreasing order):

array([7.40646826e-03, 6.30941912e-01, 3.74015634e+02])

So our complete equation is:  y = 0.0074*x² + 0.6310*x + 374

Apply equation to WAM course profile to estimate total time

Finally, let’s apply our model to the WAM course profile, which I manually created as a .csv file.  Then we calculate the time using the coefficients from the polyfit function above.

# Load WAM course data
WAM = pd.read_csv('WAM_25k_course.csv')

# Calculate estimated time for each km based on elevation change
WAM['estimated_time'] = coeff[0]*WAM['elevation']**2 + coeff[1]*WAM['elevation'] + coeff[2]

This is what the overall data looks like:

    km  elevation  estimated_time
0    1          0      374.015634
1    2          0      374.015634
2    3         13      383.469572
3    4         18      387.772284
4    5         68      451.167193
5    6        203      807.309992
6    7        158      658.599529
7    8         32      401.789998
8    9         27      396.450381
9   10        141      610.226439
10  11        190      761.268101
11  12        310     1281.369227
12  13       -120      404.955747
13  14        -23      363.421991
14  15        -78      369.863117
15  16         24      393.424365
16  17        -43      360.579691
17  18        -60      362.822405
18  19        -16      365.816619
19  20        -93      379.396580
20  21       -167      475.207328
21  22       -181      502.458454
22  23       -165      471.551317
23  24       -128      414.602645
24  25        -79      370.394991

Adding up the all the times and converting to minutes

WAM['estimated_time'].sum() / 60

This gives an estimated time of 202 minutes (3 hrs and 22 minutes).  That would be an amazing time!  But I suspect that it’s a bit optimistic as it uses a number of runs done on smooth road or track which will be much faster than a trail run.  To try and get a more accurate estimate, I went and manually classified my runs over the last year as either ‘trail’, ‘road’, or ‘track’ and entered the information in the description field of the activity on Strava.  After retrieving only the classified data again using the Strava API, I use the code below to recalculate my estimated finishing time

splits_trail = splits[splits['description'] == 'Trail']
coeff_trail = np.polyfit(splits_trail['elevation_difference'], splits_trail['elapsed_time'], 2)
WAM['estimated_time_trail'] = coeff_trail[0]*WAM['elevation']**2 + coeff_trail[1]*WAM['elevation'] + coeff_trail[2]
WAM['estimated_time_trail'].sum() / 60

This time I get an estimated finish time of 242 minutes (4 hrs and 2 minutes), which is almost exactly my goal of finishing in the middle of the pack!

Final Thoughts

This has been an interesting exercise and provided quite a bit of insight through some exploratory data analysis and some simple modelling that was relatively quick and easy to do.  This is always a good approach, as it allows you to iterate quickly to understand the process and data more fully, before diving into more complicated and time consuming modelling techniques.  Our next step would likely be to build a more complex regression model and/or another popular machine learning algorithm like Random Forest which can utilize other potential factors in estimating pace.  We already identified that the type of surface is almost certainly a factor in estimating performance.  There are some other hypothesized factors that we could add to train our model to see if it improves performance:

  • Fatigue estimate (split completed at beginning, middle or end of activity)
  • Temperature (hot day vs cold day)
  • More granular terrain classifications (ie smooth trail vs. technical trail)

Perhaps I will tackle this in a future post, but for now you have a solid set of tools to do some pretty cool analysis of your own activities.  We learned how to scrape race data from the web and retrieve data using an API, some creative ways to to visualize that data, and finally how to build a simple regression model to predict future performance.  Pretty cool!

Using the Strava API to retrieve activity data

Reading Time: 6 minutes

In this article I’ll cover how to do the following with Python:

  • A brief description of APIs
  • Use the Requests library to retrieve your training data from Strava
  • Visualize running pace vs. elevation change

This is the second part on a series on how to use Python to visualize and analyze race and personal running data with the goal of estimating future performance.  In the first part I did a bit of exploratory analysis of Whistler Alpine Meadows 25km distance race data to help set an overall goal for finishing time and required pace.  In this post I will see how that compares to my actual current run pacing using data retrieved using the Strava API.

API overview

API stands for ‘Application Programming Interface’, which allows for different servers and applications to talk and share information with each other.  A couple of good articles here and here explaining the basics of how they work.  So why do we care?  In the last post we just retrieved our race result data from a static table of results on an html webpage.  Sometimes we can directly download a .txt or .csv file from a location on the web.  But in many cases the data is stored in a database on an organization’s server and only bits are retrieved and presented as html pages as required.  This may not be convenient or easy to access for analysis purposes.  In the case of Strava for example, we could log into our account and then look at our activity feed and click  on each activity and then use that page to scrape data using Python.  But wouldn’t it be nicer if we could just request the data directly and receive it in a nice and easy to use format?  APIs to the rescue!  The Strava API allows us to do just that, by providing specific urls and a format for authenticating ourselves and requesting data to be returned.  Many organizations make some or all of their data available through an API; like Twitter, YouTube, Spotify, and NASA to name a few.  Let’s focus on the specific task of retrieving our activity data from the Strava API.

Connecting to Strava API and retrieving data

In order to access the Strava API, we need to create an app on the Strava website and then we will be provided with our access token and client secret to authorize requests for our data.  Here is a good article to walk you through this process.  Next, we will want to familiarize ourselves with what type of data is available and how to request it, there is good documentation on the Strava website.  Most organizations usually provide syntax and examples for sending requests to the API.

In order to model our running pace vs. elevation, we want to get all the 1km split times and elevation changes for all of our runs.  We will do that in 2 steps:  1) get a list of the activity ids for all of our runs and 2) get all the split data for each of the activity ids from step 1.  The Python requests library makes it easy to retrieve this data from the API so we can populate in a dataframe.  First let’s initialize a dataframe to hold the data we want (the activity id and the activity type) and set the parameters for the API GET request

# Initialize the dataframe
col_names = ['id','type']
activities = pd.DataFrame(columns=col_names)

access_token = "access_token=xxxxxxxx" # replace with your access token here
url = ""

Now, we will set up a loop to retrieve the list of activities in increments of 50 and populate the dataframe.

page = 1

while True:
    # get page of activities from Strava
    r = requests.get(url + '?' + access_token + '&per_page=50' + '&page=' + str(page))
    r = r.json()

    # if no results then exit loop
    if (not r):
    # otherwise add new data to dataframe
    for x in range(len(r)):
        activities.loc[x + (page-1)*50,'id'] = r[x]['id']
        activities.loc[x + (page-1)*50,'type'] = r[x]['type']

    # increment page
    page += 1

Note that all the instructions to the API are embedded in the url passed to the requests.get function.  There’s the base url for the activity data, then a question mark followed by our specific request parameters.  We need to pass our access token for authenticating ourselves to the Strava API, plus tell it how many results to retrieve per page and which page to retrieve.  When run with my access token it produced a dataframe with 219 activities (which matches my total on my Strava page, woo hoo!).  The first few lines of my dataframe look like this:


           id      type
0  1788816844       Run
1  1787519321  Crossfit
2  1783902835       Run
3  1779024087  Crossfit
4  1775031422       Run

My activities break down is as follows


Let’s include only run activities, and then send API requests to retrieve the 1 km split times and elevation changes for each of those runs.  The request url will be similar to the previous one, except we only need to include the specific activity id we want data from and of course our access token.

# filter to only runs
runs = activities[activities.type == 'Run']

# initialize dataframe for split data
col_names = ['average_speed','distance','elapsed_time','elevation_difference','moving_time','pace_zone', 'split','id','date']
splits = pd.DataFrame(columns=col_names)

# loop through each activity id and retrieve data
for run_id in runs['id']:
    # Load activity data
    r = requests.get(url + '/' + str(run_id) + '?' + access_token)
    r = r.json()

    # Extract Activity Splits
    activity_splits = pd.DataFrame(r['splits_metric']) 
    activity_splits['id'] = run_id
    activity_splits['date'] = r['start_date']
    # Add to total list of splits
    splits = pd.concat([splits, activity_splits])

That’s all there is to it!  In the next section we will clean the data and do some basic analysis.

Visualize pace data vs. elevation

Below is an example of the split data for a single activity, our complete dataset is a concatenation of all splits for all activities.  As you can see we have a partial split (less than 1km) of 333.6m at the end of this particular run.  This could skew our results, so we will want to filter out all splits that are not around 1 km in length.


   average_speed  distance  elapsed_time  elevation_difference  moving_time  \
0           2.79    1003.1           359                   2.1          359   
1           2.75     998.7           380                   1.4          363   
2           2.79    1000.2           362                  12.9          358   
3           2.67     999.1           374                 -10.7          374   
4           2.70     999.6           370                  19.0          370   
5           2.79    1000.1           359                 -20.4          359   
6           2.71    1001.7           370                  -3.5          370   
7           2.80     333.6           119                  -1.7          119   

   pace_zone  split         id                  date  
0          2      1  174116621  2014-08-01T23:30:41Z  
1          2      2  174116621  2014-08-01T23:30:41Z  
2          2      3  174116621  2014-08-01T23:30:41Z  
3          2      4  174116621  2014-08-01T23:30:41Z  
4          2      5  174116621  2014-08-01T23:30:41Z  
5          2      6  174116621  2014-08-01T23:30:41Z  
6          2      7  174116621  2014-08-01T23:30:41Z  
7          2      8  174116621  2014-08-01T23:30:41Z

Histogram of all split distances.

As you can see from the data, most of our splits are around 1km, but there are a number that are quite a bit less.  Let’s filter only those 1000m +/- 50m (5%).

# Filter to only those within +/-50m of 1000m
splits = splits[(splits.distance > 950) & (splits.distance < 1050)]

That reduces our sample from 1747 to 1561 data points, still a fair number for our analysis.  Finally, let’s take a look at a scatter plot of elevation change vs. pace.

plt.plot( 'elevation_difference', 'moving_time', data=splits, linestyle='', marker='o', markersize=3, alpha=0.1, color="blue")

Since there are a lot of overlapping points, a helpful tip is to use the  transparency parameter alpha.  This will help highlight the concentration of points in a particular region of a graph, otherwise you will just end up with a solid blob of points in the middle of this graph.  The data is pretty noisy, which I suspect is due to the various additional factors in this data that we didn’t account for like terrain (road vs. trail) or temperature (hot vs. cool) or fatigue (beginning of a run vs. the end).  It would be great to be able to classify runs or splits by the terrain (track, road, easy trail, technical trail), fingers crossed this will be a future Strava improvement.  Perhaps for a future post I’ll take a shot at manually classifying my activities to see if we can improve the model, but for now let’s work with what we’ve got.  Not surprisingly, there still is a noticeable trend in the data.  The fastest times are on a slight decline with times increasing on either side, with inclines having a greater reduction on speed than declines.

For my next and final post in this series, I’ll apply some regression modelling to this data to allow for estimation of pace for each km of the WAM course and finally calculate an overall estimated race time.  Let’s see how achievable my goal of 4 hours really is!


Trail Racing – An Exploratory Data Analysis

Reading Time: 6 minutes

In this article I’ll cover how to do the following with Python:

  • Scraping past race results from the web using Requests and Beautiful Soup libraries
  • Prepare and clean the data for analysis
  • Visualize results with the Matplotlib and Seaborn libraries
  • Set an audacious race goal (no need for Python here!)

This year I signed myself up for 25km distance of the Whistler Alpine Meadows trail race, located in the stunning scenery around Whistler BC.  Runners can choose from four distances; 12km, 25km, 55km or a monster 110km.  I’ve run a few road races from 5k up to a full marathon, but I’ve never raced a 25km on technical trails with an insane vertical of 1500m!  Yikes!  For training motivation, I would like to set a wholly unreasonable goal for myself.  Seems like a good chance to use some newly acquired Python skills to pull together and analyze some past race data from the web.

Web scraping past race results

Coast Mountain Trail Series, the race organizers, post all past race results.  I’m going to use that to download the data for the 25km distance of the 2016 and 2017 WAM races.  But instead of manually copying and pasting the data into a file for analysis, I’m going to use the terrific Requests package and Beautiful Soup package in Python, designed specifically for extracting data from websites.  Let’s first import the packages we need

import numpy as np # matrix functions
import pandas as pd # data manipulation
import requests # http requests
from bs4 import BeautifulSoup # html parsing

Next, I’ll want to define a function to extract a table of results from webpage.  This function uses the get function to retrieve the webpage, then creates a BeautifulSoup object and extracts the html table containing the results.  Finally we convert to a dataframe.

def read_results(url):
    res = requests.get(url) 
    soup = BeautifulSoup(res.content,'lxml') 
    table = soup.find_all('table')[0]  
    df = pd.read_html(str(table)) 
    return pd.DataFrame(df[0]) 

Let’s use our new function to extract the 2016 and 2017 results.

url = "" # 2016 data
df_2016 = read_results(url)

url = "" # 2017 data
df_2017 = read_results(url)
Prepare and clean the data

For ease of analysis, I’ll complete a couple of other steps to prepare the data.  I’ll add a ‘Year’ field to indicate which year’s race each result is from and then concatenate the two tables into a single result table.  Next, the time format is a bit awkward, so I’d like to convert to minutes to make plotting and calculations easier.  Finally, I’ll save a local copy of the data, so I can be sure to be able to reproduce the analysis at a later date.

df_2016['Year'] = 2016
df_2017['Year'] = 2017

df = pd.concat([df_2016, df_2017]) # combine the two dataframes

def convert_to_minutes(row):
    return sum(i*j for i, j in zip(map(float, row['Time'].split(':')), [60, 1, 1/60]))
df['Minutes'] = df.apply(convert_to_minutes, axis=1) # add a minutes field

df.to_csv('WAM Results.csv') # save a local copy of the data

Luckily the dataset is nice and ‘clean’, there are no missing values or mis-coded fields to contend with so there isn’t any more data preparation to do before analysis.

Visualize using Matplotlib and Seaborn

The first thing we can try is a basic histogram from the Matplotlib library to get a visual representation of the distribution of the finishing times.

import matplotlib.pyplot as plt # plotting

plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.hist(df['Minutes'], facecolor='blue', alpha=0.5)
plt.title('2016 & 2017 Whistler Alpine Meadows Times', fontsize=18, fontweight="bold")
plt.xlabel('Time (min)', fontsize=18)
plt.ylabel('Frequency', fontsize=18)

This gives us a bit of information, the average time is around 275 minutes and most racers finish somewhere between 175-375 minutes.  But that’s about it.  I’m going to try using the Seaborn library to create more visually appealing plots that also provide much more information.  First, let’s see if there’s any difference between 2016 and 2017 results.  Stacking two histograms on one plot is an option.  A better solution is multiple boxplots:

import seaborn as sns # plotting

plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
sns.boxplot(x="Year", y="Minutes", data=df)
plt.title('WAM Results by Year', fontsize=18, fontweight="bold")
plt.xlabel("Year", fontsize=18)
plt.ylabel("Minutes", fontsize=18)

As the boxplot shows the median and first and third quartiles of the data, it is easier to interpret than the histogram above. A good description of how box plots are constructed is here.  However, a boxplot still obscures some of the underlying distribution of the data and it’s hard to plot if we want to compare two dimensions simultaneously (say year and gender) without plotting a large number of boxplots.  Below are a couple of other great options that are just as easy to create with Seaborn.  First up, the violin plot:

plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
sns.violinplot(x="Year", y="Minutes", data=df, inner='quartile')
plt.title('WAM Results by Year', fontsize=18, fontweight="bold")
plt.xlabel("Year", fontsize=18)
plt.ylabel("Minutes", fontsize=18)
plt.savefig("WAM ViolinPlot.png")

Note that the only line of code that’s chaning is the actual sns. plot command, the rest is just for formatting and annotation.  By specifying the ‘inner’ parameter as quartile, it shows the median (dashed lines) and first and third quartiles (dotted lines) similar to the boxplot.  So here we can see the distribution of the data much more clearly (think of it as a probability distribution curve turned sideways and mirrored – the ‘fatness’ of the plot shows the number of observations at that point).  There’s an interesting skew in the 2017 results towards faster runners.  Since this is a fairly small dataset, we can take it even one step further and plot each individual runner and also distinguish gender with the swarmplot:

What an awesome improvement over the boxplot!  We can still see the shape of the distributions, but also see how individual racers fared identified by gender.  And that skew that we saw in the violin plots is clearly visible here, look at that group of 15 runners that finished together in just under 250 minutes in 2017.  And the lead pack of 3 men and 1 woman who finished well ahead of everyone else.  What’s great is we’ve already learned quite a bit about the data with only a few plots, and we haven’t calculated any specific measures yet like mean or standard deviation.  There’s a lot we can learn just from visualizing data in new ways.

Finally, I’d like to understand how runners in different age groups fared, so let’s combine the violin and swarmplot on one plot with the code below:

# subset only men's results
men = df.loc[df['Gender'] == 'M']
men['Age Group'] = men['Age Group'].cat.remove_unused_categories()
# plot violin and swarm plots by age group
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
sns.violinplot(x="Age Group", y="Minutes", data=men, color='lightblue', inner='quartile')
sns.swarmplot(x="Age Group", y="Minutes", data=men, color='darkblue')
plt.title('Mens WAM Results by Age Group', fontsize=18, fontweight="bold")
plt.xlabel("Age Group", fontsize=18)
plt.ylabel("Minutes", fontsize=18)
plt.savefig("WAM Mens SwarmPlot.png")

Similarly for the women

So, finally let’s calculate the actual median and quartile times for my age group (M 40-49) and figure out what km pace those correspond to

# subset my age group results
group_times = df['Minutes'].where(df['Age Group'] == 'M 40-49').dropna()
# 25, 50 and 75 percentiles for total time and calculated per km pace
np.round(np.percentile(group_times, [25, 50, 75]), 1)
np.round(np.percentile(group_times, [25, 50, 75]) / 25, 1)
Finishing Group Total Time Pace Required
To beat 7 hr cutoff 7 hr 16:48 min / km
In the top 75% 4 hrs, 35 min 11:00 min / km
In the top 50% 4 hrs, 1 min 9:39 min / km
In the top 25% 3 hrs, 39 min 8:45 min / km

Based on historical results, looks like I need to finish in 4 hours (at 9:39 pace) to finish in the top half of competitors for my age group.  That seems like a nice round number and a good stretch goal for my first trail race!

For my next post I’ll figure out how I can assess my training for the race and how close I am to achieving my goal on race day.  The course is gnarly mix of flats, climbs and descents.  I certainly won’t be running at a constant 9:39 min/km pace.  I’ll show you how easy it is to use the Strava API to download all your training data which is pretty neat.  Then I will build a crude model that estimates my WAM finishing time based on my typical pace over various elevation changes.  Pretty nerdy, but I like it.  Let’s see what the results are!