Using Regression to model race performance in Python

Reading Time: 5 minutes

In this post I’ll cover how to do the following in Python:

  • Use the Seaborn library to plot data and trendlines
  • Generate a regression equation using the polyfit function
  • Use the regression model to predict future race times
  • Review how to improve model performance

This is the third and final post in a series on how to visualize and analyze race and personal running data with the goal of estimating future performance.  In the first part I did a bit of exploratory analysis of Whistler Alpine Meadows 25km distance race data to help set an overall goal for finishing time and required pace to achieve that goal.  In the second post I dived into how to use the Strava API to retrieve my activity data that we will use in this final post to build a simple model that can estimate race finishing time.

Using Seaborn to plot polynomial regression line

First let’s load in our data from the .csv file we saved in our last post, so we don’t need to reload the data from the API.  Reading a .csv file is easy using the pandas function

splits = pd.read_csv('18-08-25 New Activity Splits.csv')

Before we return to plotting the data, let’s take another quick look at the data.  Last time we plotted the ‘moving time’ vs. the elevation change, but there is also an ‘elapsed time’ in the data.  Let’s investigate further by creating and plotting a new variable which is the difference between these two times.

splits['time_diff'] = splits['elapsed_time'] - splits['moving_time']

plt.plot( 'elevation_difference', 'time_diff', data=splits, linestyle='', marker='o', markersize=3, alpha=0.1, color="blue")

In most cases the elapsed time and moving time are close, but there are a significant number of points where they are different.  What causes this?  Time spent stationary or with little movement is captured in elapsed time but not moving time.  This confirms what I’ve noticed when logging an activity through Strava, especially on steep or twisty trails where Strava is fooled into thinking you’ve stopped.  For this analysis, I’m going to use elapsed time, even if it means that the few cases where I actually ‘stopped’ for an extended period of time will be included in data.  Using elapsed time will provide a more conservative and realistic estimate of my pace.

Last time we plotted the data using the matplotlib plot function.  This time let’s use the awesome Seaborn library to produce some nicer plots and include some trendlines and confidence intervals, using the function regplot.

sns.regplot(x = 'elevation_difference', y = 'elapsed_time', data = splits ,order = 2)
plt.title('Running Pace vs. Elevation Change', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel('Elevation Change (m)', fontsize=18)
plt.ylabel('1km Pace (sec)', fontsize=18)

Notice we used the parameter order to specify which order polynomial to try and fit to the data.  I used 2 in this case, which produces a nice parabola which approximates the data pretty well.  As a stats refresher, the equation for a second degree polynomial (also known as a quadratic) is y = ax² + bx + c.  The light blue cone represents the 95% confidence interval, which is calculated using the bootstrap method.  One drawback of this plot is it doesn’t allow us the flexibility of setting the various visual parameters that the matplotlib plot functions does.  Specifically, I’d like to make the individual points look like those in the first plot by changing the alpha level to better show the point density.  Luckily, Python makes this easy by allowing us to combine 2 plot functions onto one plot.  I use the plot function to plot the individual points, and the regplot function to plot the trendline and confidence interval.  Use ‘scatter=None’ to suppress plotting the individual points in the regplot.

plt.plot( 'elevation_difference', 'elapsed_time', data=splits, linestyle='', marker='o', markersize=5, alpha=0.1, color="blue")
sns.regplot(x = 'elevation_difference', y = 'elapsed_time', scatter=None, data = splits ,order = 2)
plt.title('Running Pace vs. Elevation Change', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel('Elevation Change (m)', fontsize=18)
plt.ylabel('1km Pace (sec)', fontsize=18)

Using Polyfit to generate the equation for the fitted model

So here’s the main drawback of using regplot, there’s no ability to have it provide the coefficients for the fitted lines and confidence intervals.  If anyone knows how to do this, I would love to hear about it in the comments!  So let’s rely on a Numpy function, polyfit, to give the equation

coeff = np.polyfit(splits['elevation_difference'], splits['elapsed_time'], 2)

That will produce the following coefficient array (in decreasing order):

array([7.40646826e-03, 6.30941912e-01, 3.74015634e+02])

So our complete equation is:  y = 0.0074*x² + 0.6310*x + 374

Apply equation to WAM course profile to estimate total time

Finally, let’s apply our model to the WAM course profile, which I manually created as a .csv file.  Then we calculate the time using the coefficients from the polyfit function above.

# Load WAM course data
WAM = pd.read_csv('WAM_25k_course.csv')

# Calculate estimated time for each km based on elevation change
WAM['estimated_time'] = coeff[0]*WAM['elevation']**2 + coeff[1]*WAM['elevation'] + coeff[2]

This is what the overall data looks like:

    km  elevation  estimated_time
0    1          0      374.015634
1    2          0      374.015634
2    3         13      383.469572
3    4         18      387.772284
4    5         68      451.167193
5    6        203      807.309992
6    7        158      658.599529
7    8         32      401.789998
8    9         27      396.450381
9   10        141      610.226439
10  11        190      761.268101
11  12        310     1281.369227
12  13       -120      404.955747
13  14        -23      363.421991
14  15        -78      369.863117
15  16         24      393.424365
16  17        -43      360.579691
17  18        -60      362.822405
18  19        -16      365.816619
19  20        -93      379.396580
20  21       -167      475.207328
21  22       -181      502.458454
22  23       -165      471.551317
23  24       -128      414.602645
24  25        -79      370.394991

Adding up the all the times and converting to minutes

WAM['estimated_time'].sum() / 60

This gives an estimated time of 202 minutes (3 hrs and 22 minutes).  That would be an amazing time!  But I suspect that it’s a bit optimistic as it uses a number of runs done on smooth road or track which will be much faster than a trail run.  To try and get a more accurate estimate, I went and manually classified my runs over the last year as either ‘trail’, ‘road’, or ‘track’ and entered the information in the description field of the activity on Strava.  After retrieving only the classified data again using the Strava API, I use the code below to recalculate my estimated finishing time

splits_trail = splits[splits['description'] == 'Trail']
coeff_trail = np.polyfit(splits_trail['elevation_difference'], splits_trail['elapsed_time'], 2)
WAM['estimated_time_trail'] = coeff_trail[0]*WAM['elevation']**2 + coeff_trail[1]*WAM['elevation'] + coeff_trail[2]
WAM['estimated_time_trail'].sum() / 60

This time I get an estimated finish time of 242 minutes (4 hrs and 2 minutes), which is almost exactly my goal of finishing in the middle of the pack!

Final Thoughts

This has been an interesting exercise and provided quite a bit of insight through some exploratory data analysis and some simple modelling that was relatively quick and easy to do.  This is always a good approach, as it allows you to iterate quickly to understand the process and data more fully, before diving into more complicated and time consuming modelling techniques.  Our next step would likely be to build a more complex regression model and/or another popular machine learning algorithm like Random Forest which can utilize other potential factors in estimating pace.  We already identified that the type of surface is almost certainly a factor in estimating performance.  There are some other hypothesized factors that we could add to train our model to see if it improves performance:

  • Fatigue estimate (split completed at beginning, middle or end of activity)
  • Temperature (hot day vs cold day)
  • More granular terrain classifications (ie smooth trail vs. technical trail)

Perhaps I will tackle this in a future post, but for now you have a solid set of tools to do some pretty cool analysis of your own activities.  We learned how to scrape race data from the web and retrieve data using an API, some creative ways to to visualize that data, and finally how to build a simple regression model to predict future performance.  Pretty cool!

2 thoughts on “Using Regression to model race performance in Python”

Leave a Reply

Your email address will not be published. Required fields are marked *