Trail Racing – An Exploratory Data Analysis

Reading Time: 6 minutes

In this article I’ll cover how to do the following with Python:

  • Scraping past race results from the web using Requests and Beautiful Soup libraries
  • Prepare and clean the data for analysis
  • Visualize results with the Matplotlib and Seaborn libraries
  • Set an audacious race goal (no need for Python here!)

This year I signed myself up for 25km distance of the Whistler Alpine Meadows trail race, located in the stunning scenery around Whistler BC.  Runners can choose from four distances; 12km, 25km, 55km or a monster 110km.  I’ve run a few road races from 5k up to a full marathon, but I’ve never raced a 25km on technical trails with an insane vertical of 1500m!  Yikes!  For training motivation, I would like to set a wholly unreasonable goal for myself.  Seems like a good chance to use some newly acquired Python skills to pull together and analyze some past race data from the web.

Web scraping past race results

Coast Mountain Trail Series, the race organizers, post all past race results.  I’m going to use that to download the data for the 25km distance of the 2016 and 2017 WAM races.  But instead of manually copying and pasting the data into a file for analysis, I’m going to use the terrific Requests package and Beautiful Soup package in Python, designed specifically for extracting data from websites.  Let’s first import the packages we need

import numpy as np # matrix functions
import pandas as pd # data manipulation
import requests # http requests
from bs4 import BeautifulSoup # html parsing

Next, I’ll want to define a function to extract a table of results from webpage.  This function uses the get function to retrieve the webpage, then creates a BeautifulSoup object and extracts the html table containing the results.  Finally we convert to a dataframe.

def read_results(url):
    res = requests.get(url) 
    soup = BeautifulSoup(res.content,'lxml') 
    table = soup.find_all('table')[0]  
    df = pd.read_html(str(table)) 
    return pd.DataFrame(df[0]) 

Let’s use our new function to extract the 2016 and 2017 results.

url = "http://www.racesplitter.com/races/204FD653A" # 2016 data
df_2016 = read_results(url)

url = "http://www.racesplitter.com/races/EC4DC29E6" # 2017 data
df_2017 = read_results(url)
Prepare and clean the data

For ease of analysis, I’ll complete a couple of other steps to prepare the data.  I’ll add a ‘Year’ field to indicate which year’s race each result is from and then concatenate the two tables into a single result table.  Next, the time format is a bit awkward, so I’d like to convert to minutes to make plotting and calculations easier.  Finally, I’ll save a local copy of the data, so I can be sure to be able to reproduce the analysis at a later date.

df_2016['Year'] = 2016
df_2017['Year'] = 2017

df = pd.concat([df_2016, df_2017]) # combine the two dataframes

def convert_to_minutes(row):
    return sum(i*j for i, j in zip(map(float, row['Time'].split(':')), [60, 1, 1/60]))
df['Minutes'] = df.apply(convert_to_minutes, axis=1) # add a minutes field

df.to_csv('WAM Results.csv') # save a local copy of the data

Luckily the dataset is nice and ‘clean’, there are no missing values or mis-coded fields to contend with so there isn’t any more data preparation to do before analysis.

Visualize using Matplotlib and Seaborn

The first thing we can try is a basic histogram from the Matplotlib library to get a visual representation of the distribution of the finishing times.

import matplotlib.pyplot as plt # plotting

plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.hist(df['Minutes'], facecolor='blue', alpha=0.5)
plt.title('2016 & 2017 Whistler Alpine Meadows Times', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel('Time (min)', fontsize=18)
plt.ylabel('Frequency', fontsize=18)
plt.show()

This gives us a bit of information, the average time is around 275 minutes and most racers finish somewhere between 175-375 minutes.  But that’s about it.  I’m going to try using the Seaborn library to create more visually appealing plots that also provide much more information.  First, let’s see if there’s any difference between 2016 and 2017 results.  Stacking two histograms on one plot is an option.  A better solution is multiple boxplots:

import seaborn as sns # plotting

plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
sns.boxplot(x="Year", y="Minutes", data=df)
plt.title('WAM Results by Year', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel("Year", fontsize=18)
plt.ylabel("Minutes", fontsize=18)

As the boxplot shows the median and first and third quartiles of the data, it is easier to interpret than the histogram above. A good description of how box plots are constructed is here.  However, a boxplot still obscures some of the underlying distribution of the data and it’s hard to plot if we want to compare two dimensions simultaneously (say year and gender) without plotting a large number of boxplots.  Below are a couple of other great options that are just as easy to create with Seaborn.  First up, the violin plot:

plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
sns.violinplot(x="Year", y="Minutes", data=df, inner='quartile')
plt.title('WAM Results by Year', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel("Year", fontsize=18)
plt.ylabel("Minutes", fontsize=18)
plt.savefig("WAM ViolinPlot.png")

Note that the only line of code that’s chaning is the actual sns. plot command, the rest is just for formatting and annotation.  By specifying the ‘inner’ parameter as quartile, it shows the median (dashed lines) and first and third quartiles (dotted lines) similar to the boxplot.  So here we can see the distribution of the data much more clearly (think of it as a probability distribution curve turned sideways and mirrored – the ‘fatness’ of the plot shows the number of observations at that point).  There’s an interesting skew in the 2017 results towards faster runners.  Since this is a fairly small dataset, we can take it even one step further and plot each individual runner and also distinguish gender with the swarmplot:

What an awesome improvement over the boxplot!  We can still see the shape of the distributions, but also see how individual racers fared identified by gender.  And that skew that we saw in the violin plots is clearly visible here, look at that group of 15 runners that finished together in just under 250 minutes in 2017.  And the lead pack of 3 men and 1 woman who finished well ahead of everyone else.  What’s great is we’ve already learned quite a bit about the data with only a few plots, and we haven’t calculated any specific measures yet like mean or standard deviation.  There’s a lot we can learn just from visualizing data in new ways.

Finally, I’d like to understand how runners in different age groups fared, so let’s combine the violin and swarmplot on one plot with the code below:

# subset only men's results
men = df.loc[df['Gender'] == 'M']
men['Age Group'] = men['Age Group'].cat.remove_unused_categories()
# plot violin and swarm plots by age group
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
sns.violinplot(x="Age Group", y="Minutes", data=men, color='lightblue', inner='quartile')
sns.swarmplot(x="Age Group", y="Minutes", data=men, color='darkblue')
plt.title('Mens WAM Results by Age Group', fontsize=18, fontweight="bold")
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel("Age Group", fontsize=18)
plt.ylabel("Minutes", fontsize=18)
plt.savefig("WAM Mens SwarmPlot.png")

Similarly for the women

So, finally let’s calculate the actual median and quartile times for my age group (M 40-49) and figure out what km pace those correspond to

# subset my age group results
group_times = df['Minutes'].where(df['Age Group'] == 'M 40-49').dropna()
# 25, 50 and 75 percentiles for total time and calculated per km pace
np.round(np.percentile(group_times, [25, 50, 75]), 1)
np.round(np.percentile(group_times, [25, 50, 75]) / 25, 1)
Finishing Group Total Time Pace Required
To beat 7 hr cutoff 7 hr 16:48 min / km
In the top 75% 4 hrs, 35 min 11:00 min / km
In the top 50% 4 hrs, 1 min 9:39 min / km
In the top 25% 3 hrs, 39 min 8:45 min / km

Based on historical results, looks like I need to finish in 4 hours (at 9:39 pace) to finish in the top half of competitors for my age group.  That seems like a nice round number and a good stretch goal for my first trail race!

For my next post I’ll figure out how I can assess my training for the race and how close I am to achieving my goal on race day.  The course is gnarly mix of flats, climbs and descents.  I certainly won’t be running at a constant 9:39 min/km pace.  I’ll show you how easy it is to use the Strava API to download all your training data which is pretty neat.  Then I will build a crude model that estimates my WAM finishing time based on my typical pace over various elevation changes.  Pretty nerdy, but I like it.  Let’s see what the results are!

 

One thought on “Trail Racing – An Exploratory Data Analysis”

Leave a Reply