In this article I’ll cover how to do the following with Python:
- A brief description of APIs
- Use the Requests library to retrieve your training data from Strava
- Visualize running pace vs. elevation change
This is the second part on a series on how to use Python to visualize and analyze race and personal running data with the goal of estimating future performance. In the first part I did a bit of exploratory analysis of Whistler Alpine Meadows 25km distance race data to help set an overall goal for finishing time and required pace. In this post I will see how that compares to my actual current run pacing using data retrieved using the Strava API.
API stands for ‘Application Programming Interface’, which allows for different servers and applications to talk and share information with each other. A couple of good articles here and here explaining the basics of how they work. So why do we care? In the last post we just retrieved our race result data from a static table of results on an html webpage. Sometimes we can directly download a .txt or .csv file from a location on the web. But in many cases the data is stored in a database on an organization’s server and only bits are retrieved and presented as html pages as required. This may not be convenient or easy to access for analysis purposes. In the case of Strava for example, we could log into our account and then look at our activity feed and click on each activity and then use that page to scrape data using Python. But wouldn’t it be nicer if we could just request the data directly and receive it in a nice and easy to use format? APIs to the rescue! The Strava API allows us to do just that, by providing specific urls and a format for authenticating ourselves and requesting data to be returned. Many organizations make some or all of their data available through an API; like Twitter, YouTube, Spotify, and NASA to name a few. Let’s focus on the specific task of retrieving our activity data from the Strava API.
Connecting to Strava API and retrieving data
In order to access the Strava API, we need to create an app on the Strava website and then we will be provided with our access token and client secret to authorize requests for our data. Here is a good article to walk you through this process. Next, we will want to familiarize ourselves with what type of data is available and how to request it, there is good documentation on the Strava website. Most organizations usually provide syntax and examples for sending requests to the API.
In order to model our running pace vs. elevation, we want to get all the 1km split times and elevation changes for all of our runs. We will do that in 2 steps: 1) get a list of the activity ids for all of our runs and 2) get all the split data for each of the activity ids from step 1. The Python requests library makes it easy to retrieve this data from the API so we can populate in a dataframe. First let’s initialize a dataframe to hold the data we want (the activity id and the activity type) and set the parameters for the API GET request
# Initialize the dataframe col_names = ['id','type'] activities = pd.DataFrame(columns=col_names) access_token = "access_token=xxxxxxxx" # replace with your access token here url = "https://www.strava.com/api/v3/activities"
Now, we will set up a loop to retrieve the list of activities in increments of 50 and populate the dataframe.
page = 1 while True: # get page of activities from Strava r = requests.get(url + '?' + access_token + '&per_page=50' + '&page=' + str(page)) r = r.json() # if no results then exit loop if (not r): break # otherwise add new data to dataframe for x in range(len(r)): activities.loc[x + (page-1)*50,'id'] = r[x]['id'] activities.loc[x + (page-1)*50,'type'] = r[x]['type'] # increment page page += 1
Note that all the instructions to the API are embedded in the url passed to the requests.get function. There’s the base url for the activity data, then a question mark followed by our specific request parameters. We need to pass our access token for authenticating ourselves to the Strava API, plus tell it how many results to retrieve per page and which page to retrieve. When run with my access token it produced a dataframe with 219 activities (which matches my total on my Strava page, woo hoo!). The first few lines of my dataframe look like this:
activities.head() id type 0 1788816844 Run 1 1787519321 Crossfit 2 1783902835 Run 3 1779024087 Crossfit 4 1775031422 Run
My activities break down is as follows
Let’s include only run activities, and then send API requests to retrieve the 1 km split times and elevation changes for each of those runs. The request url will be similar to the previous one, except we only need to include the specific activity id we want data from and of course our access token.
# filter to only runs runs = activities[activities.type == 'Run'] # initialize dataframe for split data col_names = ['average_speed','distance','elapsed_time','elevation_difference','moving_time','pace_zone', 'split','id','date'] splits = pd.DataFrame(columns=col_names) # loop through each activity id and retrieve data for run_id in runs['id']: # Load activity data print(run_id) r = requests.get(url + '/' + str(run_id) + '?' + access_token) r = r.json() # Extract Activity Splits activity_splits = pd.DataFrame(r['splits_metric']) activity_splits['id'] = run_id activity_splits['date'] = r['start_date'] # Add to total list of splits splits = pd.concat([splits, activity_splits])
That’s all there is to it! In the next section we will clean the data and do some basic analysis.
Visualize pace data vs. elevation
Below is an example of the split data for a single activity, our complete dataset is a concatenation of all splits for all activities. As you can see we have a partial split (less than 1km) of 333.6m at the end of this particular run. This could skew our results, so we will want to filter out all splits that are not around 1 km in length.
activity_splits average_speed distance elapsed_time elevation_difference moving_time \ 0 2.79 1003.1 359 2.1 359 1 2.75 998.7 380 1.4 363 2 2.79 1000.2 362 12.9 358 3 2.67 999.1 374 -10.7 374 4 2.70 999.6 370 19.0 370 5 2.79 1000.1 359 -20.4 359 6 2.71 1001.7 370 -3.5 370 7 2.80 333.6 119 -1.7 119 pace_zone split id date 0 2 1 174116621 2014-08-01T23:30:41Z 1 2 2 174116621 2014-08-01T23:30:41Z 2 2 3 174116621 2014-08-01T23:30:41Z 3 2 4 174116621 2014-08-01T23:30:41Z 4 2 5 174116621 2014-08-01T23:30:41Z 5 2 6 174116621 2014-08-01T23:30:41Z 6 2 7 174116621 2014-08-01T23:30:41Z 7 2 8 174116621 2014-08-01T23:30:41Z
Histogram of all split distances.
As you can see from the data, most of our splits are around 1km, but there are a number that are quite a bit less. Let’s filter only those 1000m +/- 50m (5%).
# Filter to only those within +/-50m of 1000m splits = splits[(splits.distance > 950) & (splits.distance < 1050)]
That reduces our sample from 1747 to 1561 data points, still a fair number for our analysis. Finally, let’s take a look at a scatter plot of elevation change vs. pace.
plt.plot( 'elevation_difference', 'moving_time', data=splits, linestyle='', marker='o', markersize=3, alpha=0.1, color="blue")
Since there are a lot of overlapping points, a helpful tip is to use the transparency parameter alpha. This will help highlight the concentration of points in a particular region of a graph, otherwise you will just end up with a solid blob of points in the middle of this graph. The data is pretty noisy, which I suspect is due to the various additional factors in this data that we didn’t account for like terrain (road vs. trail) or temperature (hot vs. cool) or fatigue (beginning of a run vs. the end). It would be great to be able to classify runs or splits by the terrain (track, road, easy trail, technical trail), fingers crossed this will be a future Strava improvement. Perhaps for a future post I’ll take a shot at manually classifying my activities to see if we can improve the model, but for now let’s work with what we’ve got. Not surprisingly, there still is a noticeable trend in the data. The fastest times are on a slight decline with times increasing on either side, with inclines having a greater reduction on speed than declines.
For my next and final post in this series, I’ll apply some regression modelling to this data to allow for estimation of pace for each km of the WAM course and finally calculate an overall estimated race time. Let’s see how achievable my goal of 4 hours really is!