Using the Strava API to retrieve activity data

Reading Time: 6 minutes

In this article I’ll cover how to do the following with Python:

  • A brief description of APIs
  • Use the Requests library to retrieve your training data from Strava
  • Visualize running pace vs. elevation change

This is the second part on a series on how to use Python to visualize and analyze race and personal running data with the goal of estimating future performance.  In the first part I did a bit of exploratory analysis of Whistler Alpine Meadows 25km distance race data to help set an overall goal for finishing time and required pace.  In this post I will see how that compares to my actual current run pacing using data retrieved using the Strava API.

API overview

API stands for ‘Application Programming Interface’, which allows for different servers and applications to talk and share information with each other.  A couple of good articles here and here explaining the basics of how they work.  So why do we care?  In the last post we just retrieved our race result data from a static table of results on an html webpage.  Sometimes we can directly download a .txt or .csv file from a location on the web.  But in many cases the data is stored in a database on an organization’s server and only bits are retrieved and presented as html pages as required.  This may not be convenient or easy to access for analysis purposes.  In the case of Strava for example, we could log into our account and then look at our activity feed and click  on each activity and then use that page to scrape data using Python.  But wouldn’t it be nicer if we could just request the data directly and receive it in a nice and easy to use format?  APIs to the rescue!  The Strava API allows us to do just that, by providing specific urls and a format for authenticating ourselves and requesting data to be returned.  Many organizations make some or all of their data available through an API; like Twitter, YouTube, Spotify, and NASA to name a few.  Let’s focus on the specific task of retrieving our activity data from the Strava API.

Connecting to Strava API and retrieving data

In order to access the Strava API, we need to create an app on the Strava website and then we will be provided with our access token and client secret to authorize requests for our data.  Here is a good article to walk you through this process.  Next, we will want to familiarize ourselves with what type of data is available and how to request it, there is good documentation on the Strava website.  Most organizations usually provide syntax and examples for sending requests to the API.

In order to model our running pace vs. elevation, we want to get all the 1km split times and elevation changes for all of our runs.  We will do that in 2 steps:  1) get a list of the activity ids for all of our runs and 2) get all the split data for each of the activity ids from step 1.  The Python requests library makes it easy to retrieve this data from the API so we can populate in a dataframe.  First let’s initialize a dataframe to hold the data we want (the activity id and the activity type) and set the parameters for the API GET request

# Initialize the dataframe
col_names = ['id','type']
activities = pd.DataFrame(columns=col_names)

access_token = "access_token=xxxxxxxx" # replace with your access token here
url = "https://www.strava.com/api/v3/activities"

Now, we will set up a loop to retrieve the list of activities in increments of 50 and populate the dataframe.

page = 1

while True:
    
    # get page of activities from Strava
    r = requests.get(url + '?' + access_token + '&per_page=50' + '&page=' + str(page))
    r = r.json()

    # if no results then exit loop
    if (not r):
        break
    
    # otherwise add new data to dataframe
    for x in range(len(r)):
        activities.loc[x + (page-1)*50,'id'] = r[x]['id']
        activities.loc[x + (page-1)*50,'type'] = r[x]['type']

    # increment page
    page += 1

Note that all the instructions to the API are embedded in the url passed to the requests.get function.  There’s the base url for the activity data, then a question mark followed by our specific request parameters.  We need to pass our access token for authenticating ourselves to the Strava API, plus tell it how many results to retrieve per page and which page to retrieve.  When run with my access token it produced a dataframe with 219 activities (which matches my total on my Strava page, woo hoo!).  The first few lines of my dataframe look like this:

activities.head()

           id      type
0  1788816844       Run
1  1787519321  Crossfit
2  1783902835       Run
3  1779024087  Crossfit
4  1775031422       Run

My activities break down is as follows

activities['type'].value_counts().plot('bar')

Let’s include only run activities, and then send API requests to retrieve the 1 km split times and elevation changes for each of those runs.  The request url will be similar to the previous one, except we only need to include the specific activity id we want data from and of course our access token.

# filter to only runs
runs = activities[activities.type == 'Run']

# initialize dataframe for split data
col_names = ['average_speed','distance','elapsed_time','elevation_difference','moving_time','pace_zone', 'split','id','date']
splits = pd.DataFrame(columns=col_names)

# loop through each activity id and retrieve data
for run_id in runs['id']:
    
    # Load activity data
    print(run_id)
    r = requests.get(url + '/' + str(run_id) + '?' + access_token)
    r = r.json()

    # Extract Activity Splits
    activity_splits = pd.DataFrame(r['splits_metric']) 
    activity_splits['id'] = run_id
    activity_splits['date'] = r['start_date']
    
    # Add to total list of splits
    splits = pd.concat([splits, activity_splits])

That’s all there is to it!  In the next section we will clean the data and do some basic analysis.

Visualize pace data vs. elevation

Below is an example of the split data for a single activity, our complete dataset is a concatenation of all splits for all activities.  As you can see we have a partial split (less than 1km) of 333.6m at the end of this particular run.  This could skew our results, so we will want to filter out all splits that are not around 1 km in length.

activity_splits

   average_speed  distance  elapsed_time  elevation_difference  moving_time  \
0           2.79    1003.1           359                   2.1          359   
1           2.75     998.7           380                   1.4          363   
2           2.79    1000.2           362                  12.9          358   
3           2.67     999.1           374                 -10.7          374   
4           2.70     999.6           370                  19.0          370   
5           2.79    1000.1           359                 -20.4          359   
6           2.71    1001.7           370                  -3.5          370   
7           2.80     333.6           119                  -1.7          119   

   pace_zone  split         id                  date  
0          2      1  174116621  2014-08-01T23:30:41Z  
1          2      2  174116621  2014-08-01T23:30:41Z  
2          2      3  174116621  2014-08-01T23:30:41Z  
3          2      4  174116621  2014-08-01T23:30:41Z  
4          2      5  174116621  2014-08-01T23:30:41Z  
5          2      6  174116621  2014-08-01T23:30:41Z  
6          2      7  174116621  2014-08-01T23:30:41Z  
7          2      8  174116621  2014-08-01T23:30:41Z

Histogram of all split distances.

As you can see from the data, most of our splits are around 1km, but there are a number that are quite a bit less.  Let’s filter only those 1000m +/- 50m (5%).

# Filter to only those within +/-50m of 1000m
splits = splits[(splits.distance > 950) & (splits.distance < 1050)]

That reduces our sample from 1747 to 1561 data points, still a fair number for our analysis.  Finally, let’s take a look at a scatter plot of elevation change vs. pace.

plt.plot( 'elevation_difference', 'moving_time', data=splits, linestyle='', marker='o', markersize=3, alpha=0.1, color="blue")

Since there are a lot of overlapping points, a helpful tip is to use the  transparency parameter alpha.  This will help highlight the concentration of points in a particular region of a graph, otherwise you will just end up with a solid blob of points in the middle of this graph.  The data is pretty noisy, which I suspect is due to the various additional factors in this data that we didn’t account for like terrain (road vs. trail) or temperature (hot vs. cool) or fatigue (beginning of a run vs. the end).  It would be great to be able to classify runs or splits by the terrain (track, road, easy trail, technical trail), fingers crossed this will be a future Strava improvement.  Perhaps for a future post I’ll take a shot at manually classifying my activities to see if we can improve the model, but for now let’s work with what we’ve got.  Not surprisingly, there still is a noticeable trend in the data.  The fastest times are on a slight decline with times increasing on either side, with inclines having a greater reduction on speed than declines.

For my next and final post in this series, I’ll apply some regression modelling to this data to allow for estimation of pace for each km of the WAM course and finally calculate an overall estimated race time.  Let’s see how achievable my goal of 4 hours really is!

 

12 thoughts on “Using the Strava API to retrieve activity data”

  1. Hello, I really enjoyed the article. However, I’m having trouble with the Strava API. I have my access token but am unable to retrieve my activities using the get method you include. I receive the below message. I am 100% certain that my access token is correct.

    I’ve found this very frustrating. Is it is also necessary to use OAuth to get this information? I would be very grateful if you could help me.

    https://www.strava.com/api/v3/activities?access_token=xxxxxxxxxxxxxxxxxxxxxxxx&per_page=20&page=1

    {
    “message”: “Authorization Error”,
    “errors”: [
    {
    “resource”: “Athlete”,
    “field”: “access_token”,
    “code”: “invalid”
    }
    ]
    }

    1. Hi Thomas, that is strange. I just tried again with the url above using my access token and it seems to work fine. I did notice in the Strava API documentation that in Oct of 2018 they changed from ‘forever’ tokens to ‘short lived’ tokens that expire after a few hours. Is it possible that your token expired? How about if you try with a different endpoint, like your athlete info: https://www.strava.com/api/v3/athlete?access_token=xxxxxxxxx ?

      Beyond that, I’m not sure what else to try, Googling the problem didn’t yield any further insight for me. Let me know if you get it to work, otherwise Stack Overflow might be of help. Good luck!

  2. Hi Michael,

    Thanks for the post. It is really helpful. Do you have any idea if i fetch all type of activities of users coming to my application, Would it restrict me ever to make many requests? If yes, then In which case this rate limit applies?

  3. Thank you for creating this!

    I’m trying to execute this but keep running into a KeyError: 0 after the Activities.loc[x + (page-1)*40,’id’] = r[x][‘id’]

    Any idea what I am doing wrong?

    I haven’t had any luck on Stack overflow with this issue – thought I’d go directly to the source!

    Thanks!
    Ryan

  4. What made it work for me was to change the get request to this:
    page = 1
    url = “https://www.strava.com/api/v3/activities”
    headers = {‘Authorization’: ‘Bearer ‘ + access_token}
    r = requests.get(url + ‘?’ + ‘&per_page=50’ + ‘&page=’ + str(page), headers = headers)

  5. Thanks for the article! Im a little stuck when it comes to the line activity_splits = pd.DataFrame(my_dataset[‘splits_metric’]) . I keep getting the error code “list indices must be integers, not str”. Any idea where this would come from?

  6. Hi,

    Im getting the following error:

    KeyError Traceback (most recent call last)
    in
    24 # otherwise add new data to dataframe
    25 for x in range(len(r)):
    —> 26 activities.loc[x + (page-1)*50,’id’] = r[x][‘id’]
    27 activities.loc[x + (page-1)*50,’type’] = r[x][‘type’]
    28

    KeyError: 0

    Any idea why?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *