I’ve been an avid runner for as long as I can remember. From playing tag with friends in the neighborhood to running cross country in college, running has been deeply engrained in who I am as a person. A lot of people see running as chore or something boring that can easily be ignored, but there is a lot of joy to be found in exploring the outdoors. Don’t get me wrong, running isn’t for everyone.
My family has a history of bad health problems. From heart and kidney failure to the rarer Osler-Weber-Rendu disease (OWRD), I will face many health complications as I grow older. One of the biggest health concerns in the world is OBESITY. There are plenty of studies that show that being overweight greatly increases you chances of developing or worsening other health conditions, like hypertension.
As a runner and someone who is learning that I can’t just eat whatever I want anymore, I wanted to see how many calories are burned during a general run and create a model that would predict said calories in any given run.
Give Me that Data
I personally don’t log my running activity. I’ll generally keep track of time and distance, but since getting out of high school, its just been a way to stay active. So I looked and pulled a general log of data from Kaggle.com. This runner clocked very similar times to my own but tracked information back all the way to 2018, its was perfect. The logs were mostly done automatically through a Garmin watch or manually.
Load the Data
We have a pretty clean data set with over 600 entries. However, after an initial onceover, there are still some things that need to be cleaned before it can be ideally used.
We set our index column to the date and sort by ascending date. We look at the data set with a .info and see that most of our columns are objects…cool…we need to change them floats so that I can use them otherwise they would all be high cardinality features that would have to be removed. We also check for nans and any other weird entries, like the “ — “ entries in the Avg Pace columns. I’m assuming those are values the runner didn’t have and looking through the data set there were only 2 so I chose to drop those.
So after changing all my ‘time’ object to actual usable numbers, we had a usable data set.
Now lets drop the ‘Title’ feature as that is a high cardinality feature and I chose to do some manual OneHotEncoding for the ‘Activity Type’ feature since there were only 3 categories. I also added a change in Heart Rate taking the average heart rate of 60BMP and comparing that to the Max HR.
Train the model
First step to creating a model is to figure out your target variable and separating it from the data frame. You can’t train a model by giving it the answers, that’s called leakage.
Next we split the data into our training set, our validation set, and our test set. We use the training set to train our model, and the other two to confirm our scores. Since the data set isn’t super large I chose to do a 80:10:10 split. I divided them by the length of the data frame rather than time due to how many runs were in each year.
We get a baseline MAE and train the models. I chose to do four different models: LinearRegression, RidgeRegression, RandomForestRegressor, and XGBRegressor. Since I manually did the OneHotEncoding and changed all my ‘Time’ objects to floats, there was no need for a pipeline.
Scoring and Hypertuning!!!
We look at MAEs to see which model is doing the best, and it looks like the RandomForestRegressor is doing the best so we’ll push forward with that model.
We want to see what features are most important to our model and which are least.
Hmmm, ‘Number of Laps’ is pretty high in the first run of the model. I looked at the R² value for the model with and without that column and the model performs better without it, so lets remove it. I did choose to leave the Cardio, Running, and Treadmill Running features in as removing those negatively affected the models score.
With our model improved, so lets due some Hyperparameter tuning. I chose to use a random grid search and got a good estimate on my n_estimators and max_depth for my model. We still stick them in the RandomForestRegressor model and our model improves, just a little.
Our model seems pretty decent. Its off by about 38 calories, which is half a large egg, practically nothing!
Let Look at Some Pictures!
There are a lot of way to visualize data. Some people enjoy just looking at hard numbers, but I personally enjoy some pictures as they can help explain the features and how the interact with each other.
First, a correlation matrix, where we see that Distance, Time, and Calories all have a very large correlation with each other, which is to be expected
Lets also look at some partial dependency plots to see how as each feature increases, so do calories burned. Again, as expected, as Distance, Time, and Average Heart Rate increases, so do the calories burned.
I also looked at SHAP models as well. Here Time and Average Heart Rate have upwards driving forces to calories where distances has a downwards force? That’s a little unexpected.
The Distance being a downward driving force makes more sense. The longer the distance, the less force it exerts when determining calories burned. So running longer equals more calories burned.
What’s the Conclusion?
So our model is pretty good at predicting THIS runners burned calories and could work as a general guideline to the amount of calories burned in a run. Of course every body is different and if I had found a larger dataset I could have made a more accurate model. I have a feeling age, weight, and maybe gender would play a huge factor as well. Your largest contributors to burned calories are time, distance, and heart rate. Find a good balance that works for you and most importantly, watch what foods go into your body.