I spent a ton of time engineering (over-engineering actually) my features and running backtests until I was sure that I had a world beater. When my model was finally put into production, it spent a year doing basically nothing. Ultimately, I think it produced a slightly negative cumulative return.
I count myself lucky. Thinking back, I realize that my model was massively overfit and I’m fortunate that the bets it recommended did not blow up and cost my firm significant amounts of money.
Recognizing When A Model Is Overfit
Overfitting is when we train our model so much to the existing data that it loses the ability to generalize. Models that generalize well are ones that can adapt reasonably successfully to new data, especially ones that are unlike any of the observations the model has seen up until now. So if an overfit model is unable to generalize, then it’s highly likely to perform erratically (and probably badly) when it gets put into production and truly goes out of sample. The most common causes of overfitting are:
- Spurious correlations: if we look hard enough, we will find strong correlations. For example, we might find that the price of Bitcoin is highly correlated to the price of pizza in Zimbabwe. But that’s most likely due to chance and randomness, not anything real, and we would be foolish to bet money on such a correlation. If we fit our model with a bunch of factors that are spuriously correlated to the thing that we are trying to predict, it will not generalize well.
- Overuse of the test set: This one is really hard to completely avoid. If a model doesn’t work well on our test set (the test set is the portion of our data that we hold out so that we can assess how the model generalizes on new data), then we will tweak it until we find a configuration that works well on both the training set and test set. The implication of doing so is that the test set is no longer an unbiased estimate of our model’s out of sample performance — after all, once we start making modeling decisions using the hold out set (a.k.a. the test set), then can we really consider it still to be held out?
- A biased training set: Our training data will rarely ever be truly representative of the population that we are trying to model. So we should be aware that we are virtually guaranteed to run into data that our model finds completely unfamiliar at some point. And while we should do our best to match the characteristics of our sample to that of the overall population, we should also know the areas where the sample falls short. Because it is data from these areas that represent the greatest risk to our model. If our sample is representative of only a small portion of our population, then our model will perform poorly over time.
How Cross Validation Helps
Cross validation is a technique that allows us to produce test set like scoring metrics using the training set. That is, it allows us to simulate the effects of “going out of sample” using just our training data, so we can get a sense of how well our model generalizes.
Without cross validation, the traditional model training process looks like this:
Traditional train-test split
We train on the blue part until we feel like our model is ready to face the wild. Then we score it on the test set (the gold part). The drawback of the traditional way is that we only get one shot at things. The moment we test our model on the test set, we’ve compromised our test data. And if our test set results were terrible, what then? Would we really be alright with throwing away all those hours of work or would we just start optimizing our results for the test set?
If only there were a way to simulate how our model might perform on the test set without actually using the test set. There is! And it’s called cross validation.
We will focus on a specific type of cross validation called K-folds (so when I merely say cross validation, I mean K-folds cross validation). K-folds cross validation splits our training data into K folds (folds = subsections). We then train and test our model K times so that each and every fold gets a chance to be the pseudo test set, which we call the validation set. Let’s use some visuals to get a better understanding of what’s going on:
3-Folds Cross Validation
Say we are developing a predictive linear regression model and are using R² as our primary scoring metric. We have some data that we have split into a training set and a test set. Our primary concern is the accuracy of the out of sample predictions of our model (how well it generalizes). Thus, we‘ve decided to not look at the test set until the very end (so we can give our model an intellectually honest grade).
But as we tweak and refine our model, we still want to get a sense of how the changes we are making might affect its out of sample performance. So we cross validate:
- We decide to run 3-folds cross validation, meaning that we split our training data into 3 folds of equal size.
- In Run 1 of our cross validation, Fold 1 is held out (the pink rectangle labeled Hold Out 1). So we train on the training data (blue rectangle) that is not in Fold 1, and then validate on Fold 1. This means that we fit our model using the non-Fold 1 training data, and then calculate and record how well we predicted the observations of the dependent variable in Fold 1. Crucially, in Run 1 we did not use any of the data in Fold 1 during the training of our model, so the R² calculated using Fold 1 is sort of like an out of sample R².
- In Run 2, Fold 2 is held out. Now Fold 1, which was previously our validation set, has become part of our training set. We fit our model using the data in Fold 1 and Fold 3, and score it using Fold 2 (via R²).
- After Run 3 concludes, we now have three R² values (since each fold gets a turn to be held out). The average of the three R²s gives us a decent estimate of the out of sample R² of our model.
The most important thing to remember about how cross validation works is that in each run, the scoring metrics it reports are calculated on just the fold that was held out.
A few tips when employing cross validation:
- It’s important to keep in mind that the cross validation score (such as R²) of our model is at best an optimistic estimate of its performance on the test set. Just like how the performance on the test set is at best an optimistic estimate of the model’s true ability to generalize.
- If a change we made to the model increases the training score (which is estimated in sample) but decreases the cross validation score, then that’s a good sign that we are overfitting the model.
- Having too few folds hamstrings the model — this is because too much of the training data gets held out in each cross validation run. For example with 2 folds, half the training data is held out, which means that the model is only fit on the remaining half (resulting in scoring metrics that are lower than they should be). An under-fit model generalizes just as badly as an overfit one.
- It’s good practice to shuffle the data before we train test split in case the data was sorted. If it were sorted in some way and we neglected to shuffle it, then our train test split would provide biased data sets, where neither one would be a good representative of the actual population.
- Once we’re done refining the model via cross validation, we should refit the model on the entire training set before testing it on the test set.
“Understanding Cross Validation | How Cross Validation Helps Us Avoid The Pitfalls Of Overfitting”– Tony Yiu Tweet