Data leakage is the process of training a statistical model with information that would be actually unavailable when using the model to make predictions.
Data leakage makes the results during model training and validation much better than what is observed when the model is deployed, generating too optimistic estimates, possibly leading to an entirely invalid predictive model.
There is no single recipe to eliminate data leakage, but some practices are helpful to avoid them:
- Don’t use future data to make predictions of the past. Although obvious, it’s a very common mistake when validating models, especially when using cross-validation. When training on time-series data, always make sure to use an appropriate validation strategy.
- Prepare the data within cross-validation folds. Another common mistake is to make data preparations, like normalization or outlier removal on the whole dataset, prior to splitting the dataset to validate the model, which is a leak of information.
- Investigate IDs. It’s easy to dismiss IDs as randomly generated values, but sometimes they encode information about the target variable. If they are leaky, it’s best to remove them from any sort of model.