in Technology by
What is data leakage in the context of data analysis? What problems may arise from it? Which strategies can be applied to avoid it?

1 Answer

0 votes
by

Data leakage is the process of training a statistical model with information that would be actually unavailable when using the model to make predictions.

Data leakage makes the results during model training and validation much better than what is observed when the model is deployed, generating too optimistic estimates, possibly leading to an entirely invalid predictive model.

There is no single recipe to eliminate data leakage, but some practices are helpful to avoid them:

  • Don’t use future data to make predictions of the past. Although obvious, it’s a very common mistake when validating models, especially when using cross-validation. When training on time-series data, always make sure to use an appropriate validation strategy.
  • Prepare the data within cross-validation folds. Another common mistake is to make data preparations, like normalization or outlier removal on the whole dataset, prior to splitting the dataset to validate the model, which is a leak of information.
  • Investigate IDs. It’s easy to dismiss IDs as randomly generated values, but sometimes they encode information about the target variable. If they are leaky, it’s best to remove them from any sort of model.

Related questions

0 votes
    Which of the following cryptographic strategies may be used to overcome Man-in-the-Middle attacks? (1)Authentication (2)Authorization (3)Confusion (4)Encryption...
asked Mar 19, 2021 in Technology by JackTerrance
0 votes
    Nodal analysis can be applied for non planar networks also. (a) true (b) false This question was addressed to ... for GATE EC Exam, Network Theory MCQ (Multiple Choice Questions)...
asked Oct 20, 2021 in Education by JackTerrance
0 votes
    What kind of conflicts may arise among sovereign nations? Please answer the above question....
asked Aug 14, 2022 in Education by JackTerrance
0 votes
    Causal analysis is commonly applied to census data. (a) True (b) False The question was posed to me ... questions and answers pdf, Data Science interview questions for beginners...
asked Oct 31, 2021 in Education by JackTerrance
0 votes
    Section 79 of the Indian IT Act declares that any 3^rd party information or personal data leakage in corporate firms ... questions and answers pdf, mcq on Cyber Security pdf,...
asked Nov 3, 2021 in Education by JackTerrance
0 votes
    Data leakage threats do not usually occur from which of the following? (a) Web and email (b) Mobile data storage ... Security questions and answers pdf, mcq on Cyber Security pdf,...
asked Nov 2, 2021 in Education by JackTerrance
0 votes
    Frequency analysis can be performed on polyalphabetic ciphers, which makes it weak. (1)False (2)True...
asked Mar 19, 2021 in Technology by JackTerrance
0 votes
    Unintentional data leakage can still result in the same penalties and reputational damage. (a) True (b) False The ... questions and answers pdf, mcq on Cyber Security pdf,...
asked Nov 2, 2021 in Education by JackTerrance
0 votes
    By using ______________ you can diminish the chance of data leakage. (a) Cryptography (b) Tomography (c) ... -for-Cyber Security:,Cyber Security-Jobs:,Cyber Security Applications...
asked Oct 31, 2021 in Education by JackTerrance
0 votes
    In humans, Rhesus condition can arise when (a) father is Rh+ and mother is Rh− (b) father ... Science,Science proposed by,electromagnetic theory engineering physics,Science nptel...
asked Nov 8, 2021 in Education by JackTerrance
0 votes
    The Sitecore package wizard can be used to build a package containing data and files for the local Sitecore ... Questions for Interview, JavaScript MCQ (Multiple Choice Questions)...
asked Jan 16, 2022 in Education by JackTerrance
0 votes
    Which of the following can be used for data analysis model? (a) CRAN (b) CPAN (c) CTAN (d) ... programming questions and answers pdf, Data Science interview questions for beginners...
asked Oct 30, 2021 in Education by JackTerrance
...