Avoiding Overfitting¶
The ultimate goal of machine learning is to make accurate predictions on unseen data. EvalML aims to help you build a model that will perform as you expect once it is deployed in to the real world.
One of the benefits of using EvalML to build models is that it provides data checks to ensure you are building pipelines that will perform reliably in the future. This page describes the various ways EvalML helps you avoid overfitting to your data.
[1]:
import evalml
Detecting Label Leakage¶
A common problem is having features that include information from your label in your training data. By default, EvalML will provide a warning when it detects this may be the case.
Let’s set up a simple example to demonstrate what this looks like:
[2]:
import pandas as pd
from evalml.data_checks import LabelLeakageDataCheck
X = pd.DataFrame({
"leaked_feature": [6, 6, 10, 5, 5, 11, 5, 10, 11, 4],
"leaked_feature_2": [3, 2.5, 5, 2.5, 3, 5.5, 2, 5, 5.5, 2],
"valid_feature": [3, 1, 3, 2, 4, 6, 1, 3, 3, 11]
})
y = pd.Series([1, 1, 0, 1, 1, 0, 1, 0, 0, 1])
label_leakage_check = LabelLeakageDataCheck()
messages = label_leakage_check.validate(X, y)
for message in messages:
print (message)
Column 'leaked_feature' is 95.0% or more correlated with the target
Column 'leaked_feature_2' is 95.0% or more correlated with the target
In the example above, EvalML warned about the input features leaked_feature
and leak_feature_2
, which are both very closely correlated with the label we are trying to predict.
The second way to find features that may be leaking label information is to look at the top features of the model after running an AutoML search. As we can see below, the top features in our model are the 2 leaked features.
[3]:
automl = evalml.AutoMLSearch(
problem_type='binary',
max_pipelines=1,
allowed_model_families=["linear_model"],
)
automl.search(X, y)
best_pipeline = automl.best_pipeline
best_pipeline.fit(X, y)
best_pipeline.feature_importance
Column 'leaked_feature' is 95.0% or more correlated with the target
Column 'leaked_feature_2' is 95.0% or more correlated with the target
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Searching up to 1 pipelines.
Allowed model families: linear_model
✔ Mode Baseline Binary Classification... 0%| | Elapsed:00:00
✔ Optimization finished 0%| | Elapsed:00:00
[3]:
feature | importance | |
---|---|---|
0 | leaked_feature | 0.0 |
1 | leaked_feature_2 | 0.0 |
2 | valid_feature | 0.0 |
Perform cross-validation for pipeline evaluation¶
By default, EvalML performs 3-fold cross validation when building pipelines. This means that it evaluates each pipeline 3 times using different sets of data for training and testing. In each trial, the data used for testing has no overlap from the data used for training.
While this is a good baseline approach, you can pass your own cross validation object to be used during modeling. The cross validation object can be any of the CV methods defined in scikit-learn or use a compatible API.
For example, if we wanted to do a time series split:
[4]:
from sklearn.model_selection import TimeSeriesSplit
X, y = evalml.demos.load_breast_cancer()
automl = evalml.AutoMLSearch(
problem_type='binary',
data_split=TimeSeriesSplit(n_splits=6),
max_pipelines=1
)
automl.search(X, y)
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Searching up to 1 pipelines.
Allowed model families: random_forest, catboost, xgboost, linear_model
✔ Mode Baseline Binary Classification... 0%| | Elapsed:00:00
✔ Optimization finished 0%| | Elapsed:00:00
if we describe the 1 pipeline we built, we can see the scores for each of the 6 splits as determined by the cross-validation object we provided. We can also see the number of training examples per fold increased because we were using TimeSeriesSplit
[5]:
automl.describe_pipeline(0)
************************************************
* Mode Baseline Binary Classification Pipeline *
************************************************
Problem Type: Binary Classification
Model Family: Baseline
Pipeline Steps
==============
1. Baseline Classifier
* strategy : random_weighted
Training
========
Training for Binary Classification problems.
Total training time (including CV): 0.1 seconds
Cross Validation
----------------
Log Loss Binary Accuracy Binary Balanced Accuracy Binary F1 Precision AUC MCC Binary # Training # Testing
0 0.880 0.358 0.500 0.000 0.000 0.500 0.000 83.0 81.0
1 0.697 0.469 0.500 0.000 0.000 0.500 0.000 164.0 81.0
2 0.697 0.333 0.500 0.000 0.000 0.500 0.000 245.0 81.0
3 0.664 0.716 0.500 0.835 0.716 0.500 0.000 326.0 81.0
4 0.619 0.790 0.500 0.883 0.790 0.500 0.000 407.0 81.0
5 0.611 0.741 0.500 0.851 0.741 0.500 0.000 488.0 81.0
mean 0.695 0.568 0.500 0.428 0.374 0.500 0.000 - -
std 0.098 0.205 0.000 0.469 0.411 0.000 0.000 - -
coef of var 0.141 0.361 0.000 1.096 1.097 0.000 inf - -
Detect unstable pipelines¶
When we perform cross validation we are trying generate an estimate of pipeline performance. EvalML does this by taking the mean of the score across the folds. If the performance across the folds varies greatly, it is indicative the the estimated value may be unreliable.
To protect the user against this, EvalML checks to see if the pipeline’s performance has a variance between the different folds. EvalML triggers a warning if the “coefficient of variance” of the scores (the standard deviation divided by mean) of the pipelines scores exeeds .2.
This warning will appear in the pipeline rankings under high_variance_cv
.
[6]:
automl.rankings
[6]:
id | pipeline_name | score | high_variance_cv | parameters | |
---|---|---|---|---|---|
0 | 0 | Mode Baseline Binary Classification Pipeline | 0.694742 | False | {'Baseline Classifier': {'strategy': 'random_w... |
Create holdout for model validation¶
EvalML offers a method to quickly create an holdout validation set. A holdout validation set is data that is not used during the process of optimizing or training the model. You should only use this validation set once you’ve picked the final model you’d like to use.
Below we create a holdout set of 20% of our data
[7]:
X, y = evalml.demos.load_breast_cancer()
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=.2)
[8]:
automl = evalml.AutoMLSearch(problem_type='binary',
objective="f1",
max_pipelines=3)
automl.search(X_train, y_train)
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************
Optimizing for F1.
Greater score is better.
Searching up to 3 pipelines.
Allowed model families: random_forest, catboost, xgboost, linear_model
✔ Mode Baseline Binary Classification... 0%| | Elapsed:00:00
✔ CatBoost Classifier w/ Simple Imput... 33%|███▎ | Elapsed:00:22
✔ Logistic Regression Classifier w/ S... 67%|██████▋ | Elapsed:00:24
✔ Optimization finished 67%|██████▋ | Elapsed:00:24
then we can retrain the best pipeline on all of our training data and see how it performs compared to the estimate
[9]:
pipeline = automl.best_pipeline
pipeline.fit(X_train, y_train)
pipeline.score(X_holdout, y_holdout, ["f1"])
[9]:
OrderedDict([('F1', 0.9793103448275863)])