Guardrails

EvalML provides guardrails to help guide you in achieving the highest performing model. These utility functions help deal with overfitting, abnormal data, and missing data. These guardrails can be found under evalml/guardrails/utils. Below we will cover abnormal and missing data guardrails. You can find an in-depth look into overfitting guardrails here.

Missing Data

Missing data or rows with NaN values provide many challenges for machine learning pipelines. In the worst case, many algorithms simply will not run with missing data! EvalML pipelines contain imputation components to ensure that doesn’t happen. Imputation works by approximating missing values with existing values. However, if a column contains a high number of missing values a large percentage of the column would be approximated by a small percentage. This could potentially create a column without useful information for machine learning pipelines. By running the detect_highly_null() guardrail, EvalML will alert you to this potential problem by returning the columns that pass the missing values threshold.

[1]:
import numpy as np
import pandas as pd

from evalml.guardrails.utils import detect_highly_null

X = pd.DataFrame(
    [
        [1, 2, 3],
        [0, 4, np.nan],
        [1, 4, np.nan],
        [9, 4, np.nan],
        [8, 6, np.nan]
    ]
)

detect_highly_null(X, percent_threshold=0.8)
[1]:
{2: 0.8}

Abnormal Data

EvalML provides two utility functions to check for abnormal data: detect_outliers() and detect_id_columns().

ID Columns

ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, detect_id_columns() reminds you if these columns exists.

[2]:
from evalml.guardrails.utils import detect_id_columns

X = pd.DataFrame([[0, 53, 6325, 5],[1, 90, 6325, 10],[2, 90, 18, 20]], columns=['user_number', 'cost', 'revenue', 'id'])


display(X)
print(detect_id_columns(X, threshold=0.95))
user_number cost revenue id
0 0 53 6325 5
1 1 90 6325 10
2 2 90 18 20
{'id': 1.0, 'user_number': 0.95}

Outliers

Outliers are observations that differ significantly from other observations in the same sample. Many machine learning pipelines suffer in performance if outliers are not dropped from the training set as they are not representative of the data. detect_outliers() uses Isolation Forests to notify you if a sample can be considered an outlier.

Below we generate a random dataset with some outliers.

[3]:
data = np.random.randn(100, 100)
X = pd.DataFrame(data=data)

# outliers
X.iloc[3, :] = pd.Series(np.random.randn(100) * 10)
X.iloc[25, :] = pd.Series(np.random.randn(100) * 20)
X.iloc[55, :] = pd.Series(np.random.randn(100) * 100)
X.iloc[72, :] = pd.Series(np.random.randn(100) * 100)

We then utilize detect_outliers to rediscover these outliers.

[4]:
from evalml.guardrails.utils import detect_outliers

detect_outliers(X)
[4]:
[3, 25, 55, 72]