EvalML provides data checks to help guide you in achieving the highest performing model. These utility functions help deal with problems such as overfitting, abnormal data, and missing data. These data checks can be found under evalml/data_checks. Below we will cover examples such as abnormal and missing data data checks.
evalml/data_checks
Missing data or rows with NaN values provide many challenges for machine learning pipelines. In the worst case, many algorithms simply will not run with missing data! EvalML pipelines contain imputation components to ensure that doesn’t happen. Imputation works by approximating missing values with existing values. However, if a column contains a high number of missing values, a large percentage of the column would be approximated by a small percentage. This could potentially create a column without useful information for machine learning pipelines. By using the HighlyNullDataCheck() data check, EvalML will alert you to this potential problem by returning the columns that pass the missing values threshold.
NaN
HighlyNullDataCheck()
[1]:
import numpy as np import pandas as pd from evalml.data_checks import HighlyNullDataCheck X = pd.DataFrame([[1, 2, 3], [0, 4, np.nan], [1, 4, np.nan], [9, 4, np.nan], [8, 6, np.nan]]) null_check = HighlyNullDataCheck(pct_null_threshold=0.8) for message in null_check.validate(X): print (message.message)
Column '2' is 80.0% or more null
EvalML provides two data checks to check for abnormal data: OutliersDataCheck() and IDColumnsDataCheck().
OutliersDataCheck()
IDColumnsDataCheck()
ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, IDColumnsDataCheck() reminds you if these columns exists. In the given example, ‘user_number’ and ‘id’ columns are both identified as potentially being unique identifiers that should be removed.
[2]:
from evalml.data_checks import IDColumnsDataCheck X = pd.DataFrame([[0, 53, 6325, 5],[1, 90, 6325, 10],[2, 90, 18, 20]], columns=['user_number', 'cost', 'revenue', 'id']) id_col_check = IDColumnsDataCheck(id_threshold=0.9) for message in id_col_check.validate(X): print (message.message)
Column 'id' is 90.0% or more likely to be an ID column Column 'user_number' is 90.0% or more likely to be an ID column
Outliers are observations that differ significantly from other observations in the same sample. Many machine learning pipelines suffer in performance if outliers are not dropped from the training set as they are not representative of the data. OutliersDataCheck() uses Isolation Forests to notify you if a sample can be considered an outlier.
Below we generate a random dataset with some outliers.
[3]:
data = np.random.randn(100, 100) X = pd.DataFrame(data=data) # generate some outliers in rows 3, 25, 55, and 72 X.iloc[3, :] = pd.Series(np.random.randn(100) * 10) X.iloc[25, :] = pd.Series(np.random.randn(100) * 20) X.iloc[55, :] = pd.Series(np.random.randn(100) * 100) X.iloc[72, :] = pd.Series(np.random.randn(100) * 100)
We then utilize OutliersDataCheck() to rediscover these outliers.
[4]:
from evalml.data_checks import OutliersDataCheck outliers_check = OutliersDataCheck() for message in outliers_check.validate(X): print (message.message)
Row '3' is likely to have outlier data Row '25' is likely to have outlier data Row '55' is likely to have outlier data Row '72' is likely to have outlier data
If you would prefer to write your own data check, you can do so by extending the DataCheck class and implementing the validate(self, X, y) class method. Below, we’ve created a new DataCheck, ZeroVarianceDataCheck.
validate(self, X, y)
ZeroVarianceDataCheck
[5]:
from evalml.data_checks import DataCheck from evalml.data_checks.data_check_message import DataCheckError class ZeroVarianceDataCheck(DataCheck): def validate(self, X, y): if not isinstance(X, pd.DataFrame): X = pd.DataFrame(X) warning_msg = "Column '{}' has zero variance" return [DataCheckError(warning_msg.format(column), self.name) for column in X.columns if len(X[column].unique()) == 1]