[1]:
import evalml

Default DataChecks in AutoML

By default, AutoML will run the series of data checks in DefaultDataChecks when automl.search() is called to check that inputs are valid before running the search and fitting pipelines. Currently, DefaultDataChecks contains HighlyNullDataCheck(), IDColumnsDataCheck(), LabelLeakageDataCheck(). You can see the other available data checks under evalml/data_checks.

If the data checks return any warnings, those warnings will be logged, but the search will continue. However, if the data checks returns any error messages, automl.search() will raise a ValueError and quit before searching. This allows users to address any issues before running the potentially time-intensive search process.

Below, we have some data that contain a lot of null values, causing DefaultDataChecks to log a warning “Column ‘D’ is 95.0% or more null” and “Column ‘id’ is 100.0% or more likely to be an ID column” when try to run the search below.

[2]:
import pandas as pd
import numpy as np

X = pd.DataFrame(np.random.random((100, 5)), columns=["A", "B", "C", "D", "id"])
X.loc[:11, 'A'] = np.nan
X.loc[:9, 'B'] = np.nan
X.loc[:30, 'C'] = np.nan
X.loc[:95, 'D'] = np.nan
X.loc[:, 'id'] = range(100)
y = pd.Series([0,1]*50)

automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
automl.search(X, y)
Column 'D' is 95.0% or more null
Column 'id' is 100.0% or more likely to be an ID column
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Searching up to 1 pipelines.
Allowed model families: catboost, linear_model, xgboost, random_forest

✔ Mode Baseline Binary Classification...     0%|          | Elapsed:00:00
✔ Optimization finished                      0%|          | Elapsed:00:00

To access the exact warning and/or error messages our data checks returned, we can access automl.data_check_results.

[3]:
for message in automl.data_check_results:
    print (message.message)
Column 'D' is 95.0% or more null
Column 'id' is 100.0% or more likely to be an ID column

Using your own DataCheck with AutoML

If you’d prefer to pass in your own data check, you can do so by passing in a DataChecks object as the value for the data_checks parameter. Here, we’ve implemented our own custom data check which returns a list of DataCheckError objects if there are any columns that have zero variance.

[4]:
from evalml.data_checks import DataCheck, DataChecks
from evalml.data_checks.data_check_message import DataCheckError

class ZeroVarianceDataCheck(DataCheck):
    def validate(self, X, y):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        warning_msg = "Column '{}' has zero variance"
        return [DataCheckError(warning_msg.format(column), self.name) for column in X.columns if len(X[column].unique()) == 1]

If we now call search(), our error message will be logged and a ValueError will be raised:

[5]:
data_checks = DataChecks(data_checks=[ZeroVarianceDataCheck()])

X = pd.DataFrame({'no_var': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  'any_average_col': [2, 0, 1, 2, 1, 2, 0, 1, 2, 1],
                  'another_average_col': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
y = pd.Series([0,1,1,0,0,0,1,1,0,0])

automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
automl.search(X, y, data_checks=data_checks)

Column 'no_var' has zero variance
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-5f0233162825> in <module>
      7
      8 automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
----> 9 automl.search(X, y, data_checks=data_checks)

~/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/v0.11.0/lib/python3.7/site-packages/evalml/automl/automl_search.py in search(self, X, y, data_checks, feature_types, raise_errors, show_iteration_plot)
    318                     logger.error(message)
    319             if any([message.message_type == DataCheckMessageType.ERROR for message in self._data_check_results]):
--> 320                 raise ValueError("Data checks raised some warnings and/or errors. Please see `self.data_check_results` for more information or pass data_checks=EmptyDataChecks() to search() to disable data checking.")
    321
    322         if self.allowed_pipelines is None:

ValueError: Data checks raised some warnings and/or errors. Please see `self.data_check_results` for more information or pass data_checks=EmptyDataChecks() to search() to disable data checking.

Again, we can access self.data_check_results to help us begin to address the issues raised by data checks.

[6]:
for message in automl.data_check_results:
    print (message.message)
Column 'no_var' has zero variance

Disabling DataChecks

If you’d prefer not to run any data checks before running search, you can provide an EmptyDataChecks instance to search() instead.

[7]:
from evalml.data_checks import EmptyDataChecks

X = pd.DataFrame({'no_var': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  'any_average_col': [2, 0, 1, 2, 1, 2, 0, 1, 2, 1],
                  'another_average_col': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
y = pd.Series([0,1,1,0,0,0,1,1,0,0])
automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
automl.search(X, y, data_checks=EmptyDataChecks())
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Searching up to 1 pipelines.
Allowed model families: catboost, linear_model, xgboost, random_forest

✔ Mode Baseline Binary Classification...     0%|          | Elapsed:00:00
✔ Optimization finished                      0%|          | Elapsed:00:00

Even though we are using the same data as above, no data checks will be run and hence, the same input data we used above will not raise an error and continue with the search process.