[1]:
import evalml
Default DataChecks in AutoML¶
By default, AutoML will run the series of data checks in DefaultDataChecks
when automl.search()
is called to check that inputs are valid before running the search and fitting pipelines. Currently, DefaultDataChecks
contains HighlyNullDataCheck()
, IDColumnsDataCheck()
, LabelLeakageDataCheck()
. You can see the other available data checks under evalml/data_checks
.
If the data checks return any warnings, those warnings will be logged, but the search will continue. However, if the data checks returns any error messages, automl.search()
will raise a ValueError
and quit before searching. This allows users to address any issues before running the potentially time-intensive search process.
Below, we have some data that contain a lot of null values, causing DefaultDataChecks
to log a warning “Column ‘D’ is 95.0% or more null” and “Column ‘id’ is 100.0% or more likely to be an ID column” when try to run the search below.
[2]:
import pandas as pd
import numpy as np
X = pd.DataFrame(np.random.random((100, 5)), columns=["A", "B", "C", "D", "id"])
X.loc[:11, 'A'] = np.nan
X.loc[:9, 'B'] = np.nan
X.loc[:30, 'C'] = np.nan
X.loc[:95, 'D'] = np.nan
X.loc[:, 'id'] = range(100)
y = pd.Series([0,1]*50)
automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
automl.search(X, y)
Column 'D' is 95.0% or more null
Column 'id' is 100.0% or more likely to be an ID column
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Searching up to 1 pipelines.
Allowed model families: catboost, linear_model, xgboost, random_forest
✔ Mode Baseline Binary Classification... 0%| | Elapsed:00:00
✔ Optimization finished 0%| | Elapsed:00:00
To access the exact warning and/or error messages our data checks returned, we can access automl.data_check_results
.
[3]:
for message in automl.data_check_results:
print (message.message)
Column 'D' is 95.0% or more null
Column 'id' is 100.0% or more likely to be an ID column
Using your own DataCheck with AutoML¶
If you’d prefer to pass in your own data check, you can do so by passing in a DataChecks
object as the value for the data_checks
parameter. Here, we’ve implemented our own custom data check which returns a list of DataCheckError
objects if there are any columns that have zero variance.
[4]:
from evalml.data_checks import DataCheck, DataChecks
from evalml.data_checks.data_check_message import DataCheckError
class ZeroVarianceDataCheck(DataCheck):
def validate(self, X, y):
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
warning_msg = "Column '{}' has zero variance"
return [DataCheckError(warning_msg.format(column), self.name) for column in X.columns if len(X[column].unique()) == 1]
If we now call search()
, our error message will be logged and a ValueError
will be raised:
[5]:
data_checks = DataChecks(data_checks=[ZeroVarianceDataCheck()])
X = pd.DataFrame({'no_var': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'any_average_col': [2, 0, 1, 2, 1, 2, 0, 1, 2, 1],
'another_average_col': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
y = pd.Series([0,1,1,0,0,0,1,1,0,0])
automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
automl.search(X, y, data_checks=data_checks)
Column 'no_var' has zero variance
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-5f0233162825> in <module>
7
8 automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
----> 9 automl.search(X, y, data_checks=data_checks)
~/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/v0.11.0/lib/python3.7/site-packages/evalml/automl/automl_search.py in search(self, X, y, data_checks, feature_types, raise_errors, show_iteration_plot)
318 logger.error(message)
319 if any([message.message_type == DataCheckMessageType.ERROR for message in self._data_check_results]):
--> 320 raise ValueError("Data checks raised some warnings and/or errors. Please see `self.data_check_results` for more information or pass data_checks=EmptyDataChecks() to search() to disable data checking.")
321
322 if self.allowed_pipelines is None:
ValueError: Data checks raised some warnings and/or errors. Please see `self.data_check_results` for more information or pass data_checks=EmptyDataChecks() to search() to disable data checking.
Again, we can access self.data_check_results
to help us begin to address the issues raised by data checks.
[6]:
for message in automl.data_check_results:
print (message.message)
Column 'no_var' has zero variance
Disabling DataChecks¶
If you’d prefer not to run any data checks before running search, you can provide an EmptyDataChecks
instance to search()
instead.
[7]:
from evalml.data_checks import EmptyDataChecks
X = pd.DataFrame({'no_var': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'any_average_col': [2, 0, 1, 2, 1, 2, 0, 1, 2, 1],
'another_average_col': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
y = pd.Series([0,1,1,0,0,0,1,1,0,0])
automl = evalml.AutoMLSearch(problem_type='binary', max_pipelines=1)
automl.search(X, y, data_checks=EmptyDataChecks())
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Searching up to 1 pipelines.
Allowed model families: catboost, linear_model, xgboost, random_forest
✔ Mode Baseline Binary Classification... 0%| | Elapsed:00:00
✔ Optimization finished 0%| | Elapsed:00:00
Even though we are using the same data as above, no data checks will be run and hence, the same input data we used above will not raise an error and continue with the search process.