Setting up pipeline search¶
Designing the right machine learning pipeline and picking the best parameters is a time-consuming process that relies on a mix of data science intuition as well as trial and error. EvalML streamlines the process of selecting the best modeling algorithms and parameters, so data scientists can focus their energy where it is most needed.
How it works¶
EvalML selects and tunes machine learning pipelines built of numerous steps. This includes encoding categorical data, missing value imputation, feature selection, feature scaling, and finally machine learning. As EvalML tunes pipelines, it uses the objective function selected and configured by the user to guide its search.
At each iteration, EvalML uses cross-validation to generate an estimate of the pipeline’s performances. If a pipeline has high variance across cross-validation folds, it will provide a warning. In this case, the pipeline may not perform reliably in the future.
EvalML is designed to work well out of the box. However, it provides numerous methods for you to control the search described below.
Selecting problem type¶
EvalML supports both classification and regression problems. You select your problem type by calling the AutoMLSearch
initialization with the corresponding argument.
[1]:
import evalml
from evalml import AutoMLSearch
[2]:
AutoMLSearch(problem_type='binary')
Using default limit of max_pipelines=5.
[2]:
<evalml.automl.automl_search.AutoMLSearch at 0x7f424ecdef10>
[3]:
AutoMLSearch(problem_type='multiclass')
Using default limit of max_pipelines=5.
[3]:
<evalml.automl.automl_search.AutoMLSearch at 0x7f424ecf67d0>
[4]:
AutoMLSearch(problem_type='regression')
Using default limit of max_pipelines=5.
[4]:
<evalml.automl.automl_search.AutoMLSearch at 0x7f424ecf72d0>
Setting the Objective Function¶
The only required parameter to start searching for pipelines is the objective function. Most domain-specific objective functions require you to specify parameters based on your business assumptions. You can do this before you initialize your pipeline search. For example
[5]:
from evalml.objectives import FraudCost
fraud_objective = FraudCost(
retry_percentage=.5,
interchange_fee=.02,
fraud_payout_percentage=.75,
amount_col='amount'
)
AutoMLSearch(problem_type='binary', objective=fraud_objective, optimize_thresholds=True)
Using default limit of max_pipelines=5.
[5]:
<evalml.automl.automl_search.AutoMLSearch at 0x7f424ecfddd0>
Evaluate on Additional Objectives¶
Additional objectives can be scored on during the evaluation process. To add another objective, use the additional_objectives
parameter in AutoMLSearch. The results of these additional objectives will then appear in the results of describe_pipeline
.
[6]:
from evalml.objectives import FraudCost
fraud_objective = FraudCost(
retry_percentage=.5,
interchange_fee=.02,
fraud_payout_percentage=.75,
amount_col='amount'
)
AutoMLSearch(problem_type='binary', objective='AUC', additional_objectives=[fraud_objective], optimize_thresholds=False)
Using default limit of max_pipelines=5.
[6]:
<evalml.automl.automl_search.AutoMLSearch at 0x7f424ec93410>
Selecting Model Types¶
By default, all model types are considered. You can control which model types to search with the allowed_model_families
parameters
[7]:
automl = AutoMLSearch(problem_type='binary',
objective="f1",
allowed_model_families=["random_forest"])
Using default limit of max_pipelines=5.
After initialization you can view the pipelines that will be included in the search
[8]:
automl.allowed_pipelines
you can see a list of all supported models like this
[9]:
evalml.list_model_families("binary") # `binary` for binary classification and `multiclass` for multiclass classification
[9]:
[<ModelFamily.LINEAR_MODEL: 'linear_model'>,
<ModelFamily.CATBOOST: 'catboost'>,
<ModelFamily.XGBOOST: 'xgboost'>,
<ModelFamily.RANDOM_FOREST: 'random_forest'>]
[10]:
evalml.list_model_families("regression")
[10]:
[<ModelFamily.LINEAR_MODEL: 'linear_model'>,
<ModelFamily.CATBOOST: 'catboost'>,
<ModelFamily.XGBOOST: 'xgboost'>,
<ModelFamily.RANDOM_FOREST: 'random_forest'>]
Limiting Search Time¶
You can limit the search time by specifying a maximum number of pipelines and/or a maximum amount of time. EvalML won’t build new pipelines after the maximum time has passed or the maximum number of pipelines have been built. If a limit is not set, then a maximum of 5 pipelines will be built.
The maximum search time can be specified as a integer in seconds or as a string in seconds, minutes, or hours.
[11]:
AutoMLSearch(problem_type='binary',
objective="f1",
max_pipelines=5,
max_time=60)
AutoMLSearch(problem_type='binary',
objective="f1",
max_time="1 minute")
[11]:
<evalml.automl.automl_search.AutoMLSearch at 0x7f4228e917d0>
Early Stopping¶
You can also limit search time by providing a patience value for early stopping. With a patience value, EvalML will stop searching when the best objective score has not been improved upon for n iterations. The patience value must be a positive integer. You can also provide a tolerance value where EvalML will only consider a score as an improvement over the best score if the difference was greater than the tolerance percentage.
[12]:
from evalml.demos import load_diabetes
X, y = load_diabetes()
automl = AutoMLSearch(problem_type='regression', objective="MSE", patience=2, tolerance=0.01, max_pipelines=10)
automl.search(X, y)
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************
Optimizing for MSE.
Lower score is better.
Searching up to 10 pipelines.
Allowed model families: linear_model, catboost, xgboost, random_forest
✔ Mean Baseline Regression Pipeline: 0%| | Elapsed:00:00
✔ CatBoost Regressor w/ Simple Imputer: 10%|█ | Elapsed:00:03
✔ Linear Regressor w/ Simple Imputer ... 20%|██ | Elapsed:00:03
✔ Random Forest Regressor w/ Simple I... 30%|███ | Elapsed:00:04
✔ XGBoost Regressor w/ Simple Imputer: 40%|████ | Elapsed:00:04
2 iterations without improvement. Stopping search early...
✔ Optimization finished 40%|████ | Elapsed:00:04
[13]:
automl.rankings
[13]:
id | pipeline_name | score | high_variance_cv | parameters | |
---|---|---|---|---|---|
0 | 2 | Linear Regressor w/ Simple Imputer + Standard ... | 3027.144520 | False | {'Simple Imputer': {'impute_strategy': 'most_f... |
1 | 1 | CatBoost Regressor w/ Simple Imputer | 3279.699820 | False | {'Simple Imputer': {'impute_strategy': 'most_f... |
2 | 3 | Random Forest Regressor w/ Simple Imputer | 3301.475890 | False | {'Simple Imputer': {'impute_strategy': 'most_f... |
3 | 4 | XGBoost Regressor w/ Simple Imputer | 3960.404464 | False | {'Simple Imputer': {'impute_strategy': 'most_f... |
4 | 0 | Mean Baseline Regression Pipeline | 5943.716736 | False | {'Baseline Regressor': {'strategy': 'mean'}} |
Control Cross Validation¶
EvalML cross-validates each model it tests during its search. By default it uses 3-fold cross-validation. You can optionally provide your own cross-validation method.
[14]:
from sklearn.model_selection import StratifiedKFold
automl = AutoMLSearch(problem_type='binary',
objective="f1",
data_split=StratifiedKFold(5))
Using default limit of max_pipelines=5.