Exploring search results

After finishing a pipeline search, we can inspect the results. First, let’s build a search of 10 different pipelines to explore.

[1]:
import evalml
from evalml import AutoMLSearch

X, y = evalml.demos.load_breast_cancer()

automl = AutoMLSearch(problem_type='binary',
                      objective="f1",
                      max_pipelines=5)

automl.search(X, y)
Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************

Optimizing for F1.
Greater score is better.

Searching up to 5 pipelines.
Allowed model families: xgboost, linear_model, catboost, random_forest

✔ Mode Baseline Binary Classification...     0%|          | Elapsed:00:00
✔ CatBoost Classifier w/ Simple Imput...    20%|██        | Elapsed:00:22
✔ Logistic Regression Classifier w/ S...    40%|████      | Elapsed:00:23
✔ Random Forest Classifier w/ Simple ...    60%|██████    | Elapsed:00:25
✔ XGBoost Classifier w/ Simple Imputer:     80%|████████  | Elapsed:00:25
✔ Optimization finished                     80%|████████  | Elapsed:00:25

View Rankings

A summary of all the pipelines built can be returned as a pandas DataFrame. It is sorted by score. EvalML knows based on our objective function whether higher or lower is better.

[2]:
automl.rankings
[2]:
id pipeline_name score high_variance_cv parameters
0 2 Logistic Regression Classifier w/ Simple Imput... 0.980447 False {'Simple Imputer': {'impute_strategy': 'most_f...
1 1 CatBoost Classifier w/ Simple Imputer 0.976333 False {'Simple Imputer': {'impute_strategy': 'most_f...
2 4 XGBoost Classifier w/ Simple Imputer 0.970577 False {'Simple Imputer': {'impute_strategy': 'most_f...
3 3 Random Forest Classifier w/ Simple Imputer 0.966629 False {'Simple Imputer': {'impute_strategy': 'most_f...
4 0 Mode Baseline Binary Classification Pipeline 0.771060 False {'Baseline Classifier': {'strategy': 'random_w...

Describe Pipeline

Each pipeline is given an id. We can get more information about any particular pipeline using that id. Here, we will get more information about the pipeline with id = 0.

[3]:
automl.describe_pipeline(1)
*****************************************
* CatBoost Classifier w/ Simple Imputer *
*****************************************

Problem Type: Binary Classification
Model Family: CatBoost

Pipeline Steps
==============
1. Simple Imputer
         * impute_strategy : most_frequent
         * fill_value : None
2. CatBoost Classifier
         * n_estimators : 1000
         * eta : 0.03
         * max_depth : 6
         * bootstrap_type : None

Training
========
Training for Binary Classification problems.
Total training time (including CV): 22.6 seconds

Cross Validation
----------------
               F1  Accuracy Binary  Balanced Accuracy Binary  Precision   AUC  Log Loss Binary  MCC Binary # Training # Testing
0           0.967            0.958                     0.949      0.951 0.995            0.106       0.910    379.000   190.000
1           0.983            0.979                     0.975      0.975 0.994            0.082       0.955    379.000   190.000
2           0.979            0.974                     0.976      0.991 0.990            0.093       0.944    380.000   189.000
mean        0.976            0.970                     0.967      0.973 0.993            0.094       0.936          -         -
std         0.008            0.011                     0.015      0.020 0.003            0.012       0.024          -         -
coef of var 0.009            0.011                     0.016      0.021 0.003            0.128       0.025          -         -

Get Pipeline

We can get the object of any pipeline via their id as well:

[4]:
automl.get_pipeline(1)
[4]:
<evalml.pipelines.utils.make_pipeline.<locals>.GeneratedPipeline at 0x7fb8a8591748>

Get best pipeline

If we specifically want to get the best pipeline, there is a convenient access

[5]:
automl.best_pipeline
[5]:
<evalml.pipelines.utils.make_pipeline.<locals>.GeneratedPipeline at 0x7fb84acf0c88>

Feature Importance

We can get the importance associated with each feature of the resulting pipeline

[6]:
pipeline = automl.get_pipeline(1)
pipeline.fit(X, y)
pipeline.feature_importance
[6]:
feature importance
0 worst texture 11.284484
1 worst concave points 8.105820
2 worst perimeter 7.989750
3 worst radius 7.872959
4 worst area 7.851859
5 mean concave points 7.098117
6 mean texture 5.919744
7 worst smoothness 4.831183
8 worst concavity 4.665609
9 area error 4.204315
10 compactness error 2.616379
11 worst symmetry 2.088257
12 mean concavity 2.015011
13 radius error 1.948857
14 concave points error 1.875384
15 mean compactness 1.824212
16 perimeter error 1.705293
17 mean smoothness 1.697390
18 worst fractal dimension 1.629284
19 mean radius 1.585845
20 mean area 1.479429
21 smoothness error 1.460370
22 fractal dimension error 1.370231
23 mean perimeter 1.306400
24 texture error 1.065599
25 mean fractal dimension 1.000343
26 worst compactness 0.967447
27 mean symmetry 0.960967
28 symmetry error 0.912462
29 concavity error 0.666999

We can also create a bar plot of the feature importances

[7]:
pipeline.graph_feature_importance()