Model Understanding¶

Simply examining a model’s performance metrics is not enough to select a model and promote it for use in a production setting. While developing an ML algorithm, it is important to understand how the model behaves on the data, to examine the key factors influencing its predictions and to consider where it may be deficient. Determination of what “success” may mean for an ML project depends first and foremost on the user’s domain expertise.

EvalML includes a variety of tools for understanding models.

First, let’s train a pipeline on some data.

[1]:

import evalml

class RFBinaryClassificationPipeline(evalml.pipelines.BinaryClassificationPipeline):
    component_graph = ['Simple Imputer', 'Random Forest Classifier']

X, y = evalml.demos.load_breast_cancer()

pipeline = RFBinaryClassificationPipeline({})
pipeline.fit(X, y)
print(pipeline.score(X, y, objectives=['log_loss_binary']))

2020-08-06 20:16:51,058 featuretools - WARNING    Featuretools failed to load plugin nlp_primitives from library nlp_primitives. For a full stack trace, set logging to debug.

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/v0.12.2/lib/python3.7/site-packages/evalml/pipelines/components/transformers/preprocessing/text_featurizer.py:31: RuntimeWarning: No text columns were given to TextFeaturizer, component will have no effect
  warnings.warn("No text columns were given to TextFeaturizer, component will have no effect", RuntimeWarning)

OrderedDict([('Log Loss Binary', 0.038403828027876195)])

Feature Importance¶

We can get the importance associated with each feature of the resulting pipeline

[2]:

pipeline.feature_importance

[2]:

	feature	importance
0	worst perimeter	0.176488
1	worst concave points	0.125260
2	worst radius	0.124161
3	mean concave points	0.086443
4	worst area	0.072465
5	mean concavity	0.072320
6	mean perimeter	0.056685
7	mean area	0.049599
8	area error	0.037229
9	worst concavity	0.028181
10	mean radius	0.023294
11	radius error	0.019457
12	worst texture	0.014990
13	perimeter error	0.014103
14	mean texture	0.013618
15	worst compactness	0.011310
16	worst smoothness	0.011139
17	worst fractal dimension	0.008118
18	worst symmetry	0.007818
19	mean smoothness	0.006152
20	concave points error	0.005887
21	fractal dimension error	0.005059
22	concavity error	0.004510
23	smoothness error	0.004493
24	texture error	0.004476
25	mean compactness	0.004050
26	compactness error	0.003559
27	mean symmetry	0.003243
28	symmetry error	0.003124
29	mean fractal dimension	0.002768

We can also create a bar plot of the feature importances

[3]:

pipeline.graph_feature_importance()

Permutation Importance¶

We can also compute and plot the permutation importance of the pipeline.

[4]:

evalml.pipelines.calculate_permutation_importance(pipeline, X, y, 'log_loss_binary')

[4]:

	feature	importance
0	worst perimeter	0.078033
1	worst radius	0.074341
2	worst concave points	0.068313
3	worst area	0.067733
4	mean concave points	0.041261
5	worst concavity	0.037533
6	mean concavity	0.036664
7	area error	0.035838
8	mean perimeter	0.025783
9	mean area	0.025203
10	worst texture	0.016211
11	perimeter error	0.011738
12	mean texture	0.011716
13	radius error	0.010910
14	mean radius	0.010775
15	worst compactness	0.008322
16	worst smoothness	0.008281
17	mean smoothness	0.005707
18	worst symmetry	0.004454
19	worst fractal dimension	0.003889
20	concavity error	0.003858
21	compactness error	0.003572
22	concave points error	0.003449
23	mean compactness	0.003173
24	smoothness error	0.003172
25	fractal dimension error	0.002618
26	texture error	0.002533
27	mean fractal dimension	0.002228
28	symmetry error	0.002126
29	mean symmetry	0.001786

[5]:

evalml.pipelines.graph_permutation_importance(pipeline, X, y, 'log_loss_binary')

Confusion Matrix¶

For binary or multiclass classification, we can view a confusion matrix of the classifier’s predictions

[6]:

y_pred = pipeline.predict(X)
evalml.pipelines.graph_utils.graph_confusion_matrix(y, y_pred)

Precision-Recall Curve¶

For binary classification, we can view the precision-recall curve of the pipeline.

[7]:

# get the predicted probabilities associated with the "true" label
y = y.map({'malignant': 0, 'benign': 1})
y_pred_proba = pipeline.predict_proba(X)["benign"]
evalml.pipelines.graph_utils.graph_precision_recall_curve(y, y_pred_proba)

ROC Curve¶

For binary and multiclass classification, we can view the Receiver Operating Characteristic (ROC) curve of the pipeline.

[8]:

# get the predicted probabilities associated with the "benign" label
y_pred_proba = pipeline.predict_proba(X)["benign"]
evalml.pipelines.graph_utils.graph_roc_curve(y, y_pred_proba)

Explaining Individual Predictions¶

We can explain why the model made an individual prediction with the explain_prediction function. This will use the Shapley Additive Explanations (SHAP) algorithms to identify the top features that explain the predicted value.

This function can explain both classification and regression models - all you need to do is provide the pipeline, the input features (must correspond to one row of the input data) and the training data. The function will return a table that you can print summarizing the top 3 most positive and negative contributing features to the predicted value.

In the example below, we explain the prediction for the third data point in the data set. We see that the worst concave points feature increased the estimated probability that the tumor is malignant by 20% while the worst radius feature decreased the probability the tumor is malignant by 5%.

[9]:

from evalml.pipelines.prediction_explanations import explain_prediction

table = explain_prediction(pipeline=pipeline, input_features=X.iloc[3:4],
                           training_data=X, include_shap_values=True)
print(table)

Positive Label

    Feature Name       Contribution to Prediction   SHAP Value
==============================================================
worst concave points               ++                 0.200
mean concave points                +                  0.110
   mean concavity                  +                  0.080
     worst area                    -                  -0.030
  worst perimeter                  -                  -0.050
    worst radius                   -                  -0.050

The interpretation of the table is the same for regression problems - but the SHAP value now corresponds to the change in the estimated value of the dependent variable rather than a change in probability. For multiclass classification problems, a table will be output for each possible class.

This functionality is currently not supported for XGBoost models or CatBoost multiclass classifiers.

Pipelines Data Checks