EvalML Components and Pipelines

EvalML searches and trains multiple machine learnining pipelines in order to find the best one for your data. Each pipeline is made up of various components that can learn from the data, transform the data and ultimately predict labels given new data. Below we’ll show an example of an EvalML pipeline. You can find a more in-depth look into components or learn how you can construct and use your own pipelines.

XGBoost Pipeline

The EvalML XGBoost Pipeline is made up of four different components: a one-hot encoder, a missing value imputer, a feature selector and an XGBoost estimator. We can see them here by calling .plot():

[1]:
from evalml.demos import load_breast_cancer
from evalml.pipelines import XGBoostPipeline

X, y = load_breast_cancer()

objective='recall'
parameters = {
        'Simple Imputer': {
            'impute_strategy': 'mean'
        },
        'RF Classifier Select From Model': {
            "percent_features": 0.5,
            "number_features": X.shape[1],
            "n_estimators": 20,
            "max_depth": 5
        },
        'XGBoost Classifier': {
            "n_estimators": 20,
            "eta": 0.5,
            "min_child_weight": 5,
            "max_depth": 10,
        }
    }

xgp = XGBoostPipeline(objective='recall', parameters=parameters, random_state=5)
xgp.graph()
[1]:
../_images/pipelines_overview_3_0.svg

From the above graph we can see each component and its parameters. Each component takes in data and feeds it to the next. You can see more detailed information by calling .describe():

[2]:
xgp.describe()
***********************************
* XGBoost Classification Pipeline *
***********************************

Supported Problem Types: Binary Classification, Multiclass Classification
Model Family: XGBoost Classifier
Objective to Optimize: Recall (greater is better)

Pipeline Steps
==============
1. One Hot Encoder
         * top_n : 10
2. Simple Imputer
         * impute_strategy : mean
         * fill_value : None
3. RF Classifier Select From Model
         * percent_features : 0.5
         * threshold : -inf
4. XGBoost Classifier
         * eta : 0.5
         * max_depth : 10
         * min_child_weight : 5
         * n_estimators : 20

You can then fit and score an individual pipeline:

[3]:
xgp.fit(X, y)
xgp.score(X, y)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/v0.8.0/lib/python3.7/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
[19:59:39] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[3]:
(0.9971988795518207, {})