EvalML Components and Pipelines¶
EvalML searches and trains multiple machine learnining pipelines in order to find the best one for your data. Each pipeline is made up of various components that can learn from the data, transform the data and ultimately predict labels given new data. Below we’ll show an example of an EvalML pipeline. You can find a more in-depth look into components or learn how you can construct and use your own pipelines.
XGBoost Pipeline¶
The EvalML XGBoost Pipeline
is made up of four different components: a one-hot encoder, a missing value imputer, a feature selector and an XGBoost estimator. To initialize a pipeline you need a parameters dictionary.
Parameters¶
The parameters dictionary needs to be in the format of a two-layered dictionary where the first key-value pair is the component name and component parameters dictionary. The component parameters dictionary consists of a key value pair of parameter name and parameter values. An example will be shown below and component parameters can be found here.
[1]:
from evalml.demos import load_breast_cancer
from evalml.pipelines import XGBoostBinaryPipeline
X, y = load_breast_cancer()
parameters = {
'Simple Imputer': {
'impute_strategy': 'mean'
},
'RF Classifier Select From Model': {
"percent_features": 0.5,
"number_features": X.shape[1],
"n_estimators": 20,
"max_depth": 5
},
'XGBoost Classifier': {
"n_estimators": 20,
"eta": 0.5,
"min_child_weight": 5,
"max_depth": 10,
}
}
xgp = XGBoostBinaryPipeline(parameters=parameters, random_state=5)
xgp.graph()
[1]:
From the above graph we can see each component and its parameters. Each component takes in data and feeds it to the next. You can see more detailed information by calling .describe()
:
[2]:
xgp.describe()
******************************************
* XGBoost Binary Classification Pipeline *
******************************************
Problem Type: Binary Classification
Model Family: XGBoost
Pipeline Steps
==============
1. Simple Imputer
* impute_strategy : mean
* fill_value : None
2. One Hot Encoder
* top_n : 10
* categories : None
* drop : None
* handle_unknown : ignore
* handle_missing : error
3. XGBoost Classifier
* eta : 0.5
* max_depth : 10
* min_child_weight : 5
* n_estimators : 20
You can then fit and score an individual pipeline with an objective. An objective can either be a string representation of an EvalML objective or an EvalML objective class. You can find more objectives here.
[3]:
xgp.fit(X, y)
xgp.score(X, y, objectives=['f1'])
[21:06:58] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/v0.11.0/lib/python3.7/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
[3]:
OrderedDict([('F1', 0.9916434540389972)])