Components¶

Components are the lowest level of building blocks in EvalML. Each component represents a fundamental operation to be applied to data.

All components accept parameters as keyword arguments to their __init__ methods. These parameters can be used to configure behavior.

Each component class definition must include a human-readable name for the component. Additionally, each component class may expose parameters for AutoML search by defining a hyperparameter_ranges attribute containing the parameters in question.

EvalML splits components into two categories: transformers and estimators.

Transformers¶

Transformers subclass the Transformer class, and define a fit method to learn information from training data and a transform method to apply a learned transformation to new data.

For example, an imputer is configured with the desired impute strategy to follow, for instance the mean value. The imputers fit method would learn the mean from the training data, and the transform method would fill the learned mean value in for any missing values in new data.

All transformers can execute fit and transform separately or in one step by calling fit_transform. Defining a custom fit_transform method can facilitate useful performance optimizations in some cases.

[1]:

import numpy as np
import pandas as pd
from evalml.pipelines.components import SimpleImputer

X = pd.DataFrame([[1, 2, 3], [1, np.nan, 3]])
display(X)

	0	1	2
0	1	2.0	3
1	1	NaN	3

[2]:

imp = SimpleImputer(impute_strategy="mean")
X = imp.fit_transform(X)

display(X)

	0	1	2
0	1.0	2.0	3.0
1	1.0	2.0	3.0

Below is a list of all transformers included with EvalML:

[3]:

from evalml.pipelines.components.utils import all_components, Estimator, Transformer
for component in all_components():
    if issubclass(component, Transformer):
        print(f"Transformer: {component.name}")

Transformer: Delayed Feature Transformer
Transformer: Text Featurization Component
Transformer: LSA Transformer
Transformer: Drop Null Columns Transformer
Transformer: DateTime Featurization Component
Transformer: PCA Transformer
Transformer: Select Columns Transformer
Transformer: Drop Columns Transformer
Transformer: Standard Scaler
Transformer: Imputer
Transformer: Per Column Imputer
Transformer: Simple Imputer
Transformer: RF Regressor Select From Model
Transformer: RF Classifier Select From Model
Transformer: Target Encoder
Transformer: One Hot Encoder

Estimators¶

Each estimator wraps an ML algorithm. Estimators subclass the Estimator class, and define a fit method to learn information from training data and a predict method for generating predictions from new data. Classification estimators should also define a predict_proba method for generating predicted probabilities.

Estimator classes each define a model_family attribute indicating what type of model is used.

Here’s an example of using the LogisticRegressionClassifier estimator to fit and predict on a simple dataset:

[4]:

from evalml.pipelines.components import LogisticRegressionClassifier

clf = LogisticRegressionClassifier()

X = X
y = [1, 0]

clf.fit(X, y)
clf.predict(X)

[4]:

0    0
1    0
dtype: int64

Below is a list of all estimators included with EvalML:

[5]:

from evalml.pipelines.components.utils import all_components, Estimator, Transformer
for component in all_components():
    if issubclass(component, Estimator):
        print(f"Estimator: {component.name}")

Estimator: Stacked Ensemble Regressor
Estimator: Stacked Ensemble Classifier
Estimator: Decision Tree Regressor.
Estimator: Baseline Regressor
Estimator: Extra Trees Regressor
Estimator: XGBoost Regressor
Estimator: CatBoost Regressor
Estimator: Random Forest Regressor
Estimator: Linear Regressor
Estimator: Elastic Net Regressor
Estimator: Decision Tree Classifier
Estimator: LightGBM Classifier
Estimator: Baseline Classifier
Estimator: Extra Trees Classifier
Estimator: Elastic Net Classifier
Estimator: CatBoost Classifier
Estimator: XGBoost Classifier
Estimator: Random Forest Classifier
Estimator: Logistic Regression Classifier

Defining Custom Components¶

EvalML allows you to easily create your own custom components by following the steps below.

Custom Transformers¶

Your transformer must inherit from the correct subclass. In this case Transformer for components that transform data. Next we will use EvalML’s DropNullColumns as an example.

[6]:

import pandas as pd
from evalml.pipelines.components import Transformer

class DropNullColumns(Transformer):
    """Transformer to drop features whose percentage of NaN values exceeds a specified threshold"""
    name = "Drop Null Columns Transformer"
    hyperparameter_ranges = {}

    def __init__(self, pct_null_threshold=1.0, random_state=0, **kwargs):
        """Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.

        Arguments:
            pct_null_threshold(float): The percentage of NaN values in an input feature to drop.
                Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.
                If equal to 1.0, will drop columns with all null values. Defaults to 0.95.
        """
        if pct_null_threshold < 0 or pct_null_threshold > 1:
            raise ValueError("pct_null_threshold must be a float between 0 and 1, inclusive.")
        parameters = {"pct_null_threshold": pct_null_threshold}
        parameters.update(kwargs)

        self._cols_to_drop = None
        super().__init__(parameters=parameters,
                         component_obj=None,
                         random_state=random_state)

    def fit(self, X, y=None):
        pct_null_threshold = self.parameters["pct_null_threshold"]
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        percent_null = X.isnull().mean()
        if pct_null_threshold == 0.0:
            null_cols = percent_null[percent_null > 0]
        else:
            null_cols = percent_null[percent_null >= pct_null_threshold]
        self._cols_to_drop = list(null_cols.index)
        return self

    def transform(self, X, y=None):
        """Transforms data X by dropping columns that exceed the threshold of null values.
        Arguments:
            X (pd.DataFrame): Data to transform
            y (pd.Series, optional): Targets
        Returns:
            pd.DataFrame: Transformed X
        """

        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        return X.drop(columns=self._cols_to_drop, axis=1)

Required fields¶

For a transformer you must provide a class attribute name indicating a human-readable name.

Required methods¶

Likewise, there are select methods you need to override as Transformer is an abstract base class:

__init__() - the __init__() method of your transformer will need to call super().__init__() and pass three parameters in: a parameters dictionary holding the parameters to the component, the component_obj, and the random_state value. You can see that component_obj is set to None above and we will discuss component_obj in depth later on.
fit() - the fit() method is responsible for fitting your component on training data.
transform() - after fitting a component, the transform() method will take in new data and transform accordingly. Note: a component must call fit() before transform().

You can also call or override fit_transform() that combines fit() and transform() into one method.

Custom Estimators¶

Your estimator must inherit from the correct subclass. In this case Estimator for components that predict new target values. Next we will use EvalML’s BaselineRegressor as an example.

[7]:

import numpy as np
import pandas as pd

from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes


class BaselineRegressor(Estimator):
    """Regressor that predicts using the specified strategy.

    This is useful as a simple baseline regressor to compare with other regressors.
    """
    name = "Baseline Regressor"
    hyperparameter_ranges = {}
    model_family = ModelFamily.BASELINE
    supported_problem_types = [ProblemTypes.REGRESSION]

    def __init__(self, strategy="mean", random_state=0, **kwargs):
        """Baseline regressor that uses a simple strategy to make predictions.

        Arguments:
            strategy (str): method used to predict. Valid options are "mean", "median". Defaults to "mean".
            random_state (int, np.random.RandomState): seed for the random number generator

        """
        if strategy not in ["mean", "median"]:
            raise ValueError("'strategy' parameter must equal either 'mean' or 'median'")
        parameters = {"strategy": strategy}
        parameters.update(kwargs)

        self._prediction_value = None
        self._num_features = None
        super().__init__(parameters=parameters,
                         component_obj=None,
                         random_state=random_state)

    def fit(self, X, y=None):
        if y is None:
            raise ValueError("Cannot fit Baseline regressor if y is None")

        if not isinstance(y, pd.Series):
            y = pd.Series(y)

        if self.parameters["strategy"] == "mean":
            self._prediction_value = y.mean()
        elif self.parameters["strategy"] == "median":
            self._prediction_value = y.median()
        self._num_features = X.shape[1]
        return self

    def predict(self, X):
        return pd.Series([self._prediction_value] * len(X))

    @property
    def feature_importance(self):
        """Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes.

        Returns:
            np.ndarray (float): an array of zeroes

        """
        return np.zeros(self._num_features)

Required fields¶

name indicating a human-readable name.
model_family - EvalML model_family that this component belongs to
supported_problem_types - list of EvalML problem_types that this component supports

Model families and problem types include:

[8]:

from evalml.model_family import ModelFamily
from evalml.problem_types import ProblemTypes

print("Model Families:\n", [m.value for m in ModelFamily])
print("Problem Types:\n", [p.value for p in ProblemTypes])

Model Families:
 ['random_forest', 'xgboost', 'lightgbm', 'linear_model', 'catboost', 'extra_trees', ('ensemble',), 'decision_tree', 'baseline', 'none']
Problem Types:
 ['binary', 'multiclass', 'regression', 'time series regression']

Required methods¶

__init__() - the __init__() method of your estimator will need to call super().__init__() and pass three parameters in: a parameters dictionary holding the parameters to the component, the component_obj, and the random_state value.
fit() - the fit() method is responsible for fitting your component on training data.
predict() - after fitting a component, the predict() method will take in new data and predict new target values. Note: a component must call fit() before predict().
feature_importance - feature_importance is a Python property that returns a list of importances associated with each feature.

If your estimator handles classification problems it also requires an additonal method:

predict_proba() - this method predicts probability estimates for classification labels

Components Wrapping Third-Party Objects¶

The component_obj parameter is used for wrapping third-party objects and using them in component implementation. If you’re using a component_obj you will need to define __init__() and pass in the relevant object that has also implemented the required methods mentioned above. However, if the component_obj does not follow EvalML component conventions, you may need to override methods as needed. Below is an example of EvalML’s LinearRegressor.

[9]:

from sklearn.linear_model import LinearRegression as SKLinearRegression

from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes

class LinearRegressor(Estimator):
    """Linear Regressor."""
    name = "Linear Regressor"
    model_family = ModelFamily.LINEAR_MODEL
    supported_problem_types = [ProblemTypes.REGRESSION]

    def __init__(self, fit_intercept=True, normalize=False, n_jobs=-1, random_state=0, **kwargs):
        parameters = {
            'fit_intercept': fit_intercept,
            'normalize': normalize,
            'n_jobs': n_jobs
        }
        parameters.update(kwargs)
        linear_regressor = SKLinearRegression(**parameters)
        super().__init__(parameters=parameters,
                         component_obj=linear_regressor,
                         random_state=random_state)

    @property
    def feature_importance(self):
        return self._component_obj.coef_

Hyperparameter Ranges for AutoML¶

hyperparameter_ranges is a dictionary mapping the parameter name (str) to an allowed range (SkOpt Space) for that parameter. Both lists and skopt.space.Categorical values are accepted for categorical spaces.

AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component’s hyperparameter_ranges class attribute. Any component parameter you add an entry for in hyperparameter_ranges will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines.

[10]:

from sklearn.linear_model import LinearRegression as SKLinearRegression

from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes

class LinearRegressor(Estimator):
    """Linear Regressor."""
    name = "Linear Regressor"
    hyperparameter_ranges = {
        'fit_intercept': [True, False],
        'normalize': [True, False]
    }
    model_family = ModelFamily.LINEAR_MODEL
    supported_problem_types = [ProblemTypes.REGRESSION]

    def __init__(self, fit_intercept=True, normalize=False, n_jobs=-1, random_state=0, **kwargs):
        parameters = {
            'fit_intercept': fit_intercept,
            'normalize': normalize,
            'n_jobs': n_jobs
        }
        parameters.update(kwargs)
        linear_regressor = SKLinearRegression(**parameters)
        super().__init__(parameters=parameters,
                         component_obj=linear_regressor,
                         random_state=random_state)

    @property
    def feature_importance(self):
        return self._component_obj.coef_

Generate Component Code¶

Once you have a component defined in EvalML, you can generate string Python code to recreate this component, which can then be saved and run elsewhere with EvalML. generate_component_code requires a component instance as the input. This method works for custom components as well, although it won’t return the code required to define the custom component.

[11]:

from evalml.pipelines.components import LogisticRegressionClassifier
from evalml.pipelines.components.utils import generate_component_code

lr = LogisticRegressionClassifier(C=5)
code = generate_component_code(lr)
print(code)

from evalml.pipelines.components.estimators.classifiers.logistic_regression import LogisticRegressionClassifier

logisticRegressionClassifier = LogisticRegressionClassifier(**{'penalty': 'l2', 'C': 5, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'})

[12]:

# this string can then be copy and pasted into a separate window and executed as python code
exec(code)
logisticRegressionClassifier

[12]:

LogisticRegressionClassifier(penalty='l2', C=5, n_jobs=-1, multi_class='auto', solver='lbfgs')

[13]:

# custom component
from evalml.pipelines.components import Transformer
import pandas as pd
from evalml.pipelines.components.utils import generate_component_code

class MyDropNullColumns(Transformer):
    """Transformer to drop features whose percentage of NaN values exceeds a specified threshold"""
    name = "My Drop Null Columns Transformer"
    hyperparameter_ranges = {}

    def __init__(self, pct_null_threshold=1.0, random_state=0, **kwargs):
        """Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.

        Arguments:
            pct_null_threshold(float): The percentage of NaN values in an input feature to drop.
                Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.
                If equal to 1.0, will drop columns with all null values. Defaults to 0.95.
        """
        if pct_null_threshold < 0 or pct_null_threshold > 1:
            raise ValueError("pct_null_threshold must be a float between 0 and 1, inclusive.")
        parameters = {"pct_null_threshold": pct_null_threshold}
        parameters.update(kwargs)

        self._cols_to_drop = None
        super().__init__(parameters=parameters,
                         component_obj=None,
                         random_state=random_state)

    def fit(self, X, y=None):
        pct_null_threshold = self.parameters["pct_null_threshold"]
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        percent_null = X.isnull().mean()
        if pct_null_threshold == 0.0:
            null_cols = percent_null[percent_null > 0]
        else:
            null_cols = percent_null[percent_null >= pct_null_threshold]
        self._cols_to_drop = list(null_cols.index)
        return self

    def transform(self, X, y=None):
        """Transforms data X by dropping columns that exceed the threshold of null values.
        Arguments:
            X (pd.DataFrame): Data to transform
            y (pd.Series, optional): Targets
        Returns:
            pd.DataFrame: Transformed X
        """

        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        return X.drop(columns=self._cols_to_drop, axis=1)

myDropNull = MyDropNullColumns()
print(generate_component_code(myDropNull))

myDropNullColumnsTransformer = MyDropNullColumns(**{'pct_null_threshold': 1.0})

Expectations for Custom Classification Components¶

EvalML expects the following from custom classification component implementations:

Classification targets will range from 0 to n-1 and are integers.
For classification estimators, the order of predict_proba’s columns must match the order of the target, and the column names must be integers ranging from 0 to n-1

Objectives Pipelines