Components#

Components are the lowest level of building blocks in EvalML. Each component represents a fundamental operation to be applied to data.

All components accept parameters as keyword arguments to their __init__ methods. These parameters can be used to configure behavior.

Each component class definition must include a human-readable name for the component. Additionally, each component class may expose parameters for AutoML search by defining a hyperparameter_ranges attribute containing the parameters in question.

EvalML splits components into two categories: transformers and estimators.

Transformers#

Transformers subclass the Transformer class, and define a fit method to learn information from training data and a transform method to apply a learned transformation to new data.

For example, an imputer is configured with the desired impute strategy to follow, for instance the mean value. The imputers fit method would learn the mean from the training data, and the transform method would fill the learned mean value in for any missing values in new data.

All transformers can execute fit and transform separately or in one step by calling fit_transform. Defining a custom fit_transform method can facilitate useful performance optimizations in some cases.

[1]:
import numpy as np
import pandas as pd
from evalml.pipelines.components import SimpleImputer

X = pd.DataFrame([[1, 2, 3], [1, np.nan, 3]])
display(X)
0 1 2
0 1 2.0 3
1 1 NaN 3
[2]:
import woodwork as ww

imp = SimpleImputer(impute_strategy="mean")

X.ww.init()
X = imp.fit_transform(X)
display(X)
0 1 2
0 1 2.0 3
1 1 2.0 3

Below is a list of all transformers included with EvalML:

[3]:
from evalml.pipelines.components.utils import all_components, Estimator, Transformer

for component in all_components():
    if issubclass(component, Transformer):
        print(f"Transformer: {component.name}")
Transformer: Time Series Regularizer
Transformer: Drop NaN Rows Transformer
Transformer: Replace Nullable Types Transformer
Transformer: Drop Rows Transformer
Transformer: URL Featurizer
Transformer: Email Featurizer
Transformer: Log Transformer
Transformer: STL Decomposer
Transformer: Polynomial Decomposer
Transformer: DFS Transformer
Transformer: Time Series Featurizer
Transformer: Natural Language Featurizer
Transformer: LSA Transformer
Transformer: Drop Null Columns Transformer
Transformer: DateTime Featurizer
Transformer: PCA Transformer
Transformer: Linear Discriminant Analysis Transformer
Transformer: Select Columns By Type Transformer
Transformer: Select Columns Transformer
Transformer: Drop Columns Transformer
Transformer: Oversampler
Transformer: Undersampler
Transformer: Standard Scaler
Transformer: Time Series Imputer
Transformer: Target Imputer
Transformer: Imputer
Transformer: KNN Imputer
Transformer: Per Column Imputer
Transformer: Simple Imputer
Transformer: RFE Selector with RF Regressor
Transformer: RFE Selector with RF Classifier
Transformer: RF Regressor Select From Model
Transformer: RF Classifier Select From Model
Transformer: Ordinal Encoder
Transformer: Label Encoder
Transformer: Target Encoder
Transformer: One Hot Encoder

Estimators#

Each estimator wraps an ML algorithm. Estimators subclass the Estimator class, and define a fit method to learn information from training data and a predict method for generating predictions from new data. Classification estimators should also define a predict_proba method for generating predicted probabilities.

Estimator classes each define a model_family attribute indicating what type of model is used.

Here’s an example of using the LogisticRegressionClassifier estimator to fit and predict on a simple dataset:

[4]:
from evalml.pipelines.components import LogisticRegressionClassifier

clf = LogisticRegressionClassifier()

X = X
y = [1, 0]

clf.fit(X, y)
clf.predict(X)
[4]:
0    0
1    0
dtype: int64

Below is a list of all estimators included with EvalML:

[5]:
from evalml.pipelines.components.utils import all_components, Estimator, Transformer

for component in all_components():
    if issubclass(component, Estimator):
        print(f"Estimator: {component.name}")
Estimator: Stacked Ensemble Regressor
Estimator: Stacked Ensemble Classifier
Estimator: Vowpal Wabbit Regressor
Estimator: VARMAX Regressor
Estimator: ARIMA Regressor
Estimator: Exponential Smoothing Regressor
Estimator: SVM Regressor
Estimator: Prophet Regressor
Estimator: Multiseries Time Series Baseline Regressor
Estimator: Time Series Baseline Estimator
Estimator: Decision Tree Regressor
Estimator: Baseline Regressor
Estimator: Extra Trees Regressor
Estimator: XGBoost Regressor
Estimator: CatBoost Regressor
Estimator: Random Forest Regressor
Estimator: LightGBM Regressor
Estimator: Linear Regressor
Estimator: Elastic Net Regressor
Estimator: Vowpal Wabbit Multiclass Classifier
Estimator: Vowpal Wabbit Binary Classifier
Estimator: SVM Classifier
Estimator: KNN Classifier
Estimator: Decision Tree Classifier
Estimator: LightGBM Classifier
Estimator: Baseline Classifier
Estimator: Extra Trees Classifier
Estimator: Elastic Net Classifier
Estimator: CatBoost Classifier
Estimator: XGBoost Classifier
Estimator: Random Forest Classifier
Estimator: Logistic Regression Classifier

Defining Custom Components#

EvalML allows you to easily create your own custom components by following the steps below.

Custom Transformers#

Your transformer must inherit from the correct subclass. In this case Transformer for components that transform data. Next we will use EvalML’s DropNullColumns as an example.

[6]:
from evalml.pipelines.components import Transformer
from evalml.utils import (
    infer_feature_types,
)


class DropNullColumns(Transformer):
    """Transformer to drop features whose percentage of NaN values exceeds a specified threshold"""

    name = "Drop Null Columns Transformer"
    hyperparameter_ranges = {}

    def __init__(self, pct_null_threshold=1.0, random_seed=0, **kwargs):
        """Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.

        Args:
            pct_null_threshold(float): The percentage of NaN values in an input feature to drop.
                Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.
                If equal to 1.0, will drop columns with all null values. Defaults to 0.95.
        """
        if pct_null_threshold < 0 or pct_null_threshold > 1:
            raise ValueError(
                "pct_null_threshold must be a float between 0 and 1, inclusive."
            )
        parameters = {"pct_null_threshold": pct_null_threshold}
        parameters.update(kwargs)

        self._cols_to_drop = None
        super().__init__(
            parameters=parameters, component_obj=None, random_seed=random_seed
        )

    def fit(self, X, y=None):
        """Fits DropNullColumns component to data

        Args:
            X (pd.DataFrame): The input training data of shape [n_samples, n_features]
            y (pd.Series, optional): The target training data of length [n_samples]

        Returns:
            self
        """
        pct_null_threshold = self.parameters["pct_null_threshold"]
        X_t = infer_feature_types(X)
        percent_null = X_t.isnull().mean()
        if pct_null_threshold == 0.0:
            null_cols = percent_null[percent_null > 0]
        else:
            null_cols = percent_null[percent_null >= pct_null_threshold]
        self._cols_to_drop = list(null_cols.index)
        return self

    def transform(self, X, y=None):
        """Transforms data X by dropping columns that exceed the threshold of null values.

        Args:
            X (pd.DataFrame): Data to transform
            y (pd.Series, optional): Ignored.

        Returns:
            pd.DataFrame: Transformed X
        """
        X_t = infer_feature_types(X)
        return X_t.drop(self._cols_to_drop)

Required fields#

  • name: A human-readable name.

  • modifies_features: A boolean that specifies whether this component modifies (subsets or transforms) the features variable during transform.

  • modifies_target: A boolean that specifies whether this component modifies (subsets or transforms) the target variable during transform.

Required methods#

Likewise, there are select methods you need to override as Transformer is an abstract base class:

  • __init__(): The __init__() method of your transformer will need to call super().__init__() and pass three parameters in: a parameters dictionary holding the parameters to the component, the component_obj, and the random_seed value. You can see that component_obj is set to None above and we will discuss component_obj in depth later on.

  • fit(): The fit() method is responsible for fitting your component on training data. It should return the component object.

  • transform(): After fitting a component, the transform() method will take in new data and transform accordingly. It should return a pandas dataframe with woodwork initialized. Note: a component must call fit() before transform().

You can also call or override fit_transform() that combines fit() and transform() into one method.

Custom Estimators#

Your estimator must inherit from the correct subclass. In this case Estimator for components that predict new target values. Next we will use EvalML’s BaselineRegressor as an example.

[7]:
import numpy as np
import pandas as pd

from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes


class BaselineRegressor(Estimator):
    """Regressor that predicts using the specified strategy.

    This is useful as a simple baseline regressor to compare with other regressors.
    """

    name = "Baseline Regressor"
    hyperparameter_ranges = {}
    model_family = ModelFamily.BASELINE
    supported_problem_types = [
        ProblemTypes.REGRESSION,
        ProblemTypes.TIME_SERIES_REGRESSION,
    ]

    def __init__(self, strategy="mean", random_seed=0, **kwargs):
        """Baseline regressor that uses a simple strategy to make predictions.

        Args:
            strategy (str): Method used to predict. Valid options are "mean", "median". Defaults to "mean".
            random_seed (int): Seed for the random number generator. Defaults to 0.

        """
        if strategy not in ["mean", "median"]:
            raise ValueError(
                "'strategy' parameter must equal either 'mean' or 'median'"
            )
        parameters = {"strategy": strategy}
        parameters.update(kwargs)

        self._prediction_value = None
        self._num_features = None
        super().__init__(
            parameters=parameters, component_obj=None, random_seed=random_seed
        )

    def fit(self, X, y=None):
        if y is None:
            raise ValueError("Cannot fit Baseline regressor if y is None")
        X = infer_feature_types(X)
        y = infer_feature_types(y)

        if self.parameters["strategy"] == "mean":
            self._prediction_value = y.mean()
        elif self.parameters["strategy"] == "median":
            self._prediction_value = y.median()
        self._num_features = X.shape[1]
        return self

    def predict(self, X):
        X = infer_feature_types(X)
        predictions = pd.Series([self._prediction_value] * len(X))
        return infer_feature_types(predictions)

    @property
    def feature_importance(self):
        """Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes.

        Returns:
            np.ndarray (float): An array of zeroes

        """
        return np.zeros(self._num_features)

Required fields#

  • name: A human-readable name.

  • model_family - EvalML model_family that this component belongs to

  • supported_problem_types - list of EvalML problem_types that this component supports

  • modifies_features: A boolean that specifies whether the return value from predict or predict_proba should be used as features.

  • modifies_target: A boolean that specifies whether the return value from predict or predict_proba should be used as the target variable.

Model families and problem types include:

[8]:
from evalml.model_family import ModelFamily
from evalml.problem_types import ProblemTypes

print("Model Families:\n", [m.value for m in ModelFamily])
print("Problem Types:\n", [p.value for p in ProblemTypes])
Model Families:
 ['k_neighbors', 'random_forest', 'svm', 'xgboost', 'lightgbm', 'linear_model', 'catboost', 'extra_trees', 'ensemble', 'decision_tree', 'exponential_smoothing', 'arima', 'varmax', 'baseline', 'prophet', 'vowpal_wabbit', 'none']
Problem Types:
 ['binary', 'multiclass', 'regression', 'time series regression', 'time series binary', 'time series multiclass', 'multiseries time series regression']

Required methods#

  • __init__() - the __init__() method of your estimator will need to call super().__init__() and pass three parameters in: a parameters dictionary holding the parameters to the component, the component_obj, and the random_seed value.

  • fit() - the fit() method is responsible for fitting your component on training data.

  • predict() - after fitting a component, the predict() method will take in new data and predict new target values. Note: a component must call fit() before predict().

  • feature_importance - feature_importance is a Python property that returns a list of importances associated with each feature.

If your estimator handles classification problems it also requires an additonal method:

  • predict_proba() - this method predicts probability estimates for classification labels

Components Wrapping Third-Party Objects#

The component_obj parameter is used for wrapping third-party objects and using them in component implementation. If you’re using a component_obj you will need to define __init__() and pass in the relevant object that has also implemented the required methods mentioned above. However, if the component_obj does not follow EvalML component conventions, you may need to override methods as needed. Below is an example of EvalML’s LinearRegressor.

[9]:
from sklearn.linear_model import LinearRegression as SKLinearRegression

from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes


class LinearRegressor(Estimator):
    """Linear Regressor."""

    name = "Linear Regressor"
    model_family = ModelFamily.LINEAR_MODEL
    supported_problem_types = [ProblemTypes.REGRESSION]

    def __init__(
        self, fit_intercept=True, normalize=False, n_jobs=-1, random_seed=0, **kwargs
    ):
        parameters = {
            "fit_intercept": fit_intercept,
            "normalize": normalize,
            "n_jobs": n_jobs,
        }
        parameters.update(kwargs)
        linear_regressor = SKLinearRegression(**parameters)
        super().__init__(
            parameters=parameters,
            component_obj=linear_regressor,
            random_seed=random_seed,
        )

    @property
    def feature_importance(self):
        return self._component_obj.coef_

Hyperparameter Ranges for AutoML#

hyperparameter_ranges is a dictionary mapping the parameter name (str) to an allowed range (SkOpt Space) for that parameter. Both lists and skopt.space.Categorical values are accepted for categorical spaces.

AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component’s hyperparameter_ranges class attribute. Any component parameter you add an entry for in hyperparameter_ranges will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines.

Generate Component Code#

Once you have a component defined in EvalML, you can generate string Python code to recreate this component, which can then be saved and run elsewhere with EvalML. generate_component_code requires a component instance as the input. This method works for custom components as well, although it won’t return the code required to define the custom component.

[10]:
from evalml.pipelines.components import LogisticRegressionClassifier
from evalml.pipelines.components.utils import generate_component_code

lr = LogisticRegressionClassifier(C=5)
code = generate_component_code(lr)
print(code)
from evalml.pipelines.components.estimators.classifiers.logistic_regression_classifier import LogisticRegressionClassifier

logisticRegressionClassifier = LogisticRegressionClassifier(**{'penalty': 'l2', 'C': 5, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'})
[11]:
# this string can then be copy and pasted into a separate window and executed as python code
exec(code)
[12]:
# We can also do this for custom components
from evalml.pipelines.components.utils import generate_component_code

myDropNull = DropNullColumns()
print(generate_component_code(myDropNull))
dropNullColumnsTransformer = DropNullColumns(**{'pct_null_threshold': 1.0})

Expectations for Custom Classification Components#

EvalML expects the following from custom classification component implementations:

  • Classification targets will range from 0 to n-1 and are integers.

  • For classification estimators, the order of predict_proba’s columns must match the order of the target, and the column names must be integers ranging from 0 to n-1