Components are the lowest level of building blocks in EvalML. Each component represents a fundamental operation to be applied to data.
All components accept parameters as keyword arguments to their __init__ methods. These parameters can be used to configure behavior.
__init__
Each component class definition must include a human-readable name for the component. Additionally, each component class may expose parameters for AutoML search by defining a hyperparameter_ranges attribute containing the parameters in question.
name
hyperparameter_ranges
EvalML splits components into two categories: transformers and estimators.
Transformers subclass the Transformer class, and define a fit method to learn information from training data and a transform method to apply a learned transformation to new data.
Transformer
fit
transform
For example, an imputer is configured with the desired impute strategy to follow, for instance the mean value. The imputers fit method would learn the mean from the training data, and the transform method would fill the learned mean value in for any missing values in new data.
All transformers can execute fit and transform separately or in one step by calling fit_transform. Defining a custom fit_transform method can facilitate useful performance optimizations in some cases.
fit_transform
[1]:
import numpy as np import pandas as pd from evalml.pipelines.components import SimpleImputer X = pd.DataFrame([[1, 2, 3], [1, np.nan, 3]]) display(X)
[2]:
imp = SimpleImputer(impute_strategy="mean") X = imp.fit_transform(X) display(X)
Below is a list of all transformers included with EvalML:
[3]:
from evalml.pipelines.components.utils import all_components, Estimator, Transformer for component in all_components(): if issubclass(component, Transformer): print(f"Transformer: {component.name}")
Transformer: Delayed Feature Transformer Transformer: Text Featurization Component Transformer: LSA Transformer Transformer: Drop Null Columns Transformer Transformer: DateTime Featurization Component Transformer: PCA Transformer Transformer: Select Columns Transformer Transformer: Drop Columns Transformer Transformer: Standard Scaler Transformer: Imputer Transformer: Per Column Imputer Transformer: Simple Imputer Transformer: RF Regressor Select From Model Transformer: RF Classifier Select From Model Transformer: Target Encoder Transformer: One Hot Encoder
Each estimator wraps an ML algorithm. Estimators subclass the Estimator class, and define a fit method to learn information from training data and a predict method for generating predictions from new data. Classification estimators should also define a predict_proba method for generating predicted probabilities.
Estimator
predict
predict_proba
Estimator classes each define a model_family attribute indicating what type of model is used.
model_family
Here’s an example of using the LogisticRegressionClassifier estimator to fit and predict on a simple dataset:
[4]:
from evalml.pipelines.components import LogisticRegressionClassifier clf = LogisticRegressionClassifier() X = X y = [1, 0] clf.fit(X, y) clf.predict(X)
0 0 1 0 dtype: int64
Below is a list of all estimators included with EvalML:
[5]:
from evalml.pipelines.components.utils import all_components, Estimator, Transformer for component in all_components(): if issubclass(component, Estimator): print(f"Estimator: {component.name}")
Estimator: Stacked Ensemble Regressor Estimator: Stacked Ensemble Classifier Estimator: Decision Tree Regressor. Estimator: Baseline Regressor Estimator: Extra Trees Regressor Estimator: XGBoost Regressor Estimator: CatBoost Regressor Estimator: Random Forest Regressor Estimator: Linear Regressor Estimator: Elastic Net Regressor Estimator: Decision Tree Classifier Estimator: LightGBM Classifier Estimator: Baseline Classifier Estimator: Extra Trees Classifier Estimator: Elastic Net Classifier Estimator: CatBoost Classifier Estimator: XGBoost Classifier Estimator: Random Forest Classifier Estimator: Logistic Regression Classifier
EvalML allows you to easily create your own custom components by following the steps below.
Your transformer must inherit from the correct subclass. In this case Transformer for components that transform data. Next we will use EvalML’s DropNullColumns as an example.
[6]:
import pandas as pd from evalml.pipelines.components import Transformer class DropNullColumns(Transformer): """Transformer to drop features whose percentage of NaN values exceeds a specified threshold""" name = "Drop Null Columns Transformer" hyperparameter_ranges = {} def __init__(self, pct_null_threshold=1.0, random_state=0, **kwargs): """Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold. Arguments: pct_null_threshold(float): The percentage of NaN values in an input feature to drop. Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values. If equal to 1.0, will drop columns with all null values. Defaults to 0.95. """ if pct_null_threshold < 0 or pct_null_threshold > 1: raise ValueError("pct_null_threshold must be a float between 0 and 1, inclusive.") parameters = {"pct_null_threshold": pct_null_threshold} parameters.update(kwargs) self._cols_to_drop = None super().__init__(parameters=parameters, component_obj=None, random_state=random_state) def fit(self, X, y=None): pct_null_threshold = self.parameters["pct_null_threshold"] if not isinstance(X, pd.DataFrame): X = pd.DataFrame(X) percent_null = X.isnull().mean() if pct_null_threshold == 0.0: null_cols = percent_null[percent_null > 0] else: null_cols = percent_null[percent_null >= pct_null_threshold] self._cols_to_drop = list(null_cols.index) return self def transform(self, X, y=None): """Transforms data X by dropping columns that exceed the threshold of null values. Arguments: X (pd.DataFrame): Data to transform y (pd.Series, optional): Targets Returns: pd.DataFrame: Transformed X """ if not isinstance(X, pd.DataFrame): X = pd.DataFrame(X) return X.drop(columns=self._cols_to_drop, axis=1)
For a transformer you must provide a class attribute name indicating a human-readable name.
Likewise, there are select methods you need to override as Transformer is an abstract base class:
__init__() - the __init__() method of your transformer will need to call super().__init__() and pass three parameters in: a parameters dictionary holding the parameters to the component, the component_obj, and the random_state value. You can see that component_obj is set to None above and we will discuss component_obj in depth later on.
__init__()
super().__init__()
parameters
component_obj
random_state
None
fit() - the fit() method is responsible for fitting your component on training data.
fit()
transform() - after fitting a component, the transform() method will take in new data and transform accordingly. Note: a component must call fit() before transform().
transform()
You can also call or override fit_transform() that combines fit() and transform() into one method.
fit_transform()
Your estimator must inherit from the correct subclass. In this case Estimator for components that predict new target values. Next we will use EvalML’s BaselineRegressor as an example.
[7]:
import numpy as np import pandas as pd from evalml.model_family import ModelFamily from evalml.pipelines.components.estimators import Estimator from evalml.problem_types import ProblemTypes class BaselineRegressor(Estimator): """Regressor that predicts using the specified strategy. This is useful as a simple baseline regressor to compare with other regressors. """ name = "Baseline Regressor" hyperparameter_ranges = {} model_family = ModelFamily.BASELINE supported_problem_types = [ProblemTypes.REGRESSION] def __init__(self, strategy="mean", random_state=0, **kwargs): """Baseline regressor that uses a simple strategy to make predictions. Arguments: strategy (str): method used to predict. Valid options are "mean", "median". Defaults to "mean". random_state (int, np.random.RandomState): seed for the random number generator """ if strategy not in ["mean", "median"]: raise ValueError("'strategy' parameter must equal either 'mean' or 'median'") parameters = {"strategy": strategy} parameters.update(kwargs) self._prediction_value = None self._num_features = None super().__init__(parameters=parameters, component_obj=None, random_state=random_state) def fit(self, X, y=None): if y is None: raise ValueError("Cannot fit Baseline regressor if y is None") if not isinstance(y, pd.Series): y = pd.Series(y) if self.parameters["strategy"] == "mean": self._prediction_value = y.mean() elif self.parameters["strategy"] == "median": self._prediction_value = y.median() self._num_features = X.shape[1] return self def predict(self, X): return pd.Series([self._prediction_value] * len(X)) @property def feature_importance(self): """Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes. Returns: np.ndarray (float): an array of zeroes """ return np.zeros(self._num_features)
name indicating a human-readable name.
model_family - EvalML model_family that this component belongs to
supported_problem_types - list of EvalML problem_types that this component supports
supported_problem_types
Model families and problem types include:
[8]:
from evalml.model_family import ModelFamily from evalml.problem_types import ProblemTypes print("Model Families:\n", [m.value for m in ModelFamily]) print("Problem Types:\n", [p.value for p in ProblemTypes])
Model Families: ['random_forest', 'xgboost', 'lightgbm', 'linear_model', 'catboost', 'extra_trees', ('ensemble',), 'decision_tree', 'baseline', 'none'] Problem Types: ['binary', 'multiclass', 'regression', 'time series regression']
__init__() - the __init__() method of your estimator will need to call super().__init__() and pass three parameters in: a parameters dictionary holding the parameters to the component, the component_obj, and the random_state value.
predict() - after fitting a component, the predict() method will take in new data and predict new target values. Note: a component must call fit() before predict().
predict()
feature_importance - feature_importance is a Python property that returns a list of importances associated with each feature.
feature_importance
If your estimator handles classification problems it also requires an additonal method:
predict_proba() - this method predicts probability estimates for classification labels
predict_proba()
The component_obj parameter is used for wrapping third-party objects and using them in component implementation. If you’re using a component_obj you will need to define __init__() and pass in the relevant object that has also implemented the required methods mentioned above. However, if the component_obj does not follow EvalML component conventions, you may need to override methods as needed. Below is an example of EvalML’s LinearRegressor.
[9]:
from sklearn.linear_model import LinearRegression as SKLinearRegression from evalml.model_family import ModelFamily from evalml.pipelines.components.estimators import Estimator from evalml.problem_types import ProblemTypes class LinearRegressor(Estimator): """Linear Regressor.""" name = "Linear Regressor" model_family = ModelFamily.LINEAR_MODEL supported_problem_types = [ProblemTypes.REGRESSION] def __init__(self, fit_intercept=True, normalize=False, n_jobs=-1, random_state=0, **kwargs): parameters = { 'fit_intercept': fit_intercept, 'normalize': normalize, 'n_jobs': n_jobs } parameters.update(kwargs) linear_regressor = SKLinearRegression(**parameters) super().__init__(parameters=parameters, component_obj=linear_regressor, random_state=random_state) @property def feature_importance(self): return self._component_obj.coef_
hyperparameter_ranges is a dictionary mapping the parameter name (str) to an allowed range (SkOpt Space) for that parameter. Both lists and skopt.space.Categorical values are accepted for categorical spaces.
skopt.space.Categorical
AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component’s hyperparameter_ranges class attribute. Any component parameter you add an entry for in hyperparameter_ranges will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines.
[10]:
from sklearn.linear_model import LinearRegression as SKLinearRegression from evalml.model_family import ModelFamily from evalml.pipelines.components.estimators import Estimator from evalml.problem_types import ProblemTypes class LinearRegressor(Estimator): """Linear Regressor.""" name = "Linear Regressor" hyperparameter_ranges = { 'fit_intercept': [True, False], 'normalize': [True, False] } model_family = ModelFamily.LINEAR_MODEL supported_problem_types = [ProblemTypes.REGRESSION] def __init__(self, fit_intercept=True, normalize=False, n_jobs=-1, random_state=0, **kwargs): parameters = { 'fit_intercept': fit_intercept, 'normalize': normalize, 'n_jobs': n_jobs } parameters.update(kwargs) linear_regressor = SKLinearRegression(**parameters) super().__init__(parameters=parameters, component_obj=linear_regressor, random_state=random_state) @property def feature_importance(self): return self._component_obj.coef_
Once you have a component defined in EvalML, you can generate string Python code to recreate this component, which can then be saved and run elsewhere with EvalML. generate_component_code requires a component instance as the input. This method works for custom components as well, although it won’t return the code required to define the custom component.
generate_component_code
[11]:
from evalml.pipelines.components import LogisticRegressionClassifier from evalml.pipelines.components.utils import generate_component_code lr = LogisticRegressionClassifier(C=5) code = generate_component_code(lr) print(code)
from evalml.pipelines.components.estimators.classifiers.logistic_regression import LogisticRegressionClassifier logisticRegressionClassifier = LogisticRegressionClassifier(**{'penalty': 'l2', 'C': 5, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'})
[12]:
# this string can then be copy and pasted into a separate window and executed as python code exec(code) logisticRegressionClassifier
LogisticRegressionClassifier(penalty='l2', C=5, n_jobs=-1, multi_class='auto', solver='lbfgs')
[13]:
# custom component from evalml.pipelines.components import Transformer import pandas as pd from evalml.pipelines.components.utils import generate_component_code class MyDropNullColumns(Transformer): """Transformer to drop features whose percentage of NaN values exceeds a specified threshold""" name = "My Drop Null Columns Transformer" hyperparameter_ranges = {} def __init__(self, pct_null_threshold=1.0, random_state=0, **kwargs): """Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold. Arguments: pct_null_threshold(float): The percentage of NaN values in an input feature to drop. Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values. If equal to 1.0, will drop columns with all null values. Defaults to 0.95. """ if pct_null_threshold < 0 or pct_null_threshold > 1: raise ValueError("pct_null_threshold must be a float between 0 and 1, inclusive.") parameters = {"pct_null_threshold": pct_null_threshold} parameters.update(kwargs) self._cols_to_drop = None super().__init__(parameters=parameters, component_obj=None, random_state=random_state) def fit(self, X, y=None): pct_null_threshold = self.parameters["pct_null_threshold"] if not isinstance(X, pd.DataFrame): X = pd.DataFrame(X) percent_null = X.isnull().mean() if pct_null_threshold == 0.0: null_cols = percent_null[percent_null > 0] else: null_cols = percent_null[percent_null >= pct_null_threshold] self._cols_to_drop = list(null_cols.index) return self def transform(self, X, y=None): """Transforms data X by dropping columns that exceed the threshold of null values. Arguments: X (pd.DataFrame): Data to transform y (pd.Series, optional): Targets Returns: pd.DataFrame: Transformed X """ if not isinstance(X, pd.DataFrame): X = pd.DataFrame(X) return X.drop(columns=self._cols_to_drop, axis=1) myDropNull = MyDropNullColumns() print(generate_component_code(myDropNull))
myDropNullColumnsTransformer = MyDropNullColumns(**{'pct_null_threshold': 1.0})
EvalML expects the following from custom classification component implementations:
Classification targets will range from 0 to n-1 and are integers.
For classification estimators, the order of predict_proba’s columns must match the order of the target, and the column names must be integers ranging from 0 to n-1