data_splitters#
Data splitter classes.
Package Contents#
Classes Summary#
Wrapper class for sklearn's KFold splitter. |
|
Does not split the training data into training and validation sets. |
|
Wrapper class for sklearn's Stratified KFold splitter. |
|
Rolling Origin Cross Validation for time series problems. |
|
Split the training data into training and validation sets. |
Contents#
- class evalml.preprocessing.data_splitters.KFold(n_splits=5, *, shuffle=False, random_state=None)[source]#
Wrapper class for sklearn’s KFold splitter.
Methods
Get metadata routing of this object.
Returns the number of splitting iterations in the cross-validator.
Returns whether or not the data splitter is a cross-validation data splitter.
Generate indices to split data into training and test set.
- get_metadata_routing(self)#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
routing – A
MetadataRequest
encapsulating routing information.- Return type
MetadataRequest
- get_n_splits(self, X=None, y=None, groups=None)#
Returns the number of splitting iterations in the cross-validator.
- Parameters
X (object) – Always ignored, exists for compatibility.
y (object) – Always ignored, exists for compatibility.
groups (object) – Always ignored, exists for compatibility.
- Returns
n_splits – Returns the number of splitting iterations in the cross-validator.
- Return type
int
- property is_cv(self)#
Returns whether or not the data splitter is a cross-validation data splitter.
- Returns
If the splitter is a cross-validation data splitter
- Return type
bool
- split(self, X, y=None, groups=None)#
Generate indices to split data into training and test set.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,), default=None) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.
- Yields
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
- class evalml.preprocessing.data_splitters.NoSplit(random_seed=0)[source]#
Does not split the training data into training and validation sets.
All data is passed as the training set, test data is simply an array of None. To be used for future unsupervised learning, should not be used in any of the currently supported pipelines.
- Parameters
random_seed (int) – The seed to use for random sampling. Defaults to 0. Not used.
Methods
Get metadata routing of this object.
Return the number of splits of this object.
Returns whether or not the data splitter is a cross-validation data splitter.
Divide the data into training and testing sets, where the testing set is empty.
- get_metadata_routing(self)#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
routing – A
MetadataRequest
encapsulating routing information.- Return type
MetadataRequest
- static get_n_splits()[source]#
Return the number of splits of this object.
- Returns
Always returns 0.
- Return type
int
- property is_cv(self)#
Returns whether or not the data splitter is a cross-validation data splitter.
- Returns
If the splitter is a cross-validation data splitter
- Return type
bool
- class evalml.preprocessing.data_splitters.StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)[source]#
Wrapper class for sklearn’s Stratified KFold splitter.
Methods
Get metadata routing of this object.
Returns the number of splitting iterations in the cross-validator.
Returns whether or not the data splitter is a cross-validation data splitter.
Generate indices to split data into training and test set.
- get_metadata_routing(self)#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
routing – A
MetadataRequest
encapsulating routing information.- Return type
MetadataRequest
- get_n_splits(self, X=None, y=None, groups=None)#
Returns the number of splitting iterations in the cross-validator.
- Parameters
X (object) – Always ignored, exists for compatibility.
y (object) – Always ignored, exists for compatibility.
groups (object) – Always ignored, exists for compatibility.
- Returns
n_splits – Returns the number of splitting iterations in the cross-validator.
- Return type
int
- property is_cv(self)#
Returns whether or not the data splitter is a cross-validation data splitter.
- Returns
If the splitter is a cross-validation data splitter
- Return type
bool
- split(self, X, y, groups=None)[source]#
Generate indices to split data into training and test set.
- Parameters
X (array-like of shape (n_samples, n_features)) –
Training data, where n_samples is the number of samples and n_features is the number of features.
Note that providing
y
is sufficient to generate the splits and hencenp.zeros(n_samples)
may be used as a placeholder forX
instead of actual training data.y (array-like of shape (n_samples,)) – The target variable for supervised learning problems. Stratification is done based on the y labels.
groups (object) – Always ignored, exists for compatibility.
- Yields
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.
- class evalml.preprocessing.data_splitters.TimeSeriesSplit(max_delay=0, gap=0, forecast_horizon=None, time_index=None, n_series=None, n_splits=3)[source]#
Rolling Origin Cross Validation for time series problems.
The max_delay, gap, and forecast_horizon parameters are only used to validate that the requested split size is not too small given these parameters.
- Parameters
max_delay (int) – Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid “throwing out” more data than in necessary. Defaults to 0.
gap (int) – Number of time units separating the data used to generate features and the data to forecast on. Defaults to 0.
forecast_horizon (int, None) – Number of time units to forecast. Used for parameter validation. If an integer, will set the size of the cv splits. Defaults to None.
time_index (str) – Name of the column containing the datetime information used to order the data. Defaults to None.
n_splits (int) – number of data splits to make. Defaults to 3.
Example
>>> import numpy as np >>> import pandas as pd ... >>> X = pd.DataFrame([i for i in range(10)], columns=["First"]) >>> y = pd.Series([i for i in range(10)]) ... >>> ts_split = TimeSeriesSplit(n_splits=4) >>> generator_ = ts_split.split(X, y) ... >>> first_split = next(generator_) >>> assert (first_split[0] == np.array([0, 1])).all() >>> assert (first_split[1] == np.array([2, 3])).all() ... ... >>> second_split = next(generator_) >>> assert (second_split[0] == np.array([0, 1, 2, 3])).all() >>> assert (second_split[1] == np.array([4, 5])).all() ... ... >>> third_split = next(generator_) >>> assert (third_split[0] == np.array([0, 1, 2, 3, 4, 5])).all() >>> assert (third_split[1] == np.array([6, 7])).all() ... ... >>> fourth_split = next(generator_) >>> assert (fourth_split[0] == np.array([0, 1, 2, 3, 4, 5, 6, 7])).all() >>> assert (fourth_split[1] == np.array([8, 9])).all()
Methods
Get metadata routing of this object.
Get the number of data splits.
Returns whether or not the data splitter is a cross-validation data splitter.
Get the time series splits.
- get_metadata_routing(self)#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
routing – A
MetadataRequest
encapsulating routing information.- Return type
MetadataRequest
- get_n_splits(self, X=None, y=None, groups=None)[source]#
Get the number of data splits.
- Parameters
X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split. Defaults to None.
groups – Ignored but kept for compatibility with sklearn API. Defaults to None.
- Returns
Number of splits.
- property is_cv(self)#
Returns whether or not the data splitter is a cross-validation data splitter.
- Returns
If the splitter is a cross-validation data splitter
- Return type
bool
- split(self, X, y=None, groups=None)[source]#
Get the time series splits.
X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time.
- Parameters
X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split. Defaults to None.
groups – Ignored but kept for compatibility with sklearn API. Defaults to None.
- Yields
Iterator of (train, test) indices tuples.
- Raises
ValueError – If one of the proposed splits would be empty.
- class evalml.preprocessing.data_splitters.TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)[source]#
Split the training data into training and validation sets.
- Parameters
test_size (float) – What percentage of data points should be included in the validation set. Defalts to the complement of train_size if train_size is set, and 0.25 otherwise.
train_size (float) – What percentage of data points should be included in the training set. Defaults to the complement of test_size
shuffle (boolean) – Whether to shuffle the data before splitting. Defaults to False.
stratify (list) – Splits the data in a stratified fashion, using this argument as class labels. Defaults to None.
random_seed (int) – The seed to use for random sampling. Defaults to 0.
Examples
>>> import numpy as np >>> import pandas as pd ... >>> X = pd.DataFrame([i for i in range(10)], columns=["First"]) >>> y = pd.Series([i for i in range(10)]) ... >>> tv_split = TrainingValidationSplit() >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([0, 1, 2, 3, 4, 5, 6])).all() >>> assert (split_[1] == np.array([7, 8, 9])).all() ... ... >>> tv_split = TrainingValidationSplit(test_size=0.5) >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([0, 1, 2, 3, 4])).all() >>> assert (split_[1] == np.array([5, 6, 7, 8, 9])).all() ... ... >>> tv_split = TrainingValidationSplit(shuffle=True) >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([9, 1, 6, 7, 3, 0, 5])).all() >>> assert (split_[1] == np.array([2, 8, 4])).all() ... ... >>> y = pd.Series([i % 3 for i in range(10)]) >>> tv_split = TrainingValidationSplit(shuffle=True, stratify=y) >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([1, 9, 3, 2, 8, 6, 7])).all() >>> assert (split_[1] == np.array([0, 4, 5])).all()
Methods
Get metadata routing of this object.
Return the number of splits of this object.
Returns whether or not the data splitter is a cross-validation data splitter.
Divide the data into training and testing sets.
- get_metadata_routing(self)#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
routing – A
MetadataRequest
encapsulating routing information.- Return type
MetadataRequest
- static get_n_splits()[source]#
Return the number of splits of this object.
- Returns
Always returns 1.
- Return type
int
- property is_cv(self)#
Returns whether or not the data splitter is a cross-validation data splitter.
- Returns
If the splitter is a cross-validation data splitter
- Return type
bool