In this demo, we will show you how to use EvalML to build models which use text data.
[1]:
import evalml from evalml import AutoMLSearch
We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not (“ham”). This dataset is originally from Kaggle, but modified to produce a slightly more even distribution of spam to ham.
[2]:
from urllib.request import urlopen import pandas as pd input_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv') data = pd.read_csv(input_data) X = data.drop(['Category'], axis=1) y = data['Category'] display(X.head())
The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% accuracy in order to perform better than a trivial baseline model which simply classifies everything as ham.
[3]:
y.value_counts(normalize=True)
ham 0.750084 spam 0.249916 Name: Category, dtype: float64
In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.
[4]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=0.2, random_state=0)
EvalML uses Woodwork to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data.
[5]:
import woodwork as ww X_train_dt = ww.DataTable(X_train) y_train_dc = ww.DataColumn(y_train)
Because the spam/ham labels are binary, we will use AutoMLSearch(problem_type='binary'). When we call .search(), the search for the best pipeline will begin.
AutoMLSearch(problem_type='binary')
.search()
[6]:
automl = AutoMLSearch(problem_type='binary', max_batches=1, optimize_thresholds=True) automl.search(X_train_dt, y_train_dc)
Generating pipelines to search over... ***************************** * Beginning pipeline search * ***************************** Optimizing for Log Loss Binary. Lower score is better. Searching up to 1 batches for a total of 9 pipelines. Allowed model families: decision_tree, xgboost, random_forest, lightgbm, linear_model, extra_trees, catboost
Batch 1: (1/9) Mode Baseline Binary Classification P... Elapsed:00:00 Starting cross validation Finished cross validation - mean Log Loss Binary: 8.638 Batch 1: (2/9) Decision Tree Classifier w/ Imputer +... Elapsed:00:00 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.718 Batch 1: (3/9) LightGBM Classifier w/ Imputer + Text... Elapsed:00:12 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.176 High coefficient of variation (cv >= 0.2) within cross validation scores. LightGBM Classifier w/ Imputer + Text Featurization Component may not perform as estimated on unseen data. Batch 1: (4/9) Extra Trees Classifier w/ Imputer + T... Elapsed:00:24 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.212 Batch 1: (5/9) Elastic Net Classifier w/ Imputer + T... Elapsed:00:36 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.512 Batch 1: (6/9) CatBoost Classifier w/ Imputer + Text... Elapsed:00:47 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.523 Batch 1: (7/9) XGBoost Classifier w/ Imputer + Text ... Elapsed:00:58 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.130 Batch 1: (8/9) Random Forest Classifier w/ Imputer +... Elapsed:01:09 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.127 Batch 1: (9/9) Logistic Regression Classifier w/ Imp... Elapsed:01:22 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.167 Search finished after 01:34 Best pipeline: Random Forest Classifier w/ Imputer + Text Featurization Component Best pipeline Log Loss Binary: 0.126659
Once the fitting process is done, we can see all of the pipelines that were searched.
[7]:
automl.rankings
to select the best pipeline we can run
[8]:
best_pipeline = automl.best_pipeline
You can get more details about any pipeline, including how it performed on other objective functions.
[9]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
********************************************************************** * Random Forest Classifier w/ Imputer + Text Featurization Component * ********************************************************************** Problem Type: binary Model Family: Random Forest Pipeline Steps ============== 1. Imputer * categorical_impute_strategy : most_frequent * numeric_impute_strategy : mean * categorical_fill_value : None * numeric_fill_value : None 2. Text Featurization Component * text_columns : ['Message'] 3. Random Forest Classifier * n_estimators : 100 * max_depth : 6 * n_jobs : -1 Training ======== Training for binary problems. Total training time (including CV): 12.3 seconds Cross Validation ---------------- Log Loss Binary MCC Binary AUC Precision F1 Balanced Accuracy Binary Accuracy Binary # Training # Testing 0 0.130 0.860 0.986 0.939 0.892 0.915 0.949 1594.000 797.000 1 0.122 0.899 0.981 0.942 0.923 0.943 0.962 1594.000 797.000 2 0.128 0.888 0.981 0.947 0.915 0.934 0.959 1594.000 797.000 mean 0.127 0.882 0.982 0.943 0.910 0.931 0.957 - - std 0.004 0.020 0.003 0.004 0.016 0.014 0.007 - - coef of var 0.033 0.023 0.003 0.004 0.018 0.015 0.007 - -
[10]:
best_pipeline.graph()
Notice above that there is a Text Featurization Component as the second step in the pipeline. The Woodwork DataTable passed in to AutoML search recognizes that 'Message' is a text column, and converts this text into numerical values that can be handled by the estimator.
Text Featurization Component
DataTable
'Message'
Finally, we retrain the best pipeline on all of the training data and evaluate on the holdout
[11]:
best_pipeline.fit(X_train, y_train)
GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Text Featurization Component':{'text_columns': ['Message']}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1},})
Now, we can score the pipeline on the hold out data using the core objectives for binary classification problems
[12]:
scores = best_pipeline.score(X_holdout, y_holdout, objectives=evalml.objectives.get_core_objectives('binary')) print(f'Accuracy Binary: {scores["Accuracy Binary"]}')
Accuracy Binary: 0.9732441471571907
As you can see, this model performs relatively well on this dataset, even on unseen data.
To demonstrate the importance of text-specific modeling, let’s train a model with the same dataset, without letting AutoMLSearch detect the text column. We can change this by explicitly setting the data type of the 'Message' column in Woodwork.
AutoMLSearch
[13]:
X_train_categorical = ww.DataTable(X_train, logical_types={'Message': 'Categorical'})
[14]:
automl_no_text = AutoMLSearch(problem_type='binary', max_batches=1, optimize_thresholds=True) automl_no_text.search(X_train_categorical, y_train_dc)
Batch 1: (1/9) Mode Baseline Binary Classification P... Elapsed:00:00 Starting cross validation Finished cross validation - mean Log Loss Binary: 8.638 Batch 1: (2/9) Decision Tree Classifier w/ Imputer +... Elapsed:00:00 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.562 Batch 1: (3/9) LightGBM Classifier w/ Imputer + One ... Elapsed:00:00 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.562 Batch 1: (4/9) Extra Trees Classifier w/ Imputer + O... Elapsed:00:00 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.561 Batch 1: (5/9) Elastic Net Classifier w/ Imputer + O... Elapsed:00:02 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.562 Batch 1: (6/9) CatBoost Classifier w/ Imputer Elapsed:00:02 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.621 Batch 1: (7/9) XGBoost Classifier w/ Imputer + One H... Elapsed:00:03 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.562 Batch 1: (8/9) Random Forest Classifier w/ Imputer +... Elapsed:00:03 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.561 Batch 1: (9/9) Logistic Regression Classifier w/ Imp... Elapsed:00:05 Starting cross validation Finished cross validation - mean Log Loss Binary: 0.559 Search finished after 00:05 Best pipeline: Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler Best pipeline Log Loss Binary: 0.559496
Like before, we can look at the rankings and pick the best pipeline
[15]:
automl_no_text.rankings
[16]:
best_pipeline_no_text = automl_no_text.best_pipeline
Here, changing the data type of the text column removed the Text Featurization Component from the pipeline
[17]:
best_pipeline_no_text.graph()
[18]:
automl_no_text.describe_pipeline(automl_no_text.rankings.iloc[0]["id"])
********************************************************************************* * Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler * ********************************************************************************* Problem Type: binary Model Family: Linear Pipeline Steps ============== 1. Imputer * categorical_impute_strategy : most_frequent * numeric_impute_strategy : mean * categorical_fill_value : None * numeric_fill_value : None 2. One Hot Encoder * top_n : 10 * features_to_encode : None * categories : None * drop : None * handle_unknown : ignore * handle_missing : error 3. Standard Scaler 4. Logistic Regression Classifier * penalty : l2 * C : 1.0 * n_jobs : -1 * multi_class : auto * solver : lbfgs Training ======== Training for binary problems. Total training time (including CV): 0.4 seconds Cross Validation ---------------- Log Loss Binary MCC Binary AUC Precision F1 Balanced Accuracy Binary Accuracy Binary # Training # Testing 0 0.558 0.061 0.508 1.000 0.010 0.503 0.752 1594.000 797.000 1 0.561 0.000 0.503 0.000 0.000 0.500 0.750 1594.000 797.000 2 0.560 0.087 0.506 1.000 0.020 0.505 0.752 1594.000 797.000 mean 0.559 0.049 0.506 0.667 0.010 0.503 0.751 - - std 0.001 0.045 0.002 0.577 0.010 0.003 0.001 - - coef of var 0.002 0.903 0.004 0.866 0.997 0.005 0.001 - -
[19]:
# train on the full training data best_pipeline_no_text.fit(X_train, y_train) # get standard performance metrics on holdout data scores = best_pipeline_no_text.score(X_holdout, y_holdout, objectives=evalml.objectives.get_core_objectives('binary')) print(f'Accuracy Binary: {scores["Accuracy Binary"]}')
Accuracy Binary: 0.7525083612040134
Without the Text Featurization Component, the 'Message' column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the One Hot Encoder. The best pipeline encoded the top 10 most frequent “categories” of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the best_pipeline_no_text did not beat the random guess of predicting “ham” in every case.
One Hot Encoder
best_pipeline_no_text