Using Text Data with EvalML¶

In this demo, we will show you how to use EvalML to build models which use text data.

[1]:

import evalml
from evalml import AutoMLSearch

Dataset¶

We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not (“ham”). This dataset is originally from Kaggle, but modified to produce a slightly more even distribution of spam to ham.

[2]:

from urllib.request import urlopen
import pandas as pd

input_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv')
data = pd.read_csv(input_data)

X = data.drop(['Category'], axis=1)
y = data['Category']

display(X.head())

	Message
0	Free entry in 2 a wkly comp to win FA Cup fina...
1	FreeMsg Hey there darling it's been 3 week's n...
2	WINNER!! As a valued network customer you have...
3	Had your mobile 11 months or more? U R entitle...
4	SIX chances to win CASH! From 100 to 20,000 po...

The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% accuracy in order to perform better than a trivial baseline model which simply classifies everything as ham.

[3]:

y.value_counts(normalize=True)

[3]:

ham     0.750084
spam    0.249916
Name: Category, dtype: float64

Search for best pipeline¶

In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.

[4]:

X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=0.2, random_state=0)

EvalML uses Woodwork to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data.

[5]:

import woodwork as ww

X_train_dt = ww.DataTable(X_train)
y_train_dc = ww.DataColumn(y_train)

Because the spam/ham labels are binary, we will use AutoMLSearch(problem_type='binary'). When we call .search(), the search for the best pipeline will begin.

[6]:

automl = AutoMLSearch(problem_type='binary',
                      max_batches=1,
                      optimize_thresholds=True)

automl.search(X_train_dt, y_train_dc)

Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Searching up to 1 batches for a total of 9 pipelines.
Allowed model families: decision_tree, xgboost, random_forest, lightgbm, linear_model, extra_trees, catboost

Batch 1: (1/9) Mode Baseline Binary Classification P... Elapsed:00:00
Starting cross validation
Finished cross validation - mean Log Loss Binary: 8.638
Batch 1: (2/9) Decision Tree Classifier w/ Imputer +... Elapsed:00:00
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.718
Batch 1: (3/9) LightGBM Classifier w/ Imputer + Text... Elapsed:00:12
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.176
High coefficient of variation (cv >= 0.2) within cross validation scores. LightGBM Classifier w/ Imputer + Text Featurization Component may not perform as estimated on unseen data.
Batch 1: (4/9) Extra Trees Classifier w/ Imputer + T... Elapsed:00:24
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.212
Batch 1: (5/9) Elastic Net Classifier w/ Imputer + T... Elapsed:00:36
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.512
Batch 1: (6/9) CatBoost Classifier w/ Imputer + Text... Elapsed:00:47
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.523
Batch 1: (7/9) XGBoost Classifier w/ Imputer + Text ... Elapsed:00:58
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.130
Batch 1: (8/9) Random Forest Classifier w/ Imputer +... Elapsed:01:09
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.127
Batch 1: (9/9) Logistic Regression Classifier w/ Imp... Elapsed:01:22
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.167

Search finished after 01:34
Best pipeline: Random Forest Classifier w/ Imputer + Text Featurization Component
Best pipeline Log Loss Binary: 0.126659

View rankings and select pipeline¶

Once the fitting process is done, we can see all of the pipelines that were searched.

[7]:

automl.rankings

[7]:

	id	pipeline_name	score	validation_score	percent_better_than_baseline	high_variance_cv	parameters
0	7	Random Forest Classifier w/ Imputer + Text Fea...	0.126659	0.129962	98.533748	False	{'Imputer': {'categorical_impute_strategy': 'm...
1	6	XGBoost Classifier w/ Imputer + Text Featuriza...	0.129805	0.137287	98.497327	False	{'Imputer': {'categorical_impute_strategy': 'm...
2	8	Logistic Regression Classifier w/ Imputer + Te...	0.167302	0.133515	98.063250	False	{'Imputer': {'categorical_impute_strategy': 'm...
3	2	LightGBM Classifier w/ Imputer + Text Featuriz...	0.176034	0.250505	97.962165	True	{'Imputer': {'categorical_impute_strategy': 'm...
4	3	Extra Trees Classifier w/ Imputer + Text Featu...	0.211965	0.187199	97.546217	False	{'Imputer': {'categorical_impute_strategy': 'm...
5	4	Elastic Net Classifier w/ Imputer + Text Featu...	0.512199	0.518497	94.070613	False	{'Imputer': {'categorical_impute_strategy': 'm...
6	5	CatBoost Classifier w/ Imputer + Text Featuriz...	0.523369	0.542935	93.941300	False	{'Imputer': {'categorical_impute_strategy': 'm...
7	1	Decision Tree Classifier w/ Imputer + Text Fea...	0.717984	0.608398	91.688374	False	{'Imputer': {'categorical_impute_strategy': 'm...
8	0	Mode Baseline Binary Classification Pipeline	8.638305	8.623860	0.000000	False	{'Baseline Classifier': {'strategy': 'mode'}}

to select the best pipeline we can run

[8]:

best_pipeline = automl.best_pipeline

Describe pipeline¶

You can get more details about any pipeline, including how it performed on other objective functions.

[9]:

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

**********************************************************************
* Random Forest Classifier w/ Imputer + Text Featurization Component *
**********************************************************************

Problem Type: binary
Model Family: Random Forest

Pipeline Steps
==============
1. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : mean
         * categorical_fill_value : None
         * numeric_fill_value : None
2. Text Featurization Component
         * text_columns : ['Message']
3. Random Forest Classifier
         * n_estimators : 100
         * max_depth : 6
         * n_jobs : -1

Training
========
Training for binary problems.
Total training time (including CV): 12.3 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Testing
0                      0.130       0.860 0.986      0.939 0.892                     0.915            0.949   1594.000   797.000
1                      0.122       0.899 0.981      0.942 0.923                     0.943            0.962   1594.000   797.000
2                      0.128       0.888 0.981      0.947 0.915                     0.934            0.959   1594.000   797.000
mean                   0.127       0.882 0.982      0.943 0.910                     0.931            0.957          -         -
std                    0.004       0.020 0.003      0.004 0.016                     0.014            0.007          -         -
coef of var            0.033       0.023 0.003      0.004 0.018                     0.015            0.007          -         -

[10]:

best_pipeline.graph()

[10]:

Notice above that there is a Text Featurization Component as the second step in the pipeline. The Woodwork DataTable passed in to AutoML search recognizes that 'Message' is a text column, and converts this text into numerical values that can be handled by the estimator.

Evaluate on holdout¶

Finally, we retrain the best pipeline on all of the training data and evaluate on the holdout

[11]:

best_pipeline.fit(X_train, y_train)

[11]:

GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Text Featurization Component':{'text_columns': ['Message']}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1},})

Now, we can score the pipeline on the hold out data using the core objectives for binary classification problems

[12]:

scores = best_pipeline.score(X_holdout, y_holdout,  objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')

Accuracy Binary: 0.9732441471571907

As you can see, this model performs relatively well on this dataset, even on unseen data.

Why encode text this way?¶

To demonstrate the importance of text-specific modeling, let’s train a model with the same dataset, without letting AutoMLSearch detect the text column. We can change this by explicitly setting the data type of the 'Message' column in Woodwork.

[13]:

X_train_categorical = ww.DataTable(X_train, logical_types={'Message': 'Categorical'})

[14]:

automl_no_text = AutoMLSearch(problem_type='binary',
                      max_batches=1,
                      optimize_thresholds=True)

automl_no_text.search(X_train_categorical, y_train_dc)

Generating pipelines to search over...
*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Searching up to 1 batches for a total of 9 pipelines.
Allowed model families: decision_tree, xgboost, random_forest, lightgbm, linear_model, extra_trees, catboost

Batch 1: (1/9) Mode Baseline Binary Classification P... Elapsed:00:00
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 8.638
Batch 1: (2/9) Decision Tree Classifier w/ Imputer +... Elapsed:00:00
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.562
Batch 1: (3/9) LightGBM Classifier w/ Imputer + One ... Elapsed:00:00
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.562
Batch 1: (4/9) Extra Trees Classifier w/ Imputer + O... Elapsed:00:00
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.561
Batch 1: (5/9) Elastic Net Classifier w/ Imputer + O... Elapsed:00:02
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.562
Batch 1: (6/9) CatBoost Classifier w/ Imputer           Elapsed:00:02
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.621
Batch 1: (7/9) XGBoost Classifier w/ Imputer + One H... Elapsed:00:03
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.562
Batch 1: (8/9) Random Forest Classifier w/ Imputer +... Elapsed:00:03
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.561
Batch 1: (9/9) Logistic Regression Classifier w/ Imp... Elapsed:00:05
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.559

Search finished after 00:05
Best pipeline: Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler
Best pipeline Log Loss Binary: 0.559496

Like before, we can look at the rankings and pick the best pipeline

[15]:

automl_no_text.rankings

[15]:

	id	pipeline_name	score	validation_score	percent_better_than_baseline	high_variance_cv	parameters
0	8	Logistic Regression Classifier w/ Imputer + On...	0.559496	0.558255	93.523087	False	{'Imputer': {'categorical_impute_strategy': 'm...
1	3	Extra Trees Classifier w/ Imputer + One Hot En...	0.561279	0.559761	93.502436	False	{'Imputer': {'categorical_impute_strategy': 'm...
2	7	Random Forest Classifier w/ Imputer + One Hot ...	0.561398	0.561302	93.501058	False	{'Imputer': {'categorical_impute_strategy': 'm...
3	6	XGBoost Classifier w/ Imputer + One Hot Encoder	0.562219	0.561991	93.491564	False	{'Imputer': {'categorical_impute_strategy': 'm...
4	1	Decision Tree Classifier w/ Imputer + One Hot ...	0.562424	0.561668	93.489187	False	{'Imputer': {'categorical_impute_strategy': 'm...
5	2	LightGBM Classifier w/ Imputer + One Hot Encoder	0.562451	0.561991	93.488872	False	{'Imputer': {'categorical_impute_strategy': 'm...
6	4	Elastic Net Classifier w/ Imputer + One Hot En...	0.562465	0.562004	93.488716	False	{'Imputer': {'categorical_impute_strategy': 'm...
7	5	CatBoost Classifier w/ Imputer	0.620998	0.622235	92.811110	False	{'Imputer': {'categorical_impute_strategy': 'm...
8	0	Mode Baseline Binary Classification Pipeline	8.638305	8.623860	0.000000	False	{'Baseline Classifier': {'strategy': 'mode'}}

[16]:

best_pipeline_no_text = automl_no_text.best_pipeline

Here, changing the data type of the text column removed the Text Featurization Component from the pipeline

[17]:

best_pipeline_no_text.graph()

[17]:

[18]:

automl_no_text.describe_pipeline(automl_no_text.rankings.iloc[0]["id"])

*********************************************************************************
* Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler *
*********************************************************************************

Problem Type: binary
Model Family: Linear

Pipeline Steps
==============
1. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : mean
         * categorical_fill_value : None
         * numeric_fill_value : None
2. One Hot Encoder
         * top_n : 10
         * features_to_encode : None
         * categories : None
         * drop : None
         * handle_unknown : ignore
         * handle_missing : error
3. Standard Scaler
4. Logistic Regression Classifier
         * penalty : l2
         * C : 1.0
         * n_jobs : -1
         * multi_class : auto
         * solver : lbfgs

Training
========
Training for binary problems.
Total training time (including CV): 0.4 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Testing
0                      0.558       0.061 0.508      1.000 0.010                     0.503            0.752   1594.000   797.000
1                      0.561       0.000 0.503      0.000 0.000                     0.500            0.750   1594.000   797.000
2                      0.560       0.087 0.506      1.000 0.020                     0.505            0.752   1594.000   797.000
mean                   0.559       0.049 0.506      0.667 0.010                     0.503            0.751          -         -
std                    0.001       0.045 0.002      0.577 0.010                     0.003            0.001          -         -
coef of var            0.002       0.903 0.004      0.866 0.997                     0.005            0.001          -         -

[19]:

# train on the full training data
best_pipeline_no_text.fit(X_train, y_train)

# get standard performance metrics on holdout data
scores = best_pipeline_no_text.score(X_holdout, y_holdout,  objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')

Accuracy Binary: 0.7525083612040134

Without the Text Featurization Component, the 'Message' column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the One Hot Encoder. The best pipeline encoded the top 10 most frequent “categories” of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the best_pipeline_no_text did not beat the random guess of predicting “ham” in every case.

Using the Cost-Benefit Matrix Objective User Guide