Text Classification Quickstart#

Deepchecks NLP tests your models during model development/research and before deploying to production. Using our testing package reduces model failures and saves tests development time. In this quickstart guide, you will learn how to use the deepchecks NLP package to analyze and evaluate text classification tasks. If you are interested in a multilabel classification task, you can refer to our Multilabel Quickstart. We will cover the following steps:

  1. Creating a TextData object and auto calculating properties

  2. Running the built-in suites and inspecting the results

  3. We’ll spotlight two interesting checks - Embeddings drift and Under-Annotated Segments

To run deepchecks for NLP, you need the following for both your train and test data:

  1. Your text data - a list of strings, each string is a single sample (can be a sentence, paragraph, document, etc.).

  2. Your labels - either a Text Classification label or a Token Classification label. These are not needed for checks that don’t require labels (such as the Embeddings Drift check or most data integrity checks), but are needed for many other checks.

  3. Your model’s predictions (see Supported Tasks and Formats for info on supported formats). These are needed only for the model related checks, shown in the Model Evaluation section of this guide.

If you don’t have deepchecks installed yet:

import sys
!{sys.executable} -m pip install 'deepchecks[nlp]' -U --quiet #--user

Some properties calculated by deepchecks.nlp require additional packages to be installed. You can install them by running:

import sys
!{sys.executable} -m pip install 'deepchecks[nlp-properties]' -U --quiet #--user

Setting Up#

Load Data#

For the purpose of this guide, we’ll use a small subset of the tweet emotion dataset. This dataset contains tweets and their corresponding emotion - Anger, Happiness, Optimism, and Sadness.

from deepchecks.nlp import TextData
from deepchecks.nlp.datasets.classification import tweet_emotion

train, test = tweet_emotion.load_data(data_format='DataFrame')
include_properties and include_embeddings are incompatible with data_format="Dataframe". loading only original text data.
text user_age gender days_on_platform user_region label
2 No but that's so cute. Atsu was probably shy a... 24.97 Male 2729 Middle East/Africa happiness
3 Rooneys fucking untouchable isn't he? Been fuc... 21.66 Male 1376 Asia Pacific anger
7 Tiller and breezy should do a collab album. Ra... 37.29 Female 3853 Americas happiness
8 @user broadband is shocking regretting signing... 15.39 Female 1831 Europe anger
9 @user Look at those teef! #growl 54.37 Female 4619 Europe anger

We can see that we have the tweet text itself, the label (the emotion) and then some additional metadata columns.

Create a TextData Objects#

We can now create a TextData object for the train and test dataframes. This object is used to pass your data to the deepchecks checks.

To create a TextData object, the only required argument is the text itself, but passing only the text will prevent multiple checks from running. In this example we’ll pass the label and define the task type and finally define the metadata columns (the other columns in the dataframe) which we’ll use later on in the guide.

train = TextData(train.text, label=train['label'], task_type='text_classification',
                 metadata=train.drop(columns=['label', 'text']))
test = TextData(test.text, label=test['label'], task_type='text_classification',
                metadata=test.drop(columns=['label', 'text']))

Calculating Properties#

Some of deepchecks’ checks use properties of the text samples for various calculations. Deepcheck has a wide variety of such properties, some simple and some that rely on external models and are more heavy to run. In order for deepchecks’ checks to be able to access the properties, they must be stored within the TextData object. You can read more about properties in the Property Guide.

# properties can be either calculated directly by Deepchecks
# or imported from other sources in appropriate format

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# train.calculate_builtin_properties(
#   include_long_calculation_properties=True, device=device
# )
# test.calculate_builtin_properties(
#   include_long_calculation_properties=True,  device=device
# )

In this example though we’ll use pre-calculated properties:

train_properties, test_properties = tweet_emotion.load_properties()

train.set_properties(train_properties, categorical_properties=['Language'])
test.set_properties(test_properties, categorical_properties=['Language'])

Text Length Average Word Length Max Word Length % Special Characters Language Sentiment Subjectivity Toxicity Fluency Formality
0 94 4.277778 8 0.021277 en 0.0 0.75 0.009497 0.349153 0.204132
1 102 6.923077 18 0.049020 en -0.8 0.90 0.995803 0.176892 0.036639

Running the Deepchecks Default Suites#

Deepchecks comes with a set of pre-built suites that can be used to run a set of checks on your data, alongside with their default conditions and thresholds. You can read more about customizing and creating your own suites in the Customizations Guide. In this guide we’ll be using 3 suites - the data integrity suite, the train test validation suite and the model evaluation suite. You can also run all the checks at once using the full_suite.

Data Integrity#

We will start by doing preliminary integrity check to validate the text formatting. It is recommended to do this step before model training as it may imply additional data engineering is required.

We’ll do that using the data_integrity pre-built suite.

from deepchecks.nlp.suites import data_integrity

data_integrity_suite = data_integrity()
data_integrity_suite.run(train, test)
Data Integrity Suite:
|        | 0/8 [Time: 00:00]
Data Integrity Suite:
|#       | 1/8 [Time: 00:00, Check=Text Property Outliers]
Data Integrity Suite:
|##      | 2/8 [Time: 00:01, Check=Unknown Tokens]
Data Integrity Suite:
|#####   | 5/8 [Time: 00:01, Check=Property Label Correlation]
Data Integrity Suite:
|######  | 6/8 [Time: 00:03, Check=Conflicting Labels]
Data Integrity Suite:
|####### | 7/8 [Time: 00:05, Check=Text Duplicates]
Data Integrity Suite:
|########| 8/8 [Time: 00:05, Check=Special Characters]
Data Integrity Suite

Integrity #1: Unknown Tokens#

First up (in the “Didn’t Pass” tab) we see that the Unknown Tokens check has returned a problem.

Looking at the result, we can see that it assumed (by default) that we’re going to use the bert-base-uncased tokenizer for our NLP model, and that if that’s the case there are many words in the dataset that contain characters (such as emojis, or Korean characters) that are unrecognized by the tokenizer. This is an important insight, as bert tokenizers are very common. You can configure the tokenizer used by this check by passing the tokenizer to the check’s constructor, and can also configure the threshold for the percent of unknown tokens allowed by modifying the checks condition.

Integrity #2: Text Outliers#

In the “Didn’t Pass” tab, by looking at the Text Outlier check result we can derive several insights by hovering over the different values and inspecting the outlier texts:

  1. hashtags (‘#…’) are usually several words written together without spaces - we might consider splitting them before feeding the tweet to a model

  2. In some instances users deliberately misspell words, for example ‘!’ instead of the letter ‘l’ or ‘okayyyyyyyyyy’.

  3. The majority of the data is in English but not all. If we want a classifier that is multilingual we should collect more data, otherwise we may consider dropping tweets in other languages from our dataset before training our model.

Integrity #3: Property-Label Correlation (Shortcut Learning)#

In the “Passed” tab we can see tha Property-Label Correlation check, that verifies the data does not contain any shortcuts the model can fixate on during the learning process. In our case we can see no indication that this problem exists in our dataset. For more information about shortcut learning see: https://towardsdatascience.com/shortcut-learning-how-and-why-models-cheat-1b37575a159

Train Test Validation#

The next suite, the train_test_validation suite serves to validate our split and compare the two dataset. These splits can be either you training and val / test sets, in which case you’d want to run this suite after the split was made but before training, or for example your training and inference data, in which case the suite is useful for validating that the inference data is similar enough to the training data.

from deepchecks.nlp.suites import train_test_validation

train_test_validation().run(train, test)
Train Test Validation Suite:
|     | 0/4 [Time: 00:00]
Train Test Validation Suite:
|#2   | 1/4 [Time: 00:00, Check=Property Drift]
Train Test Validation Suite:
|#####| 4/4 [Time: 00:02, Check=Train Test Samples Mix]
Train Test Validation Suite

Label Drift#

This check, appearing in the “Didn’t Pass” tab, lets us see that we have some significant change in the distribution of the label - the label “optimism” is suddenly way more common in the test dataset, while other labels declined. This happened because we split on time, so the topics covered by the tweets in the test dataset may correspond to specific trends or events that happened later in time. Let’s investigate!

Model Evaluation#

The suite below, the model_evaluation suite, is designed to be run after a model has been trained and requires model predictions which can be supplied via the relevant arguments in the run function.

train_preds, test_preds = tweet_emotion.load_precalculated_predictions(
    pred_format='predictions', as_train_test=True)
train_probas, test_probas = tweet_emotion.load_precalculated_predictions(
    pred_format='probabilities', as_train_test=True)

from deepchecks.nlp.suites import model_evaluation

result = model_evaluation().run(train, test,
Model Evaluation Suite:
|     | 0/4 [Time: 00:00]
Model Evaluation Suite:
|##5  | 2/4 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|###7 | 3/4 [Time: 00:45, Check=Property Segments Performance]
Model Evaluation Suite:
|#####| 4/4 [Time: 00:52, Check=Metadata Segments Performance]
Model Evaluation Suite