Note

Go to the end to download the full example code

Text Classification Quickstart#

Deepchecks NLP tests your models during model development/research and before deploying to production. Using our testing package reduces model failures and saves tests development time. In this quickstart guide, you will learn how to use the deepchecks NLP package to analyze and evaluate text classification tasks. If you are interested in a multilabel classification task, you can refer to our Multilabel Quickstart. We will cover the following steps:

Creating a TextData object and auto calculating properties
Running the built-in suites and inspecting the results
We’ll spotlight two interesting checks - Embeddings drift and Under-Annotated Segments

To run deepchecks for NLP, you need the following for both your train and test data:

Your text data - a list of strings, each string is a single sample (can be a sentence, paragraph, document, etc.).
Your labels - either a Text Classification label or a Token Classification label. These are not needed for checks that don’t require labels (such as the Embeddings Drift check or most data integrity checks), but are needed for many other checks.
Your model’s predictions (see Supported Tasks and Formats for info on supported formats). These are needed only for the model related checks, shown in the Model Evaluation section of this guide.

If you don’t have deepchecks installed yet:

import sys
!{sys.executable} -m pip install 'deepchecks[nlp]' -U --quiet #--user

Some properties calculated by deepchecks.nlp require additional packages to be installed. You can install them by running:

import sys
!{sys.executable} -m pip install 'deepchecks[nlp-properties]' -U --quiet #--user

Setting Up#

Load Data#

For the purpose of this guide, we’ll use a small subset of the tweet emotion dataset. This dataset contains tweets and their corresponding emotion - Anger, Happiness, Optimism, and Sadness.

from deepchecks.nlp import TextData
from deepchecks.nlp.datasets.classification import tweet_emotion

train, test = tweet_emotion.load_data(data_format='DataFrame')
train.head()

include_properties and include_embeddings are incompatible with data_format="Dataframe". loading only original text data.

	text	user_age	gender	days_on_platform	user_region	label
2	No but that's so cute. Atsu was probably shy a...	24.97	Male	2729	Middle East/Africa	happiness
3	Rooneys fucking untouchable isn't he? Been fuc...	21.66	Male	1376	Asia Pacific	anger
7	Tiller and breezy should do a collab album. Ra...	37.29	Female	3853	Americas	happiness
8	@user broadband is shocking regretting signing...	15.39	Female	1831	Europe	anger
9	@user Look at those teef! #growl	54.37	Female	4619	Europe	anger

We can see that we have the tweet text itself, the label (the emotion) and then some additional metadata columns.

Create a TextData Objects#

We can now create a TextData object for the train and test dataframes. This object is used to pass your data to the deepchecks checks.

To create a TextData object, the only required argument is the text itself, but passing only the text will prevent multiple checks from running. In this example we’ll pass the label and define the task type and finally define the metadata columns (the other columns in the dataframe) which we’ll use later on in the guide.

train = TextData(train.text, label=train['label'], task_type='text_classification',
                 metadata=train.drop(columns=['label', 'text']))
test = TextData(test.text, label=test['label'], task_type='text_classification',
                metadata=test.drop(columns=['label', 'text']))

Calculating Properties#

Some of deepchecks’ checks use properties of the text samples for various calculations. Deepcheck has a wide variety of such properties, some simple and some that rely on external models and are more heavy to run. In order for deepchecks’ checks to be able to access the properties, they must be stored within the TextData object. You can read more about properties in the Property Guide.

# properties can be either calculated directly by Deepchecks
# or imported from other sources in appropriate format

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# train.calculate_builtin_properties(
#   include_long_calculation_properties=True, device=device
# )
# test.calculate_builtin_properties(
#   include_long_calculation_properties=True,  device=device
# )

In this example though we’ll use pre-calculated properties:

train_properties, test_properties = tweet_emotion.load_properties()

train.set_properties(train_properties, categorical_properties=['Language'])
test.set_properties(test_properties, categorical_properties=['Language'])

train.properties.head(2)

	Text Length	Average Word Length	Max Word Length	% Special Characters	Language	Sentiment	Subjectivity	Toxicity	Fluency	Formality
0	94	4.277778	8	0.021277	en	0.0	0.75	0.009497	0.349153	0.204132
1	102	6.923077	18	0.049020	en	-0.8	0.90	0.995803	0.176892	0.036639

Running the Deepchecks Default Suites#

Deepchecks comes with a set of pre-built suites that can be used to run a set of checks on your data, alongside with their default conditions and thresholds. You can read more about customizing and creating your own suites in the Customizations Guide. In this guide we’ll be using 3 suites - the data integrity suite, the train test validation suite and the model evaluation suite. You can also run all the checks at once using the full_suite.

Data Integrity#

We will start by doing preliminary integrity check to validate the text formatting. It is recommended to do this step before model training as it may imply additional data engineering is required.

We’ll do that using the data_integrity pre-built suite.

from deepchecks.nlp.suites import data_integrity

data_integrity_suite = data_integrity()
data_integrity_suite.run(train, test)

Data Integrity Suite:
|         | 0/9 [Time: 00:00]
Data Integrity Suite:
|█        | 1/9 [Time: 00:00, Check=Text Property Outliers]
Data Integrity Suite:
|██       | 2/9 [Time: 00:00, Check=Unknown Tokens]
Data Integrity Suite:
|█████    | 5/9 [Time: 00:01, Check=Property Label Correlation]
Data Integrity Suite:
|██████   | 6/9 [Time: 00:02, Check=Conflicting Labels]
Data Integrity Suite:
|███████  | 7/9 [Time: 00:03, Check=Text Duplicates]
Data Integrity Suite:
|█████████| 9/9 [Time: 00:03, Check=Frequent Substrings]

Data Integrity Suite

Status	Check	Condition	More Info
✖	Text Property Outliers - Train Dataset	Outlier ratio in all properties is less or equal than 5%	Found 2 properties with outlier ratios above threshold. Property with highest ratio is Toxicity with outlier ratio of 16%
✖	Text Property Outliers - Test Dataset	Outlier ratio in all properties is less or equal than 5%	Found 1 properties with outlier ratios above threshold. Property with highest ratio is Toxicity with outlier ratio of 16.43%
✖	Unknown Tokens - Train Dataset	Ratio of unknown words is less than 0%	Ratio was 0.79%
✖	Unknown Tokens - Test Dataset	Ratio of unknown words is less than 0%	Ratio was 0.68%

Conditions Summary

Status	Condition	More Info
✖	Outlier ratio in all properties is less or equal than 5%	Found 2 properties with outlier ratios above threshold. Property with highest ratio is Toxicity with outlier ratio of 16%

More Info	Properties
No outliers found.	Text Length, Subjectivity, Fluency
Outliers found but not shown in graphs (n_show_top=5).	Average Word Length, % Special Characters

Conditions Summary

Status	Condition	More Info
✖	Outlier ratio in all properties is less or equal than 5%	Found 1 properties with outlier ratios above threshold. Property with highest ratio is Toxicity with outlier ratio of 16.43%

More Info	Properties
No outliers found.	Text Length, Subjectivity, Fluency
Outliers found but not shown in graphs (n_show_top=5).	Average Word Length, % Special Characters

Conditions Summary

Status	Condition	More Info
✖	Ratio of unknown words is less than 0%	Ratio was 0.79%

Conditions Summary

Status	Condition	More Info
✖	Ratio of unknown words is less than 0%	Ratio was 0.68%

Status	Check	Condition	More Info
✓	Under Annotated Property Segments - Train Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated properties segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.
✓	Under Annotated Property Segments - Test Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated properties segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.
✓	Under Annotated Meta Data Segments - Train Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated metadata segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.
✓	Under Annotated Meta Data Segments - Test Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated metadata segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.
✓	Property-Label Correlation - Train Dataset	Properties' Predictive Power Score is less than 0.3	Passed for 10 relevant columns
✓	Property-Label Correlation - Test Dataset	Properties' Predictive Power Score is less than 0.3	Passed for 10 relevant columns
✓	Text Duplicates - Train Dataset	Duplicate data ratio is less or equal to 5%	Found 0.04% duplicate data
✓	Text Duplicates - Test Dataset	Duplicate data ratio is less or equal to 5%	Found 0.05% duplicate data
✓	Special Characters - Train Dataset	Ratio of samples containing more than 20% special characters is below 5%	Found 1 samples with special char ratio above threshold
✓	Special Characters - Test Dataset	Ratio of samples containing more than 20% special characters is below 5%	Found 3 samples with special char ratio above threshold
✓	Conflicting Labels - Train Dataset	Ambiguous sample ratio is less or equal to 0%	Ratio of samples with conflicting labels: 0%
✓	Conflicting Labels - Test Dataset	Ambiguous sample ratio is less or equal to 0%	Ratio of samples with conflicting labels: 0%
✓	Frequent Substrings - Train Dataset	No more than 1 substrings with ratio above 0.05	Found 0 substrings with ratio above threshold
✓	Frequent Substrings - Test Dataset	No more than 1 substrings with ratio above 0.05	Found 0 substrings with ratio above threshold

Conditions Summary

Status	Condition	More Info
✓	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated properties segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.

Conditions Summary

Status	Condition	More Info
✓	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated properties segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.

Conditions Summary

Status	Condition	More Info
✓	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated metadata segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.

Conditions Summary

Status	Condition	More Info
✓	The relative performance of weakest segment is greater than 80% of average model performance.	Under annotated metadata segments check is skipped since your data annotation ratio is > 95.0%. Try increasing the annotation_ratio_threshold parameter.

Conditions Summary

Status	Condition	More Info
✓	Properties' Predictive Power Score is less than 0.3	Passed for 10 relevant columns

Conditions Summary

Status	Condition	More Info
✓	Properties' Predictive Power Score is less than 0.3	Passed for 10 relevant columns

Conditions Summary

Status	Condition	More Info
✓	Duplicate data ratio is less or equal to 5%	Found 0.04% duplicate data

Text	Sample IDs	Number of Samples
Blood is boiling	427, 884	2

Conditions Summary

Status	Condition	More Info
✓	Duplicate data ratio is less or equal to 5%	Found 0.05% duplicate data

Text	Sample IDs	Number of Samples
A good head and a good heart a...	726, 1774	2

Conditions Summary

Status	Condition	More Info
✓	Ratio of samples containing more than 20% special characters is below 5%	Found 1 samples with special char ratio above threshold

Sample ID	% of Special Characters	Special Characters	Text
2585	0.24	['😂', '🏽', '☀', '️', '🌥']	@user This is ☀️this is clouded sky 🌥😂😂😂👍🏽😁👏🏽
1980	0.18	['😏', '😒', '😠', '😡', '😤']	She doesn't know how to smile! So be it! #pissed 😏😒😠😡😤👀👄👊👎🙍💔
439	0.18	['😭', '😂']	@user @user 😂😂😭😭😭 resentment
616	0.15	['🙂']	No sober weekend 🙂🙂🙂
2203	0.14	['😄', '☔', '️']	@user But smiling 😄☔️

Conditions Summary

Status	Condition	More Info
✓	Ratio of samples containing more than 20% special characters is below 5%	Found 3 samples with special char ratio above threshold

Sample ID	% of Special Characters	Special Characters	Text
420	0.52	['✨']	Trying to think positive, and not let this situation discourage me ✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨✨
127	0.24	['️', '☹', '💘']	@user awe man, when are you free then? ☹️️☹️️☹️️💘💘💘
1649	0.20	['🤦', '🏻', '\u200d', '♀', '️']	Tuesday night and the wine is coming out. Just got home from work if that explains it. 🤦🏻‍♀️🤦🏻‍♀️🤦🏻‍
789	0.17	['╯', '°', '┻', '□', '）']	He's seriously so frustrating sometimes! (╯°□°）╯︵ ┻━┻ #ugh
624	0.17	['😏', '😂', '😭']	Forever raging 😏😂😭

Integrity #1: Unknown Tokens#

First up (in the “Didn’t Pass” tab) we see that the Unknown Tokens check has returned a problem.

Looking at the result, we can see that it assumed (by default) that we’re going to use the bert-base-uncased tokenizer for our NLP model, and that if that’s the case there are many words in the dataset that contain characters (such as emojis, or Korean characters) that are unrecognized by the tokenizer. This is an important insight, as bert tokenizers are very common. You can configure the tokenizer used by this check by passing the tokenizer to the check’s constructor, and can also configure the threshold for the percent of unknown tokens allowed by modifying the checks condition.

Integrity #2: Text Outliers#

In the “Didn’t Pass” tab, by looking at the Text Outlier check result we can derive several insights by hovering over the different values and inspecting the outlier texts:

hashtags (‘#…’) are usually several words written together without spaces - we might consider splitting them before feeding the tweet to a model
In some instances users deliberately misspell words, for example ‘!’ instead of the letter ‘l’ or ‘okayyyyyyyyyy’.
The majority of the data is in English but not all. If we want a classifier that is multilingual we should collect more data, otherwise we may consider dropping tweets in other languages from our dataset before training our model.

Integrity #3: Property-Label Correlation (Shortcut Learning)#

In the “Passed” tab we can see tha Property-Label Correlation check, that verifies the data does not contain any shortcuts the model can fixate on during the learning process. In our case we can see no indication that this problem exists in our dataset. For more information about shortcut learning see: https://towardsdatascience.com/shortcut-learning-how-and-why-models-cheat-1b37575a159

Train Test Validation#

The next suite, the train_test_validation suite serves to validate our split and compare the two dataset. These splits can be either you training and val / test sets, in which case you’d want to run this suite after the split was made but before training, or for example your training and inference data, in which case the suite is useful for validating that the inference data is similar enough to the training data.

from deepchecks.nlp.suites import train_test_validation

train_test_validation().run(train, test)

Train Test Validation Suite:
|     | 0/4 [Time: 00:00]
Train Test Validation Suite:
|█▎   | 1/4 [Time: 00:00, Check=Property Drift]
Train Test Validation Suite:
|█████| 4/4 [Time: 00:01, Check=Train Test Samples Mix]

Train Test Validation Suite

Status	Check	Condition	More Info
✖	Label Drift	Label drift score < 0.15	Label's drift score Cramer's V is 0.22

Conditions Summary

Status	Condition	More Info
✖	Label drift score < 0.15	Label's drift score Cramer's V is 0.22

Status	Check	Condition	More Info
✓	Property Drift	categorical drift score < 0.2 and numerical drift score < 0.2	Passed for 10 columns out of 10 columns. Found column "Language" has the highest categorical drift score: 9.17E-3 Found column "Formality" has the highest numerical drift score: 0.08
✓	Train Test Samples Mix	Percentage of test data samples that appear in train data is less or equal to 5%	No samples mix found

Conditions Summary

Status	Condition	More Info
✓	categorical drift score < 0.2 and numerical drift score < 0.2	Passed for 10 columns out of 10 columns. Found column "Language" has the highest categorical drift score: 9.17E-3 Found column "Formality" has the highest numerical drift score: 0.08

Check	Reason
Text Embeddings Drift	Functionality requires embeddings, but the the TextData object had none. To use this functionality, use the set_embeddings method to set your own embeddings with a numpy.array or use TextData.calculate_builtin_embeddings to add the default deepchecks embeddings.

Label Drift#

This check, appearing in the “Didn’t Pass” tab, lets us see that we have some significant change in the distribution of the label - the label “optimism” is suddenly way more common in the test dataset, while other labels declined. This happened because we split on time, so the topics covered by the tweets in the test dataset may correspond to specific trends or events that happened later in time. Let’s investigate!

Model Evaluation#

The suite below, the model_evaluation suite, is designed to be run after a model has been trained and requires model predictions which can be supplied via the relevant arguments in the run function.

train_preds, test_preds = tweet_emotion.load_precalculated_predictions(
    pred_format='predictions', as_train_test=True)
train_probas, test_probas = tweet_emotion.load_precalculated_predictions(
    pred_format='probabilities', as_train_test=True)

from deepchecks.nlp.suites import model_evaluation

result = model_evaluation().run(train, test,
                                train_predictions=train_preds,
                                test_predictions=test_preds,
                                train_probabilities=train_probas,
                                test_probabilities=test_probas)
result.show()

Model Evaluation Suite:
|     | 0/4 [Time: 00:00]
Model Evaluation Suite:
|██▌  | 2/4 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|█████| 4/4 [Time: 00:10, Check=Metadata Segments Performance]

Model Evaluation Suite

Status	Check	Condition	More Info
✖	Train Test Performance	Train-Test scores relative degradation is less than 0.1	10 scores failed. Found max degradation of 72.96% for metric Recall and class optimism.
!	Property Segments Performance - Test Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with accuracy score of 0.525 in comparison to an average score of 0.708 in sampled data.
!	Metadata Segments Performance - Test Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with accuracy score of 0.305 in comparison to an average score of 0.708 in sampled data.

Conditions Summary

Status	Condition	More Info
✖	Train-Test scores relative degradation is less than 0.1	10 scores failed. Found max degradation of 72.96% for metric Recall and class optimism.

Conditions Summary

Status	Condition	More Info
!	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with accuracy score of 0.525 in comparison to an average score of 0.708 in sampled data.

Conditions Summary

Status	Condition	More Info
!	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with accuracy score of 0.305 in comparison to an average score of 0.708 in sampled data.

Status	Check	Condition	More Info
✓	Prediction Drift	Prediction drift score < 0.15	Found model prediction Cramer's V drift score of 0.04
✓	Property Segments Performance - Train Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	WeakSegmentsPerformance was unable to train an error model to find weak segments.Try supplying additional properties.
✓	Metadata Segments Performance - Train Dataset	The relative performance of weakest segment is greater than 80% of average model performance.	WeakSegmentsPerformance was unable to train an error model to find weak segments.Try supplying additional metadata.

Conditions Summary

Status	Condition	More Info
✓	Prediction drift score < 0.15	Found model prediction Cramer's V drift score of 0.04

Conditions Summary

Status	Condition	More Info
✓	The relative performance of weakest segment is greater than 80% of average model performance.	WeakSegmentsPerformance was unable to train an error model to find weak segments.Try supplying additional properties.

Conditions Summary

Status	Condition	More Info
✓	The relative performance of weakest segment is greater than 80% of average model performance.	WeakSegmentsPerformance was unable to train an error model to find weak segments.Try supplying additional metadata.

OK! We have many important issues being surfaced by this suite. Let’s dive into the individual checks:

Model Eval #1: Train Test Performance#

We can immediately see in the “Didn’t Pass” tab that there has been significant degradation in the Recall on class “optimism”. This is very likely a result of the severe label drift we saw after running the previous suite.

Model Eval #2: Segment Performance#

Also in the “Didn’t Pass” tab we can see the two segment performance checks - Property Segment Performance and Metadata Segment Performance. These use the metadata columns of user related information OR our calculated properties to try and automatically detect significant data segments on which our model performs badly.

In this case we can see that both checks have found issues in the test dataset:

The Property Segment Performance check has found that we’re getting very poor results on low toxicity samples. That probably means that our model is using the toxicity of the text to infer the “anger” label, and is having a harder problem with other, more benign text samples.
The Metadata Segment Performance check has found that we have predicting correct results on new users from the Americas. That’s 5% of our dataset so we better investigate that further.

You’ll note that these two issues occur only in the test data, and so the results of these checks for the training data appear in the “Passed” tab.

Model Eval #3: Prediction Drift#

We note that the Prediction Drift (here in the “Passed” tab) shows no issue. Given that we already know that there is significant Label Drift, this means we have Concept Drift - the labels corresponding to our samples have changed, while the model continues to predict the same labels. You can learn more about the different types of drift and how deepchecks detects them in our Drift Guide.

Running Individual Checks#

Checks can also be run individually. In this section, we’ll show two of the more interesting checks and how you can run them stand-alone and add conditions to them. You can learn more about customizing suites, checks and conditions in our Customizations Guide.

Embeddings Drift#

In order to run the Embeddings Drift check you must have text embeddings loaded to both datasets. You can read more about using embeddings in deepchecks NLP in our Embeddings Guide.

In this example, we have the embeddings already pre-calculated:

from deepchecks.nlp.datasets.classification.tweet_emotion import load_embeddings

train_embeddings, test_embeddings = load_embeddings()

train.set_embeddings(train_embeddings)
test.set_embeddings(test_embeddings)

You can also calculate the embeddings using deepchecks, either using an open-source sentence-transformer or using Open AI’s embedding API.

# train.calculate_builtin_embeddings()
# test.calculate_builtin_embeddings()

from deepchecks.nlp.checks import TextEmbeddingsDrift

check = TextEmbeddingsDrift()
res = check.run(train, test)
res.show()

n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.

Embeddings Drift

Here we can see some clusters that distinctly contain more samples from train or more sample for test. For example, if we look at the greenish cluster in the middle (by hovering on the samples and reading the tweets) we see it’s full of inspirational quotes and sayings, and belongs mostly to the test dataset. That is the source of the drastic increase in optimistic labels!

There are of course also other note-worthy clusters, such as the greenish cluster on the right that contains tweets about a terror attack in Bangladesh, which belongs solely to the test data.

Under Annotated Segments#

Another note-worthy segment is the Under Annotated Segments check, which explores our data and automatically identifies segments where the data is under-annotated - meaning that the ratio of missing labels is higher. To this check we’ll also add a condition that will fail in case that an under-annotated segment of significant size is found.

from deepchecks.nlp.checks import UnderAnnotatedPropertySegments
test_under = tweet_emotion.load_under_annotated_data()

check = UnderAnnotatedPropertySegments(
    segment_minimum_size_ratio=0.1
).add_condition_segments_relative_performance_greater_than()

check.run(test_under)

Under Annotated Property Segments

Conditions Summary

Status	Condition	More Info
!	The relative performance of weakest segment is greater than 80% of average model performance.	Found a segment with annotation ratio of 0.565 in comparison to an average score of 0.899 in sampled data.

For example, here the check detected that we have a lot of lacking annotations for samples that are informal and not very fluent. May it be the case that our annotators have a problem annotating these samples and prefer not to deal with them? If these samples are important for use, we may have to put special focus on annotating this segment.

Note

You can find the full list of available NLP checks in the nlp.checks api documentation ֿ.

Total running time of the script: (0 minutes 27.586 seconds)

Download Python source code: plot_text_classification.py

Download Jupyter notebook: plot_text_classification.ipynb

Gallery generated by Sphinx-Gallery

Multi Label Classification Quickstart

The TextData Object

Text Classification Quickstart#

Setting Up#

Load Data#

Create a TextData Objects#

Calculating Properties#

Running the Deepchecks Default Suites#

Data Integrity#

Data Integrity Suite

Text Property Outliers - Train Dataset

Conditions Summary

Additional Outputs

Properties Not Shown:

Text Property Outliers - Test Dataset

Conditions Summary

Additional Outputs

Properties Not Shown:

Unknown Tokens - Train Dataset

Conditions Summary

Additional Outputs

Unknown Tokens - Test Dataset

Conditions Summary

Additional Outputs

Under Annotated Property Segments - Train Dataset

Conditions Summary

Additional Outputs

Under Annotated Property Segments - Test Dataset

Conditions Summary

Additional Outputs

Under Annotated Meta Data Segments - Train Dataset

Conditions Summary

Additional Outputs

Under Annotated Meta Data Segments - Test Dataset

Conditions Summary

Additional Outputs

Property-Label Correlation - Train Dataset

Conditions Summary

Additional Outputs

Property-Label Correlation - Test Dataset

Conditions Summary

Additional Outputs

Text Duplicates - Train Dataset

Conditions Summary

Additional Outputs

Text Duplicates - Test Dataset

Conditions Summary

Additional Outputs

Special Characters - Train Dataset

Conditions Summary

Additional Outputs

Special Characters - Test Dataset

Conditions Summary

Additional Outputs

Integrity #1: Unknown Tokens#

Integrity #2: Text Outliers#

Integrity #3: Property-Label Correlation (Shortcut Learning)#

Train Test Validation#

Train Test Validation Suite

Label Drift

Conditions Summary

Additional Outputs

Property Drift

Conditions Summary

Additional Outputs

Label Drift#

Model Evaluation#

Model Evaluation Suite

Train Test Performance

Conditions Summary

Additional Outputs

Property Segments Performance - Test Dataset

Conditions Summary

Additional Outputs

Metadata Segments Performance - Test Dataset

Conditions Summary

Additional Outputs

Prediction Drift

Conditions Summary

Additional Outputs

Property Segments Performance - Train Dataset

Conditions Summary