The TextData Object#

The TextData is a container for your textual data, labels, and relevant metadata for NLP tasks and is a basic building block in the deepchecks.nlp subpackage. In order to use any functionality of the deepchecks.nlp subpackage, you need to first create a TextData object. The TextData object enables easy access to metadata, embeddings and properties relevant for training and validating ML models.

Class Properties#

The main properties are:

  • raw_text - The raw text data, a list of strings representing the raw text of each sample. Each sample can be a sentence, paragraph, or a document, depending on the task.

  • label - The labels for the text data samples.

  • task_type - The task type, must be either text_classification, token_classification or None. See the Supported Tasks Guide for more information about supported formats.

TextData API Reference#


TextData wraps together the raw text data and the labels for the nlp task.

Creating a TextData#

The default TextData constructor expects to get a sequence of raw text strings or tokenized text. The rest of the arguments are optional, but if you have labels for your data you would want to define them in the constructor, as many checks require the dataset labels in order to run.

Defining task_type

If you define labels, you must also define the task_type so deepchecks will know how to parse the labels.

>>> raw_text = ["This is an example.", "Another example here."]
>>> labels = ["positive", "negative"]
>>> task_type = "text_classification"
>>> text_data = TextData(raw_text=raw_text, label=labels, task_type=task_type)

Tokenized Text#

If you have tokenized text, you can also create a TextData object from it rather than using the raw_text argument:

>>> # A tokenized example with named entities and locations
>>> tokenized_text = [["Dan", "lives", "in", "New", "York", "."], ["He", "works", "at", "Google", "."]]
>>> labels = [["B-PER", "O", "O", "B-LOC", "I-LOC", "O"], ["O", "O", "O", "B-ORG", "O"]]
>>> text_data = TextData(tokenized_text=tokenized_text, label=labels, task_type=task_type)

If you’re running deepchecks on a token classification task it is recommended to use that argument instead of the raw_text argument. If you did pass raw_text to the constructor, deepchecks will break the text into tokens for you, using the default python str.split() method to split the text into tokens.

Useful Functions#

Describe data#

The describe() function is a great way to get a quick overview of your dataset. Calling the function will display the label distribution, the distribution of the calculated text properties and statistical information. You can use the function in the following way:

>>> text_data.describe()

Calculate Default Properties#

To calculate all the default properties, you do not need to pass the include_properties parameter in the calculate_builtin_properties function. If you pass either include_properties or ignore_properties parameter then only the properties specified will be calculated or ignored. You can calculate the default text properties for the TextData object using:

>>> text_data.calculate_builtin_properties()

To learn more about how deepchecks uses properties and how you can calculate or set them yourself, see the Text Properties Guide.

Add Metadata#

You can add metadata to the TextData object:

>>> text_data.set_metadata(metadata_df, categorical_metadata_columns)

To learn more about how deepchecks uses metadata, see the Text Metadata Guide.


You can sample a subset of the TextData object:

>>> text_data.sample(10000)

Working with Class Parameters#

You can work directly with the TextData object, to inspect its defined raw text, tokenized text, and label:

>>> text_data.raw_text
["This is an example.", "Another example here."]
>>> text_data.tokenized_text
[["This", "is", "an", "example."], ["Another", "example", "here."]]
>>> text_data.label
["positive", "negative"]

Get its internal metadata and properties DataFrames:

>>> text_data.metadata