The TextData Object#
The TextData
is a container for your textual data, labels, and relevant
metadata for NLP tasks and is a basic building block in the deepchecks.nlp
subpackage.
In order to use any functionality of the deepchecks.nlp
subpackage, you need to first create a TextData
object.
The TextData
object enables easy access to metadata, embeddings and properties relevant for training and validating ML
models.
Class Properties#
The main properties are:
raw_text - The raw text data, a list of strings representing the raw text of each sample. Each sample can be a sentence, paragraph, or a document, depending on the task.
label - The labels for the text data samples.
task_type - The task type, must be either text_classification, token_classification or None. See the Supported Tasks Guide for more information about supported formats.
TextData API Reference#
TextData wraps together the raw text data and the labels for the nlp task. |
Creating a TextData#
The default TextData
constructor expects to get a sequence of raw text strings or tokenized text.
The rest of the arguments are optional, but if you have labels for your data you would want to define them in the constructor,
as many checks require the dataset labels in order to run.
Defining task_type
If you define labels, you must also define the task_type
so deepchecks will know how to parse the labels.
>>> raw_text = ["This is an example.", "Another example here."]
>>> labels = ["positive", "negative"]
>>> task_type = "text_classification"
>>> text_data = TextData(raw_text=raw_text, label=labels, task_type=task_type)
Tokenized Text#
If you have tokenized text, you can also create a TextData object from it rather than using the raw_text
argument:
>>> # A tokenized example with named entities and locations
>>> tokenized_text = [["Dan", "lives", "in", "New", "York", "."], ["He", "works", "at", "Google", "."]]
>>> labels = [["B-PER", "O", "O", "B-LOC", "I-LOC", "O"], ["O", "O", "O", "B-ORG", "O"]]
>>> text_data = TextData(tokenized_text=tokenized_text, label=labels, task_type=task_type)
If you’re running deepchecks on a token classification task it is recommended to use that argument instead of the
raw_text
argument. If you did pass raw_text
to the constructor,
deepchecks will break the text into tokens for you, using the default python str.split()
method to split the text
into tokens.
Useful Functions#
Describe data#
The describe()
function is a great way to get a quick overview of your dataset. Calling the function will display the label distribution,
the distribution of the calculated text properties and statistical information. You can use the function in the following way:
>>> text_data.describe()
Calculate Default Properties#
To calculate all the default properties, you do not need to pass the include_properties
parameter in the
calculate_builtin_properties
function. If you pass either include_properties
or ignore_properties
parameter
then only the properties specified will be calculated or ignored. You can calculate the default text properties for the TextData object using:
>>> text_data.calculate_builtin_properties()
To learn more about how deepchecks uses properties and how you can calculate or set them yourself, see the Text Properties Guide.
Add Metadata#
You can add metadata to the TextData object:
>>> text_data.set_metadata(metadata_df, categorical_metadata_columns)
To learn more about how deepchecks uses metadata, see the Text Metadata Guide.
Sample#
You can sample a subset of the TextData object:
>>> text_data.sample(10000)
Working with Class Parameters#
You can work directly with the TextData
object, to inspect its defined raw text, tokenized text, and label:
>>> text_data.raw_text
["This is an example.", "Another example here."]
>>> text_data.tokenized_text
[["This", "is", "an", "example."], ["Another", "example", "here."]]
>>> text_data.label
["positive", "negative"]
Get its internal metadata and properties DataFrames:
>>> text_data.metadata
>>> text_data.properties