TextData#

class TextData[source]#

TextData wraps together the raw text data and the labels for the nlp task.

The TextData class contains metadata and methods intended for easily accessing metadata relevant for the training or validating of ML models.

Parameters
raw_textt.Sequence[str], default: None

The raw text data, a sequence of strings representing the raw text of each sample. If not given, tokenized_text must be given, and raw_text will be created from it by joining the tokens with spaces.

tokenized_textt.Sequence[t.Sequence[str]], default: None

The tokenized text data, a sequence of sequences of strings representing the tokenized text of each sample. Only relevant for task_type ‘token_classification’. If not given, raw_text must be given, and tokenized_text will be created from it by splitting the text by spaces.

labelt.Optional[TTextLabel], default: None

The label for the text data. Can be either a text_classification label or a token_classification label. If None, the label is not set.

  • text_classification label - For text classification the accepted label format differs between multilabel and single label cases. For single label data, the label should be passed as a sequence of labels, with one entry per sample that can be either a string or an integer. For multilabel data, the label should be passed as a sequence of sequences, with the sequence for each sample being a binary vector, representing the presence of the i-th label in that sample.

  • token_classification label - For token classification the accepted label format is the IOB format or similar to it. The Label must be a sequence of sequences of strings or integers, with each sequence corresponding to a sample in the tokenized text, and exactly the length of the corresponding tokenized text.

task_typestr, default: None

The task type for the text data. Can be either ‘text_classification’ or ‘token_classification’. Must be set if label is provided.

namet.Optional[str] , default: None

The name of the dataset. If None, the dataset name will be defined when running it within a check.

metadatat.Optional[t.Union[pd.DataFrame, str]] , default: None

Metadata for the samples. Metadata must be given as a pandas DataFrame or a path to a pandas DataFrame compatible csv file, with the rows representing each sample and columns representing the different metadata columns. If None, no metadata is set. The number of rows in the metadata DataFrame must be equal to the number of samples in the dataset, and the order of the rows must be the same as the order of the samples in the dataset. For more on metadata, see the NLP Metadata Guide.

categorical_metadatat.Optional[t.List[str]] , default: None

The names of the categorical metadata columns. If None, categorical metadata columns are automatically inferred. Only relevant if metadata is not None.

propertiest.Optional[t.Union[pd.DataFrame, str]] , default: None

The text properties for the samples. Properties must be given as either a pandas DataFrame or a path to a pandas DataFrame compatible csv file, with the rows representing each sample and columns representing the different properties. If None, no properties are set. The number of rows in the properties DataFrame must be equal to the number of samples in the dataset, and the order of the rows must be the same as the order of the samples in the dataset. In order to calculate the default properties, use the TextData.calculate_builtin_properties function after the creation of the TextData object. For more on properties, see the NLP Properties Guide.

categorical_propertiest.Optional[t.List[str]] , default: None

The names of the categorical properties columns. Should be given only for custom properties, not for any of the built-in properties. If None, categorical properties columns are automatically inferred for custom properties.

embeddingst.Optional[Union[np.ndarray, pd.DataFrame, str]], default: None

The text embeddings for the samples. Embeddings must be given as a numpy array (or a path to an .npy file containing a numpy array) of shape (N, E), where N is the number of samples in the TextData object and E is the number of embeddings dimensions. The numpy array must be in the same order as the samples in the TextData. If None, no embeddings are set.

In order to use the built-in embeddings, use the TextData.calculate_builtin_embeddings function after the creation of the TextData object. For more on embeddings, see the Text Embeddings Guide

__init__(raw_text: Optional[Sequence[str]] = None, tokenized_text: Optional[Sequence[Sequence[str]]] = None, label: Optional[Union[Sequence[Union[int, str, Tuple[Union[int, str]]]], Sequence[Sequence[Union[str, int]]], Sequence[None]]] = None, task_type: Optional[str] = None, name: Optional[str] = None, embeddings: Optional[Union[DataFrame, ndarray, str]] = None, metadata: Optional[DataFrame] = None, categorical_metadata: Optional[List[str]] = None, properties: Optional[DataFrame] = None, categorical_properties: Optional[List[str]] = None)[source]#
__new__(*args, **kwargs)#

Attributes

TextData.categorical_metadata

Return categorical metadata column names.

TextData.categorical_properties

Return categorical properties names.

TextData.embeddings

Return the embeddings of for the dataset.

TextData.label

Return the label defined in the dataset.

TextData.metadata

Return the metadata of for the dataset.

TextData.n_samples

Return number of samples in the dataset.

TextData.name

TextData.numerical_metadata

Return numeric metadata column names.

TextData.numerical_properties

Return numerical properties names.

TextData.properties

Return the properties of the dataset.

TextData.task_type

Return the task type.

TextData.text

Return sequence of raw text samples.

TextData.tokenized_text

Return sequence of tokenized text samples.

Methods

TextData.calculate_builtin_embeddings([...])

Calculate the built-in embeddings of the dataset.

TextData.calculate_builtin_properties([...])

Calculate the default properties of the dataset.

TextData.cast_to_dataset(obj)

Verify Dataset or transform to Dataset.

TextData.copy([rows_to_use])

Create a copy of this Dataset with new data.

TextData.describe([n_properties_to_show, ...])

Provide holistic view of the data.

TextData.get_original_text_indexes()

Return the original indexes of the text samples.

TextData.get_sample_at_original_index(index)

Return the text sample at the original index.

TextData.has_label()

Return True if label was set.

TextData.head([n_samples, model_classes])

Return a copy of the dataset as a pandas Dataframe with the first n_samples samples.

TextData.is_multi_label_classification()

Check if the dataset is multi-label.

TextData.is_sampled(n_samples)

Return True if the dataset number of samples will decrease when sampled with n_samples samples.

TextData.label_for_display([model_classes])

Return the label defined in the dataset in a format that can be displayed.

TextData.label_for_print([model_classes])

Return the label defined in the dataset in a format that can be printed nicely.

TextData.len_when_sampled(n_samples)

Return number of samples in the sampled dataframe this dataset is sampled with n_samples samples.

TextData.sample(n_samples[, replace, ...])

Create a copy of the dataset object, with the internal data being a sample of the original data.

TextData.save_properties(path)

Save the dataset properties to csv.

TextData.set_embeddings(embeddings[, verbose])

Set the embeddings of the dataset.

TextData.set_metadata(metadata[, ...])

Set the metadata of the dataset.

TextData.set_properties(properties[, ...])

Set the properties of the dataset.

TextData.validate_textdata_compatibility(...)

Verify that all provided datasets share same label name and task types.