TextData#

class TextData[source]#

TextData wraps together the raw text data and the labels for the nlp task.

The TextData class contains metadata and methods intended for easily accessing metadata relevant for the training or validating of ML models.

Parameters
raw_textt.Sequence[str], default: None

The raw text data, a sequence of strings representing the raw text of each sample. If not given, tokenized_text must be given, and raw_text will be created from it by joining the tokens with spaces.

tokenized_textt.Sequence[t.Sequence[str]], default: None

The tokenized text data, a sequence of sequences of strings representing the tokenized text of each sample. Only relevant for task_type ‘token_classification’. If not given, raw_text must be given, and tokenized_text will be created from it by splitting the text by spaces.

labelt.Optional[TTextLabel], default: None

The label for the text data. Can be either a text_classification label or a token_classification label. If None, the label is not set.

  • text_classification label - For text classification the accepted label format differs between multilabel and single label cases. For single label data, the label should be passed as a sequence of labels, with one entry per sample that can be either a string or an integer. For multilabel data, the label should be passed as a sequence of sequences, with the sequence for each sample being a binary vector, representing the presence of the i-th label in that sample.

  • token_classification label - For token classification the accepted label format is the IOB format or similar to it. The Label must be a sequence of sequences of strings or integers, with each sequence corresponding to a sample in the tokenized text, and exactly the length of the corresponding tokenized text.

task_typestr, default: None

The task type for the text data. Can be either ‘text_classification’ or ‘token_classification’. Must be set if label is provided.

dataset_namet.Optional[str] , default: None

The name of the dataset. If None, the dataset name will be defined when running it within a check.

indext.Optional[t.Sequence[int]] , default: None

The index of the samples. If None, the index is set to np.arange(len(raw_text)).

metadatat.Optional[pd.DataFrame] , default: None

Metadata for the samples. If None, no metadata is set. If a DataFrame is given, it must contain the same number of samples as the raw_text and identical index.

propertiest.Optional[Union[pd.DataFrame, str]] , default: None

The text properties for the samples. If None, no properties are set. If ‘auto’, the properties are calculated using the default properties. If a DataFrame is given, it must contain the properties for each sample as the raw text and identical index.

devicestr, default: None

The device to use to calculate the text properties.

__init__(raw_text: Optional[Sequence[str]] = None, tokenized_text: Optional[Sequence[Sequence[str]]] = None, label: Optional[Union[Sequence[Union[Tuple[int, str], Tuple[Tuple[int, str]]]], Sequence[Sequence[Union[str, int]]]]] = None, task_type: Optional[str] = None, dataset_name: Optional[str] = None, index: Optional[Sequence[Any]] = None, metadata: Optional[DataFrame] = None, properties: Optional[Union[DataFrame, str]] = None, device: Optional[str] = None)[source]#
__new__(*args, **kwargs)#

Attributes

TextData.is_multilabel

Return True if label is multilabel.

TextData.label

Return the label defined in the dataset.

TextData.metadata

Return the metadata of for the dataset.

TextData.metadata_types

Return the metadata types of for the dataset.

TextData.n_samples

Return number of samples in the dataset.

TextData.name

TextData.properties

Return the properties of the dataset.

TextData.properties_types

Return the property types of the dataset.

TextData.task_type

Return the task type.

TextData.text

Return sequence of raw text samples.

TextData.tokenized_text

Return sequence of tokenized text samples.

TextData.index

Methods

TextData.calculate_default_properties([...])

Calculate the default properties of the dataset.

TextData.cast_to_dataset(obj)

Verify Dataset or transform to Dataset.

TextData.copy([rows_to_use])

Create a copy of this Dataset with new data.

TextData.datasets_share_task_type(*datasets)

Verify that all provided datasets share same label name and task types.

TextData.get_raw_sample(index)

Get the raw text of a sample.

TextData.get_tokenized_sample(index)

Get the tokenized text of a sample.

TextData.has_label()

Return True if label was set.

TextData.head([n_samples])

Return a copy of the dataset as a pandas Dataframe with the first n_samples samples.

TextData.is_sampled(n_samples)

Return True if the dataset number of samples will decrease when sampled with n_samples samples.

TextData.len_when_sampled(n_samples)

Return number of samples in the sampled dataframe this dataset is sampled with n_samples samples.

TextData.reindex(index)

Reindex the TextData with a new index.

TextData.sample(n_samples[, replace, ...])

Create a copy of the dataset object, with the internal data being a sample of the original data.

TextData.set_metadata(metadata[, metadata_types])

Set the metadata of the dataset.

TextData.set_properties(properties[, ...])

Set the properties of the dataset.