TextData#

class TextData[source]#

TextData wraps together the raw text data and the labels for the nlp task.

The TextData class contains metadata and methods intended for easily accessing metadata relevant for the training or validating of ML models.

Parameters

raw_textt.Sequence[str], default: None

The raw text data, a sequence of strings representing the raw text of each sample. If not given, tokenized_text must be given, and raw_text will be created from it by joining the tokens with spaces.

tokenized_textt.Sequence[t.Sequence[str]], default: None

The tokenized text data, a sequence of sequences of strings representing the tokenized text of each sample. Only relevant for task_type ‘token_classification’. If not given, raw_text must be given, and tokenized_text will be created from it by splitting the text by spaces.

labelt.Optional[TTextLabel], default: None

The label for the text data. Can be either a text_classification label or a token_classification label. If None, the label is not set.

text_classification label - For text classification the accepted label format differs between multilabel and single label cases. For single label data, the label should be passed as a sequence of labels, with one entry per sample that can be either a string or an integer. For multilabel data, the label should be passed as a sequence of sequences, with the sequence for each sample being a binary vector, representing the presence of the i-th label in that sample.
token_classification label - For token classification the accepted label format is the IOB format or similar to it. The Label must be a sequence of sequences of strings or integers, with each sequence corresponding to a sample in the tokenized text, and exactly the length of the corresponding tokenized text.

task_typestr, default: None

The task type for the text data. Can be either ‘text_classification’ or ‘token_classification’. Must be set if label is provided.

dataset_namet.Optional[str] , default: None

The name of the dataset. If None, the dataset name will be defined when running it within a check.

indext.Optional[t.Sequence[int]] , default: None

The index of the samples. If None, the index is set to np.arange(len(raw_text)).

metadatat.Optional[pd.DataFrame] , default: None

Metadata for the samples. If None, no metadata is set. If a DataFrame is given, it must contain the same number of samples as the raw_text and identical index.

propertiest.Optional[Union[pd.DataFrame, str]] , default: None

The text properties for the samples. If None, no properties are set. If ‘auto’, the properties are calculated using the default properties. If a DataFrame is given, it must contain the properties for each sample as the raw text and identical index.

devicestr, default: None

The device to use to calculate the text properties.

__init__(raw_text: Optional[Sequence[str]] = None, tokenized_text: Optional[Sequence[Sequence[str]]] = None, label: Optional[Union[Sequence[Union[Tuple[int, str], Tuple[Tuple[int, str]]]], Sequence[Sequence[Union[str, int]]]]] = None, task_type: Optional[str] = None, dataset_name: Optional[str] = None, index: Optional[Sequence[Any]] = None, metadata: Optional[DataFrame] = None, properties: Optional[Union[DataFrame, str]] = None, device: Optional[str] = None)[source]#

__new__(*args, **kwargs)#

Attributes

`TextData.is_multilabel`	Return True if label is multilabel.
`TextData.label`	Return the label defined in the dataset.
`TextData.metadata`	Return the metadata of for the dataset.
`TextData.metadata_types`	Return the metadata types of for the dataset.
`TextData.n_samples`	Return number of samples in the dataset.
`TextData.name`
`TextData.properties`	Return the properties of the dataset.
`TextData.properties_types`	Return the property types of the dataset.
`TextData.task_type`	Return the task type.
`TextData.text`	Return sequence of raw text samples.
`TextData.tokenized_text`	Return sequence of tokenized text samples.
`TextData.index`

Methods

`TextData.calculate_default_properties`([...])	Calculate the default properties of the dataset.
`TextData.cast_to_dataset`(obj)	Verify Dataset or transform to Dataset.
`TextData.copy`([rows_to_use])	Create a copy of this Dataset with new data.
`TextData.datasets_share_task_type`(*datasets)	Verify that all provided datasets share same label name and task types.
`TextData.get_raw_sample`(index)	Get the raw text of a sample.
`TextData.get_tokenized_sample`(index)	Get the tokenized text of a sample.
`TextData.has_label`()	Return True if label was set.
`TextData.head`([n_samples])	Return a copy of the dataset as a pandas Dataframe with the first n_samples samples.
`TextData.is_sampled`(n_samples)	Return True if the dataset number of samples will decrease when sampled with n_samples samples.
`TextData.len_when_sampled`(n_samples)	Return number of samples in the sampled dataframe this dataset is sampled with n_samples samples.
`TextData.reindex`(index)	Reindex the TextData with a new index.
`TextData.sample`(n_samples[, replace, ...])	Create a copy of the dataset object, with the internal data being a sample of the original data.
`TextData.set_metadata`(metadata[, metadata_types])	Set the metadata of the dataset.
`TextData.set_properties`(properties[, ...])	Set the properties of the dataset.

text_data

TextData.is_multilabel