TextData#
- class TextData[source]#
TextData wraps together the raw text data and the labels for the nlp task.
The TextData class contains metadata and methods intended for easily accessing metadata relevant for the training or validating of ML models.
- Parameters
- raw_textt.Sequence[str], default: None
The raw text data, a sequence of strings representing the raw text of each sample. If not given, tokenized_text must be given, and raw_text will be created from it by joining the tokens with spaces.
- tokenized_textt.Sequence[t.Sequence[str]], default: None
The tokenized text data, a sequence of sequences of strings representing the tokenized text of each sample. Only relevant for task_type ‘token_classification’. If not given, raw_text must be given, and tokenized_text will be created from it by splitting the text by spaces.
- labelt.Optional[TTextLabel], default: None
The label for the text data. Can be either a text_classification label or a token_classification label. If None, the label is not set.
text_classification label - For text classification the accepted label format differs between multilabel and single label cases. For single label data, the label should be passed as a sequence of labels, with one entry per sample that can be either a string or an integer. For multilabel data, the label should be passed as a sequence of sequences, with the sequence for each sample being a binary vector, representing the presence of the i-th label in that sample.
token_classification label - For token classification the accepted label format is the IOB format or similar to it. The Label must be a sequence of sequences of strings or integers, with each sequence corresponding to a sample in the tokenized text, and exactly the length of the corresponding tokenized text.
- task_typestr, default: None
The task type for the text data. Can be either ‘text_classification’ or ‘token_classification’. Must be set if label is provided.
- namet.Optional[str] , default: None
The name of the dataset. If None, the dataset name will be defined when running it within a check.
- metadatat.Optional[t.Union[pd.DataFrame, str]] , default: None
Metadata for the samples. Metadata must be given as a pandas DataFrame or a path to a pandas DataFrame compatible csv file, with the rows representing each sample and columns representing the different metadata columns. If None, no metadata is set. The number of rows in the metadata DataFrame must be equal to the number of samples in the dataset, and the order of the rows must be the same as the order of the samples in the dataset. For more on metadata, see the NLP Metadata Guide.
- categorical_metadatat.Optional[t.List[str]] , default: None
The names of the categorical metadata columns. If None, categorical metadata columns are automatically inferred. Only relevant if metadata is not None.
- propertiest.Optional[t.Union[pd.DataFrame, str]] , default: None
The text properties for the samples. Properties must be given as either a pandas DataFrame or a path to a pandas DataFrame compatible csv file, with the rows representing each sample and columns representing the different properties. If None, no properties are set. The number of rows in the properties DataFrame must be equal to the number of samples in the dataset, and the order of the rows must be the same as the order of the samples in the dataset. In order to calculate the default properties, use the TextData.calculate_builtin_properties function after the creation of the TextData object. For more on properties, see the NLP Properties Guide.
- categorical_propertiest.Optional[t.List[str]] , default: None
The names of the categorical properties columns. If None, categorical properties columns are automatically inferred. Only relevant if properties is not None.
- embeddingst.Optional[Union[np.ndarray, pd.DataFrame, str]], default: None
The text embeddings for the samples. Embeddings must be given as a numpy array (or a path to an .npy file containing a numpy array) of shape (N, E), where N is the number of samples in the TextData object and E is the number of embeddings dimensions. The numpy array must be in the same order as the samples in the TextData. If None, no embeddings are set.
In order to use the default embeddings, use the TextData.calculate_default_embeddings function after the creation of the TextData object. For more on embeddings, see the Text Embeddings Guide
- __init__(raw_text: Optional[Sequence[str]] = None, tokenized_text: Optional[Sequence[Sequence[str]]] = None, label: Optional[Union[Sequence[Union[int, str, Tuple[Union[int, str]]]], Sequence[Sequence[Union[str, int]]]]] = None, task_type: Optional[str] = None, name: Optional[str] = None, embeddings: Optional[Union[DataFrame, ndarray, str]] = None, metadata: Optional[DataFrame] = None, categorical_metadata: Optional[List[str]] = None, properties: Optional[DataFrame] = None, categorical_properties: Optional[List[str]] = None)[source]#
- __new__(*args, **kwargs)#
Attributes
Return categorical metadata column names. |
|
Return categorical properties names. |
|
Return the metadata of for the dataset. |
|
Return the label defined in the dataset. |
|
Return the metadata of for the dataset. |
|
Return number of samples in the dataset. |
|
Return the properties of the dataset. |
|
Return the task type. |
|
Return sequence of raw text samples. |
|
Return sequence of tokenized text samples. |
Methods
Calculate the default properties of the dataset. |
|
Calculate the default properties of the dataset. |
|
Verify Dataset or transform to Dataset. |
|
|
Create a copy of this Dataset with new data. |
Return the original indexes of the text samples. |
|
Return True if label was set. |
|
|
Return a copy of the dataset as a pandas Dataframe with the first n_samples samples. |
Check if the dataset is multi-label. |
|
|
Return True if the dataset number of samples will decrease when sampled with n_samples samples. |
|
Return the label defined in the dataset in a format that can be displayed. |
|
Return number of samples in the sampled dataframe this dataset is sampled with n_samples samples. |
|
Create a copy of the dataset object, with the internal data being a sample of the original data. |
|
Save the dataset properties to csv. |
|
Set the metadata of the dataset. |
|
Set the metadata of the dataset. |
|
Set the properties of the dataset. |
Verify that all provided datasets share same label name and task types. |