TextEmbeddingsDrift#

class TextEmbeddingsDrift[source]#

Calculate drift between the train and test datasets using a model trained to distinguish between their embeddings.

This check detects drift between the model embeddings of the train and test data. To do so, the check trains a Domain Classifier, which is a model trained to distinguish between the train and test datasets.

For optimizing time and improving the model’s performance, the check uses dimension reduction to reduce the number of embeddings dimensions. The check uses UMAP for dimension reduction by default, but can also use PCA or no dimension reduction at all.

For more information about embeddings in deepchecks, see Text Embeddings Guide.

Parameters
sample_sizeint , default: 10_000

Max number of rows to use from each dataset for the training and evaluation of the domain classifier.

random_stateint , default: 42

Random seed for the check.

test_sizefloat , default: 0.3

Fraction of the combined datasets to use for the evaluation of the domain classifier

dimension_reduction_methodstr , default: ‘auto’

Dimension reduction method to use for the check. Dimension reduction is used to reduce the number of embeddings dimensions in order for the domain classifier to train more efficiently on the data. The 2 supported methods are PCA and UMAP. While UMAP yields better results (especially visually), it is much slower than PCA. Supported values: - ‘auto’ (default): Automatically choose the best method for the data. Uses UMAP if with_display is True, otherwise uses PCA for a faster calculation. Doesn’t use dimension reduction at all if the number of embeddings dimensions is less than 30. - ‘pca’: Use PCA for dimension reduction. - ‘umap’: Use UMAP for dimension reduction. - ‘none’: Don’t use dimension reduction.

num_samples_in_displayint , default: 500

Number of samples to display in the check display scatter plot.

__init__(sample_size: int = 10000, random_state: int = 42, test_size: float = 0.3, dimension_reduction_method: str = 'auto', num_samples_in_display: int = 500, **kwargs)[source]#
__new__(*args, **kwargs)#

Methods

TextEmbeddingsDrift.add_condition(name, ...)

Add new condition function to the check.

TextEmbeddingsDrift.add_condition_overall_drift_value_less_than([...])

Add condition.

TextEmbeddingsDrift.clean_conditions()

Remove all conditions from this check instance.

TextEmbeddingsDrift.conditions_decision(result)

Run conditions on given result.

TextEmbeddingsDrift.config([...])

Return check configuration (conditions' configuration not yet supported).

TextEmbeddingsDrift.from_config(conf[, ...])

Return check object from a CheckConfig object.

TextEmbeddingsDrift.from_json(conf[, ...])

Deserialize check instance from JSON string.

TextEmbeddingsDrift.metadata([with_doc_link])

Return check metadata.

TextEmbeddingsDrift.name()

Name of class in split camel case.

TextEmbeddingsDrift.params([show_defaults])

Return parameters to show when printing the check.

TextEmbeddingsDrift.remove_condition(index)

Remove given condition by index.

TextEmbeddingsDrift.run(train_dataset, ...)

Run check.

TextEmbeddingsDrift.run_logic(context)

Run check.

TextEmbeddingsDrift.to_json([indent, ...])

Serialize check instance to JSON string.