NLP Embeddings#

Embeddings are a way to represent text as a vector of numbers. The vector is a representation of the text in the latent space, in which text with similar meaning is represented by similar vectors.

Embeddings are usually extracted from the one of the final layers of a trained neural network model. This model can either be a model that was trained on the specific task at hand (e.g. sentiment analysis), or a model that was trained on a different task, but is known to be good at extracting embeddings (e.g. GPT).

What Are Embeddings Used For?#

Embeddings are used by some of the Deepchecks’ checks to produce a meaningful representation of the data, insights on the data, since some computations cannot be computed directly on the text (for example, drift). Inspecting the distribution of the embeddings, or the distance between the embeddings of different texts, can help uncover potential problems in the way that the datasets were built, or hint about the model’s expected performance on unseen data.

Example for specific scenarios in which using embeddings may come in handy:

Detecting drift in the text - If the distribution of the embeddings of the training data is different from the distribution of the embeddings of the test data, it may indicate that the test data is not representative of the training data, and that the model’s performance on the test data may be lower than expected.
Investigating low test performance - By comparing similar texts on which the model doesn’t perform well, we can try to understand what is the model missing. For example, if the model performs well on news articles, but performs poorly on scientific articles, it may indicate that the model was trained on a dataset that is biased towards the news articles, and that the model is not generalizing well to the scientific articles.
Find conflicting annotations - Clean data is critical for training a good model. Mistakes in annotations (labeling) of the data can lead to a model that is not performing well. By finding similar texts (using embeddings) with different annotations, we can find potential annotation mistakes and fix them.

Using Embeddings in Checks#

Whether you are Using Deepchecks to Calculate Embeddings or using your own model’s embeddings, the process of using them in the checks is the same. In order to use the embeddings of your text in a check, the embeddings should already be part of the TextData object.

Using Deepchecks to Calculate Embeddings#

If you don’t have model embeddings for you text, you can use deepchecks to calculate the embeddings for you. deepchecks currently supports using the open-source sentence-transformers library to calculate the embeddings, or the paid API of open-ai.

Calculating your embeddings is done by calling the calculate_builtin_embeddings method of the TextData object. This method will calculate the embeddings and add them to the TextData object.

Example of calculating the built-in embeddings in order to use the TextEmbeddingsDrift check: In the following example, we will calculate the built-in embeddings in order to use the TextEmbeddingsDrift check:

from deepchecks.nlp.checks import TextEmbeddingsDrift
from deepchecks.nlp import TextData

# Initialize the TextData object
text_data = TextData(text)

# Calculate the built-in embeddings
text_data.calculate_builtin_embeddings()

# Run the check
TextEmbeddingsDrift().run(text_data)

Note that any use of the deepchecks.nlp.TextData.calculate_builtin_embeddings() method will override the existing embeddings.

Currently, deepchecks supports either using the all-MiniLM-L6-v2 (default) model from the sentence-transformers library, or Open AI’s text-embedding-ada-002 model. You can choose which model to use by setting the model parameter to either miniLM or open_ai.

The embeddings are automatically saved on a local CSV file so they can be used later. You can change the location and name of the file by using the file_path parameter.

Using Your Own Embeddings#

Whether you saved the deepchecks embeddings for this dataset somewhere to save time, or you used your own model, you can set the embeddings of the TextData object to use them by using one of the following methods:

When initializing the TextData object, pass your pre-calculated embeddings to the embeddings parameter.
After the initialization, call the set_embeddings method of the TextData object.

In both methods, you can pass the embeddings as a numpy array, or as a path to an .npy file. For the correct format of the embeddings, see the Pre-Calculated Embeddings Format section.

In the following example, we will pass pre-calculated embeddings to the TextData object in order to use the TextPropertyOutliers check:

from deepchecks.nlp.checks import TextEmbeddingsDrift
from deepchecks.nlp import TextData

# Option 1: Initialize the TextData object with the embeddings:
text_data = TextData(text, embeddings=embeddings)

# Option 2: Initialize the TextData object and then set the embeddings:
text_data = TextData(text)
text_data.set_embeddings(embeddings)

# Run the check
TextEmbeddingsDrift().run(text_data)

Pre-Calculated Embeddings Format#

The embeddings should be a numpy.ndarray of shape (N, E), where N is the number of samples in the TextData object and E is the number of embeddings dimensions. The numpy.ndarray must be in the same order as the samples in the TextData object.

NLP Metadata

Data Integrity