Embeddings are a way to represent text as a vector of numbers. The vector is a representation of the text in the latent space, in which text with similar meaning is represented by similar vectors.
Embeddings are usually extracted from the one of the final layers of a trained neural network model. This model can either be a model that was trained on the specific task at hand (e.g. sentiment analysis), or a model that was trained on a different task, but is known to be good at extracting embeddings (e.g. GPT).
What Are Embeddings Used For?#
Embeddings are used by some of the Deepchecks’ checks to produce a meaningful representation of the data, insights on the data, since some computations cannot be computed directly on the text (for example, drift). Inspecting the distribution of the embeddings, or the distance between the embeddings of different texts, can help uncover potential problems in the way that the datasets were built, or hint about the model’s expected performance on unseen data.
Example for specific scenarios in which using embeddings may come in handy:
Detecting drift in the text - If the distribution of the embeddings of the training data is different from the distribution of the embeddings of the test data, it may indicate that the test data is not representative of the training data, and that the model’s performance on the test data may be lower than expected.
Investigating low test performance - By comparing similar texts on which the model doesn’t perform well, we can try to understand what is the model missing. For example, if the model performs well on news articles, but performs poorly on scientific articles, it may indicate that the model was trained on a dataset that is biased towards the news articles, and that the model is not generalizing well to the scientific articles.
Find conflicting annotations - Clean data is critical for training a good model. Mistakes in annotations (labeling) of the data can lead to a model that is not performing well. By finding similar texts (using embeddings) with different annotations, we can find potential annotation mistakes and fix them.
Using Embeddings in Checks#
Whether you are Using Deepchecks to Calculate Embeddings or using your own model’s embeddings, the process of
using them in the checks is the same.
In order to use the embeddings of your text in a check, the embeddings should already be part of the
Using Deepchecks to Calculate Embeddings#
If you don’t have model embeddings for you text, you can use deepchecks to calculate the embeddings for you.
deepchecks currently supports using the open-source
sentence-transformers library to calculate the embeddings,
or the paid API of
Calculating your embeddings is done by calling the
calculate_builtin_embeddings method of the
object. This method will calculate the embeddings and add them to the
Example of calculating the built-in embeddings in order to use the TextEmbeddingsDrift check: In the following example, we will calculate the built-in embeddings in order to use the TextEmbeddingsDrift check:
from deepchecks.nlp.checks import TextEmbeddingsDrift from deepchecks.nlp import TextData # Initialize the TextData object text_data = TextData(text) # Calculate the built-in embeddings text_data.calculate_builtin_embeddings() # Run the check TextEmbeddingsDrift().run(text_data)
Note that any use of the
deepchecks.nlp.TextData.calculate_builtin_embeddings() method will override the existing embeddings.
Currently, deepchecks supports either using the
all-MiniLM-L6-v2 (default) model from the
or Open AI’s
text-embedding-ada-002 model. You can choose which model to use by setting the
The embeddings are automatically saved on a local CSV file so they can be used later. You can change the location and
name of the file by using the
Using Your Own Embeddings#
Whether you saved the deepchecks embeddings for this dataset somewhere to save time, or you used your own model,
you can set the embeddings of the
TextData object to use them by using one of the following methods:
When initializing the
TextDataobject, pass your pre-calculated embeddings to the
After the initialization, call the
set_embeddingsmethod of the
In both methods, you can pass the embeddings as a numpy array, or as a path to an .npy file. For the correct format of the embeddings, see the Pre-Calculated Embeddings Format section.
In the following example, we will pass pre-calculated embeddings to the
TextData object in order to use the
from deepchecks.nlp.checks import TextEmbeddingsDrift from deepchecks.nlp import TextData # Option 1: Initialize the TextData object with the embeddings: text_data = TextData(text, embeddings=embeddings) # Option 2: Initialize the TextData object and then set the embeddings: text_data = TextData(text) text_data.set_embeddings(embeddings) # Run the check TextEmbeddingsDrift().run(text_data)
Pre-Calculated Embeddings Format#
The embeddings should be a numpy.ndarray of shape (N, E), where N is the number of samples in the
TextData object and E is the number of embeddings dimensions.
The numpy.ndarray must be in the same order as the samples in the TextData object.