NLP Embeddings#
Embeddings are a way to represent text as a vector of numbers. The vector is a representation of the text in the latent space, in which text with similar meaning is represented by similar vectors.
Embeddings are usually extracted from the one of the final layers of a trained neural network model. This model can either be a model that was trained on the specific task at hand (e.g. sentiment analysis), or a model that was trained on a different task, but is known to be good at extracting embeddings (e.g. GPT).
What Are Embeddings Used For?#
Embeddings are used by some of the Deepchecks’ checks to produce a meaningful representation of the data, insights on the data, since some computations cannot be computed directly on the text (for example, drift). Inspecting the distribution of the embeddings, or the distance between the embeddings of different texts, can help uncover potential problems in the way that the datasets were built, or hint about the model’s expected performance on unseen data.
Example for specific scenarios in which using embeddings may come in handy:
Detecting drift in the text - If the distribution of the embeddings of the training data is different from the distribution of the embeddings of the test data, it may indicate that the test data is not representative of the training data, and that the model’s performance on the test data may be lower than expected.
Investigating low test performance - By comparing similar texts on which the model doesn’t perform well, we can try to understand what is the model missing. For example, if the model performs well on news articles, but performs poorly on scientific articles, it may indicate that the model was trained on a dataset that is biased towards the news articles, and that the model is not generalizing well to the scientific articles.
Find conflicting annotations - Clean data is critical for training a good model. Mistakes in annotations (labeling) of the data can lead to a model that is not performing well. By finding similar texts (using embeddings) with different annotations, we can find potential annotation mistakes and fix them.
Using Embeddings in Checks#
Whether you are Using Deepchecks to Calculate Embeddings or using your own model’s embeddings, the process of
using them in the checks is the same.
In order to use the embeddings of your text in a check, the embeddings should already be part of the TextData
object.
Using Deepchecks to Calculate Embeddings#
If you don’t have model embeddings for you text, you can use deepchecks to calculate the embeddings for you.
deepchecks currently supports using the open-source sentence-transformers
library to calculate the embeddings,
or the paid API of open-ai
.
Calculating your embeddings is done by calling the calculate_default_embeddings
method of the TextData
object. This method will calculate the embeddings and add them to the TextData
object.
Example of calculating the default embeddings in order to use the TextEmbeddingsDrift check: In the following example, we will calculate the default embeddings in order to use the TextEmbeddingsDrift check:
from deepchecks.nlp.checks import TextEmbeddingsDrift
from deepchecks.nlp import TextData
# Initialize the TextData object
text_data = TextData(text)
# Calculate the default embeddings
text_data.calculate_default_embeddings()
# Run the check
TextEmbeddingsDrift().run(text_data)
Note that any use of the deepchecks.nlp.TextData.calculate_default_embeddings()
method will override the existing embeddings.
Currently, deepchecks supports either using the all-MiniLM-L6-v2
(default) model from the sentence-transformers
library,
or Open AI’s text-embedding-ada-002
model. You can choose which model to use by setting the model
parameter
to either miniLM
or open_ai
.
The embeddings are automatically saved on a local CSV file so they can be used later. You can change the location and
name of the file by using the file_path
parameter.
Using Your Own Embeddings#
Whether you saved the deepchecks embeddings for this dataset somewhere to save time, or you used your own model,
you can set the embeddings of the TextData
object to use them by using one of the following methods:
When initializing the
TextData
object, pass your pre-calculated embeddings to theembeddings
parameter.After the initialization, call the
set_embeddings
method of theTextData
object.
In both methods, you can pass the embeddings as a numpy array, or as a path to an .npy file. For the correct format of the embeddings, see the Pre-Calculated Embeddings Format section.
In the following example, we will pass pre-calculated embeddings to the TextData
object in order to use the
TextPropertyOutliers check:
from deepchecks.nlp.checks import TextEmbeddingsDrift
from deepchecks.nlp import TextData
# Option 1: Initialize the TextData object with the embeddings:
text_data = TextData(text, embeddings=embeddings)
# Option 2: Initialize the TextData object and then set the embeddings:
text_data = TextData(text)
text_data.set_embeddings(embeddings)
# Run the check
TextEmbeddingsDrift().run(text_data)
Pre-Calculated Embeddings Format#
The embeddings should be a numpy.ndarray of shape (N, E), where N is the number of samples in the
TextData
object and E is the number of embeddings dimensions.
The numpy.ndarray must be in the same order as the samples in the TextData object.