TextData.calculate_builtin_embeddings#

TextData.calculate_builtin_embeddings(model: str = 'miniLM', file_path: str = 'embeddings.npy', device: Optional[str] = None, long_sample_behaviour: str = 'average+warn', open_ai_batch_size: int = 500)[source]#

Calculate the built-in embeddings of the dataset.

Parameters
modelstr, default: ‘miniLM’

The model to use for calculating the embeddings. Possible values are: ‘miniLM’: using the miniLM model in the sentence-transformers library. ‘open_ai’: using the ADA model in the open_ai library. Requires an API key.

file_pathstr, default: ‘embeddings.npy’

The path to save the embeddings to.

devicestr, default: None

The device to use for calculating the embeddings. If None, the default device will be used.

long_sample_behaviourstr, default ‘average+warn’

How to handle long samples. Averaging is done as described in https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb Currently, applies only to the ‘open_ai’ model, as the ‘miniLM’ model can handle long samples.

Options are:
  • ‘average+warn’ (default): average the embeddings of the chunks and warn if the sample is too long.

  • ‘average’: average the embeddings of the chunks.

  • ‘truncate’: truncate the sample to the maximum length.

  • ‘raise’: raise an error if the sample is too long.

  • ‘nan’: return an embedding vector of nans for each sample that is too long.

open_ai_batch_sizeint, default 500

The amount of samples to send to open ai in each batch. Reduce if getting errors from open ai.