TextData.calculate_builtin_embeddings#
- TextData.calculate_builtin_embeddings(model: str = 'miniLM', file_path: str = 'embeddings.npy', device: Optional[str] = None, long_sample_behaviour: str = 'average+warn', open_ai_batch_size: int = 500)[source]#
Calculate the built-in embeddings of the dataset.
- Parameters
- modelstr, default: ‘miniLM’
The model to use for calculating the embeddings. Possible values are: ‘miniLM’: using the miniLM model in the sentence-transformers library. ‘open_ai’: using the ADA model in the open_ai library. Requires an API key.
- file_pathstr, default: ‘embeddings.npy’
The path to save the embeddings to.
- devicestr, default: None
The device to use for calculating the embeddings. If None, the default device will be used.
- long_sample_behaviourstr, default ‘average+warn’
How to handle long samples. Averaging is done as described in https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb Currently, applies only to the ‘open_ai’ model, as the ‘miniLM’ model can handle long samples.
- Options are:
‘average+warn’ (default): average the embeddings of the chunks and warn if the sample is too long.
‘average’: average the embeddings of the chunks.
‘truncate’: truncate the sample to the maximum length.
‘raise’: raise an error if the sample is too long.
‘nan’: return an embedding vector of nans for each sample that is too long.
- open_ai_batch_sizeint, default 500
The amount of samples to send to open ai in each batch. Reduce if getting errors from open ai.