Embeddings Drift

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/train_test_validation/plot_embeddings_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_train_test_validation_plot_embeddings_drift.py: .. _nlp__embeddings_drift: Embeddings Drift ******************* This notebooks provides an overview for using and understanding the embeddings drift check. **Structure:** * `What Is Embeddings Drift? <#what-is-embeddings-drift>`__ * `Loading the Data <#load-data>`__ * `Run the Check <#run-check>`__ What Is Embeddings Drift? ============================== Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model's performance degrades over time. In unstructured data such as text, we cannot measure the drift of the data directly, as there's no "distribution" to measure. In order to measure the drift of the data, we can use the model's embeddings as a proxy for the data distribution. For more on embeddings, see the :ref:`Text Embeddings Guide `. This detects embeddings drift by using :ref:`a domain classifier `. For more information on drift, see the :ref:`Drift Guide `. How Does This Check Work? ========================= This check detects the embeddings drift by using :ref:`a domain classifier `, and uses the AUC score of the classifier as the basis for the measure of drift. For efficiency, the check first reduces the dimensionality of the embeddings, and then trains the classifier on the reduced embeddings. By default, the check uses UMAP for dimensionality reduction, but you can also use PCA by setting the `dimension_reduction_method` parameter to `pca`. The check also provides a scatter plot of the embeddings, which is a 2D representation of the embeddings space. This is achieved by further reducing the dimensionality, using UMAP. How To Use Embeddings in Deepchecks? ==================================== See how to calculate default embeddings or setting your own embeddings in the :ref:`Embeddings Guide `. .. GENERATED FROM PYTHON SOURCE LINES 52-55 .. code-block:: default from deepchecks.nlp.datasets.classification import tweet_emotion from deepchecks.nlp.checks import TextEmbeddingsDrift .. GENERATED FROM PYTHON SOURCE LINES 56-61 Load Data ========== For this example, we'll use the tweet emotion dataset, which is a dataset of tweets labeled by one of four emotions: happiness, anger, sadness and optimism. .. GENERATED FROM PYTHON SOURCE LINES 61-68 .. code-block:: default train_ds, test_ds = tweet_emotion.load_data() train_embeddings, test_embeddings = tweet_emotion.load_embeddings(as_train_test=True) # Set the embeddings in the datasets: train_ds.set_embeddings(train_embeddings) test_ds.set_embeddings(test_embeddings) .. GENERATED FROM PYTHON SOURCE LINES 69-70 Let's see how our data looks like: .. GENERATED FROM PYTHON SOURCE LINES 70-72 .. code-block:: default train_ds.head() .. raw:: html

	text	label	user_age	gender	days_on_platform	user_region
0	No but that's so cute. Atsu was probably shy a...	happiness	24.97	Male	2729	Middle East/Africa
1	Rooneys fucking untouchable isn't he? Been fuc...	anger	21.66	Male	1376	Asia Pacific
2	Tiller and breezy should do a collab album. Ra...	happiness	37.29	Female	3853	Americas
3	@user broadband is shocking regretting signing...	anger	15.39	Female	1831	Europe
4	@user Look at those teef! #growl	anger	54.37	Female	4619	Europe

.. GENERATED FROM PYTHON SOURCE LINES 73-75 Run Check =============================== .. GENERATED FROM PYTHON SOURCE LINES 77-78 As there's natural drift in this dataset, we can expect to see some drift in the data: .. GENERATED FROM PYTHON SOURCE LINES 78-83 .. code-block:: default check = TextEmbeddingsDrift() result = check.run(train_dataset=train_ds, test_dataset=test_ds) result .. rst-class:: sphx-glr-script-out .. code-block:: none n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism. n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism. .. raw:: html

Embeddings Drift

.. GENERATED FROM PYTHON SOURCE LINES 84-93 Observing the results ---------------------- We can see that the check found drift in the data. Moreover, we can investigate the drift by looking at the scatter plot, which is a 2D representation of the embeddings space. We can see that there are a few clusters in the graph where there are more tweets from the test dataset than the train dataset. This is a sign of drift in the data. By hovering over the points, we can see the actual tweets that are in the dataset, and see for example that there are clusters of tweets about motivational quotes, which are more common in the test dataset than the train dataset. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 28.275 seconds) .. _sphx_glr_download_nlp_auto_checks_train_test_validation_plot_embeddings_drift.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_embeddings_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_embeddings_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_