.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/train_test_validation/plot_embeddings_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_train_test_validation_plot_embeddings_drift.py: .. _nlp__embeddings_drift: Embeddings Drift ******************* This notebooks provides an overview for using and understanding the embeddings drift check. **Structure:** * `What Is Embeddings Drift? <#what-is-embeddings-drift>`__ * `Loading the Data <#load-data>`__ * `Run the Check <#run-check>`__ What Is Embeddings Drift? ============================== Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model's performance degrades over time. In unstructured data such as text, we cannot measure the drift of the data directly, as there's no "distribution" to measure. In order to measure the drift of the data, we can use the model's embeddings as a proxy for the data distribution. For more on embeddings, see the :ref:`Text Embeddings Guide `. This detects embeddings drift by using :ref:`a domain classifier `. For more information on drift, see the :ref:`Drift Guide `. How Does This Check Work? ========================= This check detects the embeddings drift by using :ref:`a domain classifier `, and uses the AUC score of the classifier as the basis for the measure of drift. For efficiency, the check first reduces the dimensionality of the embeddings, and then trains the classifier on the reduced embeddings. By default, the check uses UMAP for dimensionality reduction, but you can also use PCA by setting the `dimension_reduction_method` parameter to `pca`. The check also provides a scatter plot of the embeddings, which is a 2D representation of the embeddings space. This is achieved by further reducing the dimensionality, using UMAP. How To Use Embeddings in Deepchecks? ==================================== See how to calculate default embeddings or setting your own embeddings in the :ref:`Embeddings Guide `. .. GENERATED FROM PYTHON SOURCE LINES 52-55 .. code-block:: default from deepchecks.nlp.datasets.classification import tweet_emotion from deepchecks.nlp.checks import TextEmbeddingsDrift .. GENERATED FROM PYTHON SOURCE LINES 56-61 Load Data ========== For this example, we'll use the tweet emotion dataset, which is a dataset of tweets labeled by one of four emotions: happiness, anger, sadness and optimism. .. GENERATED FROM PYTHON SOURCE LINES 61-68 .. code-block:: default train_ds, test_ds = tweet_emotion.load_data() train_embeddings, test_embeddings = tweet_emotion.load_embeddings(as_train_test=True) # Set the embeddings in the datasets: train_ds.set_embeddings(train_embeddings) test_ds.set_embeddings(test_embeddings) .. GENERATED FROM PYTHON SOURCE LINES 69-70 Let's see how our data looks like: .. GENERATED FROM PYTHON SOURCE LINES 70-72 .. code-block:: default train_ds.head() .. raw:: html
text label user_age gender days_on_platform user_region
0 No but that's so cute. Atsu was probably shy a... happiness 24.97 Male 2729 Middle East/Africa
1 Rooneys fucking untouchable isn't he? Been fuc... anger 21.66 Male 1376 Asia Pacific
2 Tiller and breezy should do a collab album. Ra... happiness 37.29 Female 3853 Americas
3 @user broadband is shocking regretting signing... anger 15.39 Female 1831 Europe
4 @user Look at those teef! #growl anger 54.37 Female 4619 Europe


.. GENERATED FROM PYTHON SOURCE LINES 73-75 Run Check =============================== .. GENERATED FROM PYTHON SOURCE LINES 77-78 As there's natural drift in this dataset, we can expect to see some drift in the data: .. GENERATED FROM PYTHON SOURCE LINES 78-83 .. code-block:: default check = TextEmbeddingsDrift() result = check.run(train_dataset=train_ds, test_dataset=test_ds) result .. rst-class:: sphx-glr-script-out .. code-block:: none n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism. n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism. .. raw:: html
Embeddings Drift


.. GENERATED FROM PYTHON SOURCE LINES 84-93 Observing the results ---------------------- We can see that the check found drift in the data. Moreover, we can investigate the drift by looking at the scatter plot, which is a 2D representation of the embeddings space. We can see that there are a few clusters in the graph where there are more tweets from the test dataset than the train dataset. This is a sign of drift in the data. By hovering over the points, we can see the actual tweets that are in the dataset, and see for example that there are clusters of tweets about motivational quotes, which are more common in the test dataset than the train dataset. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 25.385 seconds) .. _sphx_glr_download_nlp_auto_checks_train_test_validation_plot_embeddings_drift.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_embeddings_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_embeddings_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_