Prediction Drift

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "user-guide/nlp/auto_quickstarts/plot_text_classification.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_user-guide_nlp_auto_quickstarts_plot_text_classification.py: Test NLP Classification Tasks - Quickstart ****************************************** In order to run deepchecks for NLP all you need to have are the following for both your train and test data: 1. Your text data - a list of strings, each string is a single sample (can be a sentence, paragraph, document etc.). 2. Your labels - either a :ref:`Text Classification label or a :ref:`Token Classification ` label. 3. Your models predictions (see :doc:`Supported Tasks ` for info on supported formats). If you don't have deepchecks installed yet: .. code:: python import sys !{sys.executable} -m pip install deepchecks[nlp] -U --quiet #--user Some properties calculated by ``deepchecks.nlp`` require additional packages to be installed. You can install them by running: .. code:: python import sys !{sys.executable} -m pip install langdetect>=1.0.9 textblob>=0.17.1 -U --quiet #--user Finally, we'll be using the CatBoost model in this guide, so we'll also need to install it: .. code:: python import sys !{sys.executable} -m pip install catboost -U --quiet #--user .. GENERATED FROM PYTHON SOURCE LINES 38-42 Load Data & Create TextData Objects =================================== For the purpose of this guide we'll use a small subset of the `tweet emotion `__ dataset: .. GENERATED FROM PYTHON SOURCE LINES 42-51 .. code-block:: default # Imports from deepchecks.nlp import TextData from deepchecks.nlp.datasets.classification import tweet_emotion # Load Data train, test = tweet_emotion.load_data(data_format='DataFrame') train.head() .. raw:: html

	text	user_age	gender	days_on_platform	user_region	label
2	No but that's so cute. Atsu was probably shy a...	24.97	Male	2729	Middle East/Africa	happiness
3	Rooneys fucking untouchable isn't he? Been fuc...	21.66	Male	1376	Asia Pacific	anger
7	Tiller and breezy should do a collab album. Ra...	37.29	Female	3853	Americas	happiness
8	@user broadband is shocking regretting signing...	15.39	Female	1831	Europe	anger
9	@user Look at those teef! #growl	54.37	Female	4619	Europe	anger

.. GENERATED FROM PYTHON SOURCE LINES 52-63 We can see that we have the tweet text itself, the label (the emotion) and then some additional metadata columns. We can now create a :class:`TextData ` object for the train and test dataframes. This object is used to pass your data to the deepchecks checks. To create a TextData object, the only required argument is the text itself, but passing only the text will prevent multiple checks from running. In this example we'll pass the label as well and also provide metadata (the other columns in the dataframe) which we'll use later on in the guide. Finally, we'll also explicitly set the index. .. note:: .. GENERATED FROM PYTHON SOURCE LINES 64-74 .. code-block:: default # The label column is optional, but if provided you must also pass the ``task_type`` argument, so that deepchecks # will know how to interpret the label column. # train = TextData(train.text, label=train['label'], task_type='text_classification', index=train.index, metadata=train.drop(columns=['label', 'text'])) test = TextData(test.text, label=test['label'], task_type='text_classification', index=test.index, metadata=test.drop(columns=['label', 'text'])) .. GENERATED FROM PYTHON SOURCE LINES 75-84 Building a Model ================ In this example we'll train a very basic model for simplicity, using a CatBoostClassifier trained over the embeddings of the tweets. In this case these embeddings were created using the OpenAI GPT-3 model. If you want to reproduce this kind of basic model for your own task, you can calculate your own embeddings, or use our :func:`calculate_embeddings_for_text ` function to create embeddings from a generic model. Note that in order to run it you need either an OpenAI API key or have HuggingFace's transformers installed. .. GENERATED FROM PYTHON SOURCE LINES 84-98 .. code-block:: default from sklearn.metrics import roc_auc_score from catboost import CatBoostClassifier # Load Embeddings and Split to Train and Test embeddings = tweet_emotion.load_embeddings() train_embeddings, test_embeddings = embeddings[train.index, :], embeddings[test.index, :] model = CatBoostClassifier(max_depth=2, n_estimators=50, random_state=42) model.fit(embeddings[train.index, :], train.label, verbose=0) print(roc_auc_score(test.label, model.predict_proba(embeddings[test.index, :]), multi_class="ovr", average="macro")) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.8775586575490586 .. GENERATED FROM PYTHON SOURCE LINES 99-121 Running Deepchecks ================== Now that we have our data and model, we can run our first checks! We'll run two types of checks: 1. `Model Evaluation Checks`_ - checks to run once we have trained our model. 2. `Data Integrity Checks`_ - checks to run on our dataset, before we train our model. Additionally ``deepchecks.nlp`` currently has one `Train-Test Validation` Check - the :class:`Label Drift ` Check. You can read more about when should you use deepchecks :ref:`here `. Model Evaluation Checks ----------------------- We'll start by running the :class:`PredictionDrift ` check, which will let us know if there has been a significant change in the model's predictions between the train and test data. Such a change may imply that something has changed in the data distribution between the train and test data in a way that affects the model's predictions. We'll also add a condition to the check, which will make it fail if the drift score is higher than 0.1. .. GENERATED FROM PYTHON SOURCE LINES 121-136 .. code-block:: default # Start by computing the predictions for the train and test data: train_preds, train_probas = model.predict(embeddings[train.index, :]), model.predict_proba(embeddings[train.index, :]) test_preds, test_probas = model.predict(embeddings[test.index, :]), model.predict_proba(embeddings[test.index, :]) # Run the check from deepchecks.nlp.checks import PredictionDrift check = PredictionDrift().add_condition_drift_score_less_than(0.1) result = check.run(train, test, train_predictions=list(train_preds), test_predictions=list(test_preds)) # Note: the result can be saved as html using suite_result.save_as_html() # or exported to json using suite_result.to_json() result.show() .. raw:: html

Prediction Drift

.. GENERATED FROM PYTHON SOURCE LINES 137-145 We can see that the check passed, and that the drift score is quite low. Next, we'll run the :class:`MetadataSegmentsPerformance ` check, which will check the performance of the model on different segments of the metadata that we provided earlier when creating the :class:`TextData ` objects, and report back on any segments that have significantly lower performance than the rest of the data. .. GENERATED FROM PYTHON SOURCE LINES 145-153 .. code-block:: default from deepchecks.nlp.checks import MetadataSegmentsPerformance check = MetadataSegmentsPerformance() result = check.run(test, predictions=list(test_preds), probabilities=test_probas) result.show() .. raw:: html

Metadata Segments Performance

.. GENERATED FROM PYTHON SOURCE LINES 154-187 As we can see, the check found a segment that has significantly lower performance than the rest of the data. In the first tab of the display we can see that there is a large segment of young Europeans that have significantly lower performance than the rest of the data. Perhaps there is some language gap here? We should probably collect and annotate more data from this segment. Properties ^^^^^^^^^^ Properties are one-dimension values that are extracted from the text. Among their uses, they can be used to segment the data, similar to the metadata segments that we saw in the previous check. Before we can run the :class:`PropertySegmentsPerformance ` check, we need to make sure that our :class:`TextData ` objects have the properties that we want to use. Properties can be added to the TextData objects in one of the following ways: 1. Calculated automatically by deepchecks. Deepchecks has a set of predefined properties that can be calculated automatically. They can be added to the TextData object either by passing ``properties='auto'`` to the TextData constructor, or by calling the :meth:`calculate_default_properties ` method anytime later. 2. You can calculate your own properties and then add them to the TextData object. This can be done by passing a DataFrame of properties to the TextData `properties` argument, or by calling the :meth:`set_properties ` method anytime later with such a DataFrame. You .. note:: Some of the default properties require additional packages to be installed. If you want to use them, you can install them by running ``pip install deepchecks[nlp-properties]``. Additionally, some properties that use the ``transformers`` package are computationally expensive, and may take a long time to calculate. If you have a GPU or a similar device you can use it by installing the appropriate package versions and passing a ``device`` argument to the ``TextData`` constructor or to the ``calculate_default_properties`` method. .. GENERATED FROM PYTHON SOURCE LINES 187-199 .. code-block:: default # Calculate properties train.calculate_default_properties() test.calculate_default_properties() # Run the check from deepchecks.nlp.checks import PropertySegmentsPerformance check = PropertySegmentsPerformance(segment_minimum_size_ratio=0.07) result = check.run(test, predictions=list(test_preds), probabilities=test_probas) result.show() .. raw:: html

Property Segments Performance

.. GENERATED FROM PYTHON SOURCE LINES 200-215 As we can see, the check found some segments that have lower performance compared to the rest of the dataset. It seems that the model has a harder time predicting the emotions in the "neutral-positive" sentiment range (in our case, between around 0 and 0.45). Data Integrity Checks --------------------- These previous checks were all about the model's performance. Now we'll run a check that attempts to find instances of shortcut learning - cases in which the label can be predicted by simple aspects of the data, which in many cases can be an indication that the model has used some information that won't generalize to the real world. This check is the :class:`PropertyLabelCorrelation ` check, which will check the correlation between the properties and the labels, and report back on any properties that have a high correlation with the labels. .. GENERATED FROM PYTHON SOURCE LINES 215-222 .. code-block:: default from deepchecks.nlp.checks import PropertyLabelCorrelation check = PropertyLabelCorrelation(n_top_features=10) result = check.run(test) result.show() .. raw:: html

Property-Label Correlation

.. GENERATED FROM PYTHON SOURCE LINES 223-228 In this case the check didn't find any properties that have a high correlation with the labels. Apart from the sentiment property, which is expected to have high relevance to the emotion of the tweet, the other properties have very low correlation to the label. You can find the full list of available NLP checks in the :mod:`nlp.checks api documentation `. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 53.647 seconds) .. _sphx_glr_download_user-guide_nlp_auto_quickstarts_plot_text_classification.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_text_classification.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_text_classification.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_