.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_tutorials/quickstarts/plot_multi_label_classification.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_tutorials_quickstarts_plot_multi_label_classification.py: .. _nlp__multilabel_quickstart: NLP Multi Label Classification Quickstart ***************************************** In this quickstart guide, we will go over using the deepchecks NLP package to analyze and evaluate a text multi label classification task. If you are interested in a regular multiclass classification task, you can refer to our :ref:`Multiclass Quickstart `. We will cover the following: 1. `Creating a TextData object and auto calculating properties <#setting-up>`__ 2. `Running the built-in suites <#running-the-deepchecks-default-suites>`__ 3. `Running individual checks <#running-individual-checks>`__ To run deepchecks for NLP, you need the following for both your train and test data: 1. Your text data - a list of strings, each string is a single sample (can be a sentence, paragraph, document, etc.). 2. Your labels and prediction in the :ref:`correct format ` (Optional). 3. :ref:`Metadata `, :ref:`Properties ` or :ref:`Embeddings ` for the provided text data (Optional). If you don't have deepchecks installed yet: .. code:: python import sys !{sys.executable} -m pip install deepchecks[nlp] -U --quiet #--user Some properties calculated by ``deepchecks.nlp`` require additional packages to be installed. You can install them by running: .. code:: python import sys !{sys.executable} -m pip install [nlp-properties] -U --quiet #--user Setting Up ========== Load Data --------- For the purpose of this guide, we'll use a small subset of the `just dance `__ comment analysis dataset. A dataset containing comments, metadata and labels for a multilabel category classification use case on youtube comments. .. GENERATED FROM PYTHON SOURCE LINES 48-57 .. code-block:: default from deepchecks.nlp import TextData from deepchecks.nlp.datasets.classification import just_dance_comment_analysis data = just_dance_comment_analysis.load_data(data_format='DataFrame', as_train_test=False) metadata_cols = ['likes', 'dateComment'] data.head(2) .. rst-class:: sphx-glr-script-out .. code-block:: none include_properties and include_embeddings are incompatible with data_format="Dataframe" .. raw:: html
Aesthetics and Appeal Affect and Emotion Anticipation Bodily image and Appearance Comfort ... Usability User Differences dateComment likes originalText
0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 2009-07-12 1.0 Thanks, Leeanna made it. and put it on my chan...
1 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 2009-07-12 1.0 omg i love this video ! \nthanks for uploading...

2 rows × 45 columns



.. GENERATED FROM PYTHON SOURCE LINES 58-64 Create TextData Objects ------------------------ Deepchecks' :ref:`TextData ` object contains the text samples, labels, and possibly also properties and metadata. It stores cache to save time between repeated computations and contains functionalities for input validations and sampling. .. GENERATED FROM PYTHON SOURCE LINES 64-71 .. code-block:: default label_cols = data.drop(columns=['originalText'] + metadata_cols) class_names = label_cols.columns.to_list() dataset = TextData(data['originalText'], label=label_cols.to_numpy().astype(int), task_type='text_classification', metadata=data[metadata_cols], categorical_metadata=[]) .. GENERATED FROM PYTHON SOURCE LINES 72-80 Calculating Properties ---------------------- Some of deepchecks' checks use properties of the text samples for various calculations. Deepcheck has a wide variety of such properties, some simple and some that rely on external models and are more heavy to run. In order for deepchecks' checks to be able to access the properties, they must be stored within the :ref:`TextData ` object. You can read more about properties in the :ref:`Property Guide `. .. GENERATED FROM PYTHON SOURCE LINES 80-91 .. code-block:: default # properties can be either calculated directly by Deepchecks # or imported from other sources in appropriate format # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # dataset.calculate_builtin_properties(include_long_calculation_properties=True, device=device) properties = just_dance_comment_analysis.load_properties(as_train_test=False) dataset.set_properties(properties, categorical_properties=['Language']) dataset.properties.head(2) .. raw:: html
Language Text Length Average Word Length Max Word Length % Special Characters ... Formality Lexical Density Unique Noun Count Readability Score Average Sentence Length
0 en 97 4.105263 9 0.041237 ... 0.049017 89.47 7.0 102.449 6.0
1 NaN 73 3.866667 10 0.082192 ... NaN 100.00 NaN NaN 3.0

2 rows × 14 columns



.. GENERATED FROM PYTHON SOURCE LINES 92-107 Running the deepchecks default suites ===================================== Deepchecks comes with a set of pre-built suites that can be used to run a set of checks on your data, alongside with their default conditions and thresholds. You can read more about customizing and creating your own suites in the :ref:`Customizations Guide `. Data Integrity -------------- We will start by doing preliminary integrity check to validate the text formatting. It is recommended to do this step before your train and test/validation splits and model training as it may imply additional data engineering is required. We'll do that using the :mod:`data_integrity ` pre-built suite. Note that we are limiting the number of samples to 1000 in order to get quick high level overview of potential issues. .. GENERATED FROM PYTHON SOURCE LINES 107-113 .. code-block:: default from deepchecks.nlp.suites import data_integrity data_integrity_suite = data_integrity(n_samples=1000) data_integrity_suite.run(dataset, model_classes=class_names) .. rst-class:: sphx-glr-script-out .. code-block:: none Data Integrity Suite: | | 0/8 [Time: 00:00] Data Integrity Suite: |## | 2/8 [Time: 00:00, Check=Text Property Outliers] Data Integrity Suite: |### | 3/8 [Time: 00:01, Check=Text Duplicates] Data Integrity Suite: |#### | 4/8 [Time: 00:02, Check=Conflicting Labels] Data Integrity Suite: |##### | 5/8 [Time: 00:02, Check=Special Characters] Data Integrity Suite: |###### | 6/8 [Time: 00:04, Check=Unknown Tokens] Data Integrity Suite: |####### | 7/8 [Time: 00:09, Check=Under Annotated Property Segments] 6 fits failed out of a total of 6. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 6 fits failed with the following error: Traceback (most recent call last): File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 1315, in fit super().fit( File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 165, in fit X, y = self._validate_data( File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/base.py", line 578, in _validate_data X = check_array(X, **check_X_params) File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite raise ValueError( ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). Data Integrity Suite: |########| 8/8 [Time: 00:09, Check=Under Annotated Meta Data Segments] .. raw:: html
Data Integrity Suite


.. GENERATED FROM PYTHON SOURCE LINES 114-134 Integrity #1: Unknown Tokens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First up (in the “Didn’t Pass” tab) we see that the Unknown Tokens check has returned a problem. Looking at the result, we can see that it assumed (by default) that we’re going to use the bert-base-uncased tokenizer for our NLP model, and that if that’s the case there are many words in the dataset that contain characters (specifically here emojis) that are unrecognized by the tokenizer. This is an important insight, as bert tokenizers are very common. Integrity #2: Conflicting Labels ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Looking at the Conflicting Labels check result (in the “Didn’t Pass” tab) we can see that there are 2 occurrences of duplicate samples that have different labels. This may suggest a more severe labeling error in the dataset which we would want to explore further. .. GENERATED FROM PYTHON SOURCE LINES 136-146 Train Test Validation --------------------- The next suite, the :mod:`train_test_validation ` suite serves to validate our split and compare the two dataset. These splits can be either you training and val / test sets, in which case you'd want to run this suite after the split was made but before training, or for example your training and inference data, in which case the suite is useful for validating that the inference data is similar enough to the training data. To run this suite we'll split the data into train and test/validation sets. We'll use a predefined split based on comment dates. .. GENERATED FROM PYTHON SOURCE LINES 146-155 .. code-block:: default from deepchecks.nlp.suites import train_test_validation train_ds, test_ds = just_dance_comment_analysis.load_data( data_format='TextData', as_train_test=True, include_embeddings=True, include_properties=True) train_test_validation(n_samples=1000).run(train_ds, test_ds, model_classes=class_names) .. rst-class:: sphx-glr-script-out .. code-block:: none Train Test Validation Suite: | | 0/4 [Time: 00:00] Train Test Validation Suite: |#2 | 1/4 [Time: 00:01, Check=Property Drift] Train Test Validation Suite: |##5 | 2/4 [Time: 00:01, Check=Label Drift] Train Test Validation Suite: |###7 | 3/4 [Time: 00:02, Check=Train Test Samples Mix] Train Test Validation Suite: |#####| 4/4 [Time: 01:20, Check=Text Embeddings Drift] .. raw:: html
Train Test Validation Suite


.. GENERATED FROM PYTHON SOURCE LINES 156-180 Train Test Validation #1: Properties Drift ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Based on the different properties we have calculated for the dataset, we can now search for properties whose distribution changes between the train and test datasets. Changes like this are especially important to look for when monitoring your model over time, as data drift is one of the top reasons why machine learning model’s performance degrades over time. In our case, we can see that the “% Special Characters” and the "Formality" property have different distributions between train and test. Drilling further into the results, we can see that the language of the comments in the test set is much less formal and includes more special characters (possibly emojis?) than the train set. Since this change is quite significant, we may want to consider adding more informal comments containing special characters to the train set before training (or retraining) our model. Train Test Validation #2: Embedding Drift ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Similarly to the properties drift, we can also look for embedding drift between the train and test datasets. The benefit of using embedding on top of the properties is that they are able to detect semantic changes in the data. In our case, we see there are significant semantic differences between the train and test sets. Specifically, we can see some clusters that distinctly contain more samples from train or more samples from the train dataset or more sample from the test dataset. By hovering over the clusters we can read the user comments understand what is the difference between the clusters. .. GENERATED FROM PYTHON SOURCE LINES 182-187 Model Evaluation ---------------- The suite below, the :mod:`model_evaluation ` suite, is designed to be run after a model has been trained and requires model predictions which can be supplied via the relevant arguments in the ``run`` function. .. GENERATED FROM PYTHON SOURCE LINES 187-206 .. code-block:: default train_preds, test_preds = just_dance_comment_analysis.\ load_precalculated_predictions(pred_format='predictions', as_train_test=True) train_probas, test_probas = just_dance_comment_analysis.\ load_precalculated_predictions(pred_format='probabilities', as_train_test=True) from deepchecks.nlp.suites import model_evaluation suite = model_evaluation(n_samples=1000) result = suite.run(train_ds, test_ds, train_predictions=train_preds, test_predictions=test_preds, train_probabilities=train_probas, test_probabilities=test_probas, model_classes=class_names) result.show() .. rst-class:: sphx-glr-script-out .. code-block:: none Model Evaluation Suite: | | 0/4 [Time: 00:00] Model Evaluation Suite: |#2 | 1/4 [Time: 00:00, Check=Train Test Performance] Model Evaluation Suite: |##5 | 2/4 [Time: 00:00, Check=Prediction Drift] Model Evaluation Suite: |###7 | 3/4 [Time: 01:41, Check=Property Segments Performance] 6 fits failed out of a total of 6. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 6 fits failed with the following error: Traceback (most recent call last): File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 1315, in fit super().fit( File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 165, in fit X, y = self._validate_data( File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/base.py", line 578, in _validate_data X = check_array(X, **check_X_params) File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite raise ValueError( ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). 6 fits failed out of a total of 6. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 6 fits failed with the following error: Traceback (most recent call last): File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 1315, in fit super().fit( File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 165, in fit X, y = self._validate_data( File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/base.py", line 578, in _validate_data X = check_array(X, **check_X_params) File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") File "/home/runner/work/deepchecks/deepchecks/venv/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite raise ValueError( ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). Model Evaluation Suite: |#####| 4/4 [Time: 01:42, Check=Metadata Segments Performance] .. raw:: html
Model Evaluation Suite


.. GENERATED FROM PYTHON SOURCE LINES 207-215 Model Eval #1: Train Test Performance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can immediately see in the "Didn't Pass" tab that there has been significant degradation in the Recall on class “Pain and Discomfort”. Moreover, it seems there is a general deterioration in our model performance on the test set compared to the train set. This can be explained based on the data drift we saw in the previous suite. .. GENERATED FROM PYTHON SOURCE LINES 217-225 Running Individual Checks ========================= Checks can also be run individually as well as within a suite. You can learn more about customizing suites, checks and conditions in our :ref:`Customizations Guide `. In this section, we'll show you how to do that while showcasing one of our most interesting checks - :ref:`PropertySegmentPerformance `. .. GENERATED FROM PYTHON SOURCE LINES 225-233 .. code-block:: default from deepchecks.nlp.checks import PropertySegmentsPerformance check = PropertySegmentsPerformance(segment_minimum_size_ratio=0.05) check = check.add_condition_segments_relative_performance_greater_than(0.1) result = check.run(test_ds, probabilities=test_probas) result.show() .. raw:: html
Property Segments Performance


.. GENERATED FROM PYTHON SOURCE LINES 234-243 In the display we can see some distinct property based segments that our model under performs on. By reviewing the results we can see that our model is performing poorly on samples that have a low level of Subjectivity, by looking at the "Subjectivity vs Average Sentence Length" tab We can see that the problem is even more severe on samples containing long sentences. In addition to the visual display, most checks also return detailed data describing the results. This data can be used for further analysis, create custom visualizations or to set custom conditions. .. GENERATED FROM PYTHON SOURCE LINES 243-246 .. code-block:: default result.value['weak_segments_list'].head(3) .. raw:: html
F1 Macro Score Feature1 Feature1 Range Feature2 Feature2 Range % of Data Samples in Segment
0 0.498002 Subjectivity (-inf, 0.12916667014360428) Formality (-inf, 0.07886043190956116) 5.54 [24, 26, 62, 69, 76, 94, 104, 143, 167, 172, 1...
1 0.502655 Subjectivity (-inf, 0.12916667014360428) Average Sentence Length (11.5, 24.5) 6.84 [5, 11, 93, 94, 104, 158, 170, 173, 176, 181, ...
2 0.506293 Subjectivity (-inf, 0.029166667722165585) Lexical Density (-inf, 98.83499908447266) 6.76 [50, 62, 69, 76, 84, 90, 94, 104, 116, 143, 15...


.. GENERATED FROM PYTHON SOURCE LINES 247-250 You can find the full list of available NLP checks in the :mod:`nlp.checks api documentation ֿ `. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 7 minutes 24.135 seconds) .. _sphx_glr_download_nlp_auto_tutorials_quickstarts_plot_multi_label_classification.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_multi_label_classification.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_multi_label_classification.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_