.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/model_evaluation/plot_train_test_performance.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_model_evaluation_plot_train_test_performance.py: .. _nlp__train_test_performance: Train Test Performance for NLP Models ************************************** This notebook provides an overview for using and understanding the train test performance check. **Structure:** * `What is the purpose of the check? <#what-is-the-purpose-of-the-check>`__ * `Generate data & predictions <#generate-data-predictions>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ * `Using a custom scorer <#using-a-custom-scorer>`__ What is the purpose of the check? ================================== This check helps you compare your NLP model's performance between the train and test datasets based on multiple metrics. For Text Classification tasks the supported metrics are sklearn scorers. You may use any of the existing sklearn scorers or create your own. For more information about the supported sklearn scorers, defining your own metrics and to learn how to use metrics for other supported task types, see the :ref:`metrics_user_guide`. The default scorers are F1, Precision, and Recall for Classification, and F1 Macro, Recall Macro and Precision Macro for Token Classification. See more about the supported task types at the :ref:`nlp__supported_tasks` guide. .. GENERATED FROM PYTHON SOURCE LINES 30-32 .. code-block:: default import numpy as np .. GENERATED FROM PYTHON SOURCE LINES 33-35 Load data & predictions ======================= .. GENERATED FROM PYTHON SOURCE LINES 35-41 .. code-block:: default from deepchecks.nlp.datasets.classification.tweet_emotion import load_data, load_precalculated_predictions train_dataset, test_dataset = load_data() train_preds, test_preds = load_precalculated_predictions('predictions') .. GENERATED FROM PYTHON SOURCE LINES 42-47 Run the check ============== You can select which scorers to use by passing either a list or a dict of scorers to the check, the full list of possible scorers can be seen at the :ref:`metrics_user_guide`. .. GENERATED FROM PYTHON SOURCE LINES 47-54 .. code-block:: default from deepchecks.nlp.checks import TrainTestPerformance check = TrainTestPerformance(scorers=['recall_per_class', 'precision_per_class', 'f1_macro', 'f1_micro']) result = check.run(train_dataset, test_dataset, train_predictions=train_preds, test_predictions=test_preds) result.show() .. raw:: html
Train Test Performance


.. GENERATED FROM PYTHON SOURCE LINES 55-61 Define a condition =================== We can define on our check a condition that will validate that our model doesn't degrade on new data. Let's add a condition to the check and see what happens when it fails: .. GENERATED FROM PYTHON SOURCE LINES 61-66 .. code-block:: default check.add_condition_train_test_relative_degradation_less_than(0.15) result = check.run(train_dataset, test_dataset, train_predictions=train_preds, test_predictions=test_preds) result.show(show_additional_outputs=False) .. raw:: html
Train Test Performance


.. GENERATED FROM PYTHON SOURCE LINES 67-68 We detected that for class "optimism" the Recall has degraded by more than 70%! .. GENERATED FROM PYTHON SOURCE LINES 70-74 Using a custom scorer ======================= In addition to the built-in scorers, we can define our own scorer based on sklearn api and run it using the check alongside other scorers: .. GENERATED FROM PYTHON SOURCE LINES 74-82 .. code-block:: default from sklearn.metrics import fbeta_score, make_scorer fbeta_scorer = make_scorer(fbeta_score, labels=np.arange(len(set(test_dataset.label))), average=None, beta=0.2) check = TrainTestPerformance(scorers={'my scorer': fbeta_scorer, 'recall': 'recall_per_class'}) result = check.run(train_dataset, test_dataset, train_predictions=train_preds, test_predictions=test_preds) result.show() .. raw:: html
Train Test Performance


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.055 seconds) .. _sphx_glr_download_nlp_auto_checks_model_evaluation_plot_train_test_performance.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_performance.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_performance.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_