.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/model_evaluation/plot_model_error_analysis.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_model_evaluation_plot_model_error_analysis.py: Model Error Analysis ******************** This notebooks provides an overview for using and understanding the model error analysis check. **Structure:** * `What is Model Error Analysis? <#what-is-model-error-analysis>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ What is Model Error Analysis? ============================= Evaluating the model's overall performance metrics gives a good high-level overview and can be useful for tracking model progress during training of for comparing models. However, when it's time to fully evaluate if a model is fit for production, or when you're interested in a deeper understanding of your model's performance in order to improve it or to be aware of its weaknesses, it's recommended to look deeper at how the model performs on various segments of the data. The model error analysis check searches for data segments in which the model error is significantly lower from the model error of the dataset as a whole. Algorithm: ---------- 1. Computes the per-sample loss (for log-loss for classification, mse for regression). 2. Trains a regression model to predict the error of the user's model, based on the input features. 3. Repeat stage 2 several times with various tree parameters and random states to ensure that the most relevant partitions for model error are selected. 4. The features scoring the highest feature importance for the error regression model are selected and the distribution of the error vs the feature values is plotted. The check results are shown only if the error regression model manages to predict the error well enough (above a given r squared performance threshold, defined by the min_error_model_score parameter and set by default to 0.5). The resulting plots show the distribution of the error for the features that are most effective at segmenting the error to high and low values, without need for manual selection of segmentation features. Related Checks: --------------- When the important segments of the data are known in advance (when we know that some population segments have different behaviours and business importance, for example income levels or state of residence) it is possible to just have a look at the performance at various pre-defined segments. In deepchecks, this can be done using the :doc:`Segment Performance ` check, which shows the performance for segments defined by combination of values from two pre-defined columns. Run the check ============= We will run the check on the adult dataset which can be downloaded from the `UCI machine learning repository `_ and is also available in `deepchecks.tabular.datasets`. .. GENERATED FROM PYTHON SOURCE LINES 47-59 .. code-block:: default from deepchecks.tabular.datasets.classification import adult from deepchecks.tabular.checks import ModelErrorAnalysis train_ds, test_ds = adult.load_data(data_format='Dataset', as_train_test=True) model = adult.load_fitted_model() # We create the check with a slightly lower r squared threshold to ensure that the check can run on the example dataset. check = ModelErrorAnalysis(min_error_model_score=0.3) result = check.run(train_ds, test_ds, model) result .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:581: UserWarning: It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data. 8 categorical features were inferred: workclass, education, education-num, marital-status, occupation, relationship, race... For full list use dataset.cat_features /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:179: UserWarning: Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:289: UserWarning: Calculating permutation feature importance without time limit. Expected to finish in 36 seconds .. raw:: html
Model Error Analysis


.. GENERATED FROM PYTHON SOURCE LINES 60-63 The check has found that the features 'hours-per-week', 'age' and 'relationship' are the most predictive of differences in the model error. We can further investigate the model performance by passing two of these columns to the :doc:`Segment Performance ` check: .. GENERATED FROM PYTHON SOURCE LINES 63-68 .. code-block:: default from deepchecks.tabular.checks import SegmentPerformance SegmentPerformance(feature_1='age', feature_2='relationship').run(test_ds, model) .. raw:: html
Segment Performance


.. GENERATED FROM PYTHON SOURCE LINES 69-71 From which we learn that the model error is exceptionally higher for people in the "Husband" or "Other" status, except for the lower age groups for which the error is lower. .. GENERATED FROM PYTHON SOURCE LINES 73-77 Define a condition ================== We can define a condition that enforces that the relative difference between the weak and strong segments is not greater than a certain ratio, for example ratio of 0.05 .. GENERATED FROM PYTHON SOURCE LINES 77-81 .. code-block:: default check = check.add_condition_segments_performance_relative_difference_not_greater_than(0.05) result = check.run(train_ds, test_ds, model) result.show(show_additional_outputs=False) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:581: UserWarning: It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data. 8 categorical features were inferred: workclass, education, education-num, marital-status, occupation, relationship, race... For full list use dataset.cat_features /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:179: UserWarning: Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:289: UserWarning: Calculating permutation feature importance without time limit. Expected to finish in 38 seconds .. raw:: html
Model Error Analysis


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 1 minutes 0.615 seconds) .. _sphx_glr_download_checks_gallery_tabular_model_evaluation_plot_model_error_analysis.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_model_error_analysis.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_model_error_analysis.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_