.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/methodology/plot_boosting_overfit.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_methodology_plot_boosting_overfit.py: Boosting Overfit **************** This notebooks provides an overview for using and understanding the boosting overfit check. **Structure:** * `What is a boosting overfit? <#what-is-a-boosting-overfit>`__ * `Generate data & model <#generate-data-model>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ What is A Boosting Overfit? =========================== A boosting algorithm is a machine learning algorithm that uses a combination of weak learners to predict a target variable. The mechanism of boosting is to increase the number of weak learners in the ensemble by iteratively adding a new weak learner. The new weak learner uses the error of the ensemble from the previous iterations as its training data. This mechanism continues until the ensemble reaches a certain performance level or until the given maximum number of iterations is reached. Thanks to its mechanism, boosting algorithms are usually less prone to overfitting than other traditional algorithms like single decision trees. However, the number of weak learners in the ensemble can be too large making the ensemble too complex given the amount of data it was trained on. In this case, the ensemble may be overfitted on the training data. How deepchecks detects a boosting overfit? ------------------------------------------ The check runs for a pre-defined number of iterations, and in each step it uses only the first X estimators from the boosting model when predicting the target variable (number of estimators X is monotonic increasing). It plots the given score calculated for each iteration for both the train dataset and the test dataset. If the ratio of decline between the maximal test score achieved in any boosting iteration and the test score achieved in the last iteration ("full" model score) is above a given threshold (0.05 by default), it means the model is overfitted and the default condition, if added, will fail. Supported Models ---------------- Currently the check supports the following models: - AdaBoost (sklearn) - GradientBoosting (sklearn) - XGBoost (xgboost) - LGBM (lightgbm) - CatBoost (catboost) Generate data & model ===================== The dataset is the adult dataset which can be downloaded from the UCI machine learning repository. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. .. GENERATED FROM PYTHON SOURCE LINES 56-76 .. code-block:: default import pandas as pd from sklearn.preprocessing import LabelEncoder from deepchecks.tabular.datasets.classification import adult from deepchecks.tabular import Dataset train_df, val_df = adult.load_data(data_format='Dataframe') # Run label encoder on all categorical columns for column in train_df.columns: if train_df[column].dtype == 'object': le = LabelEncoder() le.fit(pd.concat([train_df[column], val_df[column]])) train_df[column] = le.transform(train_df[column]) val_df[column] = le.transform(val_df[column]) train_ds = Dataset(train_df, label='income') validation_ds = Dataset(val_df, label='income') .. GENERATED FROM PYTHON SOURCE LINES 77-80 Classification model -------------------- We use the AdaBoost boosting algorithm with a decision tree as weak learner. .. GENERATED FROM PYTHON SOURCE LINES 80-86 .. code-block:: default from sklearn.ensemble import AdaBoostClassifier clf = AdaBoostClassifier(random_state=0, n_estimators=100) clf.fit(train_ds.data[train_ds.features], train_ds.data[train_ds.label_name]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none AdaBoostClassifier(n_estimators=100, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 87-89 Run the check ============== .. GENERATED FROM PYTHON SOURCE LINES 89-94 .. code-block:: default from deepchecks.tabular.checks.methodology.boosting_overfit import BoostingOverfit result = BoostingOverfit().run(train_ds, validation_ds, clf) result .. raw:: html

Boosting Overfit

Check for overfit caused by using too many iterations in a gradient boosted model.

Additional Outputs
The check limits the boosting model to using up to N estimators each time, and plotting the Accuracy calculated for each subset of estimators for both the train dataset and the test dataset.


.. GENERATED FROM PYTHON SOURCE LINES 95-99 Define a condition ================== Now, we define a condition that will validate if the percent of decline between the maximal score achieved in any boosting iteration and the score achieved in the last iteration is above 0.02%. .. GENERATED FROM PYTHON SOURCE LINES 99-103 .. code-block:: default check = BoostingOverfit() check.add_condition_test_score_percent_decline_not_greater_than(0.0002) result = check.run(train_ds, validation_ds, clf) result.show(show_additional_outputs=False) .. raw:: html
Boosting Overfit


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 14.165 seconds) .. _sphx_glr_download_checks_gallery_tabular_methodology_plot_boosting_overfit.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_boosting_overfit.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_boosting_overfit.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_