.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/methodology/plot_unused_features.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_methodology_plot_unused_features.py: Unused Features *************** This notebook provides an overview for using and understanding the Unused Features check. **Structure:** * `How unused features affect my model? <#how-unused-features-affect-my-model>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ How unused features affect my model? ===================================== Having too many features can prolong training times and degrade model performance due to "The Curse of Dimensionality" or "Hughes Phenomenon". This is because the dimensional space grows exponentially with the number of features. When the space is too large in relate to the number of data samples, it results in a very sparse distribution of the samples in the space. This sparsity also makes the samples more similar to each other, since they are all far from each other which makes it harder to find cluster together similar samples in order to find patterns. The increased dimensional space and samples similarity may require more complex models, which in turn are in greater risk of overfitting. Features with low model contribution (feature importance) are probably just noise, and should be removed as they increase the dimensionality without contributing anything. Nevertheless, models may miss important features. For that reason the Unused Features check selects out of these features those that have high variance, as they may represent information that was ignored during model construction. We may wish to manually inspect those features to make sure our model is not missing on important information. Run the check ============= We will run the check on the adult dataset which can be downloaded from the `UCI machine learning repository `_ and is also available in `deepchecks.tabular.datasets`. .. GENERATED FROM PYTHON SOURCE LINES 36-47 .. code-block:: default from deepchecks.tabular.checks import UnusedFeatures from deepchecks.tabular.datasets.classification import adult train_ds, test_ds = adult.load_data() model = adult.load_fitted_model() UnusedFeatures().add_condition_number_of_high_variance_unused_features_not_greater_than() result = UnusedFeatures().run(train_ds, test_ds, model) result .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning: Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:290: UserWarning: Calculating permutation feature importance without time limit. Expected to finish in 39 seconds .. raw:: html

Unused Features

Detect features that are nearly unused by the model.

Additional Outputs
Features above the line are a sample of the most important features, while the features below the line are the unused features with highest variance, as defined by check parameters


.. GENERATED FROM PYTHON SOURCE LINES 48-53 Controlling the variance threshold ---------------------------------- The check can be configured to use a different threshold which controls which features are considered "high variance". The default value is `0.4`. We will use a more strict value and see that fewer features are considered "high variance". .. GENERATED FROM PYTHON SOURCE LINES 53-56 .. code-block:: default result = UnusedFeatures(feature_variance_threshold=1.5).run(train_ds, test_ds, model) result .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning: Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead .. raw:: html

Unused Features

Detect features that are nearly unused by the model.

Additional Outputs
Features above the line are a sample of the most important features, while the features below the line are the unused features with highest variance, as defined by check parameters


.. GENERATED FROM PYTHON SOURCE LINES 57-61 Controlling the importance threshold ------------------------------------ We can also define the importance threshold which controls features are considered important. If we define it as 0 then all features are considered important. .. GENERATED FROM PYTHON SOURCE LINES 61-64 .. code-block:: default result = UnusedFeatures(feature_importance_threshold=0).run(train_ds, test_ds, model) result .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning: Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead .. raw:: html

Unused Features

Detect features that are nearly unused by the model.

Additional Outputs

Nothing to display



.. GENERATED FROM PYTHON SOURCE LINES 65-69 Define a condition ================== We can define a condition that enforces that number of unused features with high variance is not greater than a given amount, the default is 5. .. GENERATED FROM PYTHON SOURCE LINES 69-72 .. code-block:: default check = UnusedFeatures().add_condition_number_of_high_variance_unused_features_not_greater_than(5) result = check.run(train_ds, test_ds, model) result.show(show_additional_outputs=False) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/work/deepchecks/deepchecks/deepchecks/utils/features.py:180: UserWarning: Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead .. raw:: html
Unused Features


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 32.188 seconds) .. _sphx_glr_download_checks_gallery_tabular_methodology_plot_unused_features.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_unused_features.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_unused_features.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_