Feature Importance#
What is Feature Importance?#
Feature importance is a ranking that represents the significance of input features to the model’s predictions. A feature with higher importance has more influence on the prediction of the model. Feature importance can be general (meaning, for all model predictions, on average) or local (meaning, for a specific sample). There are many ways to calculate feature importance, some are generic for all models (such as Shapley values) and some are specific for a specific model type (such as the Gini importance for decision trees).
Why Does Deepchecks Use Feature Importance?#
Deepchecks uses your model’s feature importance for 2 main reasons:
Help you find issues with your model or data, as in the check UnusedFeatures
Prioritize the display according to the most relevant information the check has found (for instance, if deepchecks found drift in many features, as in the check TrainTestFeatureDrift, it would only display the features with the highest importance)
Note
Most checks don’t require the usage of feature importance. For those, you can shorten or even skip this phase of the calculation.
How Does Deepchecks Get Feature Importance?#
There are 3 ways in which deepchecks can get your model’s feature importance:
Your Model Has a Built-in Feature Importance#
First of all, deepchecks searches for your model’s built-in feature importance, as some scikit-learn models have.
Deepchecks looks for the attribute feature_importances_
or coef_
and uses that information if it exists.
You Insert Your Own Feature Importance Data#
This can be done by using the features_importance
parameter in the run
function, available in all
checks and suites.
Deepchecks expects this data to be a pandas.Series
where the index is feature names and the value is the calculated
importance. In addition, deepchecks expects the feature importance to be normalized (meaning, the sum of all feature
importance values is 1), and will normalize it.
>>> check = UnusedFeatures()
>>> check.run(ds_train, ds_test, model, feature_importance=pd.Series({'feat1': 0.3, 'feat2': 0.7}))
If you don’t have your feature importance precalculated, you can use deepchecks to calculate it:
>>> from deepchecks.tabular.feature_importance import calculate_feature_importance
>>> fi = calculate_feature_importance(model, ds_train)
>>> check.run(ds_train, ds_test, model, feature_importance=fi)
Deepchecks Calculates the Feature Importance for You#
If there’s no built-in feature importance in the model or the user has not supplied feature importance data of their own, deepchecks will calculate feature importance using scikit-learn’s permutation_importance.
You can also force this action by using the feature_importance_force_permutation
parameter in the run
function, available in all checks and suites.
>>> check = TrainTestFeatureDrift()
>>> check.run(ds_train, ds_test, model, feature_importance_force_permutation=True)
What if the Feature Importance Calculation Takes Too Long?#
Permutation feature importance is a complex calculation which can take a lot of time, depending on the number of features and
samples in your data.
However, except for certain checks, deepchecks does not require feature importance.
Therefore, if you want deepchecks to skip the calculation of feature importance, you can use the
feature_importance_timeout
parameter in the run
function, available in all checks and suites.
Before running the permutation feature importance, deepchecks predicts the calculation time. If the predicted time
is bigger than feature_importance_timeout
, the process will be skipped.
Configuring this parameter to 0 will ensure the calculation is always skipped.
>>> check = MultivariateDrift()
>>> check.run(ds_train, ds_test, model, feature_importance_timeout=0)