Class Imbalance#

This notebook provides an overview for using and understanding the Class Imbalance check.

Structure:

What is the Class Imbalance check
Generate data
Run the check
Define a condition

What is the Class Imbalance check#

The ClassImbalance check produces a distribution of the target variable. An indication for an imbalanced dataset is an uneven distribution in label classes.

An imbalanced dataset poses its own challenges, namely learning the characteristics of the minority label, scarce minority instances to train on (or test for) and defining the right evaluation metric.

Albeit, there are many techniques to address these challenges, including artificially increasing the minority sample size (by over-sampling or using SMOTE), drop instances from the majority class (under-sampling), using regularization, and adjusting the label classes weights.

Imports#

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import ClassImbalance
from deepchecks.tabular.datasets.classification import lending_club

Generate data#

df = lending_club.load_data(data_format='Dataframe', as_train_test=False)
dataset = Dataset(df, label='loan_status', features=['id', 'loan_amnt'], cat_features=[])

Run the check#

ClassImbalance().run(dataset)

Class Imbalance

Skew the target variable and run the check#

df.loc[df.sample(frac=0.7, random_state=0).index, 'loan_status'] = 1
dataset = Dataset(df, label='loan_status', features=['id', 'loan_amnt'], cat_features=[])
ClassImbalance().run(dataset)

Class Imbalance

Define a condition#

A manually defined ratio between the labels can also be set:

ClassImbalance().add_condition_class_ratio_less_than(0.15).run(dataset)

Class Imbalance

Total running time of the script: ( 0 minutes 2.734 seconds)

Gallery generated by Sphinx-Gallery

Tabular Checks

Columns Info