Simple Feature Contribution#

This notebook provides an overview for using and understanding the “Simple Feature Contribution Train Test” check.

Structure:

What is the purpose of the check?#

The check estimates for every image property (such as brightness, contrast etc.) its ability to predict the label by itself. This check can help find:

• A potential bias in one or both datasets, that leads to the the labels being strongly correlated with simple image properties such as color, brightness, aspect ratio and more. This is a critical problem, that will likely stay hidden without this check (as it won’t pop up when comparing model performance on train and test).

The check is based on calculating the predictive power score (PPS) of each image property. For more details you can read here how the PPS is calculated.

What is a problematic result?#

1. Image properties with a high predictive score - can indicate that there is a bias in the dataset, as a single property can predict the label successfully, using simple classic ML algorithms.

This means that a deep learning algorithm may accidentally learn these properties instead of more accurate complex abstractions. For example, in a classification dataset of wolves and dogs photographs, if only wolves are photographed in the snow, the brightness of the image may be used to predict the label “wolf” easily.

In this case, a model might not learn to discern wolf from dog by the animal’s characteristics, but by using the background color.

2. A high difference between the PPS scores of a certain image property in the train and in the test datasets - this is an indication for a drift between the relation of the property and the label and a possible bias in one of the datasets.

For example: an object detection dataset that identifies household items. In it, a pen would usually create a long and thin rectangle bounding box. If in the test dataset the pens would be angled differently, or other object are mistakenly identified as pens, the bounding boxes may have a different aspect ratio. In this case, the PPS of the train dataset will be high, while the PPS of the test dataset would be low, indicating that a bias in the train dataset does not appear in the test dataset, and could indicate the model will not be able to infer correctly on the test (or any other) dataset due to drift.

How do we calculate for different vision tasks?#

• For classification tasks, this check uses PPS to predict the class by image properties.

• For object detection tasks, this check uses PPS to predict the class of each bounding box, by the image properties of that specific bounding box. This means that for each image, this check crops all the sub-images defined by bounding boxes, and uses them as inputs as though they were regular classification dataset images.

How is the Predictive Power Score (PPS) calculated?#

The properties’ predictive score results in a numeric score between 0 (feature has no predictive power) and 1 (feature can fully predict the label alone).

The process of calculating the PPS is the following:

The properties’ predictive score results in a numeric score between 0 (feature has no predictive power) and 1 (feature can fully predict the label alone).

The process of calculating the PPS is the following:

1. Extract from the data only the label and the feature being tested

2. Drop samples with missing values

3. Keep 5000 (this is configurable parameter) samples from the data

4. Preprocess categorical columns. For the label using sklearn.LabelEncoder and for the feature using sklearn.OneHotEncoder

5. Partition the data with 4-fold cross-validation

6. Train decision tree

7. Compare the trained model’s performance with naive model’s performance as follows:

Regression: The naive model always predicts the median of the label column, the metric being used is MAE and the PPS calculation is:

Classification: The naive model always predicts the most common class of the label column, The metric being used is F1 and the PPS calculation is: :math: rac{ ext{F1 model} - ext{F1 naive}}{1 - ext{F1 naive}}

Note: all the PPS parameters can be changed by passing to the check the parameter ppscore_params

For further information about PPS you can visit the ppscore github or the following blog post: RIP correlation. Introducing the Predictive Power Score

Run the check on a classification task (MNIST)#

Imports#

import numpy as np

from deepchecks.vision.checks import SimpleFeatureContribution
from deepchecks.vision.datasets.classification.mnist import load_dataset


train_ds = load_dataset(train=True, object_type='VisionData')
test_ds = load_dataset(train=False, object_type='VisionData')


Insert bias#

Let’s see what happens when we insert bias into the dataset.

Specifically, we’re going to change the pixel values of the image depending on the label (0 to 9) so there is a correlation between brightness of image and the label (also a small correlation of the index)

from deepchecks.vision.utils.transformations import un_normalize_batch

def mnist_batch_to_images_with_bias_mod(mod):
def mnist_batch_to_images_with_bias(batch):
"""Create function which inverse the data normalization."""
tensor = batch[0]
tensor = tensor.permute(0, 2, 3, 1)
ret = un_normalize_batch(tensor, (0.1307,), (0.3081,))
# add some label/index correlation
for i, label in enumerate(batch[1]):
if i % mod != 0:
ret[i] = np.ones(ret[i].shape) * int(i % 3 + 1) * int(label)

return ret
return mnist_batch_to_images_with_bias

train_ds.batch_to_images = mnist_batch_to_images_with_bias_mod(9)
test_ds.batch_to_images = mnist_batch_to_images_with_bias_mod(2)


Run after bias#

check = SimpleFeatureContribution()
check.run(train_ds, test_ds)


Out:

Validating Input:   0%| | 0/1 [00:00<?, ? /s]

Ingesting Batches - Train Dataset:   0%|                                                                                                                                                             | 0/157 [00:00<?, ? Batch/s]
Ingesting Batches - Train Dataset:   8%|#############                                                                                                                                                | 13/157 [00:00<00:01, 129.35 Batch/s]
Ingesting Batches - Train Dataset:  17%|###########################                                                                                                                                  | 27/157 [00:00<00:00, 130.17 Batch/s]
Ingesting Batches - Train Dataset:  26%|#########################################                                                                                                                    | 41/157 [00:00<00:00, 127.80 Batch/s]
Ingesting Batches - Train Dataset:  34%|######################################################                                                                                                       | 54/157 [00:00<00:00, 128.09 Batch/s]
Ingesting Batches - Train Dataset:  43%|###################################################################                                                                                          | 67/157 [00:00<00:00, 127.96 Batch/s]
Ingesting Batches - Train Dataset:  52%|#################################################################################                                                                            | 81/157 [00:00<00:00, 129.46 Batch/s]
Ingesting Batches - Train Dataset:  60%|##############################################################################################                                                               | 94/157 [00:00<00:00, 129.19 Batch/s]
Ingesting Batches - Train Dataset:  68%|###########################################################################################################                                                  | 107/157 [00:00<00:00, 129.39 Batch/s]
Ingesting Batches - Train Dataset:  76%|########################################################################################################################                                     | 120/157 [00:00<00:00, 129.51 Batch/s]
Ingesting Batches - Train Dataset:  85%|#####################################################################################################################################                        | 133/157 [00:01<00:00, 126.85 Batch/s]
Ingesting Batches - Train Dataset:  93%|##################################################################################################################################################           | 146/157 [00:01<00:00, 126.07 Batch/s]

Ingesting Batches - Test Dataset:   0%|                                                                                                                                                             | 0/157 [00:00<?, ? Batch/s]
Ingesting Batches - Test Dataset:   1%|#                                                                                                                                                            | 1/157 [00:00<00:17,  8.97 Batch/s]
Ingesting Batches - Test Dataset:   1%|##                                                                                                                                                           | 2/157 [00:00<00:17,  8.78 Batch/s]
Ingesting Batches - Test Dataset:   2%|###                                                                                                                                                          | 3/157 [00:00<00:17,  8.72 Batch/s]
Ingesting Batches - Test Dataset:   3%|####                                                                                                                                                         | 4/157 [00:00<00:18,  8.35 Batch/s]
Ingesting Batches - Test Dataset:   3%|#####                                                                                                                                                        | 5/157 [00:00<00:18,  8.11 Batch/s]
Ingesting Batches - Test Dataset:   4%|######                                                                                                                                                       | 6/157 [00:00<00:17,  8.41 Batch/s]
Ingesting Batches - Test Dataset:   4%|######9                                                                                                                                                      | 7/157 [00:00<00:17,  8.43 Batch/s]
Ingesting Batches - Test Dataset:   5%|########                                                                                                                                                     | 8/157 [00:00<00:17,  8.57 Batch/s]
Ingesting Batches - Test Dataset:   6%|#########                                                                                                                                                    | 9/157 [00:01<00:16,  8.85 Batch/s]
Ingesting Batches - Test Dataset:   6%|##########                                                                                                                                                   | 10/157 [00:01<00:16,  9.04 Batch/s]

Computing Check:   0%| | 0/1 [00:00<?, ? Check/s]
Computing Check: 100%|#| 1/1 [00:01<00:00,  1.41s/ Check]


Simple Feature Contribution

Return the Predictive Power Score of image properties, in order to estimate their ability to predict the label.

The Predictive Power Score (PPS) is used to estimate the ability of an image property (such as brightness)to predict the label by itself. (Read more about Predictive Power Score)
In the graph above, we should suspect we have problems in our data if:
1. Train dataset PPS values are high:
A high PPS (close to 1) can mean that there's a bias in the dataset, as a single property can predict the label successfully, using simple classic ML algorithms
2. Large difference between train and test PPS (train PPS is larger):
An even more powerful indication of dataset bias, as an image property that was powerful in train
but not in test can be explained by bias in train that is not relevant to a new dataset.
3. Large difference between test and train PPS (test PPS is larger):
An anomalous value, could indicate drift in test dataset that caused a coincidental correlation to the target label.

Note - data sampling: Running on 10000 train data samples out of 60000. Sample size can be controlled with the "n_samples" parameter.

We can see that the check detected the bias we inserted, and that the brightness property of the image has a high PPS, meaning it can be used to solely predict the label.

Run the check on an Object Detection task (COCO)#

from deepchecks.vision.datasets.detection.coco import (load_dataset,
yolo_label_formatter)

train_ds = load_dataset(train=True, object_type='VisionData')
test_ds = load_dataset(train=False, object_type='VisionData')

check = SimpleFeatureContribution(per_class=False)
check.run(train_ds, test_ds)


Out:

Validating Input:   0%| | 0/1 [00:00<?, ? /s]

Ingesting Batches - Train Dataset:   0%|  | 0/2 [00:00<?, ? Batch/s]
Ingesting Batches - Train Dataset:  50%|# | 1/2 [00:01<00:01,  1.21s/ Batch]
Ingesting Batches - Train Dataset: 100%|##| 2/2 [00:02<00:00,  1.13s/ Batch]

Ingesting Batches - Test Dataset:   0%|  | 0/2 [00:00<?, ? Batch/s]
Ingesting Batches - Test Dataset:  50%|# | 1/2 [00:00<00:00,  1.07 Batch/s]
Ingesting Batches - Test Dataset: 100%|##| 2/2 [00:01<00:00,  1.01s/ Batch]

Computing Check:   0%| | 0/1 [00:00<?, ? Check/s]
Computing Check: 100%|#| 1/1 [00:00<00:00,  3.61 Check/s]


Simple Feature Contribution

Return the Predictive Power Score of image properties, in order to estimate their ability to predict the label.

The Predictive Power Score (PPS) is used to estimate the ability of an image property (such as brightness)to predict the label by itself. (Read more about Predictive Power Score)
In the graph above, we should suspect we have problems in our data if:
1. Train dataset PPS values are high:
A high PPS (close to 1) can mean that there's a bias in the dataset, as a single property can predict the label successfully, using simple classic ML algorithms
2. Large difference between train and test PPS (train PPS is larger):
An even more powerful indication of dataset bias, as an image property that was powerful in train
but not in test can be explained by bias in train that is not relevant to a new dataset.
3. Large difference between test and train PPS (test PPS is larger):
An anomalous value, could indicate drift in test dataset that caused a coincidental correlation to the target label.

Insert bias#

Let’s now see what happens when we insert bias into the dataset.

Specifically, we’re going to change the pixel values of parts of the image (where the bounding boxes are), so there is a correlation between brightness of image and the label

# Increase the pixel values of all bounding boxes by the labels value:
def coco_batch_to_images_with_bias_mod(mod):
def coco_batch_to_images_with_bias(batch):
import numpy as np
ret = [np.array(x) for x in batch[0]]
for i, labels in enumerate(train_ds.batch_to_labels(batch)):
if i % mod != 0:
for label in labels:
x, y, w, h = np.array(label[1:]).astype(int)
ret[i][y:y+h, x:x+w] = ret[i][y:y+h, x:x+w].clip(min=200) * int(label[0])
return ret
return coco_batch_to_images_with_bias

train_ds.batch_to_images = coco_batch_to_images_with_bias_mod(12)
test_ds.batch_to_images = coco_batch_to_images_with_bias_mod(2)


Re-run after bias#

check = SimpleFeatureContribution(per_class=False)
check.run(train_ds, test_ds)


Out:

Validating Input:   0%| | 0/1 [00:00<?, ? /s]

Ingesting Batches - Train Dataset:   0%|  | 0/2 [00:00<?, ? Batch/s]
Ingesting Batches - Train Dataset:  50%|# | 1/2 [00:01<00:01,  1.25s/ Batch]
Ingesting Batches - Train Dataset: 100%|##| 2/2 [00:02<00:00,  1.16s/ Batch]

Ingesting Batches - Test Dataset:   0%|  | 0/2 [00:00<?, ? Batch/s]
Ingesting Batches - Test Dataset:  50%|# | 1/2 [00:00<00:00,  1.09 Batch/s]
Ingesting Batches - Test Dataset: 100%|##| 2/2 [00:01<00:00,  1.01 Batch/s]

Computing Check:   0%| | 0/1 [00:00<?, ? Check/s]
Computing Check: 100%|#| 1/1 [00:00<00:00,  3.76 Check/s]


Simple Feature Contribution

Return the Predictive Power Score of image properties, in order to estimate their ability to predict the label.

The Predictive Power Score (PPS) is used to estimate the ability of an image property (such as brightness)to predict the label by itself. (Read more about Predictive Power Score)
In the graph above, we should suspect we have problems in our data if:
1. Train dataset PPS values are high:
A high PPS (close to 1) can mean that there's a bias in the dataset, as a single property can predict the label successfully, using simple classic ML algorithms
2. Large difference between train and test PPS (train PPS is larger):
An even more powerful indication of dataset bias, as an image property that was powerful in train
but not in test can be explained by bias in train that is not relevant to a new dataset.
3. Large difference between test and train PPS (test PPS is larger):
An anomalous value, could indicate drift in test dataset that caused a coincidental correlation to the target label.

We can see that the check detected the bias we inserted, and that the brightness property of the image has a high PPS, meaning it can be used to solely predict the label.

Define a condition#

We can define on our check a condition that will validate that our pps scores aren’t too high. The check has 2 possible built-in conditions:

add_condition_feature_pps_difference_not_greater_than - Validate that the difference in the PPS between train and test is not larger than defined amount (default 0.2)

add_condition_feature_pps_in_train_not_greater_than - Validate that the PPS scores on train dataset are not exceeding a defined amount (default 0.7)

Let’s add the conditions, and re-run the check:

check = SimpleFeatureContribution(per_class=False).add_condition_feature_pps_difference_not_greater_than(0.1) \
result = check.run(train_dataset=train_ds, test_dataset=test_ds)


Out:

Validating Input:   0%| | 0/1 [00:00<?, ? /s]

Ingesting Batches - Train Dataset:   0%|  | 0/2 [00:00<?, ? Batch/s]
Ingesting Batches - Train Dataset:  50%|# | 1/2 [00:01<00:01,  1.19s/ Batch]
Ingesting Batches - Train Dataset: 100%|##| 2/2 [00:02<00:00,  1.13s/ Batch]

Ingesting Batches - Test Dataset:   0%|  | 0/2 [00:00<?, ? Batch/s]
Ingesting Batches - Test Dataset:  50%|# | 1/2 [00:00<00:00,  1.13 Batch/s]
Ingesting Batches - Test Dataset: 100%|##| 2/2 [00:01<00:00,  1.03 Batch/s]

Computing Check:   0%| | 0/1 [00:00<?, ? Check/s]
Computing Check: 100%|#| 1/1 [00:00<00:00,  3.77 Check/s]

Simple Feature Contribution

Total running time of the script: ( 0 minutes 18.038 seconds)

Gallery generated by Sphinx-Gallery