Using Deepchecks Vision With a Few Lines of Code#

Deepchecks Vision is built to validate your data and model, however complex your model and data may be. That being said, sometime there is no need to write a full-blown ClassificationData or DetectionData. In the case of a simple classification task, there is quite a few checks that can be run writing only a few lines of code. In this tutorial, we will show you how to run all checks that do not require a model on a simple classification task.

This is ideal, for example, when receiving a new dataset for a classification task. Running these checks on the dataset before even starting with training will give you a quick idea of how the dataset looks like and what potential issues it contains.

Downloading the Data#

For this example we’ll use a small sample of the RGB EuroSAT dataset. EuroSAT dataset is based on Sentinel-2 satellite images covering 13 spectral bands and consisting of 10 classes with 27000 labeled and geo-referenced samples.

Citations:

[1] Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. Patrick Helber, Benjamin Bischke, Andreas Dengel, Damian Borth. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.

[2] Introducing EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. Patrick Helber, Benjamin Bischke, Andreas Dengel. 2018 IEEE International Geoscience and Remote Sensing Symposium, 2018.

import urllib.request
import zipfile

import numpy as np

url = 'https://figshare.com/ndownloader/files/34912884'
urllib.request.urlretrieve(url, 'EuroSAT_data.zip')

with zipfile.ZipFile('EuroSAT_data.zip', 'r') as zip_ref:
    zip_ref.extractall('EuroSAT')

Loading a Simple Classification Dataset#

A simple classification dataset is an image dataset structured in the following way:

  • root/
    • train/
      • class1/

        image1.jpeg

    • test/
      • class1/

        image1.jpeg

from deepchecks.vision.simple_classification_data import load_dataset

train_ds = load_dataset('./EuroSAT/euroSAT/', train=True, object_type='VisionData', image_extension='jpg')
test_ds = load_dataset('./EuroSAT/euroSAT/', train=False, object_type='VisionData', image_extension='jpg')

Running Deepchecks’ full suite#

That’s it, we have just defined the classification data object and are ready to run the train_test_validation suite:

from deepchecks.vision.suites import train_test_validation

suite = train_test_validation()
result = suite.run(train_ds, test_ds)

Out:

Validating Input:   0%| | 0/1 [00:00<?, ? /s]

Ingesting Batches - Train Dataset:   0%|                               | 0/31 [00:00<?, ? Batch/s]

Ingesting Batches - Train Dataset:   6%|##                             | 2/31 [00:00<00:01, 14.76 Batch/s]

Ingesting Batches - Train Dataset:  13%|####                           | 4/31 [00:00<00:01, 14.99 Batch/s]

Ingesting Batches - Train Dataset:  19%|######                         | 6/31 [00:00<00:01, 15.06 Batch/s]

Ingesting Batches - Train Dataset:  26%|########                       | 8/31 [00:00<00:01, 15.00 Batch/s]

Ingesting Batches - Train Dataset:  32%|##########                     | 10/31 [00:00<00:01, 15.11 Batch/s]

Ingesting Batches - Train Dataset:  39%|############                   | 12/31 [00:00<00:01, 15.16 Batch/s]

Ingesting Batches - Train Dataset:  45%|##############                 | 14/31 [00:00<00:01, 15.18 Batch/s]

Ingesting Batches - Train Dataset:  52%|################               | 16/31 [00:01<00:00, 15.20 Batch/s]

Ingesting Batches - Train Dataset:  58%|##################             | 18/31 [00:01<00:00, 15.25 Batch/s]

Ingesting Batches - Train Dataset:  65%|####################           | 20/31 [00:01<00:00, 15.30 Batch/s]

Ingesting Batches - Train Dataset:  71%|######################         | 22/31 [00:01<00:00, 15.35 Batch/s]

Ingesting Batches - Train Dataset:  77%|########################       | 24/31 [00:01<00:00, 15.34 Batch/s]

Ingesting Batches - Train Dataset:  84%|##########################     | 26/31 [00:01<00:00, 15.37 Batch/s]

Ingesting Batches - Train Dataset:  90%|############################   | 28/31 [00:01<00:00, 15.36 Batch/s]

Ingesting Batches - Train Dataset:  97%|############################## | 30/31 [00:01<00:00, 15.35 Batch/s]


Ingesting Batches - Test Dataset:   0%|                                | 0/32 [00:00<?, ? Batch/s]


Ingesting Batches - Test Dataset:   6%|##                              | 2/32 [00:00<00:02, 14.73 Batch/s]


Ingesting Batches - Test Dataset:  12%|####                            | 4/32 [00:00<00:01, 15.07 Batch/s]


Ingesting Batches - Test Dataset:  19%|######                          | 6/32 [00:00<00:01, 15.18 Batch/s]


Ingesting Batches - Test Dataset:  25%|########                        | 8/32 [00:00<00:01, 15.16 Batch/s]


Ingesting Batches - Test Dataset:  31%|##########                      | 10/32 [00:00<00:01, 15.11 Batch/s]


Ingesting Batches - Test Dataset:  38%|############                    | 12/32 [00:00<00:01, 14.99 Batch/s]


Ingesting Batches - Test Dataset:  44%|##############                  | 14/32 [00:00<00:01, 14.91 Batch/s]


Ingesting Batches - Test Dataset:  50%|################                | 16/32 [00:01<00:01, 14.87 Batch/s]


Ingesting Batches - Test Dataset:  56%|##################              | 18/32 [00:01<00:00, 15.01 Batch/s]


Ingesting Batches - Test Dataset:  62%|####################            | 20/32 [00:01<00:00, 15.04 Batch/s]


Ingesting Batches - Test Dataset:  69%|######################          | 22/32 [00:01<00:00, 15.06 Batch/s]


Ingesting Batches - Test Dataset:  75%|########################        | 24/32 [00:01<00:00, 15.04 Batch/s]


Ingesting Batches - Test Dataset:  81%|##########################      | 26/32 [00:01<00:00, 15.07 Batch/s]


Ingesting Batches - Test Dataset:  88%|############################    | 28/32 [00:01<00:00, 15.14 Batch/s]


Ingesting Batches - Test Dataset:  94%|##############################  | 30/32 [00:01<00:00, 15.11 Batch/s]



Computing Checks:   0%|        | 0/8 [00:00<?, ? Check/s]



Computing Checks:   0%|        | 0/8 [00:00<?, ? Check/s, Check=Similar Image Leakage]



Computing Checks:  12%|#       | 1/8 [00:03<00:21,  3.03s/ Check, Check=Similar Image Leakage]



Computing Checks:  12%|#       | 1/8 [00:03<00:21,  3.03s/ Check, Check=Heatmap Comparison]



Computing Checks:  25%|##      | 2/8 [00:03<00:18,  3.03s/ Check, Check=Train Test Label Drift]



Computing Checks:  38%|###     | 3/8 [00:03<00:15,  3.03s/ Check, Check=Train Test Prediction Drift]



Computing Checks:  50%|####    | 4/8 [00:03<00:12,  3.03s/ Check, Check=Image Property Drift]



Computing Checks:  62%|#####   | 5/8 [00:03<00:01,  1.95 Check/s, Check=Image Property Drift]



Computing Checks:  62%|#####   | 5/8 [00:03<00:01,  1.95 Check/s, Check=Image Dataset Drift] Calculating permutation feature importance. Expected to finish in 1 seconds




Computing Checks:  75%|######  | 6/8 [00:03<00:00,  2.38 Check/s, Check=Image Dataset Drift]



Computing Checks:  75%|######  | 6/8 [00:03<00:00,  2.38 Check/s, Check=Simple Feature Contribution]



Computing Checks:  88%|####### | 7/8 [00:04<00:00,  1.60 Check/s, Check=Simple Feature Contribution]



Computing Checks:  88%|####### | 7/8 [00:04<00:00,  1.60 Check/s, Check=New Labels]

Observing the Results:#

The results can be saved as a html file with the following code:

result.save_as_html('output.html')

Out:

'output.html'

Or, if working inside a notebook, the output can be displayed directly by simply printing the result object:

result

Train Test Validation Suite

The suite is composed of various checks such as: Similar Image Leakage, Image Dataset Drift, Image Property Drift, etc...
Each check may contain conditions (which will result in pass / fail / warning / error , represented by / / ! / ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Similar Image Leakage Number of similar images between train and test is not greater than 0 Number of similar images between train and test datasets: 18
Simple Feature Contribution Train-Test properties' Predictive Power Score difference is not greater than 0.2 Features with PPS difference above threshold: {'RMS Contrast': '0.34'}
Train Test Label Drift PSI <= 0.15 and Earth Mover's Distance <= 0.075 for label drift
Image Property Drift Earth Mover's Distance <= 0.1 for image properties drift
New Labels Percentage of new labels in the test set not above 0.5%.

Check With Conditions Output

Similar Image Leakage

Check for images in training that are similar to images in test.

Conditions Summary
Status Condition More Info
Number of similar images between train and test is not greater than 0 Number of similar images between train and test datasets: 18
Additional Outputs

Similar Images

Total number of test samples with similar images in train: 18

Samples

Train
Test

Go to top

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures.

Conditions Summary
Status Condition More Info
PSI <= 0.15 and Earth Mover's Distance <= 0.075 for label drift
Additional Outputs
The Drift score is a measure for the difference between two distributions. In this check, drift is measured for the distribution of the following label properties: ['Samples Per Class'].

Go to top

Image Property Drift

Calculate drift between train dataset and test dataset per image property, using statistical measures.

Conditions Summary
Status Condition More Info
Earth Mover's Distance <= 0.1 for image properties drift
Additional Outputs
The Drift score is a measure for the difference between two distributions. In this check, drift is measured for the distribution of the following image properties: ['Area', 'Aspect Ratio', 'Brightness', 'Mean Blue Relative Intensity', 'Mean Green Relative Intensity', 'Mean Red Relative Intensity', 'RMS Contrast'].

Go to top

Simple Feature Contribution

Return the Predictive Power Score of image properties, in order to estimate their ability to predict the label.

Conditions Summary
Status Condition More Info
Train-Test properties' Predictive Power Score difference is not greater than 0.2 Features with PPS difference above threshold: {'RMS Contrast': '0.34'}
Additional Outputs
The Predictive Power Score (PPS) is used to estimate the ability of an image property (such as brightness)to predict the label by itself. (Read more about Predictive Power Score)
In the graph above, we should suspect we have problems in our data if:
1. Train dataset PPS values are high:
A high PPS (close to 1) can mean that there's a bias in the dataset, as a single property can predict the label successfully, using simple classic ML algorithms
2. Large difference between train and test PPS (train PPS is larger):
An even more powerful indication of dataset bias, as an image property that was powerful in train
but not in test can be explained by bias in train that is not relevant to a new dataset.
3. Large difference between test and train PPS (test PPS is larger):
An anomalous value, could indicate drift in test dataset that caused a coincidental correlation to the target label.

Go to top

Check Without Conditions Output

Heatmap Comparison

Check if the average image brightness (or bbox location if applicable) is similar between train and test set.

Additional Outputs

Go to top

Other Checks That Weren't Displayed

Check Reason
Train Test Prediction Drift - Train Dataset DeepchecksNotSupportedError: Check is irrelevant for Datasets without model
Image Dataset Drift Nothing found
New Labels Nothing found

Go to top


Understanding the Results:#

Looking at the results we see two checks whose conditions have failed:

  1. Similar Image Leakage

  2. Simple Feature Contribution

The first has clearly failed due to the naturally occurring similarity between different ocean / lake image, and the prevailing green of some forest images. We may wish to remove some of these duplicate images but for this dataset they make sense.

The second failure is more interesting. The Simple Feature Contribution check computes various simple image properties and checks if the image label can be inferred using a simple model (for example, a Classification Tree) using the property values. The ability to predict the label using these properties is measures by the Predictive Power Score (PPS) and this measure is compared between the training and test dataset. In this case, the condition alerts us to the fact that this PPS for the “RMS Contrast” property was significantly higher in the training dataset than in the test dataset.

We’ll show the relevant plot again for ease of discussion:

check_idx = np.where([result.results[i].check.name() == 'Simple Feature Contribution'
                      for i in range(len(result.results))])[0][0]
result.results[check_idx]

Simple Feature Contribution

Return the Predictive Power Score of image properties, in order to estimate their ability to predict the label.

Conditions Summary
Status Condition More Info
Train-Test properties' Predictive Power Score difference is not greater than 0.2 Features with PPS difference above threshold: {'RMS Contrast': '0.34'}
Additional Outputs
The Predictive Power Score (PPS) is used to estimate the ability of an image property (such as brightness)to predict the label by itself. (Read more about Predictive Power Score)
In the graph above, we should suspect we have problems in our data if:
1. Train dataset PPS values are high:
A high PPS (close to 1) can mean that there's a bias in the dataset, as a single property can predict the label successfully, using simple classic ML algorithms
2. Large difference between train and test PPS (train PPS is larger):
An even more powerful indication of dataset bias, as an image property that was powerful in train
but not in test can be explained by bias in train that is not relevant to a new dataset.
3. Large difference between test and train PPS (test PPS is larger):
An anomalous value, could indicate drift in test dataset that caused a coincidental correlation to the target label.


Here we can see the plot dedicated to the PPS of the property RMS Contrast, which measures the contrast in the image by calculating the grayscale standard deviation of the image. This plot shows us that specifically for the classes “Forest” and “SeaLake” (the same culprits from the Similar Image Leakage condition), the contrast is a great predictor, but only in the training data! This means we have a critical problem - or model may learn to classify these classes using only the contrast, without actually learning anything about the image content. We now can go on and fix this issue (perhaps by adding train augmentations, or enriching our training set) even before we start thinking about what model to train for the task.

Total running time of the script: ( 0 minutes 11.458 seconds)

Gallery generated by Sphinx-Gallery