.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/vision/train_test_validation/plot_similar_image_leakage.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_vision_train_test_validation_plot_similar_image_leakage.py: .. _plot_vision_similar_image_leakage: Similar Image Leakage *************************** This notebook provides an overview for using and understanding the "Similar Image Leakage" check. **Structure:** * `What is the purpose of the check? <#what-is-the-purpose-of-the-check>`__ * `Run the check <#run-the-check>`__ * `Define a condition <#define-a-condition>`__ What is the purpose of the check? ================================= The check helps identify if the training dataset contains any images that are similar to any images in the test dataset. Such a situation is nearly always a case of leakage, because we can expect that the model will have an easier time getting correct predictions on an image that is similar to an image in the training set, compared to it's "real world" performance. This may mean that the metrics we're seeing for the test data are too optimistic, and we should remove those similar images from the test set. How is similarity calculated? ------------------------------------- The similarity is calculated using an image hash known as Average Hash. This hash compresses the image using the following algorithm: #. Resize the image to a very compact form (the check default is 8X8). #. Compute the average of the image pixels. #. For each pixel, replace value by the boolean result of `pixel_value >= image_average`. Now we end up with a representation of the image that is 8 bytes long, but still contains some real information about the image content. We then proceed to check for similar images by searching for test images whose hash is close to a hash of a training image, when distance is defined by the Hamming distance between the binary vectors that are the hashed images. Note about default parameters -------------------------------------- Similarity between images depends on the purpose of the dataset. This is because sometimes we're training a model to find large differences between images (e.g. people vs dogs) and sometimes we're training to find small differences (e.g. different types of trees). Moreover, sometimes our images are taken from real-world datasets, where they were taken by different people, in different locations. In some use-cases though the images are "cleaner", such as ones taken under microscope or from the same security camera with the same background. The check's default parameters are set to match a real-world rgb photos and their differences. If your dataset has more delicate differences in it, it is advised to use the *hash_size* and *similarity_threshold* parameters of this check. The *hash_size* parameter controls the size of the hashed image. A larger hash_size will enable to find finer differences between images (and results in less similarity). The *similarity_threshold* parameter controls the ratio of pixels that need to be different in order for 2 images to be considered "different". A lower similarity_threshold will define less images as "similar". Run the check =============== .. GENERATED FROM PYTHON SOURCE LINES 58-64 .. code-block:: default from deepchecks.vision.checks import SimilarImageLeakage from deepchecks.vision.datasets.detection.coco import load_dataset train_ds = load_dataset(train=True, object_type='VisionData', shuffle=False) test_ds = load_dataset(train=False, object_type='VisionData', shuffle=False) .. GENERATED FROM PYTHON SOURCE LINES 65-70 .. code-block:: default check = SimilarImageLeakage() result = check.run(train_ds, test_ds) result .. rst-class:: sphx-glr-script-out .. code-block:: none Validating Input: | | 0/1 [Time: 00:00] Validating Input: |#####| 1/1 [Time: 00:00] Ingesting Batches - Train Dataset: | | 0/2 [Time: 00:00] Ingesting Batches - Train Dataset: |##5 | 1/2 [Time: 00:00] Ingesting Batches - Train Dataset: |#####| 2/2 [Time: 00:00] Ingesting Batches - Train Dataset: |#####| 2/2 [Time: 00:00] Ingesting Batches - Test Dataset: | | 0/2 [Time: 00:00] Ingesting Batches - Test Dataset: |##5 | 1/2 [Time: 00:00] Ingesting Batches - Test Dataset: |#####| 2/2 [Time: 00:00] Ingesting Batches - Test Dataset: |#####| 2/2 [Time: 00:00] Computing Check: | | 0/1 [Time: 00:00] Computing Check: |#####| 1/1 [Time: 00:00] .. raw:: html
Similar Image Leakage


.. GENERATED FROM PYTHON SOURCE LINES 71-72 To display the results in an IDE like PyCharm, you can use the following code: .. GENERATED FROM PYTHON SOURCE LINES 72-74 .. code-block:: default # result.show_in_window() .. GENERATED FROM PYTHON SOURCE LINES 75-76 The result will be displayed in a new window. .. GENERATED FROM PYTHON SOURCE LINES 78-79 As we can see, no similar images were found. .. GENERATED FROM PYTHON SOURCE LINES 81-85 Insert training images into test --------------------------------- Let's now see what happens when we insert some of the training images into the test set. We'll insert them with some changes to brightness to see what happens. .. GENERATED FROM PYTHON SOURCE LINES 85-112 .. code-block:: default from copy import copy import numpy as np from PIL import Image from deepchecks.vision.utils.test_utils import get_modified_dataloader test_ds_modified = copy(test_ds) def get_modification_func(): other_dataset = train_ds.data_loader.dataset def mod_func(orig_dataset, idx): if idx in range(5): # Run only on the first 5 images data, label = other_dataset[idx] # Add some brightness by adding 50 to all pixels return Image.fromarray(np.clip(np.array(data, dtype=np.uint16) + 50, 0, 255).astype(np.uint8)), label else: return orig_dataset[idx] return mod_func test_ds_modified._data_loader = get_modified_dataloader(test_ds, get_modification_func()) .. GENERATED FROM PYTHON SOURCE LINES 113-115 Re-run after introducing the similar images -------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 115-120 .. code-block:: default check = SimilarImageLeakage() result = check.run(train_ds, test_ds_modified) result .. rst-class:: sphx-glr-script-out .. code-block:: none Validating Input: | | 0/1 [Time: 00:00] Validating Input: |#####| 1/1 [Time: 00:00] Ingesting Batches - Train Dataset: | | 0/2 [Time: 00:00] Ingesting Batches - Train Dataset: |##5 | 1/2 [Time: 00:00] Ingesting Batches - Train Dataset: |#####| 2/2 [Time: 00:00] Ingesting Batches - Train Dataset: |#####| 2/2 [Time: 00:00] Ingesting Batches - Test Dataset: | | 0/64 [Time: 00:00] Ingesting Batches - Test Dataset: |########### | 11/64 [Time: 00:00] Ingesting Batches - Test Dataset: |####################### | 23/64 [Time: 00:00] Ingesting Batches - Test Dataset: |#################################### | 36/64 [Time: 00:00] Ingesting Batches - Test Dataset: |################################################## | 50/64 [Time: 00:00] Ingesting Batches - Test Dataset: |############################################################### | 63/64 [Time: 00:00] Ingesting Batches - Test Dataset: |################################################################| 64/64 [Time: 00:00] Computing Check: | | 0/1 [Time: 00:00] Computing Check: |#####| 1/1 [Time: 00:00] Computing Check: |#####| 1/1 [Time: 00:00] .. raw:: html
Similar Image Leakage


.. GENERATED FROM PYTHON SOURCE LINES 121-127 We can see that the check detected the five images from the training set we introduced to the test set. Define a condition ================== We can define on our check a condition that will validate no similar images where found. The default is that no similar images are allowed at all, but this can be modified as shown here. .. GENERATED FROM PYTHON SOURCE LINES 127-131 .. code-block:: default check = SimilarImageLeakage().add_condition_similar_images_less_or_equal(3) result = check.run(train_dataset=train_ds, test_dataset=test_ds_modified) result.show(show_additional_outputs=False) .. rst-class:: sphx-glr-script-out .. code-block:: none Validating Input: | | 0/1 [Time: 00:00] Validating Input: |#####| 1/1 [Time: 00:00] Ingesting Batches - Train Dataset: | | 0/2 [Time: 00:00] Ingesting Batches - Train Dataset: |##5 | 1/2 [Time: 00:00] Ingesting Batches - Train Dataset: |#####| 2/2 [Time: 00:00] Ingesting Batches - Train Dataset: |#####| 2/2 [Time: 00:00] Ingesting Batches - Test Dataset: | | 0/64 [Time: 00:00] Ingesting Batches - Test Dataset: |########### | 11/64 [Time: 00:00] Ingesting Batches - Test Dataset: |####################### | 23/64 [Time: 00:00] Ingesting Batches - Test Dataset: |#################################### | 36/64 [Time: 00:00] Ingesting Batches - Test Dataset: |################################################## | 50/64 [Time: 00:00] Ingesting Batches - Test Dataset: |############################################################### | 63/64 [Time: 00:00] Ingesting Batches - Test Dataset: |################################################################| 64/64 [Time: 00:00] Computing Check: | | 0/1 [Time: 00:00] Computing Check: |#####| 1/1 [Time: 00:00] Computing Check: |#####| 1/1 [Time: 00:00] .. raw:: html
Similar Image Leakage


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 4.187 seconds) .. _sphx_glr_download_checks_gallery_vision_train_test_validation_plot_similar_image_leakage.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_similar_image_leakage.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_similar_image_leakage.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_