.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/vision/distribution/plot_image_dataset_drift.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_vision_distribution_plot_image_dataset_drift.py: Image Dataset Drift =================== This notebooks provides an overview for using and understanding the image dataset drift check, used to detect drift in simple image properties between train and test datasets. **Structure:** * `What Is Dataset Drift? <#what-is-dataset-drift>`__ * `How Does the ImageDatasetDrift Check Work? <#how-does-the-imagedatasetdrift-check-work>`__ * `Which Image Properties Are Used? <#which-image-properties-are-used>`__ * `Loading The Data <#loading-the-data>`__ * `Run The Check <#run-the-check>`__ What Is Dataset Drift? ------------------------ Data drift is simply a change in the distribution of data over time. It is also one of the top reasons that a machine learning model performance degrades over time. Specifically, a whole dataset drift, or a multivariate dataset drift, occurs when there is a change in the relation between input features. Causes of data drift include: * Natural drift in the data, such as lighting (brightness) changes between summer and winter. * Upstream process changes, such as a camera being replaced that has a different lens, which makes images sharper. * Data quality issues, such as a malfunctioning camera that always returns a black image. * Data pipeline errors, such as a change in image augmentations done in preprocessing. In the context of machine learning, drift between the training set and the test set (which is not due to augmentation) will likely make the model prone to errors. In other words, if the model was trained on data that is different from the current test data, it will probably make more mistakes predicting the target variable. How Does the ImageDatasetDrift Check Work? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are many methods to detect feature drift. Some of them are statistical methods that aim to measure difference between distribution of 2 given sets. This methods are more suited to univariate distributions and are primarily used to detect drift between 2 subsets of a single feature. Measuring a multivariate data drift is a bit more challenging. In the image dataset drift check, the multivariate drift is measured by training a classifier that detects which samples come from a known distribution and defines the drift by the accuracy of this classifier. Practically, the check concatenates the train and the test sets, and assigns label 0 to samples that come from the training set, and 1 to those from the test set. Then, we train a binary classifer of type `Histogram-based Gradient Boosting Classification Tree `_, and measure the drift score from the AUC score of this classifier. As the classifier is a tree model, that cannot run on the images themselves, the check calculates properties for each image (such as brightness, aspect ratio etc.) and uses them as input features to the classifier. Which Image Properties Are Used? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ============================== ========== Property name What is it ============================== ========== Aspect Ratio Ratio between height and width of image (height / width) Area Area of image in pixels (height * width) Brightness Average intensity of image pixels. Color channels have different weights according to RGB-to-Grayscale formula RMS Contrast Contrast of image, calculated by standard deviation of pixels Mean Red Relative Intensity Mean over all pixels of the red channel, scaled to their relative intensity in comparison to the other channels [r / (r + g + b)]. Mean Green Relative Intensity Mean over all pixels of the green channel, scaled to their relative intensity in comparison to the other channels [g / (r + g + b)]. Mean Blue Relative Intensity Mean over all pixels of the blue channel, scaled to their relative intensity in comparison to the other channels [b / (r + g + b)]. ============================== ========== .. GENERATED FROM PYTHON SOURCE LINES 83-85 Imports ------- .. GENERATED FROM PYTHON SOURCE LINES 85-91 .. code-block:: default import numpy as np from deepchecks.vision.checks import ImageDatasetDrift from deepchecks.vision.datasets.detection.coco import load_dataset .. GENERATED FROM PYTHON SOURCE LINES 92-94 Loading the data ---------------- .. GENERATED FROM PYTHON SOURCE LINES 94-99 .. code-block:: default train_ds = load_dataset(train=True, object_type='VisionData') test_ds = load_dataset(train=False, object_type='VisionData') .. GENERATED FROM PYTHON SOURCE LINES 100-104 Run the check ------------- without drift ^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 104-108 .. code-block:: default check = ImageDatasetDrift() check.run(train_dataset=train_ds, test_dataset=test_ds) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Validating Input: 0%| | 0/1 [00:00

Image Dataset Drift

Calculate drift between the entire train and test datasets (based on image properties) using a trained model.

Additional Outputs

Nothing to display



.. GENERATED FROM PYTHON SOURCE LINES 109-112 Insert drift ^^^^^^^^^^^^ Now, we will define a custom data object that will insert a drift to the training set. .. GENERATED FROM PYTHON SOURCE LINES 112-127 .. code-block:: default from deepchecks.vision.datasets.detection.coco import COCOData def add_brightness(img): reverse = 255 - img addition_of_brightness = (reverse * 0.07).astype(int) return img + addition_of_brightness class DriftedCOCO(COCOData): def batch_to_images(self, batch): return [add_brightness(np.array(img)) for img in batch[0]] .. GENERATED FROM PYTHON SOURCE LINES 128-134 .. code-block:: default train_dataloader = load_dataset(train=True, object_type='DataLoader') test_dataloader = load_dataset(train=False, object_type='DataLoader') drifted_train_ds = DriftedCOCO(train_dataloader) test_ds = COCOData(test_dataloader) .. GENERATED FROM PYTHON SOURCE LINES 135-137 Run the check again ^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 137-139 .. code-block:: default check = ImageDatasetDrift() check.run(train_dataset=drifted_train_ds, test_dataset=test_ds) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Validating Input: 0%| | 0/1 [00:00

Image Dataset Drift

Calculate drift between the entire train and test datasets (based on image properties) using a trained model.

Additional Outputs
The shown features are the image properties (brightness, aspect ratio, etc.) that are most important for the domain classifier - the domain_classifier trained to distinguish between the train and test datasets.
The percents of explained dataset difference are the importance values for the feature calculated using `permutation_importance`.

Main features contributing to drift

* showing only the top 3 columns, you can change it using n_top_columns param


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 13.185 seconds) .. _sphx_glr_download_checks_gallery_vision_distribution_plot_image_dataset_drift.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_image_dataset_drift.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_image_dataset_drift.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_