Data Properties#

Properties are one-dimension values that are extracted from either the images, labels or predictions. For example, an image property is brightness, and a label property is bounding box area (for detection tasks). Deepchecks includes built-in properties and supports implementing your own properties.

What Are Properties Used For?#

Properties are used by some of the Deepchecks’ checks (e.g. train-test drift), in order to extract meaningful features from the data, since some computations are difficult to compute directly on the images (for example drift). Inspecting the distribution of the property’s values (e.g. to notice some images are extremely dark, or that the aspect ratio of images is different between the train and test sets) can help uncover potential problems in the way that the datasets were built, or hint about the model’s expected performance on unseen data.

Example for specific scenarios in which measuring properties may come in handy:

  1. Investigating low test performance - detecting high drift in certain properties may help you pinpoint the causes of the model’s lower performance on the test data.

  2. Generalizability on new data - a drift in significant data properties, may indicate lower ability of the model to accurately predict on the new (different) unlabeled data.

  3. Find weak segments - The properties can be used to segment the data and test for low performing segments. If found, the weak segment may indicate a underrepresented segment or an area where the data quality is worse.

  4. Find obscure relations between the data and the targets - the model training might be affected by properties we are not aware of, and that aren’t the core attributes of what we are aiming for it to learn. For example, in a classification dataset of wolves and dogs photographs, if only wolves are photographed in the snow, the brightness of the image may be used to predict the label “wolf” easily. In this case, a model might not learn to discern wolf from dog by the animal’s characteristics, but by using the background color.

Deepchecks’ Built-in Properties#

We divide the properties by the data that they are based on: images, labels or predictions. You can either use the built-in properties or implement your own ones and pass them to the relevant checks.

Image Properties#

The built-in image properties are:

Property name

What is it

Aspect Ratio

Ratio between height and width of image (height / width)

Area

Area of image in pixels (height * width)

Brightness

Average intensity of image pixels. Color channels have different weights according to RGB-to-Grayscale formula

RMS Contrast

Contrast of image, calculated by standard deviation of pixels

Mean Red Relative Intensity

Mean over all pixels of the red channel, scaled to their relative intensity in comparison to the other channels [r / (r + g + b)].

Mean Green Relative Intensity

Mean over all pixels of the green channel, scaled to their relative intensity in comparison to the other channels [g / (r + g + b)].

Mean Blue Relative Intensity

Mean over all pixels of the blue channel, scaled to their relative intensity in comparison to the other channels [b / (r + g + b)].

Label & Prediction Properties#

The built-in label & predictions properties are:

Property name

What is it

Samples Per Class

The classes abundance in the data

Bounding Box Area

Area of bounding boxes in pixels (height * width) for object detection

Number of Bounding Boxes Per Image

Number of bounding boxes in a single image for object detection

Property Structure#

All property types have a similar structure, which is a dictionary with 3 keys:

  • name - The name of the property

  • method - The callable function that calculates the property’s value. It accepts the relevant data and returns the values list.

  • output_type - Relates to the method’s return values list, and is one of the following:

    • continuous - For numeric values with continuous nature

    • discrete - For numeric values with discrete nature or non-numeric values

    • class_id - Means the output is of class ids. In this case we will try to translate the ids into their corresponding class labels.

Each dictionary is a single property, and the checks accepts a list of those dictionaries. For example:

def mean_image(images):
  return [image.mean() for image in images]

properties = [
  {'name': 'My Image Mean', 'method': mean_image, 'output_type': 'continuous'}
]

The Method’s Input#

Each property is built for the specific data type that it runs on, and receives its deepchecks-expected format, as demonstrated in Deepchecks’ format. Note that prediction and label-based properties are not interchangeable due to their slightly different format, even if they calculate similar values.

The Method’s Output#

Each property function must return a sequence in the same length as the length of the input object. This is used later in order to couple each sample to its right properties values. In image properties we expect each image to generate a single property value, which results in a list of primitives types in the same length as the number of images. On the other hand for label & predictions we allow each one to have multiple primitive values (for example area of bounding box), which means the returned list may contain either primitives values or a lists of primitive values per label/prediction.

Customizing the Checks’ Properties#

By default, checks using properties will use the built-in properties. Those default properties can be overridden in one of two ways:

  1. Properties - a list of functions to be calculated on the data during the check (in the format specified above) passed to the check init. Properties format.

  2. Pre-Calculated Properties - a dictionary with the result of pre calculated properties per sample passed to the check run. Pre-calculated properties format.

Properties Demonstration#

We will demonstrate the 3 drift checks (for each property type) and implement the properties to pass to it.

Image Property#

from deepchecks.vision.checks.distribution import ImagePropertyDrift
from skimage.color import rgb2gray
import numpy as np


def aspect_ratio(images: List[np.ndarray]) -> List[float]:
  """Return list of floats of image height to width ratio."""
  return [x[0] / x[1] for x in _sizes(batch)]

def brightness(images: List[np.ndarray]) -> List[float]:
  """Calculate brightness on each image in the batch."""
  # If grayscale
  if images[0].shape[2] == 1:
      return [img.mean() for img in batch]
  else:
      return [rgb2gray(img).mean() for img in batch]


  properties = [
  {'name': 'Aspect Ratio', 'method': aspect_ratio, 'output_type': 'continuous'},
  {'name': 'Brightness', 'method': brightness, 'output_type': 'continuous'}
]

check = ImagePropertyDrift(alternative_image_properties=properties)

Label Property#

For label property the input varies according to the task type you are running. In this example we implement properties which apply to the Detection task type.

from deepchecks.vision.checks.distribution import TrainTestLabelDrift
import torch

def number_of_labels(labels: List[torch.Tensor]) -> List[int]:
  """Return a list containing the number of detections per sample in batch."""
  return [label.shape[0] for label in labels]

def classes_in_labels(labels: List[torch.Tensor]) -> List[List[int]]:
  """Return a list containing the classes in batch."""
  return [label.reshape((-1, 5))[:, 0].tolist() for label in labels]


  properties = [
  {'name': 'Labels Per Sample', 'method': number_of_labels, 'output_type': 'discrete'},
  {'name': 'Classes Appearance', 'method': classes_in_labels, 'output_type': 'class_id'}
]

check = TrainTestLabelDrift(label_properties=properties)

Prediction Property#

Prediction property’s input, like label property, also varies by the task type you are running. In this example we implement properties which apply to the Detection task type.

from deepchecks.vision.checks.distribution import TrainTestPredictionDrift
import torch

def classes_of_predictions(predictions: List[torch.Tensor]) -> List[List[int]]:
  """Return a list containing the classes in batch."""
  return [tensor.reshape((-1, 6))[:, -1].tolist() for tensor in predictions]

def bbox_area(predictions: List[torch.Tensor]) -> List[List[float]]:
  """Return a list containing the area of bboxes per image in batch."""
  return [(prediction.reshape((-1, 6))[:, 2] * prediction.reshape((-1, 6))[:, 3]).tolist()
           for prediction in predictions]


properties = [
  {'name': 'Classes in Predictions', 'method': classes_of_predictions, 'output_type': 'class_id'},
  {'name': 'Bounding Box Area', 'method': bbox_area, 'output_type': 'continuous'}
]

check = TrainTestPredictionDrift(prediction_properties=properties)

Pre-Calculated Properties#

Properties can be calculated and saved ahead of time and then passed to the check. This can be useful in cases where calculation on the fly is not practical, for example demands extra computing resources that are not always available, as well as for using meta-data as properties such as the camera type for images and annotator identity for labels.

To use this option, pass the static properties to the check.run argument train_properties in the case of the train dataset or single dataset, and to the argument test_properties in the case tf the test set.

The expected format for the static properties is the following nested dictionary:
  • sample index (int):
    • properties input type (PropertiesInputType):
      • property name (str):
        • property values per sample (list)

The values per sample is a list to support the case of object detection where there might be multiple bounding boxes per image.

Code Example#

import pandas as pd
import numpy as np
from deepchecks.vision.checks import ImagePropertyOutliers
from deepchecks.vision.datasets.detection.coco import load_dataset
from deepchecks.vision.utils import static_properties_from_df

train_data = load_dataset(train=True, object_type='VisionData')

# Say we have a dataframe with previously calculated properties
df = pd.DataFrame({'property1': np.random.random(train_data.num_samples),
                   'property2': np.random.random(train_data.num_samples)})

static_props = static_properties_from_df(df, image_cols=('property1', 'property2'))


check = ImagePropertyOutliers()
result = check.run(train_data, train_properties=static_props)