API Reference - OutlierSampleDetection

Note

Go to the end to download the full example code

Outlier Sample Detection#

This notebook provides an overview for using and understanding the Outlier Sample Detection check.

Structure:

How deepchecks detects outliers
Prepare data
Run the check
Define a condition

How deepchecks detects outliers#

Outlier Sample Detection searches for outliers samples (jointly across all features) using the LoOP algorithm. The LoOP algorithm is a robust method for detecting outliers in a dataset across multiple variables by comparing the density in the area of a sample with the densities in the areas of its nearest neighbors (see the LoOp paper for further details).

LoOP relies on a distance matrix. In our implementation we use the Gower distance that averages the distances per feature between samples. For numeric features it calculates the absolute distance divided by the range of the feature and for categorical features it is an indicator for whether the values are the same (see link for further details).

Imports#

import pandas as pd
from sklearn.datasets import load_iris

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import OutlierSampleDetection

Prepare data#

iris = pd.DataFrame(load_iris().data)
iris.describe()

	0	1	2	3
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Add an outlier:

outlier_sample = [1, 10, 50, 100]
iris.loc[len(iris.index)] = outlier_sample
print(iris.tail())
modified_iris = Dataset(iris, cat_features=[])

       0     1     2      3
6.3   2.5   5.0    1.9
6.5   3.0   5.2    2.0
6.2   3.4   5.4    2.3
5.9   3.0   5.1    1.8
1.0  10.0  50.0  100.0

Run the Check#

We define the nearest_neighbors_percent and the extent parameters for the LoOP algorithm.

check = OutlierSampleDetection(nearest_neighbors_percent=0.01, extent_parameter=3)
check.run(modified_iris)

Outlier Sample Detection

	Outlier Probability Score	0	1	2	3
150	1.00	1.00	10.00	50.00	100.00
41	0.67	4.50	2.30	1.30	0.30
108	0.50	6.70	2.50	5.80	1.80
109	0.50	7.20	3.60	6.10	2.50
22	0.44	4.60	3.60	1.00	0.20

	Outlier Probability Score	0	1	2	3
150	1.00	1.00	10.00	50.00	100.00
41	0.67	4.50	2.30	1.30	0.30
108	0.50	6.70	2.50	5.80	1.80
109	0.50	7.20	3.60	6.10	2.50
22	0.44	4.60	3.60	1.00	0.20

Define a condition#

Now, we define a condition that enforces that the ratio of outlier samples in out dataset is below 0.001.

check = OutlierSampleDetection()
check.add_condition_outlier_ratio_less_or_equal(max_outliers_ratio=0.001, outlier_score_threshold=0.9)
check.run(modified_iris)

Outlier Sample Detection

Conditions Summary

Status	Condition	More Info
!	Ratio of samples exceeding the outlier score threshold 0.9 is less or equal to 0.1%	0.6% of dataset samples above outlier threshold

	Outlier Probability Score	0	1	2	3
150	1.00	1.00	10.00	50.00	100.00
41	0.67	4.50	2.30	1.30	0.30
108	0.50	6.70	2.50	5.80	1.80
109	0.50	7.20	3.60	6.10	2.50
22	0.44	4.60	3.60	1.00	0.20

Conditions Summary

Status	Condition	More Info
!	Ratio of samples exceeding the outlier score threshold 0.9 is less or equal to 0.1%	0.6% of dataset samples above outlier threshold

	Outlier Probability Score	0	1	2	3
150	1.00	1.00	10.00	50.00	100.00
41	0.67	4.50	2.30	1.30	0.30
108	0.50	6.70	2.50	5.80	1.80
109	0.50	7.20	3.60	6.10	2.50
22	0.44	4.60	3.60	1.00	0.20

Total running time of the script: (0 minutes 0.113 seconds)

Gallery generated by Sphinx-Gallery

Class Imbalance

Identifier Label Correlation

Outlier Sample Detection#

How deepchecks detects outliers#

Imports#

Prepare data#

Run the Check#

Outlier Sample Detection

Additional Outputs

Outlier Sample Detection

Additional Outputs

Define a condition#

Outlier Sample Detection

Conditions Summary

Additional Outputs

Outlier Sample Detection

Conditions Summary

Additional Outputs