class OutlierSampleDetection[source]#

Detects outliers in a dataset using the LoOP algorithm.

The LoOP algorithm is a robust method for detecting outliers in a dataset across multiple variables by comparing the density in the area of a sample with the densities in the areas of its nearest neighbors. The output of the algorithm is highly dependent on the number of nearest neighbors, it is recommended to select a value k that represent the maximum cluster size that will still be considered as “outliers”. See https://www.dbs.ifi.lmu.de/Publikationen/Papers/LoOP1649.pdf for more details. LoOP relies on a distance matrix, in our implementation we use the Gower distance that measure the distance between two samples based on its numeric and categorical features. See https://statisticaloddsandends.wordpress.com/2021/02/23/what-is-gowers-distance/ for further details.

columnsUnion[Hashable, List[Hashable]] , default: None

Columns to check, if none are given checks all columns except ignored ones.

ignore_columnsUnion[Hashable, List[Hashable]] , default: None

Columns to ignore, if none given checks based on columns variable

nearest_neighbors_percentfloat, default: 0.01

Percent of the dataset to use as K, nearest neighbors for the LoOP outlier detection. It is recommended to select a percentage that represent the maximum cluster size that will still be considered as “outliers”.

extent_parameter: int, default: 3

Extend parameter for LoOP algorithm.

n_samplesint , default: 5_000

number of samples to use for this check.

n_to_showint , default: 5

number of data elements with the highest outlier score to show (out of sample).

random_stateint, default: 42

random seed for all check internals.

timeoutint, default: 10

Check will be interrupted if it takes more than this number of seconds. If 0, check will not be interrupted.

__init__(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, nearest_neighbors_percent: float = 0.01, extent_parameter: int = 3, n_samples: int = 5000, n_to_show: int = 5, random_state: int = 42, timeout: int = 10, **kwargs)[source]#
__new__(*args, **kwargs)#


OutlierSampleDetection.add_condition(name, ...)

Add new condition function to the check.


Add condition - no elements over outlier threshold are allowed.


Add condition - ratio of samples over outlier score is less or equal to the threshold.


Remove all conditions from this check instance.


Run conditions on given result.


Return check configuration (conditions' configuration not yet supported).

OutlierSampleDetection.from_config(conf[, ...])

Return check object from a CheckConfig object.

OutlierSampleDetection.from_json(conf[, ...])

Deserialize check instance from JSON string.


Return check metadata.


Name of class in split camel case.


Return parameters to show when printing the check.


Remove given condition by index.

OutlierSampleDetection.run(dataset[, model, ...])

Run check.

OutlierSampleDetection.run_logic(context, ...)

Run check.

OutlierSampleDetection.to_json([indent, ...])

Serialize check instance to JSON string.