data_integrity#

Module contains all data integrity checks.

Classes

ColumnsInfo

Return the role and logical type of each column.

MixedNulls

Search for various types of null values, including string representations of null.

StringMismatch

Detect different variants of string categories (e.g.

MixedDataTypes

Detect columns which contain a mix of numerical and string values.

IsSingleValue

Check if there are columns which have only a single unique value in all rows.

SpecialCharacters

Search in column[s] for values that contains only special characters.

StringLengthOutOfBounds

Detect strings with length that is much longer/shorter than the identified "normal" string lengths.

DataDuplicates

Checks for duplicate samples in the dataset.

ConflictingLabels

Find samples which have the exact same features' values but different labels.

ClassImbalance

Check if a dataset is imbalanced by looking at the target variable distribution.

OutlierSampleDetection

Detects outliers in a dataset using the LoOP algorithm.

FeatureLabelCorrelation

Return the PPS (Predictive Power Score) of all features in relation to the label.

FeatureFeatureCorrelation

Checks for pairwise correlation between the features.

IdentifierLabelCorrelation

Check if identifiers (Index/Date) can be used to predict the label.

PercentOfNulls

Percent of 'Null' values in each column.