data_integrity#

data_integrity(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, n_top_columns: Optional[int] = None, n_samples: Optional[int] = None, random_state: int = 42, n_to_show: int = 5, **kwargs) Suite[source]#

Suite for detecting integrity issues within a single dataset.

List of Checks:
List of Checks#

Check Example

API Reference

Is Single Value

IsSingleValue

Special Characters

SpecialCharacters

Mixed Nulls

MixedNulls

Mixed Data Types

MixedDataTypes

String Mismatch

StringMismatch

Data Duplicates

DataDuplicates

String Length Out Of Bounds

StringLengthOutOfBounds

Conflicting Labels

ConflictingLabels

Outlier Sample Detection

OutlierSampleDetection

Feature Label Correlation

FeatureLabelCorrelation

Identifier Label Correlation

IdentifierLabelCorrelation

Feature Feature Correlation

FeatureFeatureCorrelation

Parameters
columnsUnion[Hashable, List[Hashable]] , default: None

The columns to be checked. If None, all columns will be checked except the ones in ignore_columns.

ignore_columnsUnion[Hashable, List[Hashable]] , default: None

The columns to be ignored. If None, no columns will be ignored.

n_top_columnsint , optional

number of columns to show ordered by feature importance (date, index, label are first) (check dependent)

n_samplesint , default: 1_000_000

number of samples to use for checks that sample data. If none, using the default n_samples per check.

random_stateint, default: 42

random seed for all checks.

n_to_showint , default: 5

number of top results to show (check dependent)

**kwargsdict

additional arguments to pass to the checks.

Returns
Suite

A suite for detecting integrity issues within a single dataset.

Examples

>>> from deepchecks.tabular.suites import data_integrity
>>> suite = data_integrity(columns=['a', 'b', 'c'], n_samples=1_000_000)
>>> result = suite.run()
>>> result.show()
run(self, train_dataset: Optional[Union[Dataset, DataFrame]] = None, test_dataset: Optional[Union[Dataset, DataFrame]] = None, model: Optional[BasicModel] = None, feature_importance: Optional[Series] = None, feature_importance_force_permutation: bool = False, feature_importance_timeout: int = 120, with_display: bool = True, y_pred_train: Optional[ndarray] = None, y_pred_test: Optional[ndarray] = None, y_proba_train: Optional[ndarray] = None, y_proba_test: Optional[ndarray] = None, run_single_dataset: Optional[str] = None, model_classes: Optional[List] = None) SuiteResult#

Run all checks.

Parameters
train_dataset: Optional[Union[Dataset, pd.DataFrame]] , default None

object, representing data an estimator was fitted on

test_datasetOptional[Union[Dataset, pd.DataFrame]] , default None

object, representing data an estimator predicts on

modelOptional[BasicModel] , default None

A scikit-learn-compatible fitted estimator instance

run_single_dataset: Optional[str], default None

‘Train’, ‘Test’ , or None to run on both train and test.

feature_importance: pd.Series , default: None

pass manual features importance

feature_importance_force_permutationbool , default: False

force calculation of permutation features importance

feature_importance_timeoutint , default: 120

timeout in second for the permutation features importance calculation

y_pred_train: Optional[np.ndarray] , default: None

Array of the model prediction over the train dataset.

y_pred_test: Optional[np.ndarray] , default: None

Array of the model prediction over the test dataset.

y_proba_train: Optional[np.ndarray] , default: None

Array of the model prediction probabilities over the train dataset.

y_proba_test: Optional[np.ndarray] , default: None

Array of the model prediction probabilities over the test dataset.

model_classes: Optional[List] , default: None

For classification: list of classes known to the model

Returns
SuiteResult

All results by all initialized checks