train_test_validation#

train_test_validation(columns: Optional[Union[Hashable, List[Hashable]]] = None, ignore_columns: Optional[Union[Hashable, List[Hashable]]] = None, n_top_columns: Optional[int] = None, n_samples: Optional[int] = None, random_state: int = 42, n_to_show: int = 5, **kwargs) Suite[source]#

Suite for validating correctness of train-test split, including distribution, leakage and integrity checks.

List of Checks:
List of Checks#

Check Example

API Reference

Datasets Size Comparison

DatasetsSizeComparison

New Label

NewLabelTrainTest

New Category

CategoryMismatchTrainTest

String Mismatch Comparison

StringMismatchComparison

Date Train Test Leakage Duplicates

DateTrainTestLeakageDuplicates

Date Train Test Leakage Overlap

DateTrainTestLeakageOverlap

Index Leakage

IndexTrainTestLeakage

Train Test Samples Mix

TrainTestSamplesMix

Feature Label Correlation Change

FeatureLabelCorrelationChange

Train Test Feature Drift

TrainTestFeatureDrift

Train Test Label Drift

TrainTestLabelDrift

Whole Dataset Drift

WholeDatasetDrift

Parameters
columnsUnion[Hashable, List[Hashable]] , default: None

The columns to be checked. If None, all columns will be checked except the ones in ignore_columns.

ignore_columnsUnion[Hashable, List[Hashable]] , default: None

The columns to be ignored. If None, no columns will be ignored.

n_top_columnsint , optional

number of columns to show ordered by feature importance (date, index, label are first) (check dependent)

n_samplesint , default: None

number of samples to use for checks that sample data. If none, using the default n_samples per check.

random_stateint, default: 42

random seed for all checkss.

n_to_showint , default: 5

number of top results to show (check dependent)

**kwargsdict

additional arguments to pass to the checks.

Returns
Suite

A suite for validating correctness of train-test split, including distribution, leakage and integrity checks.

Examples

>>> from deepchecks.tabular.suites import train_test_validation
>>> suite = train_test_validation(columns=['a', 'b', 'c'], n_samples=1_000_000)
>>> result = suite.run()
>>> result.show()
run(self, train_dataset: Optional[Union[Dataset, DataFrame]] = None, test_dataset: Optional[Union[Dataset, DataFrame]] = None, model: Optional[BasicModel] = None, feature_importance: Optional[Series] = None, feature_importance_force_permutation: bool = False, feature_importance_timeout: int = 120, with_display: bool = True, y_pred_train: Optional[ndarray] = None, y_pred_test: Optional[ndarray] = None, y_proba_train: Optional[ndarray] = None, y_proba_test: Optional[ndarray] = None) SuiteResult#

Run all checks.

Parameters
train_dataset: Optional[Union[Dataset, pd.DataFrame]] , default None

object, representing data an estimator was fitted on

test_datasetOptional[Union[Dataset, pd.DataFrame]] , default None

object, representing data an estimator predicts on

modelOptional[BasicModel] , default None

A scikit-learn-compatible fitted estimator instance

feature_importance: pd.Series , default: None

pass manual features importance

feature_importance_force_permutationbool , default: False

force calculation of permutation features importance

feature_importance_timeoutint , default: 120

timeout in second for the permutation features importance calculation

y_pred_train: Optional[np.ndarray] , default: None

Array of the model prediction over the train dataset.

y_pred_test: Optional[np.ndarray] , default: None

Array of the model prediction over the test dataset.

y_proba_train: Optional[np.ndarray] , default: None

Array of the model prediction probabilities over the train dataset.

y_proba_test: Optional[np.ndarray] , default: None

Array of the model prediction probabilities over the test dataset.

features_importance: Optional[pd.Series] , default: None

pass manual features importance .. deprecated:: 0.8.1

Use ‘feature_importance’ instead.

Returns
SuiteResult

All results by all initialized checks