Data Duplicates#

This notebooks provides an overview for using and understanding the data duplicates check:

Structure:

Why data duplicates?
Load Data
Run the Check
Define a Condition

from datetime import datetime

import pandas as pd

from deepchecks.tabular.datasets.classification.phishing import load_data

Why data duplicates?#

The DataDuplicates check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset’s nature it has identical-looking samples) this may be valid, however if this is an hidden issue we’re not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention.

Load Data#

phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset

	target	month	scrape_date	ext	urlLength	...	specialChars	scriptLength	sbr	bscr	sscr
0	0	1	2019-01-01	net	102	...	9419	23919	0.736286	0.289940	2.539442
1	0	1	2019-01-01	country	154	...	2735	794	0.049015	0.168838	0.290311
2	0	1	2019-01-01	net	171	...	27798	83817	0.811049	0.268985	2.412174
3	0	1	2019-01-01	com	94	...	9087	19427	0.569824	0.266536	2.137889
4	0	1	2019-01-01	other	95	...	39	0	0.000000	0.193069	0.000000
...	...	...	...	...	...	...	...	...	...	...	...
11345	0	1	2020-01-15	country	89	...	971	1866	0.625302	0.213266	2.932029
11346	0	1	2020-01-15	other	107	...	3185	4228	0.291069	0.214348	1.357928
11347	0	1	2020-01-15	com	112	...	0	0	0.000000	0.000000	0.000000
11348	0	1	2020-01-15	html	111	...	0	0	0.000000	0.000000	0.000000
11349	0	1	2020-01-15	html	97	...	25	0	0.000000	0.167785	0.000000

11350 rows × 25 columns

Run the Check#

from deepchecks.tabular.checks import DataDuplicates

DataDuplicates().run(phishing_dataset)

# With Check Parameters
# ---------------------
# ``DataDuplicates`` check can also use a specific subset of columns (or alternatively
# use all columns except specific ignore_columns to check duplication):

DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)

Out:

/home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:886: UserWarning:

Received a "pandas.DataFrame" instance. It is recommended to pass a "deepchecks.tabular.Dataset" instance by doing "Dataset(dataframe)"

/home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:581: UserWarning:

It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
10 categorical features were inferred: target, month, ext, numParams, num_%20, num_@, has_ip... For full list use dataset.cat_features

Data Duplicates

DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)

Out:

/home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:886: UserWarning:

Received a "pandas.DataFrame" instance. It is recommended to pass a "deepchecks.tabular.Dataset" instance by doing "Dataset(dataframe)"

/home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:581: UserWarning:

It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
10 categorical features were inferred: target, month, ext, numParams, num_%20, num_@, has_ip... For full list use dataset.cat_features

Data Duplicates

Define a Condition#

Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

check = DataDuplicates()
check.add_condition_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Out:

/home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:886: UserWarning:

Received a "pandas.DataFrame" instance. It is recommended to pass a "deepchecks.tabular.Dataset" instance by doing "Dataset(dataframe)"

/home/runner/work/deepchecks/deepchecks/deepchecks/tabular/dataset.py:581: UserWarning:

It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
10 categorical features were inferred: target, month, ext, numParams, num_%20, num_@, has_ip... For full list use dataset.cat_features

Data Duplicates

Total running time of the script: ( 0 minutes 2.881 seconds)

Gallery generated by Sphinx-Gallery

Conflicting Labels

Feature Label Correlation