Data Duplicates#

This notebook provides an overview for using and understanding the data duplicates check:

Structure:

Why data duplicates?
Load Data
Run the Check
Define a Condition

from datetime import datetime

import pandas as pd

from deepchecks.tabular.datasets.classification.phishing import load_data

Why data duplicates?#

The DataDuplicates check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset’s nature it has identical-looking samples) this may be valid, however if this is an hidden issue we’re not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention.

Load Data#

phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset

	target	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	has_ip	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
0	0	1	2019-01-01	net	102	8	0	0	0	-4.384032	0	True	False	False	4921	191	32486	3	5	330	9419	23919	0.736286	0.289940	2.539442
1	0	1	2019-01-01	country	154	60	0	2	0	-3.566515	0	True	False	False	0	0	16199	0	4	39	2735	794	0.049015	0.168838	0.290311
2	0	1	2019-01-01	net	171	5	11	0	0	-4.608755	0	True	False	False	5374	104	103344	18	9	302	27798	83817	0.811049	0.268985	2.412174
3	0	1	2019-01-01	com	94	10	0	0	0	-4.548921	0	True	False	False	6107	466	34093	11	43	199	9087	19427	0.569824	0.266536	2.137889
4	0	1	2019-01-01	other	95	11	0	0	0	-4.717188	0	True	False	False	3819	928	202	1	0	0	39	0	0.000000	0.193069	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11345	0	1	2020-01-15	country	89	7	0	0	0	-4.254491	0	True	False	False	0	0	4117	5	0	1	971	1866	0.625302	0.213266	2.932029
11346	0	1	2020-01-15	other	107	13	0	0	0	-4.758879	0	True	False	False	9073	1882	17788	47	58	645	3185	4228	0.291069	0.214348	1.357928
11347	0	1	2020-01-15	com	112	10	0	0	0	-4.723014	0	True	False	False	2640	1011	0	0	0	0	0	0	0.000000	0.000000	0.000000
11348	0	1	2020-01-15	html	111	3	0	0	0	-4.289384	0	True	False	False	2291	265	0	0	0	0	0	0	0.000000	0.000000	0.000000
11349	0	1	2020-01-15	html	97	0	0	0	0	-4.304523	0	True	False	False	6273	298	149	1	0	0	25	0	0.000000	0.167785	0.000000

11350 rows × 25 columns

Run the Check#

from deepchecks.tabular.checks import DataDuplicates

DataDuplicates().run(phishing_dataset)

# With Check Parameters
# ---------------------
# ``DataDuplicates`` check can also use a specific subset of columns (or alternatively
# use all columns except specific ignore_columns to check duplication):

DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)

Data Duplicates

		entropy	numParams
Instances	Number of Duplicates
2360, 6560, 11342, 2150, 3528, 1557, 810...	13	-4.31	0
4638, 4412, 1729, 2213, 1641, 2234, 6328...	8	-4.57	4
6774, 7528, 9592, 4634, 10046, 6504, 271...	8	-4.49	8
4047, 8391, 2499, 9932, 929, 11345, 9348...	8	-4.25	0
2984, 1670, 1020, 6666, 10923, 9138, 180...	7	-4.65	5

DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)

Data Duplicates

		target	month	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	has_ip	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
Instances	Number of Duplicates
4696, 5398, 4719	3	0	6	other	123	28	4	0	0	-4.91	0	True	False	False	0	0	0	0	0	0	0	0	0.00	0.00	0.00
11342, 82	2	0	1	html	92	2	0	0	0	-4.31	0	True	False	False	0	0	149	1	0	0	25	0	0.00	0.17	0.00
790, 250	2	0	1	php	107	4	8	0	0	-4.53	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
217, 6	2	0	1	php	107	5	8	0	0	-4.52	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
763, 609	2	0	1	php	113	6	8	0	0	-4.63	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
1557, 974	2	0	2	html	92	2	0	0	0	-4.31	0	True	False	False	0	0	149	1	0	0	25	0	0.00	0.17	0.00
2360, 2150	2	0	3	html	92	2	0	0	0	-4.31	0	True	False	False	0	0	149	1	0	0	25	0	0.00	0.17	0.00
2238, 2489	2	0	3	php	108	3	8	0	0	-4.51	0	True	False	False	1381	79	0	1	0	0	0	0	0.00	0.00	0.00
3192, 3444	2	0	4	other	123	28	4	0	0	-4.92	0	True	False	False	0	0	0	0	0	0	0	0	0.00	0.00	0.00
3498, 3277	2	0	4	php	93	31	1	0	0	-4.93	0	True	False	False	0	0	281	0	0	0	74	142	0.51	0.26	1.92

Define a Condition#

Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

check = DataDuplicates()
check.add_condition_ratio_less_or_equal(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Data Duplicates

Conditions Summary

Status	Condition	More Info
!	Duplicate data ratio is less or equal to 0%	Found 0.0088% duplicate data

Total running time of the script: (0 minutes 1.633 seconds)

Gallery generated by Sphinx-Gallery

Conflicting Labels

Special Characters

Data Duplicates#

Why data duplicates?#

Load Data#

Run the Check#

Data Duplicates

Additional Outputs

Data Duplicates

Additional Outputs

Define a Condition#

Data Duplicates

Conditions Summary