Data Duplicates#

This notebook provides an overview for using and understanding the data duplicates check:

Structure:

from datetime import datetime

import pandas as pd

from deepchecks.tabular.datasets.classification.phishing import load_data

Why data duplicates?#

The DataDuplicates check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset’s nature it has identical-looking samples) this may be valid, however if this is an hidden issue we’re not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention.

Load Data#

phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset
target month scrape_date ext urlLength numDigits numParams num_%20 num_@ entropy has_ip hasHttp hasHttps urlIsLive dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr
0 0 1 2019-01-01 net 102 8 0 0 0 -4.384032 0 True False False 4921 191 32486 3 5 330 9419 23919 0.736286 0.289940 2.539442
1 0 1 2019-01-01 country 154 60 0 2 0 -3.566515 0 True False False 0 0 16199 0 4 39 2735 794 0.049015 0.168838 0.290311
2 0 1 2019-01-01 net 171 5 11 0 0 -4.608755 0 True False False 5374 104 103344 18 9 302 27798 83817 0.811049 0.268985 2.412174
3 0 1 2019-01-01 com 94 10 0 0 0 -4.548921 0 True False False 6107 466 34093 11 43 199 9087 19427 0.569824 0.266536 2.137889
4 0 1 2019-01-01 other 95 11 0 0 0 -4.717188 0 True False False 3819 928 202 1 0 0 39 0 0.000000 0.193069 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11345 0 1 2020-01-15 country 89 7 0 0 0 -4.254491 0 True False False 0 0 4117 5 0 1 971 1866 0.625302 0.213266 2.932029
11346 0 1 2020-01-15 other 107 13 0 0 0 -4.758879 0 True False False 9073 1882 17788 47 58 645 3185 4228 0.291069 0.214348 1.357928
11347 0 1 2020-01-15 com 112 10 0 0 0 -4.723014 0 True False False 2640 1011 0 0 0 0 0 0 0.000000 0.000000 0.000000
11348 0 1 2020-01-15 html 111 3 0 0 0 -4.289384 0 True False False 2291 265 0 0 0 0 0 0 0.000000 0.000000 0.000000
11349 0 1 2020-01-15 html 97 0 0 0 0 -4.304523 0 True False False 6273 298 149 1 0 0 25 0 0.000000 0.167785 0.000000

11350 rows × 25 columns



Run the Check#

from deepchecks.tabular.checks import DataDuplicates

DataDuplicates().run(phishing_dataset)

# With Check Parameters
# ---------------------
# ``DataDuplicates`` check can also use a specific subset of columns (or alternatively
# use all columns except specific ignore_columns to check duplication):

DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)
Data Duplicates


DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)
Data Duplicates


Define a Condition#

Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

check = DataDuplicates()
check.add_condition_ratio_less_or_equal(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)
Data Duplicates


Total running time of the script: ( 0 minutes 3.280 seconds)

Gallery generated by Sphinx-Gallery