Data Duplicates

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tabular/auto_checks/data_integrity/plot_data_duplicates.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tabular_auto_checks_data_integrity_plot_data_duplicates.py: .. _tabular__data_duplicates: Data Duplicates *************** This notebook provides an overview for using and understanding the data duplicates check: **Structure:** * `Why data duplicates? <#why-data-duplicates>`__ * `Load Data <#load-data>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ .. GENERATED FROM PYTHON SOURCE LINES 19-26 .. code-block:: default from datetime import datetime import pandas as pd from deepchecks.tabular.datasets.classification.phishing import load_data .. GENERATED FROM PYTHON SOURCE LINES 27-38 Why data duplicates? ==================== The ``DataDuplicates`` check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset's nature it has identical-looking samples) this may be valid, however if this is an hidden issue we're not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention. Load Data ========= .. GENERATED FROM PYTHON SOURCE LINES 38-43 .. code-block:: default phishing_dataset = load_data(as_train_test=False, data_format='DataFrame') phishing_dataset .. raw:: html

	target	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	has_ip	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
0	0	1	2019-01-01	net	102	8	0	0	0	-4.384032	0	True	False	False	4921	191	32486	3	5	330	9419	23919	0.736286	0.289940	2.539442
1	0	1	2019-01-01	country	154	60	0	2	0	-3.566515	0	True	False	False	0	0	16199	0	4	39	2735	794	0.049015	0.168838	0.290311
2	0	1	2019-01-01	net	171	5	11	0	0	-4.608755	0	True	False	False	5374	104	103344	18	9	302	27798	83817	0.811049	0.268985	2.412174
3	0	1	2019-01-01	com	94	10	0	0	0	-4.548921	0	True	False	False	6107	466	34093	11	43	199	9087	19427	0.569824	0.266536	2.137889
4	0	1	2019-01-01	other	95	11	0	0	0	-4.717188	0	True	False	False	3819	928	202	1	0	0	39	0	0.000000	0.193069	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11345	0	1	2020-01-15	country	89	7	0	0	0	-4.254491	0	True	False	False	0	0	4117	5	0	1	971	1866	0.625302	0.213266	2.932029
11346	0	1	2020-01-15	other	107	13	0	0	0	-4.758879	0	True	False	False	9073	1882	17788	47	58	645	3185	4228	0.291069	0.214348	1.357928
11347	0	1	2020-01-15	com	112	10	0	0	0	-4.723014	0	True	False	False	2640	1011	0	0	0	0	0	0	0.000000	0.000000	0.000000
11348	0	1	2020-01-15	html	111	3	0	0	0	-4.289384	0	True	False	False	2291	265	0	0	0	0	0	0	0.000000	0.000000	0.000000
11349	0	1	2020-01-15	html	97	0	0	0	0	-4.304523	0	True	False	False	6273	298	149	1	0	0	25	0	0.000000	0.167785	0.000000

11350 rows × 25 columns

.. GENERATED FROM PYTHON SOURCE LINES 44-46 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 46-58 .. code-block:: default from deepchecks.tabular.checks import DataDuplicates DataDuplicates().run(phishing_dataset) # With Check Parameters # --------------------- # ``DataDuplicates`` check can also use a specific subset of columns (or alternatively # use all columns except specific ignore_columns to check duplication): DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset) .. raw:: html

Data Duplicates

.. GENERATED FROM PYTHON SOURCE LINES 59-62 .. code-block:: default DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset) .. raw:: html

Data Duplicates

.. GENERATED FROM PYTHON SOURCE LINES 63-68 Define a Condition ================== Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 68-73 .. code-block:: default check = DataDuplicates() check.add_condition_ratio_less_or_equal(0) result = check.run(phishing_dataset) result.show(show_additional_outputs=False) .. raw:: html

Data Duplicates

.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 2.108 seconds) .. _sphx_glr_download_tabular_auto_checks_data_integrity_plot_data_duplicates.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_data_duplicates.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_data_duplicates.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_