.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tabular/auto_checks/data_integrity/plot_data_duplicates.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tabular_auto_checks_data_integrity_plot_data_duplicates.py: .. _tabular__data_duplicates: Data Duplicates *************** This notebook provides an overview for using and understanding the data duplicates check: **Structure:** * `Why data duplicates? <#why-data-duplicates>`__ * `Load Data <#load-data>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ .. GENERATED FROM PYTHON SOURCE LINES 19-26 .. code-block:: default from datetime import datetime import pandas as pd from deepchecks.tabular.datasets.classification.phishing import load_data .. GENERATED FROM PYTHON SOURCE LINES 27-38 Why data duplicates? ==================== The ``DataDuplicates`` check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset's nature it has identical-looking samples) this may be valid, however if this is an hidden issue we're not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention. Load Data ========= .. GENERATED FROM PYTHON SOURCE LINES 38-43 .. code-block:: default phishing_dataset = load_data(as_train_test=False, data_format='DataFrame') phishing_dataset .. raw:: html
target month scrape_date ext urlLength numDigits numParams num_%20 num_@ entropy has_ip hasHttp hasHttps urlIsLive dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr
0 0 1 2019-01-01 net 102 8 0 0 0 -4.384032 0 True False False 4921 191 32486 3 5 330 9419 23919 0.736286 0.289940 2.539442
1 0 1 2019-01-01 country 154 60 0 2 0 -3.566515 0 True False False 0 0 16199 0 4 39 2735 794 0.049015 0.168838 0.290311
2 0 1 2019-01-01 net 171 5 11 0 0 -4.608755 0 True False False 5374 104 103344 18 9 302 27798 83817 0.811049 0.268985 2.412174
3 0 1 2019-01-01 com 94 10 0 0 0 -4.548921 0 True False False 6107 466 34093 11 43 199 9087 19427 0.569824 0.266536 2.137889
4 0 1 2019-01-01 other 95 11 0 0 0 -4.717188 0 True False False 3819 928 202 1 0 0 39 0 0.000000 0.193069 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11345 0 1 2020-01-15 country 89 7 0 0 0 -4.254491 0 True False False 0 0 4117 5 0 1 971 1866 0.625302 0.213266 2.932029
11346 0 1 2020-01-15 other 107 13 0 0 0 -4.758879 0 True False False 9073 1882 17788 47 58 645 3185 4228 0.291069 0.214348 1.357928
11347 0 1 2020-01-15 com 112 10 0 0 0 -4.723014 0 True False False 2640 1011 0 0 0 0 0 0 0.000000 0.000000 0.000000
11348 0 1 2020-01-15 html 111 3 0 0 0 -4.289384 0 True False False 2291 265 0 0 0 0 0 0 0.000000 0.000000 0.000000
11349 0 1 2020-01-15 html 97 0 0 0 0 -4.304523 0 True False False 6273 298 149 1 0 0 25 0 0.000000 0.167785 0.000000

11350 rows × 25 columns



.. GENERATED FROM PYTHON SOURCE LINES 44-46 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 46-58 .. code-block:: default from deepchecks.tabular.checks import DataDuplicates DataDuplicates().run(phishing_dataset) # With Check Parameters # --------------------- # ``DataDuplicates`` check can also use a specific subset of columns (or alternatively # use all columns except specific ignore_columns to check duplication): DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset) .. raw:: html
Data Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 59-62 .. code-block:: default DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset) .. raw:: html
Data Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 63-68 Define a Condition ================== Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 68-73 .. code-block:: default check = DataDuplicates() check.add_condition_ratio_less_or_equal(0) result = check.run(phishing_dataset) result.show(show_additional_outputs=False) .. raw:: html
Data Duplicates


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 3.280 seconds) .. _sphx_glr_download_tabular_auto_checks_data_integrity_plot_data_duplicates.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_data_duplicates.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_data_duplicates.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_