.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/data_integrity/plot_data_duplicates.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_data_integrity_plot_data_duplicates.py: .. _plot_tabular_data_duplicates: Data Duplicates *************** This notebooks provides an overview for using and understanding the data duplicates check: **Structure:** * `Why data duplicates? <#why-data-duplicates>`__ * `Load Data <#load-data>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ .. GENERATED FROM PYTHON SOURCE LINES 19-26 .. code-block:: default from datetime import datetime import pandas as pd from deepchecks.tabular.datasets.classification.phishing import load_data .. GENERATED FROM PYTHON SOURCE LINES 27-38 Why data duplicates? ==================== The ``DataDuplicates`` check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset's nature it has identical-looking samples) this may be valid, however if this is an hidden issue we're not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention. Load Data ========= .. GENERATED FROM PYTHON SOURCE LINES 38-43 .. code-block:: default phishing_dataset = load_data(as_train_test=False, data_format='DataFrame') phishing_dataset .. raw:: html
target month scrape_date ext urlLength ... specialChars scriptLength sbr bscr sscr
0 0 1 2019-01-01 net 102 ... 9419 23919 0.736286 0.289940 2.539442
1 0 1 2019-01-01 country 154 ... 2735 794 0.049015 0.168838 0.290311
2 0 1 2019-01-01 net 171 ... 27798 83817 0.811049 0.268985 2.412174
3 0 1 2019-01-01 com 94 ... 9087 19427 0.569824 0.266536 2.137889
4 0 1 2019-01-01 other 95 ... 39 0 0.000000 0.193069 0.000000
... ... ... ... ... ... ... ... ... ... ... ...
11345 0 1 2020-01-15 country 89 ... 971 1866 0.625302 0.213266 2.932029
11346 0 1 2020-01-15 other 107 ... 3185 4228 0.291069 0.214348 1.357928
11347 0 1 2020-01-15 com 112 ... 0 0 0.000000 0.000000 0.000000
11348 0 1 2020-01-15 html 111 ... 0 0 0.000000 0.000000 0.000000
11349 0 1 2020-01-15 html 97 ... 25 0 0.000000 0.167785 0.000000

11350 rows × 25 columns



.. GENERATED FROM PYTHON SOURCE LINES 44-46 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 46-58 .. code-block:: default from deepchecks.tabular.checks import DataDuplicates DataDuplicates().run(phishing_dataset) # With Check Parameters # --------------------- # ``DataDuplicates`` check can also use a specific subset of columns (or alternatively # use all columns except specific ignore_columns to check duplication): DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset) .. raw:: html
Data Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 59-62 .. code-block:: default DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset) .. raw:: html
Data Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 63-68 Define a Condition ================== Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong. .. GENERATED FROM PYTHON SOURCE LINES 68-73 .. code-block:: default check = DataDuplicates() check.add_condition_ratio_less_or_equal(0) result = check.run(phishing_dataset) result.show(show_additional_outputs=False) .. raw:: html
Data Duplicates


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 2.610 seconds) .. _sphx_glr_download_checks_gallery_tabular_data_integrity_plot_data_duplicates.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_data_duplicates.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_data_duplicates.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_