.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "user-guide/tabular/auto_quickstarts/plot_quick_data_integrity.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_user-guide_tabular_auto_quickstarts_plot_quick_data_integrity.py: .. _quick_data_integrity: Quickstart - Data Integrity Suite ********************************* The deepchecks integrity suite is relevant any time you have data that you wish to validate: whether it's on a fresh batch of data, or right before splitting it or using it for training. Here we'll use the avocado prices dataset (:mod:`deepchecks.tabular.datasets.regression.avocado`), to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find. .. code-block:: bash # Before we start, if you don't have deepchecks installed yet, run: import sys !{sys.executable} -m pip install deepchecks -U --quiet # or install using pip from your python environment .. GENERATED FROM PYTHON SOURCE LINES 24-26 Load and Prepare Data ==================================================== .. GENERATED FROM PYTHON SOURCE LINES 26-32 .. code-block:: default from deepchecks.tabular import datasets # load data data = datasets.regression.avocado.load_data(data_format='DataFrame', as_train_test=False) .. GENERATED FROM PYTHON SOURCE LINES 33-34 Insert a few typcial problems to dataset for demonstration. .. GENERATED FROM PYTHON SOURCE LINES 34-50 .. code-block:: default import pandas as pd def add_dirty_data(df): # change strings df.loc[df[df['type'] == 'organic'].sample(frac=0.18).index,'type'] = 'Organic' df.loc[df[df['type'] == 'organic'].sample(frac=0.01).index,'type'] = 'ORGANIC' # add duplicates df = pd.concat([df, df.sample(frac=0.156)], axis=0, ignore_index=True) # add column with single value df['Is Ripe'] = True return df dirty_df = add_dirty_data(data) .. GENERATED FROM PYTHON SOURCE LINES 51-60 Run Deepchecks for Data Integrity ==================================== Create a Dataset Object ------------------------ Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out :class:`deepchecks.tabular.Dataset` to see all of the columns and types that can be declared. .. GENERATED FROM PYTHON SOURCE LINES 60-70 .. code-block:: default from deepchecks.tabular import Dataset # Categorical features can be heuristically inferred, however we # recommend to state them explicitly to avoid misclassification. # Metadata attributes are optional. Some checks will run only if specific attributes are declared. ds = Dataset(dirty_df, cat_features= ['type'], datetime_name='Date', label= 'AveragePrice') .. GENERATED FROM PYTHON SOURCE LINES 71-80 Run the Deepchecks Suite -------------------------- Validate your data with the :func:`deepchecks.tabular.suites.data_integrity` suite. It runs on a single dataset, so you can run it on any batch of data (e.g. train data, test data, a new batch of data that recently arrived) Check out the :doc:`when should you use ` deepchecks guide for some more info about the existing suites and when to use them. .. GENERATED FROM PYTHON SOURCE LINES 80-90 .. code-block:: default from deepchecks.tabular.suites import data_integrity # Run Suite: integ_suite = data_integrity() suite_result = integ_suite.run(ds) # Note: the result can be saved as html using suite_result.save_as_html() # or exported to json using suite_result.to_json() suite_result.show() .. rst-class:: sphx-glr-script-out .. code-block:: none Data Integrity Suite: | | 0/12 [Time: 00:00] Data Integrity Suite: |## | 2/12 [Time: 00:00, Check=Special Characters] Data Integrity Suite: |#### | 4/12 [Time: 00:00, Check=Mixed Data Types] Data Integrity Suite: |###### | 6/12 [Time: 00:00, Check=Data Duplicates] Data Integrity Suite: |######## | 8/12 [Time: 00:00, Check=Conflicting Labels] Data Integrity Suite: |########## | 10/12 [Time: 00:03, Check=Feature Label Correlation] Data Integrity Suite: |############| 12/12 [Time: 00:03, Check=Identifier Label Correlation] .. raw:: html
Data Integrity Suite


.. GENERATED FROM PYTHON SOURCE LINES 91-93 We can inspect the suite outputs and see that there are a few problems we'd like to fix. We'll now fix them and check that they're resolved by re-running those specific checks. .. GENERATED FROM PYTHON SOURCE LINES 96-99 Run a Single Check ------------------- We can run a single check on a dataset, and see the results. .. GENERATED FROM PYTHON SOURCE LINES 99-105 .. code-block:: default from deepchecks.tabular.checks import IsSingleValue, DataDuplicates # first let's see how the check runs: IsSingleValue().run(ds) .. raw:: html
Single Value in Column


.. GENERATED FROM PYTHON SOURCE LINES 106-112 .. code-block:: default # we can also add a condition: single_value_with_condition = IsSingleValue().add_condition_not_single_value() result = single_value_with_condition.run(ds) result.show() .. raw:: html
Single Value in Column


.. GENERATED FROM PYTHON SOURCE LINES 113-117 .. code-block:: default # We can also inspect and use the result's value: result.value .. rst-class:: sphx-glr-script-out .. code-block:: none {'Date': 169, 'AveragePrice': 259, 'Total Volume': 18237, '4046': 17702, '4225': 18103, '4770': 12071, 'Total Bags': 18097, 'Small Bags': 17321, 'Large Bags': 15082, 'XLarge Bags': 5588, 'type': 4, 'year': 4, 'region': 54, 'Is Ripe': 1} .. GENERATED FROM PYTHON SOURCE LINES 118-120 Now let's remove the single value column and rerun (notice that we're using directly the ``data`` attribute that stores the dataframe inside the Dataset) .. GENERATED FROM PYTHON SOURCE LINES 120-125 .. code-block:: default ds.data.drop('Is Ripe', axis=1, inplace=True) result = single_value_with_condition.run(ds) result.show() .. raw:: html
Single Value in Column


.. GENERATED FROM PYTHON SOURCE LINES 126-135 .. code-block:: default # Alternatively we can fix the dataframe directly, and create a new dataset. # Let's fix also the duplicate values: dirty_df.drop_duplicates(inplace=True) dirty_df.drop('Is Ripe', axis=1, inplace=True) ds = Dataset(dirty_df, cat_features=['type'], datetime_name='Date', label='AveragePrice') result = DataDuplicates().add_condition_ratio_less_or_equal(0).run(ds) result.show() .. raw:: html
Data Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 136-142 Rerun Suite on the Fixed Dataset --------------------------------- Finally, we'll choose to keep the "organic" multiple spellings as they represent different sources. So we'll customaize the suite by removing the condition from it (or delete check completely). Alternatively - we can customize it by creating a new Suite with the desired checks and conditions. See :doc:`/user-guide/general/customizations/examples/plot_create_a_custom_suite` for more info. .. GENERATED FROM PYTHON SOURCE LINES 142-146 .. code-block:: default # let's inspect the suite's structure integ_suite .. rst-class:: sphx-glr-script-out .. code-block:: none Data Integrity Suite: [ 0: IsSingleValue Conditions: 0: Does not contain only a single value 1: SpecialCharacters Conditions: 0: Ratio of samples containing solely special character is less or equal to 0.1% 2: MixedNulls Conditions: 0: Number of different null types is less or equal to 1 3: MixedDataTypes Conditions: 0: Rare data types in column are either more than 10% or less than 1% of the data 4: StringMismatch Conditions: 0: No string variants 5: DataDuplicates Conditions: 0: Duplicate data ratio is less or equal to 0% 6: StringLengthOutOfBounds Conditions: 0: Ratio of string length outliers is less or equal to 0% 7: ConflictingLabels Conditions: 0: Ambiguous sample ratio is less or equal to 0% 8: OutlierSampleDetection 9: FeatureLabelCorrelation(ppscore_params={}, random_state=42) Conditions: 0: Features' Predictive Power Score is less than 0.8 10: FeatureFeatureCorrelation Conditions: 0: Not more than 0 pairs are correlated above 0.9 11: IdentifierLabelCorrelation(ppscore_params={}) Conditions: 0: Identifier columns PPS is less or equal to 0 ] .. GENERATED FROM PYTHON SOURCE LINES 147-151 .. code-block:: default # and remove the condition: integ_suite[3].clean_conditions() .. GENERATED FROM PYTHON SOURCE LINES 152-153 Now we can re-run the suite using: .. GENERATED FROM PYTHON SOURCE LINES 153-155 .. code-block:: default res = integ_suite.run(ds) .. rst-class:: sphx-glr-script-out .. code-block:: none Data Integrity Suite: | | 0/12 [Time: 00:00] Data Integrity Suite: |## | 2/12 [Time: 00:00, Check=Special Characters] Data Integrity Suite: |#### | 4/12 [Time: 00:00, Check=Mixed Data Types] Data Integrity Suite: |####### | 7/12 [Time: 00:00, Check=String Length Out Of Bounds] Data Integrity Suite: |######### | 9/12 [Time: 00:03, Check=Outlier Sample Detection] Data Integrity Suite: |########## | 10/12 [Time: 00:03, Check=Feature Label Correlation] Data Integrity Suite: |############| 12/12 [Time: 00:03, Check=Identifier Label Correlation] .. GENERATED FROM PYTHON SOURCE LINES 156-165 and all of the conditions will pass. *Note: the check we manipulated will still run as part of the Suite, however it won't appear in the Conditions Summary since it no longer has any conditions defined on it. You can still see its display results in the Additional Outputs section* For more info about working with conditions, see the detailed :doc:`/user-guide/general/customizations/examples/plot_configure_check_conditions` guide. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 9.669 seconds) .. _sphx_glr_download_user-guide_tabular_auto_quickstarts_plot_quick_data_integrity.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_quick_data_integrity.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_quick_data_integrity.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_