.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "user-guide/tabular/auto_tutorials/plot_quick_data_integrity.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_user-guide_tabular_auto_tutorials_plot_quick_data_integrity.py: Data Integrity Suite on Avocado Sales Data - Quickstart ******************************************************* The deepchecks integrity suite is relevant any time you have data that you wish to validate: whether it's on a fresh batch of data, or right before splitting it or using it for training. Here we'll use the avocado prices dataset, to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find. .. code-block:: bash # Before we start, if you don't have deepchecks installed yet, # make sure to run: pip install deepchecks -U --quiet #--user .. GENERATED FROM PYTHON SOURCE LINES 19-21 Load and Prepare Data ==================================================== .. GENERATED FROM PYTHON SOURCE LINES 21-27 .. code-block:: default from deepchecks.tabular import datasets # load data data = datasets.regression.avocado.load_data(data_format='DataFrame', as_train_test=False) .. GENERATED FROM PYTHON SOURCE LINES 28-29 Insert a few typcial problems to dataset for demonstration. .. GENERATED FROM PYTHON SOURCE LINES 29-45 .. code-block:: default import pandas as pd def add_dirty_data(df): # change strings df.loc[df[df['type'] == 'organic'].sample(frac=0.18).index,'type'] = 'Organic' df.loc[df[df['type'] == 'organic'].sample(frac=0.01).index,'type'] = 'ORGANIC' # add duplicates df = pd.concat([df, df.sample(frac=0.156)], axis=0, ignore_index=True) # add column with single value df['Is Ripe'] = True return df dirty_df = add_dirty_data(data) .. GENERATED FROM PYTHON SOURCE LINES 46-54 Run Deepchecks for Data Integrity ==================================== Define a Dataset Object ------------------------ Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out :class:`deepchecks.tabular.Dataset` to see all of the columns that can be declared. .. GENERATED FROM PYTHON SOURCE LINES 54-62 .. code-block:: default from deepchecks.tabular import Dataset # We explicitly state the categorical features, # otherwise they will be automatically inferred, which may not work perfectly and is not recommended. # The label can be passed as a column name or a separate pd.Series / pd.DataFrame ds = Dataset(dirty_df, cat_features = ['type'], datetime_name='Date', label = 'AveragePrice') .. GENERATED FROM PYTHON SOURCE LINES 63-72 Run the Deepchecks Suite -------------------------- Validate your data with the :class:`deepchecks.tabular.suites.single_dataset_integrity` suite. It runs on a single dataset, so you can run it on any batch of data (e.g. train data, test data, a new batch of data that recently arrived) Check out the :doc:`when should you use ` deepchecks guide for some more info about the existing suites and when to use them. .. GENERATED FROM PYTHON SOURCE LINES 72-79 .. code-block:: default from deepchecks.tabular.suites import data_integrity # Run Suite: integ_suite = data_integrity() integ_suite.run(ds) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Data Integrity Suite: | | 0/10 [00:00 Data Integrity Suite

.. GENERATED FROM PYTHON SOURCE LINES 80-82 We can inspect the suite outputs and see that there are a few problems we'd like to fix. We'll now fix them and check that they're resolved by re-running those specific checks. .. GENERATED FROM PYTHON SOURCE LINES 85-88 Run a Single Check ------------------- We can run a single check on a dataset, and see the results. .. GENERATED FROM PYTHON SOURCE LINES 88-94 .. code-block:: default from deepchecks.tabular.checks import IsSingleValue, DataDuplicates # first let's see how the check runs: IsSingleValue().run(ds) .. raw:: html
Single Value in Column


.. GENERATED FROM PYTHON SOURCE LINES 95-101 .. code-block:: default # we can also add a condition: single_value_with_condition = IsSingleValue().add_condition_not_single_value() result = single_value_with_condition.run(ds) result .. raw:: html
Single Value in Column


.. GENERATED FROM PYTHON SOURCE LINES 102-106 .. code-block:: default # We can also inspect and use the result's value: result.value .. rst-class:: sphx-glr-script-out Out: .. code-block:: none {'Date': 169, 'AveragePrice': 259, 'Total Volume': 18237, '4046': 17702, '4225': 18103, '4770': 12071, 'Total Bags': 18097, 'Small Bags': 17321, 'Large Bags': 15082, 'XLarge Bags': 5588, 'type': 4, 'year': 4, 'region': 54, 'Is Ripe': 1} .. GENERATED FROM PYTHON SOURCE LINES 107-109 Now let's remove the single value column and rerun (notice that we're using directly the ``data`` attribute that stores the dataframe inside the Dataset) .. GENERATED FROM PYTHON SOURCE LINES 109-114 .. code-block:: default ds.data.drop('Is Ripe', axis=1, inplace=True) result = single_value_with_condition.run(ds) result .. raw:: html
Single Value in Column


.. GENERATED FROM PYTHON SOURCE LINES 115-124 .. code-block:: default # Alternatively we can fix the dataframe directly, and create a new dataset. # Let's fix also the duplicate values: dirty_df.drop_duplicates(inplace=True) dirty_df.drop('Is Ripe', axis=1, inplace=True) ds = Dataset(dirty_df, cat_features=['type'], datetime_name='Date', label='AveragePrice') result = DataDuplicates().add_condition_ratio_not_greater_than(0).run(ds) result .. raw:: html
Data Duplicates


.. GENERATED FROM PYTHON SOURCE LINES 125-131 Rerun Suite on the Fixed Dataset --------------------------------- Finally, we'll choose to keep the "organic" multiple spellings as they represent different sources. So we'll customaize the suite by removing the condition from it (or delete check completely). Alternatively - we can customize it by creating a new Suite with the desired checks and conditions. See :doc:`/user-guide/general/customizations/examples/customizing-suites` for more info. .. GENERATED FROM PYTHON SOURCE LINES 131-135 .. code-block:: default # let's inspect the suite's structure integ_suite .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Data Integrity Suite: [ 0: IsSingleValue Conditions: 0: Does not contain only a single value 1: SpecialCharacters Conditions: 0: Ratio of entirely special character samples not greater than 0.1% 2: MixedNulls Conditions: 0: Not more than 1 different null types 3: MixedDataTypes Conditions: 0: Rare data types in column are either more than 10% or less than 1% of the data 4: StringMismatch Conditions: 0: No string variants 5: DataDuplicates Conditions: 0: Duplicate data ratio is not greater than 0% 6: StringLengthOutOfBounds Conditions: 0: Ratio of outliers not greater than 0% string length outliers 7: ConflictingLabels Conditions: 0: Ambiguous sample ratio is not greater than 0% 8: OutlierSampleDetection 9: FeatureLabelCorrelation(ppscore_params={}) Conditions: 0: Features' Predictive Power Score is not greater than 0.8 ] .. GENERATED FROM PYTHON SOURCE LINES 136-140 .. code-block:: default # and remove the condition: integ_suite[3].clean_conditions() .. GENERATED FROM PYTHON SOURCE LINES 141-142 Now we can re-run the suite using: .. GENERATED FROM PYTHON SOURCE LINES 142-144 .. code-block:: default res = integ_suite.run(ds) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Data Integrity Suite: | | 0/10 [00:00` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_quick_data_integrity.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_