Quickstart - Data Integrity Suite (Avocado Sales Data)#

The deepchecks integrity suite is relevant any time you have data that you wish to validate: whether it’s on a fresh batch of data, or right before splitting it or using it for training. Here we’ll use the avocado prices dataset (deepchecks.tabular.datasets.regression.avocado), to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find.

# Before we start, if you don't have deepchecks installed yet,
# make sure to run:
pip install deepchecks -U --quiet #--user

Load and Prepare Data#

from deepchecks.tabular import datasets

# load data
data = datasets.regression.avocado.load_data(data_format='DataFrame', as_train_test=False)

Insert a few typcial problems to dataset for demonstration.

import pandas as pd

def add_dirty_data(df):
    # change strings
    df.loc[df[df['type'] == 'organic'].sample(frac=0.18).index,'type'] = 'Organic'
    df.loc[df[df['type'] == 'organic'].sample(frac=0.01).index,'type'] = 'ORGANIC'
    # add duplicates
    df = pd.concat([df, df.sample(frac=0.156)], axis=0, ignore_index=True)
    # add column with single value
    df['Is Ripe'] = True
    return df

dirty_df = add_dirty_data(data)

Run Deepchecks for Data Integrity#

Define a Dataset Object#

Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out deepchecks.tabular.Dataset to see all of the columns and types that can be declared.

from deepchecks.tabular import Dataset

# We state the categorical features, otherwise they will be automatically inferred,
# which may be less accurate, therefore stating them explicitly is recommended.

# The label can be passed as a column name or a separate pd.Series / pd.DataFrame

ds = Dataset(dirty_df, cat_features = ['type'], datetime_name='Date', label = 'AveragePrice')

Run the Deepchecks Suite#

Validate your data with the deepchecks.tabular.suites.data_integrity() suite. It runs on a single dataset, so you can run it on any batch of data (e.g. train data, test data, a new batch of data that recently arrived)

Check out the when should you use deepchecks guide for some more info about the existing suites and when to use them.

from deepchecks.tabular.suites import data_integrity

# Run Suite:
integ_suite = data_integrity()
suite_result = integ_suite.run(ds)
# Note: the result can be saved as html using suite_result.save_as_html()
# or exported to json using suite_result.to_json()


Data Integrity Suite:
|          | 0/10 [00:00<?, ? Check/s]
Data Integrity Suite:
|##        | 2/10 [00:00<00:00, 19.48 Check/s, Check=Special Characters]
Data Integrity Suite:
|####      | 4/10 [00:00<00:00, 15.63 Check/s, Check=Mixed Data Types]
Data Integrity Suite:
|######    | 6/10 [00:00<00:00,  8.46 Check/s, Check=Data Duplicates]
Data Integrity Suite:
|########  | 8/10 [00:00<00:00,  9.33 Check/s, Check=Conflicting Labels]
Data Integrity Suite:
|##########| 10/10 [00:01<00:00,  7.07 Check/s, Check=Feature Label Correlation]
Data Integrity Suite

We can inspect the suite outputs and see that there are a few problems we’d like to fix. We’ll now fix them and check that they’re resolved by re-running those specific checks.

Run a Single Check#

We can run a single check on a dataset, and see the results.

from deepchecks.tabular.checks import IsSingleValue, DataDuplicates

# first let's see how the check runs:
Single Value in Column

# we can also add a condition:
single_value_with_condition = IsSingleValue().add_condition_not_single_value()
result = single_value_with_condition.run(ds)
Single Value in Column

# We can also inspect and use the result's value:


{'Date': 169, 'AveragePrice': 259, 'Total Volume': 18237, '4046': 17702, '4225': 18103, '4770': 12071, 'Total Bags': 18097, 'Small Bags': 17321, 'Large Bags': 15082, 'XLarge Bags': 5588, 'type': 4, 'year': 4, 'region': 54, 'Is Ripe': 1}

Now let’s remove the single value column and rerun (notice that we’re using directly the data attribute that stores the dataframe inside the Dataset)

ds.data.drop('Is Ripe', axis=1, inplace=True)
result = single_value_with_condition.run(ds)
Single Value in Column

# Alternatively we can fix the dataframe directly, and create a new dataset.
# Let's fix also the duplicate values:
dirty_df.drop('Is Ripe', axis=1, inplace=True)
ds = Dataset(dirty_df, cat_features=['type'], datetime_name='Date', label='AveragePrice')
result = DataDuplicates().add_condition_ratio_less_or_equal(0).run(ds)
Data Duplicates

Rerun Suite on the Fixed Dataset#

Finally, we’ll choose to keep the “organic” multiple spellings as they represent different sources. So we’ll customaize the suite by removing the condition from it (or delete check completely). Alternatively - we can customize it by creating a new Suite with the desired checks and conditions. See Create a Custom Suite for more info.

# let's inspect the suite's structure


Data Integrity Suite: [
    0: IsSingleValue
                    0: Does not contain only a single value
    1: SpecialCharacters
                    0: Ratio of samples containing solely special character is less or equal to 0.1%
    2: MixedNulls
                    0: Number of different null types is less or equal to 1
    3: MixedDataTypes
                    0: Rare data types in column are either more than 10% or less than 1% of the data
    4: StringMismatch
                    0: No string variants
    5: DataDuplicates
                    0: Duplicate data ratio is less or equal to 0%
    6: StringLengthOutOfBounds
                    0: Ratio of string length outliers is less or equal to 0%
    7: ConflictingLabels
                    0: Ambiguous sample ratio is less or equal to 0%
    8: OutlierSampleDetection
    9: FeatureLabelCorrelation(ppscore_params={})
                    0: Features' Predictive Power Score is less than 0.8
# and remove the condition:

Now we can re-run the suite using:

res = integ_suite.run(ds)


Data Integrity Suite:
|          | 0/10 [00:00<?, ? Check/s]
Data Integrity Suite:
|###       | 3/10 [00:00<00:00, 17.08 Check/s, Check=Mixed Nulls]
Data Integrity Suite:
|######    | 6/10 [00:00<00:00, 21.49 Check/s, Check=Data Duplicates]
Data Integrity Suite:
|######### | 9/10 [00:03<00:00,  2.15 Check/s, Check=Outlier Sample Detection]

and all of the conditions will pass.

Note: the check we manipulated will still run as part of the Suite, however it won’t appear in the Conditions Summary since it no longer has any conditions defined on it. You can still see its display results in the Additional Outputs section

For more info about working with conditions, see the detailed /user-guide/general/customizations/examples/plot_configure_checks_conditions guide.

Total running time of the script: ( 0 minutes 6.496 seconds)

Gallery generated by Sphinx-Gallery