Preparing Your Tabular Data for Deepchecks Monitoring#

What You Need to Get Through the Tutorial#

The in order to start monitoring your tabular data and model using Deepchecks you will need to have the following pre-requisites:

Data which can be loaded into a pandas DataFrame. This can be a csv file, a database connection or any other.
A timestamp column in your data. This column will be used to identify the time of the sample and will be used to monitor the data over time. In most cases, the time of the model prediction will be a good choice.
A working python environment with deepchecks and deepchecks-client installed. See quickstart guide for additional details.

All the pre-requisites are fulfilled? Great! Let’s get started.

Preparing Your Data#

In this short tutorial we’ll go over the required steps in order to prepare your data for Deepchecks Monitoring which include:

Preparing the Reference Data (Optional)
Creating a Data Schema
Preparing the Production Data
Supplying Model Predictions (Optional)

After this tutorial you will have a ready to go setup in order to start monitoring your data and model using Deepchecks. See Setup Guide for a follow-up tutorial on setting up your monitoring system.

In this tutorial we will use the Lending Club loan data which is stored in two csv files, one containing the data used for the model training (reference data) and the other containing the production data. It is preferable to run this tutorial on your own data or one that you are familiar with.

Preparing the Reference Data (Optional)#

Reference data represent the data used for model training and is required in order to run checks which compare the production data to the reference data. An example of such a check is the Feature Drift check.

We will load the reference data from a csv file and use it to create a Dataset object which is used in order to create the data schema and upload the reference data to the monitoring system.

import pandas as pd

train_df = pd.read_csv('https://figshare.com/ndownloader/files/39316160')
train_df.head(2)

	issue_d	sub_grade	term	home_ownership	fico_range_low	total_acc	pub_rec	revol_util	annual_inc	int_rate	dti	purpose	mort_acc	loan_amnt	application_type	installment	verification_status	pub_rec_bankruptcies	addr_state	initial_list_status	fico_range_high	revol_bal	id	open_acc	emp_length	loan_status	time_to_earliest_cr_line
0	2014-01-01	D1	60 months	MORTGAGE	660.0	18.0	0.0	86.8	40440.0	16.99	15.16	credit_card	1.0	17775.0	Individual	441.66	Verified	0.0	AR	f	664.0	17264.0	11024793	11.0	2.0	1	478656.0
1	2014-01-01	C4	60 months	MORTGAGE	740.0	26.0	0.0	103.5	59000.0	15.61	16.74	credit_card	4.0	29175.0	Individual	703.45	Verified	0.0	VT	f	744.0	6725.0	10596078	8.0	3.0	1	541728.0

So what do we have? Let’s note the special columns in our data:

issue_d - The timestamp of the sample (This is unnecessary for reference data, but is required for production data)
id - the id of the loan application
loan_status - Our label, which is the final status of the loan. 0 means “paid in full”, and 1 are defaults.

All the other columns are features that can be used by our model to predict whether the user will default or not.

In order to create a Dataset object we must specify the name of the label column and which features are categorical. If the data contains a datetime column, index column or other columns which are not features, we need to also pass a features argument containing the features column names.

from deepchecks.tabular import Dataset

features = train_df.columns.drop(['id', 'issue_d', 'loan_status'])
cat_features = ['sub_grade', 'home_ownership', 'term', 'purpose', 'application_type', 'verification_status',
                'addr_state', 'initial_list_status']
ref_dataset = Dataset(train_df, cat_features=cat_features, features=features, label='loan_status')
ref_dataset

--------- Dataset Description ----------

                                   Column           DType                 Kind Additional Info
                           loan_status         integer
                             sub_grade          string  Categorical Feature
                                  term          string  Categorical Feature
                        home_ownership          string  Categorical Feature
                        fico_range_low        floating    Numerical Feature
                             total_acc        floating    Numerical Feature
                               pub_rec        floating    Numerical Feature
                            revol_util        floating    Numerical Feature
                            annual_inc        floating    Numerical Feature
                              int_rate        floating    Numerical Feature
                                  dti        floating    Numerical Feature
                              purpose          string  Categorical Feature
                             mort_acc        floating    Numerical Feature
                            loan_amnt        floating    Numerical Feature
                     application_type          string  Categorical Feature
                          installment        floating    Numerical Feature
                  verification_status          string  Categorical Feature
                 pub_rec_bankruptcies        floating    Numerical Feature
                           addr_state          string  Categorical Feature
                  initial_list_status          string  Categorical Feature
                      fico_range_high        floating    Numerical Feature
                            revol_bal        floating    Numerical Feature
                             open_acc        floating    Numerical Feature
                           emp_length        floating    Numerical Feature
             time_to_earliest_cr_line        floating    Numerical Feature
                              issue_d          string       Dataset Column
                                   id         integer       Dataset Column


----------- Dataset Content ------------

                    loan_status       sub_grade            term  home_ownership  ...      emp_length  time_to_earliest_cr_line         issue_d              id
                           1              D1       60 months        MORTGAGE  ...             2.0                  478656.0      2014-01-01        11024793
                           1              C4       60 months        MORTGAGE  ...             3.0                  541728.0      2014-01-01        10596078
                           1              A4       36 months            RENT  ...             1.0                  657590.4      2014-01-01        10775616
                           1              D1       60 months        MORTGAGE  ...            11.0                  328838.4      2014-01-01        10765610
                           1              C3       36 months        MORTGAGE  ...             2.0                  305164.8      2014-01-01        10794837
...                         ...             ...             ...             ...  ...             ...                       ...             ...             ...
236841                        0              C2       36 months            RENT  ...             0.0                  276220.8      2015-12-01        67476992
236842                        0              D2       36 months        MORTGAGE  ...            11.0                  533779.2      2015-08-01        56130981
236843                        0              B1       36 months        MORTGAGE  ...            11.0                  376185.6      2016-02-01        71502396
236844                        1              C3       60 months            RENT  ...            11.0                  867801.6      2016-06-01        83875883
236845                        1              E5       60 months            RENT  ...             3.0                  286588.8      2015-05-01        49197629

[236846 rows x 27 columns]

Creating the Data Schema#

Schema file contains the description of the data (features and additional data) associated with a model version and is used by the monitoring system to validate the production data. It is highly recommended to review the created schema file before moving forward to creating the model version.

from deepchecks_client import create_schema, read_schema

schema_file_path = 'schema_file.yaml'
create_schema(dataset=ref_dataset, schema_output_file=schema_file_path)
read_schema(schema_file_path)
# Note: for conveniently changing the auto-inferred schema it's recommended to edit the textual file with an
# app of your choice.
# After editing, you can use the `read_schema` function to verify the validity of the syntax in your updated schema.

Schema was successfully generated and saved to schema_file.yaml.

{'additional_data': {'id': 'integer', 'issue_d': 'categorical'}, 'features': {'addr_state': 'categorical', 'annual_inc': 'numeric', 'application_type': 'categorical', 'dti': 'numeric', 'emp_length': 'numeric', 'fico_range_high': 'numeric', 'fico_range_low': 'numeric', 'home_ownership': 'categorical', 'initial_list_status': 'categorical', 'installment': 'numeric', 'int_rate': 'numeric', 'loan_amnt': 'numeric', 'mort_acc': 'numeric', 'open_acc': 'numeric', 'pub_rec': 'numeric', 'pub_rec_bankruptcies': 'numeric', 'purpose': 'categorical', 'revol_bal': 'numeric', 'revol_util': 'numeric', 'sub_grade': 'categorical', 'term': 'categorical', 'time_to_earliest_cr_line': 'numeric', 'total_acc': 'numeric', 'verification_status': 'categorical'}}

Preparing the Production Data#

In order to prepare the production data we will take a closer look at index and datetime columns which are required for production data but not for reference data.

The index is the global identifier for a sample in the deepchecks system and is used in various displays as well as for future updates of the sample as well. It is crucial to provide meaningful values for this column. In our case we will use the id column as the index.

The timestamps represent either the time the sample was observed or the time the model prediction took place. It should be provided in Unix timestamp format (seconds since 1970-01-01 00:00:00 UTC). In our case we will use the issue_d column and convert it to the required format.

from time import time

prod_data = pd.read_csv('https://figshare.com/ndownloader/files/39316157', parse_dates=['issue_d'])
# Convert pandas datetime format to unix timestamp
prod_data['issue_d'] = prod_data['issue_d'].astype(int) // 10 ** 9
# we will varify that the index column is unique and that the datetime column is in the correct format
assert prod_data.index.is_unique
assert prod_data['issue_d'].min() > 0 and prod_data['issue_d'].max() < int(time())

Supplying Model Predictions#

If we wish to also monitor the model’s behaviour we need to provide the model’s predictions for both the reference and production data in the required format and optionally also the model feature importance.

Currently, model predictions are only supported for regression and classification tasks. For classification tasks, it is preferable to provide the predicted probabilities per class rather than the predicted classes themselves.

# Loading the model (CatBoost Classifier)
import joblib
from urllib.request import urlopen

with urlopen('https://figshare.com/ndownloader/files/39316172') as f:
    model = joblib.load(f)

# Extracting feature importance - optional
feature_importance = pd.Series(model.feature_importances_ / sum(model.feature_importances_), index=model.feature_names_)

# Predicting on the reference data and production data
ref_predictions = model.predict_proba(train_df[features].fillna('NONE'))
prod_predictions = model.predict_proba(prod_data[features].fillna('NONE'))

Total running time of the script: (0 minutes 6.613 seconds)

Gallery generated by Sphinx-Gallery

Tabular Quickstarts

Quickstart - Get Deepchecks Monitoring Up and Running