Preparing Your Tabular Data for Deepchecks Monitoring#

What You Need to Get Through the Tutorial#

The in order to start monitoring your tabular data and model using Deepchecks you will need to have the following pre-requisites:

  • Data which can be loaded into a pandas DataFrame. This can be a csv file, a database connection or any other.

  • A timestamp column in your data. This column will be used to identify the time of the sample and will be used to monitor the data over time. In most cases, the time of the model prediction will be a good choice.

  • A working python environment with deepchecks and deepchecks-client installed. See quickstart guide for additional details.

All the pre-requisites are fulfilled? Great! Let’s get started.

Preparing Your Data#

In this short tutorial we’ll go over the required steps in order to prepare your data for Deepchecks Monitoring which include:

  1. Preparing the Reference Data (Optional)

  2. Creating a Data Schema

  3. Preparing the Production Data

  4. Supplying Model Predictions (Optional)

After this tutorial you will have a ready to go setup in order to start monitoring your data and model using Deepchecks. See Setup Guide for a follow-up tutorial on setting up your monitoring system.

In this tutorial we will use the Lending Club loan data which is stored in two csv files, one containing the data used for the model training (reference data) and the other containing the production data. It is preferable to run this tutorial on your own data or one that you are familiar with.

Preparing the Reference Data (Optional)#

Reference data represent the data used for model training and is required in order to run checks which compare the production data to the reference data. An example of such a check is the Feature Drift check.

We will load the reference data from a csv file and use it to create a Dataset object which is used in order to create the data schema and upload the reference data to the monitoring system.

import pandas as pd

train_df = pd.read_csv('https://figshare.com/ndownloader/files/39316160')
train_df.head(2)
issue_d sub_grade term home_ownership fico_range_low total_acc pub_rec revol_util annual_inc int_rate dti purpose mort_acc loan_amnt application_type installment verification_status pub_rec_bankruptcies addr_state initial_list_status fico_range_high revol_bal id open_acc emp_length loan_status time_to_earliest_cr_line
0 2014-01-01 D1 60 months MORTGAGE 660.0 18.0 0.0 86.8 40440.0 16.99 15.16 credit_card 1.0 17775.0 Individual 441.66 Verified 0.0 AR f 664.0 17264.0 11024793 11.0 2.0 1 478656.0
1 2014-01-01 C4 60 months MORTGAGE 740.0 26.0 0.0 103.5 59000.0 15.61 16.74 credit_card 4.0 29175.0 Individual 703.45 Verified 0.0 VT f 744.0 6725.0 10596078 8.0 3.0 1 541728.0


So what do we have? Let’s note the special columns in our data:

  1. issue_d - The timestamp of the sample (This is unnecessary for reference data, but is required for production data)

  2. id - the id of the loan application

  3. loan_status - Our label, which is the final status of the loan. 0 means “paid in full”, and 1 are defaults.

All the other columns are features that can be used by our model to predict whether the user will default or not.

In order to create a Dataset object we must specify the name of the label column and which features are categorical. If the data contains a datetime column, index column or other columns which are not features, we need to also pass a features argument containing the features column names.

from deepchecks.tabular import Dataset

features = train_df.columns.drop(['id', 'issue_d', 'loan_status'])
cat_features = ['sub_grade', 'home_ownership', 'term', 'purpose', 'application_type', 'verification_status',
                'addr_state', 'initial_list_status']
ref_dataset = Dataset(train_df, cat_features=cat_features, features=features, label='loan_status')
ref_dataset
--------- Dataset Description ----------

                                   Column           DType                 Kind Additional Info
0                             loan_status         integer
1                               sub_grade          string  Categorical Feature
2                                    term          string  Categorical Feature
3                          home_ownership          string  Categorical Feature
4                          fico_range_low        floating    Numerical Feature
5                               total_acc        floating    Numerical Feature
6                                 pub_rec        floating    Numerical Feature
7                              revol_util        floating    Numerical Feature
8                              annual_inc        floating    Numerical Feature
9                                int_rate        floating    Numerical Feature
10                                    dti        floating    Numerical Feature
11                                purpose          string  Categorical Feature
12                               mort_acc        floating    Numerical Feature
13                              loan_amnt        floating    Numerical Feature
14                       application_type          string  Categorical Feature
15                            installment        floating    Numerical Feature
16                    verification_status          string  Categorical Feature
17                   pub_rec_bankruptcies        floating    Numerical Feature
18                             addr_state          string  Categorical Feature
19                    initial_list_status          string  Categorical Feature
20                        fico_range_high        floating    Numerical Feature
21                              revol_bal        floating    Numerical Feature
22                               open_acc        floating    Numerical Feature
23                             emp_length        floating    Numerical Feature
24               time_to_earliest_cr_line        floating    Numerical Feature
25                                issue_d          string       Dataset Column
26                                     id         integer       Dataset Column


----------- Dataset Content ------------

                    loan_status       sub_grade            term  home_ownership  ...      emp_length  time_to_earliest_cr_line         issue_d              id
0                             1              D1       60 months        MORTGAGE  ...             2.0                  478656.0      2014-01-01        11024793
1                             1              C4       60 months        MORTGAGE  ...             3.0                  541728.0      2014-01-01        10596078
2                             1              A4       36 months            RENT  ...             1.0                  657590.4      2014-01-01        10775616
3                             1              D1       60 months        MORTGAGE  ...            11.0                  328838.4      2014-01-01        10765610
4                             1              C3       36 months        MORTGAGE  ...             2.0                  305164.8      2014-01-01        10794837
...                         ...             ...             ...             ...  ...             ...                       ...             ...             ...
236841                        0              C2       36 months            RENT  ...             0.0                  276220.8      2015-12-01        67476992
236842                        0              D2       36 months        MORTGAGE  ...            11.0                  533779.2      2015-08-01        56130981
236843                        0              B1       36 months        MORTGAGE  ...            11.0                  376185.6      2016-02-01        71502396
236844                        1              C3       60 months            RENT  ...            11.0                  867801.6      2016-06-01        83875883
236845                        1              E5       60 months            RENT  ...             3.0                  286588.8      2015-05-01        49197629

[236846 rows x 27 columns]

Creating the Data Schema#

Schema file contains the description of the data (features and additional data) associated with a model version and is used by the monitoring system to validate the production data. It is highly recommended to review the created schema file before moving forward to creating the model version.

from deepchecks_client import create_schema, read_schema

schema_file_path = 'schema_file.yaml'
create_schema(dataset=ref_dataset, schema_output_file=schema_file_path)
read_schema(schema_file_path)
# Note: for conveniently changing the auto-inferred schema it's recommended to edit the textual file with an
# app of your choice.
# After editing, you can use the `read_schema` function to verify the validity of the syntax in your updated schema.
Schema was successfully generated and saved to schema_file.yaml.

{'additional_data': {'id': 'integer', 'issue_d': 'categorical'}, 'features': {'addr_state': 'categorical', 'annual_inc': 'numeric', 'application_type': 'categorical', 'dti': 'numeric', 'emp_length': 'numeric', 'fico_range_high': 'numeric', 'fico_range_low': 'numeric', 'home_ownership': 'categorical', 'initial_list_status': 'categorical', 'installment': 'numeric', 'int_rate': 'numeric', 'loan_amnt': 'numeric', 'mort_acc': 'numeric', 'open_acc': 'numeric', 'pub_rec': 'numeric', 'pub_rec_bankruptcies': 'numeric', 'purpose': 'categorical', 'revol_bal': 'numeric', 'revol_util': 'numeric', 'sub_grade': 'categorical', 'term': 'categorical', 'time_to_earliest_cr_line': 'numeric', 'total_acc': 'numeric', 'verification_status': 'categorical'}}

Preparing the Production Data#

In order to prepare the production data we will take a closer look at index and datetime columns which are required for production data but not for reference data.

The index is the global identifier for a sample in the deepchecks system and is used in various displays as well as for future updates of the sample as well. It is crucial to provide meaningful values for this column. In our case we will use the id column as the index.

The timestamps represent either the time the sample was observed or the time the model prediction took place. It should be provided in Unix timestamp format (seconds since 1970-01-01 00:00:00 UTC). In our case we will use the issue_d column and convert it to the required format.

from time import time

prod_data = pd.read_csv('https://figshare.com/ndownloader/files/39316157', parse_dates=['issue_d'])
# Convert pandas datetime format to unix timestamp
prod_data['issue_d'] = prod_data['issue_d'].astype(int) // 10 ** 9
# we will varify that the index column is unique and that the datetime column is in the correct format
assert prod_data.index.is_unique
assert prod_data['issue_d'].min() > 0 and prod_data['issue_d'].max() < int(time())

Supplying Model Predictions#

If we wish to also monitor the model’s behaviour we need to provide the model’s predictions for both the reference and production data in the required format and optionally also the model feature importance.

Currently, model predictions are only supported for regression and classification tasks. For classification tasks, it is preferable to provide the predicted probabilities per class rather than the predicted classes themselves.

# Loading the model (CatBoost Classifier)
import joblib
from urllib.request import urlopen

with urlopen('https://figshare.com/ndownloader/files/39316172') as f:
    model = joblib.load(f)

# Extracting feature importance - optional
feature_importance = pd.Series(model.feature_importances_ / sum(model.feature_importances_), index=model.feature_names_)

# Predicting on the reference data and production data
ref_predictions = model.predict_proba(train_df[features].fillna('NONE'))
prod_predictions = model.predict_proba(prod_data[features].fillna('NONE'))

Total running time of the script: (0 minutes 9.188 seconds)

Gallery generated by Sphinx-Gallery