Note
Go to the end to download the full example code
Preparing Your Tabular Data for Deepchecks Monitoring#
What You Need to Get Through the Tutorial#
The in order to start monitoring your tabular data and model using Deepchecks you will need to have the following pre-requisites:
Data which can be loaded into a pandas DataFrame. This can be a csv file, a database connection or any other.
A timestamp column in your data. This column will be used to identify the time of the sample and will be used to monitor the data over time. In most cases, the time of the model prediction will be a good choice.
A working python environment with deepchecks and deepchecks-client installed. See quickstart guide for additional details.
All the pre-requisites are fulfilled? Great! Let’s get started.
Preparing Your Data#
In this short tutorial we’ll go over the required steps in order to prepare your data for Deepchecks Monitoring which include:
Preparing the Reference Data (Optional)
Supplying Model Predictions (Optional)
After this tutorial you will have a ready to go setup in order to start monitoring your data and model using Deepchecks. See Setup Guide for a follow-up tutorial on setting up your monitoring system.
In this tutorial we will use the Lending Club loan data which is stored in two csv files, one containing the data used for the model training (reference data) and the other containing the production data. It is preferable to run this tutorial on your own data or one that you are familiar with.
Preparing the Reference Data (Optional)#
Reference data represent the data used for model training and is required in order to run checks which compare the production data to the reference data. An example of such a check is the Feature Drift check.
We will load the reference data from a csv file and use it to create a Dataset object which is used in order to create the data schema and upload the reference data to the monitoring system.
import pandas as pd
train_df = pd.read_csv('https://figshare.com/ndownloader/files/39316160')
train_df.head(2)
So what do we have? Let’s note the special columns in our data:
issue_d - The timestamp of the sample (This is unnecessary for reference data, but is required for production data)
id - the id of the loan application
loan_status - Our label, which is the final status of the loan. 0 means “paid in full”, and 1 are defaults.
All the other columns are features that can be used by our model to predict whether the user will default or not.
In order to create a Dataset object we must specify the name of the label column and which features are categorical. If the data contains a datetime column, index column or other columns which are not features, we need to also pass a features argument containing the features column names.
from deepchecks.tabular import Dataset
features = train_df.columns.drop(['id', 'issue_d', 'loan_status'])
cat_features = ['sub_grade', 'home_ownership', 'term', 'purpose', 'application_type', 'verification_status',
'addr_state', 'initial_list_status']
ref_dataset = Dataset(train_df, cat_features=cat_features, features=features, label='loan_status')
ref_dataset
--------- Dataset Description ----------
Column DType Kind Additional Info
0 loan_status integer
1 sub_grade string Categorical Feature
2 term string Categorical Feature
3 home_ownership string Categorical Feature
4 fico_range_low floating Numerical Feature
5 total_acc floating Numerical Feature
6 pub_rec floating Numerical Feature
7 revol_util floating Numerical Feature
8 annual_inc floating Numerical Feature
9 int_rate floating Numerical Feature
10 dti floating Numerical Feature
11 purpose string Categorical Feature
12 mort_acc floating Numerical Feature
13 loan_amnt floating Numerical Feature
14 application_type string Categorical Feature
15 installment floating Numerical Feature
16 verification_status string Categorical Feature
17 pub_rec_bankruptcies floating Numerical Feature
18 addr_state string Categorical Feature
19 initial_list_status string Categorical Feature
20 fico_range_high floating Numerical Feature
21 revol_bal floating Numerical Feature
22 open_acc floating Numerical Feature
23 emp_length floating Numerical Feature
24 time_to_earliest_cr_line floating Numerical Feature
25 issue_d string Dataset Column
26 id integer Dataset Column
----------- Dataset Content ------------
loan_status sub_grade term home_ownership ... emp_length time_to_earliest_cr_line issue_d id
0 1 D1 60 months MORTGAGE ... 2.0 478656.0 2014-01-01 11024793
1 1 C4 60 months MORTGAGE ... 3.0 541728.0 2014-01-01 10596078
2 1 A4 36 months RENT ... 1.0 657590.4 2014-01-01 10775616
3 1 D1 60 months MORTGAGE ... 11.0 328838.4 2014-01-01 10765610
4 1 C3 36 months MORTGAGE ... 2.0 305164.8 2014-01-01 10794837
... ... ... ... ... ... ... ... ... ...
236841 0 C2 36 months RENT ... 0.0 276220.8 2015-12-01 67476992
236842 0 D2 36 months MORTGAGE ... 11.0 533779.2 2015-08-01 56130981
236843 0 B1 36 months MORTGAGE ... 11.0 376185.6 2016-02-01 71502396
236844 1 C3 60 months RENT ... 11.0 867801.6 2016-06-01 83875883
236845 1 E5 60 months RENT ... 3.0 286588.8 2015-05-01 49197629
[236846 rows x 27 columns]
Creating the Data Schema#
Schema file contains the description of the data (features and additional data) associated with a model version and is used by the monitoring system to validate the production data. It is highly recommended to review the created schema file before moving forward to creating the model version.
from deepchecks_client import create_schema, read_schema
schema_file_path = 'schema_file.yaml'
create_schema(dataset=ref_dataset, schema_output_file=schema_file_path)
read_schema(schema_file_path)
# Note: for conveniently changing the auto-inferred schema it's recommended to edit the textual file with an
# app of your choice.
# After editing, you can use the `read_schema` function to verify the validity of the syntax in your updated schema.
Schema was successfully generated and saved to schema_file.yaml.
{'additional_data': {'id': 'integer', 'issue_d': 'categorical'}, 'features': {'addr_state': 'categorical', 'annual_inc': 'numeric', 'application_type': 'categorical', 'dti': 'numeric', 'emp_length': 'numeric', 'fico_range_high': 'numeric', 'fico_range_low': 'numeric', 'home_ownership': 'categorical', 'initial_list_status': 'categorical', 'installment': 'numeric', 'int_rate': 'numeric', 'loan_amnt': 'numeric', 'mort_acc': 'numeric', 'open_acc': 'numeric', 'pub_rec': 'numeric', 'pub_rec_bankruptcies': 'numeric', 'purpose': 'categorical', 'revol_bal': 'numeric', 'revol_util': 'numeric', 'sub_grade': 'categorical', 'term': 'categorical', 'time_to_earliest_cr_line': 'numeric', 'total_acc': 'numeric', 'verification_status': 'categorical'}}
Preparing the Production Data#
In order to prepare the production data we will take a closer look at index and datetime columns which are required for production data but not for reference data.
The index is the global identifier for a sample in the deepchecks system and is used in various displays as well as for future updates of the sample as well. It is crucial to provide meaningful values for this column. In our case we will use the id column as the index.
The timestamps represent either the time the sample was observed or the time the model prediction took place. It should be provided in Unix timestamp format (seconds since 1970-01-01 00:00:00 UTC). In our case we will use the issue_d column and convert it to the required format.
from time import time
prod_data = pd.read_csv('https://figshare.com/ndownloader/files/39316157', parse_dates=['issue_d'])
# Convert pandas datetime format to unix timestamp
prod_data['issue_d'] = prod_data['issue_d'].astype(int) // 10 ** 9
# we will varify that the index column is unique and that the datetime column is in the correct format
assert prod_data.index.is_unique
assert prod_data['issue_d'].min() > 0 and prod_data['issue_d'].max() < int(time())
Supplying Model Predictions#
If we wish to also monitor the model’s behaviour we need to provide the model’s predictions for both the reference and production data in the required format and optionally also the model feature importance.
Currently, model predictions are only supported for regression and classification tasks. For classification tasks, it is preferable to provide the predicted probabilities per class rather than the predicted classes themselves.
# Loading the model (CatBoost Classifier)
import joblib
from urllib.request import urlopen
with urlopen('https://figshare.com/ndownloader/files/39316172') as f:
model = joblib.load(f)
# Extracting feature importance - optional
feature_importance = pd.Series(model.feature_importances_ / sum(model.feature_importances_), index=model.feature_names_)
# Predicting on the reference data and production data
ref_predictions = model.predict_proba(train_df[features].fillna('NONE'))
prod_predictions = model.predict_proba(prod_data[features].fillna('NONE'))
Total running time of the script: (0 minutes 9.188 seconds)