Use Cases - Classifying Malicious URLs#

This notebook demonstrates how the deepchecks package can help you validate your basic data science workflow right out of the box!

The scenario is a real business use case: You work as a data scientist at a cyber security startup, and the company wants to provide the clients with a tool to automatically detect phishing attempts performed through emails and warn clients about them. The idea is to scan emails and determine for each web URL they include whether it points to a phishing-related web page or not.

Since phishing attempts are an always-adapting efforts, static black lists or white lists composed of good or bad URLs seen in the past are simply not enough to make a good filtering system for the future. The way the company chose to deal with this challenge is to have you train a Machine Learning model to generalize what a phishing URL looks like from historic data!

To enable you to do this the company’s security team has collected a set of benign (meaning OK, or Kosher) URLs and phishing URLs observed during 2019 (not necessarily in clients emails). They have also wrote a script extracting features they believe should help discern phishing URLs from benign ones.

These features are divided to three sub-sets:

  • String Characteristics - Extracted from the URL string itself.

  • Domain Characteristics - Extracted by interacting with the domain provider.

  • Web Page Characteristics - Extracted from the content of the web page the URL points to.

The string characteristics are based the way URLs are structured, and what their different parts do. Here is an informative illustration. You can read more at Mozilla’s What is a URL article. We’ll see the specific features soon.

from IPython.core.display import HTML
from IPython.display import Image

Image(url= "https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL/mdn-url-all.png")


(Note: This is a slightly synthetic dataset based on a great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors has released it under an open license per our request, and for that we are very grateful to them.)

Installing requirements

import sys
!{sys.executable} -m pip install deepchecks --quiet

Loading the data#

OK, let’s take a look at the data!

import numpy as np
import pandas as pd
import sklearn

import deepchecks

pd.set_option('display.max_columns', 45); SEED=832; np.random.seed(SEED);
from deepchecks.tabular.datasets.classification.phishing import load_data
df = load_data(data_format='dataframe', as_train_test=False)
df.shape

Out:

(11350, 25)
df.head(5)
target month scrape_date ext urlLength numDigits numParams num_%20 num_@ entropy has_ip hasHttp hasHttps urlIsLive dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr
0 0 1 2019-01-01 net 102 8 0 0 0 -4.384032 0 True False False 4921 191 32486 3 5 330 9419 23919 0.736286 0.289940 2.539442
1 0 1 2019-01-01 country 154 60 0 2 0 -3.566515 0 True False False 0 0 16199 0 4 39 2735 794 0.049015 0.168838 0.290311
2 0 1 2019-01-01 net 171 5 11 0 0 -4.608755 0 True False False 5374 104 103344 18 9 302 27798 83817 0.811049 0.268985 2.412174
3 0 1 2019-01-01 com 94 10 0 0 0 -4.548921 0 True False False 6107 466 34093 11 43 199 9087 19427 0.569824 0.266536 2.137889
4 0 1 2019-01-01 other 95 11 0 0 0 -4.717188 0 True False False 3819 928 202 1 0 0 39 0 0.000000 0.193069 0.000000


Here is the actual list of features:

df.columns

Out:

Index(['target', 'month', 'scrape_date', 'ext', 'urlLength', 'numDigits',
       'numParams', 'num_%20', 'num_@', 'entropy', 'has_ip', 'hasHttp',
       'hasHttps', 'urlIsLive', 'dsr', 'dse', 'bodyLength', 'numTitles',
       'numImages', 'numLinks', 'specialChars', 'scriptLength', 'sbr', 'bscr',
       'sscr'],
      dtype='object')

Feature List#

And here is a short explanation of each:

Feature Name

Feature Group

Description

target

Meta Features

0 if the URL is benign, 1 if it is related to phishing

month

Meta Features

The month this URL was first encountered, as an int

scrape_date

Meta Features

The exact date this URL was first encountered

ext

String Characteristics

The domain extension

urlLength

String Characteristics

The number of characters in the URL

numDigits

String Characteristics

The number of digits in the URL

numParams

String Characteristics

The number of query parameters in the URL

num_%20

String Characteristics

The number of ‘%20’ substrings in the URL

num_@

String Characteristics

The number of @ characters in the URL

entropy

String Characteristics

The entropy of the URL

has_ip

String Characteristics

True if the URL string contains an IP addres

hasHttp

Domain Characteristics

True if the url’s domain supports http

hasHttps

Domain Characteristics

True if the url’s domain supports https

urlIsLive

Domain Characteristics

The URL was live at the time of scraping

dsr

Domain Characteristics

The number of days since domain registration

dse

Domain Characteristics

The number of days since domain registration expired

bodyLength

Web Page Characteristics

The number of charcters in the URL’s web page

numTitles

Web Page Characteristics

The number of HTML titles (H1/H2/…) in the page

numImages

Web Page Characteristics

The number of images in the page

numLinks

Web Page Characteristics

The number of links in the page

specialChars

Web Page Characteristics

The number of special characters in the page

scriptLength

Web Page Characteristics

The number of charcters in scripts embedded in the page

sbr

Web Page Characteristics

The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)

bscr

Web Page Characteristics

The ratio of bodyLength to specialChars (= specialChars / bodyLength)

sscr

Web Page Characteristics

The ratio of scriptLength to specialChars (= scriptLength / specialChars)

Data Integrity with Deepchecks!#

The nice thing about the deepchecks package is that we can already use it out of the box! Instead of running a single check, we use a pre-defined test suite to run a host of data validation checks.

We think it’s valuable to start off with these types of suites as there are various issues we can identify at the get go just by looking at raw data.

We will first import the appropriate factory function from the deepchecks.suites module - in this case, an integrity suite tailored for a single dataset (as opposed to a division into a train and test, for example) - and use it to create a new suite object:

from deepchecks.tabular.suites import single_dataset_integrity

integ_suite = single_dataset_integrity()

We will now run that suite on our data. While running on the native DataFrame is possible in some cases, it is recommended to wrap it with the deepchecks.tabular.Dataset object instead, to give the package a bit more context, namely what is the label column, and whether we have a datetime column (we have, as an index, so we’ll set set_datetime_from_dataframe_index=True), or any categorical features (we have none after one-hot encoding them, so we’ll set cat_features=[] explicitly).

dataset = deepchecks.tabular.Dataset(df=df, label='target',
                                     set_datetime_from_dataframe_index=True, cat_features=[])
integ_suite.run(dataset)

Out:

Single Dataset Integrity Suite:   0%|         | 0/9 [00:00<?, ? Check/s]
Single Dataset Integrity Suite:   0%|         | 0/9 [00:00<?, ? Check/s, Check=Is Single Value]
Single Dataset Integrity Suite:  11%|#        | 1/9 [00:00<00:00, 165.53 Check/s, Check=Mixed Nulls]
Single Dataset Integrity Suite:  22%|##       | 2/9 [00:00<00:00, 36.78 Check/s, Check=Mixed Data Types]
Single Dataset Integrity Suite:  33%|###      | 3/9 [00:00<00:00, 39.57 Check/s, Check=String Mismatch]
Single Dataset Integrity Suite:  44%|####     | 4/9 [00:00<00:00, 47.91 Check/s, Check=Data Duplicates]
Single Dataset Integrity Suite:  56%|#####    | 5/9 [00:00<00:00, 28.74 Check/s, Check=Data Duplicates]
Single Dataset Integrity Suite:  56%|#####    | 5/9 [00:00<00:00, 28.74 Check/s, Check=String Length Out Of Bounds]
Single Dataset Integrity Suite:  67%|######   | 6/9 [00:00<00:00, 28.74 Check/s, Check=Special Characters]
Single Dataset Integrity Suite:  78%|#######  | 7/9 [00:00<00:00, 28.74 Check/s, Check=Conflicting Labels]
Single Dataset Integrity Suite:  89%|######## | 8/9 [00:00<00:00,  7.68 Check/s, Check=Conflicting Labels]
Single Dataset Integrity Suite:  89%|######## | 8/9 [00:00<00:00,  7.68 Check/s, Check=Outlier Sample Detection]

Single Dataset Integrity Suite

The suite is composed of various checks such as: String Length Out Of Bounds, Outlier Sample Detection, Mixed Nulls, etc...
Each check may contain conditions (which will result in pass / fail / warning / error , represented by / / ! / ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Single Value in Column Does not contain only a single value Found columns with a single value: ['has_ip', 'urlIsLive']
!
Data Duplicates Duplicate data ratio is not greater than 0% Found 0.0088% duplicate data
Mixed Nulls Not more than 1 different null types
Mixed Data Types Rare data types in column are either more than 10% or less than 1% of the data
String Mismatch No string variants
String Length Out Of Bounds Ratio of outliers not greater than 0% string length outliers
Special Characters Ratio of entirely special character samples not greater than 0.1%
Conflicting Labels Ambiguous sample ratio is not greater than 0%

Check With Conditions Output

Single Value in Column

Check if there are columns which have only a single unique value in all rows.

Conditions Summary
Status Condition More Info
Does not contain only a single value Found columns with a single value: ['has_ip', 'urlIsLive']
Additional Outputs
The following columns have only one unique value
  has_ip urlIsLive
Single unique value 0 False

Go to top

Data Duplicates

Checks for duplicate samples in the dataset.

Conditions Summary
Status Condition More Info
!
Duplicate data ratio is not greater than 0% Found 0.0088% duplicate data
Additional Outputs
0.0088% of data samples are duplicates.
Each row in the table shows an example of duplicate data and the number of times it appears.
    target month scrape_date ext urlLength numDigits numParams num_%20 num_@ entropy has_ip hasHttp hasHttps urlIsLive dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr
Instances Number of Duplicates                                                  
4696, 4719 2 0 6 2019-06-06 other 123 28 4 0 0 -4.91 0 True False False 0 0 0 0 0 0 0 0 0.00 0.00 0.00

Go to top

Check Without Conditions Output


Other Checks That Weren't Displayed

Check Reason
Outlier Sample Detection - Train Dataset UFuncTypeError: Cannot cast ufunc 'true_divide' output from dtype('O') to dtype('float64') with casting rule 'same_kind'
Mixed Nulls Nothing found
Mixed Data Types Nothing found
String Mismatch Nothing found
String Length Out Of Bounds Nothing found
Special Characters Nothing found
Conflicting Labels Nothing found

Go to top


Understanding the checks’ results!#

Ok, so we’ve got some interesting results! Even though this is quite a tidy dataset without even any preprocessing, deepchecks has found a couple of columns (has_ip and urlIsLive) containing only a single value and a couple of duplicate values.

We also get a nice list of all checks that turned out ok, and what each check is about.

So nothing dramatic, but we will be sure to drop those useless columns. :)

Preprocessing#

Let’s split the data to train and test first. Since we want to examine how well a model can generalize from the past to the future, we’ll simply assign the first months of the dataset to the training set, and the last few months to the test set.

raw_train_df = df[df.month <= 9]
len(raw_train_df)

Out:

8626
raw_test_df = df[df.month > 9]
len(raw_test_df)

Out:

2724

Ok! Let’s process the data real quick and see how some baseline classifiers perform!

We’ll just set the scrape date as our index, drop a few useless columns, one-hot encode our categorical ext column and scale all numeric data:

from deepchecks.tabular.datasets.classification.phishing import \
    get_url_preprocessor

pipeline = get_url_preprocessor()

Now we’ll fit on and transform the raw train dataframe:

train_df = pipeline.fit_transform(raw_train_df)
train_X = train_df.drop('target', axis=1)
train_y = train_df['target']
train_X.head(3)
urlLength numDigits numParams num_%20 num_@ entropy hasHttp hasHttps dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr ext_com ext_country ext_html ext_info ext_net ext_other ext_php
scrape_date
2019-01-01 -0.271569 -0.329581 -0.327303 -0.089699 -0.068846 0.314615 0.239243 -0.241671 0.280235 -0.356485 -0.125958 -0.255521 -0.264688 1.393957 -0.059321 -0.068217 0.753133 0.753298 -0.054849 -0.859105 -0.434899 -0.401599 -0.035733 3.553473 -0.426577 -0.226517
2019-01-01 0.917509 2.357675 -0.327303 5.663025 -0.068846 2.991389 0.239243 -0.241671 -1.093947 -0.629844 -0.254032 -0.344488 -0.290751 -0.358447 -0.269256 -0.282689 -1.087302 -0.414405 -0.174310 -0.859105 2.299385 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517
2019-01-01 1.306246 -0.484615 6.957823 -0.089699 -0.068846 -0.421190 0.239243 -0.241671 0.406734 -0.480999 0.431238 0.189313 -0.160433 1.225340 0.517939 0.487306 0.953338 0.551243 -0.061609 -0.859105 -0.434899 -0.401599 -0.035733 3.553473 -0.426577 -0.226517


And apply the same fitted preprocessing pipeline (with the fitted scaler, for example) to the test dataframe:

test_df = pipeline.transform(raw_test_df)
test_X = test_df.drop('target', axis=1)
test_y = test_df['target']
test_X.head(3)
urlLength numDigits numParams num_%20 num_@ entropy hasHttp hasHttps dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr ext_com ext_country ext_html ext_info ext_net ext_other ext_php
scrape_date
2019-10-01 -0.500238 -0.691327 -0.327303 -0.089699 -0.068846 0.956667 0.239243 -0.241671 -1.093947 -0.629844 -0.381413 -0.344488 -0.395006 -0.593305 -0.355159 -0.290053 -1.218560 -2.042381 -0.189730 -0.859105 2.299385 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517
2019-10-01 0.002834 0.238877 -0.327303 -0.089699 -0.068846 -0.498665 0.239243 -0.241671 -1.093947 -0.629844 10.879221 -0.136899 1.533700 0.153424 9.579742 8.281871 0.509814 0.087470 -0.034532 1.164002 -0.434899 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517
2019-10-01 -0.614572 0.342233 -0.327303 -0.089699 -0.068846 -0.030503 0.239243 -0.241671 -0.247266 -0.266319 -0.200150 -0.314833 -0.082243 -0.448777 -0.127258 -0.174697 0.020147 0.559584 -0.098683 1.164002 -0.434899 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517


from sklearn.linear_model import LogisticRegression; from sklearn.metrics import accuracy_score; hyperparameters = {'penalty': 'l2', 'fit_intercept': True, 'random_state': SEED, 'C': 0.009}
logreg = LogisticRegression(**hyperparameters)
logreg.fit(train_X, train_y);
pred_y = logreg.predict(test_X)
accuracy_score(test_y, pred_y)

Out:

0.9698972099853157

Ok, so we’ve got a nice accuracy score from the get go! Let’s see what deepchecks can tell us about our model…

from deepchecks.tabular.suites import train_test_validation

Now that we have separate train and test DataFrames, we will create two deepchecks.tabular.Dataset objects to enable this suite and the next one to run addressing the train and test dataframes according to their role. Notice that here we pass the label as a column instead of a column name, because we’ve seperated the feature DataFrame from the target.

ds_train = deepchecks.tabular.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True,
                                      cat_features=[])
ds_test = deepchecks.tabular.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[])

Now we just have to provide the run method of the suite object with both the model and the Dataset objects.

vsuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test)

Out:

Train Test Validation Suite:   0%|              | 0/14 [00:00<?, ? Check/s]
Train Test Validation Suite:   0%|              | 0/14 [00:00<?, ? Check/s, Check=Train Test Feature Drift]
Train Test Validation Suite:   7%|#             | 1/14 [00:01<00:15,  1.20s/ Check, Check=Train Test Feature Drift]
Train Test Validation Suite:   7%|#             | 1/14 [00:01<00:15,  1.20s/ Check, Check=Train Test Label Drift]
Train Test Validation Suite:  14%|##            | 2/14 [00:01<00:14,  1.20s/ Check, Check=Whole Dataset Drift]   Calculating permutation feature importance. Expected to finish in 2 seconds

Train Test Validation Suite:  21%|###           | 3/14 [00:01<00:06,  1.77 Check/s, Check=Whole Dataset Drift]
Train Test Validation Suite:  21%|###           | 3/14 [00:01<00:06,  1.77 Check/s, Check=Dominant Frequency Change]
Train Test Validation Suite:  29%|####          | 4/14 [00:01<00:05,  1.77 Check/s, Check=Category Mismatch Train Test]
Train Test Validation Suite:  36%|#####         | 5/14 [00:01<00:05,  1.77 Check/s, Check=New Label Train Test]
Train Test Validation Suite:  43%|######        | 6/14 [00:01<00:04,  1.77 Check/s, Check=String Mismatch Comparison]
Train Test Validation Suite:  50%|#######       | 7/14 [00:01<00:03,  1.77 Check/s, Check=Datasets Size Comparison]
Train Test Validation Suite:  57%|########      | 8/14 [00:01<00:03,  1.77 Check/s, Check=Date Train Test Leakage Duplicates]
Train Test Validation Suite:  64%|#########     | 9/14 [00:01<00:02,  1.77 Check/s, Check=Date Train Test Leakage Overlap]
Train Test Validation Suite:  71%|##########    | 10/14 [00:01<00:02,  1.77 Check/s, Check=Single Feature Contribution Train Test]
Train Test Validation Suite:  79%|###########   | 11/14 [00:02<00:00,  5.87 Check/s, Check=Single Feature Contribution Train Test]
Train Test Validation Suite:  79%|###########   | 11/14 [00:02<00:00,  5.87 Check/s, Check=Train Test Samples Mix]
Train Test Validation Suite:  86%|############  | 12/14 [00:03<00:00,  3.98 Check/s, Check=Train Test Samples Mix]
Train Test Validation Suite:  86%|############  | 12/14 [00:03<00:00,  3.98 Check/s, Check=Identifier Leakage]
Train Test Validation Suite:  93%|############# | 13/14 [00:03<00:00,  4.33 Check/s, Check=Identifier Leakage]
Train Test Validation Suite:  93%|############# | 13/14 [00:03<00:00,  4.33 Check/s, Check=Index Train Test Leakage]

Train Test Validation Suite

The suite is composed of various checks such as: Identifier Leakage, Train Test Label Drift, Train Test Samples Mix, etc...
Each check may contain conditions (which will result in pass / fail / warning / error , represented by / / ! / ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Date Train-Test Leakage (overlap) Date leakage ratio is not greater than 0% Found 100% leaked dates
Train Test Drift PSI <= 0.2 and Earth Mover's Distance <= 0.1
Train Test Label Drift PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift
Whole Dataset Drift Drift value is not greater than 0.25
Dominant Frequency Change Change in ratio of dominant value in data is not greater than 25%
Category Mismatch Train Test Ratio of samples with a new category is not greater than 0%
New Label Train Test Number of new label values is not greater than 0
String Mismatch Comparison No new variants allowed in test data
Datasets Size Comparison Test-Train size ratio is not smaller than 0.01
Date Train-Test Leakage (duplicates) Date leakage ratio is not greater than 0%
Single Feature Contribution Train-Test Train-Test features' Predictive Power Score difference is not greater than 0.2
Single Feature Contribution Train-Test Train features' Predictive Power Score is not greater than 0.7
Train Test Samples Mix Percentage of test data samples that appear in train data not greater than 10%
Identifier Leakage - Train Dataset Identifier columns PPS is not greater than 0
Identifier Leakage - Test Dataset Identifier columns PPS is not greater than 0

Check With Conditions Output

Train Test Drift

Calculate drift between train dataset and test dataset per feature, using statistical measures.

Conditions Summary
Status Condition More Info
PSI <= 0.2 and Earth Mover's Distance <= 0.1
Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the features, sorted by feature importance and showing only the top 5 features, according to feature importance.
If available, the plot titles also show the feature importance (FI) rank.

Go to top

Train Test Label Drift

Calculate label drift between train dataset and test dataset, using statistical measures.

Conditions Summary
Status Condition More Info
PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift
Additional Outputs
The Drift score is a measure for the difference between two distributions, in this check - the test and train distributions.
The check shows the drift score and distributions for the label.

Go to top

Whole Dataset Drift

Calculate drift between the entire train and test datasets using a model trained to distinguish between them.

Conditions Summary
Status Condition More Info
Drift value is not greater than 0.25
Additional Outputs
The shown features are the features that are most important for the domain classifier - the domain_classifier trained to distinguish between the train and test datasets.
The percents of explained dataset difference are the importance values for the feature calculated using `permutation_importance`.

Main features contributing to drift

* showing only the top 3 columns, you can change it using n_top_columns param

Go to top

Datasets Size Comparison

Verify test dataset size comparing it to the train dataset size.

Conditions Summary
Status Condition More Info
Test-Train size ratio is not smaller than 0.01
Additional Outputs
  Train Test
Size 8626 2724

Go to top

Date Train-Test Leakage (overlap)

Check test data that is dated earlier than latest date in train.

Conditions Summary
Status Condition More Info
Date leakage ratio is not greater than 0% Found 100% leaked dates
Additional Outputs
100% of test data dates before last training data date (2020/01/15 00:00:00.000000 )

Go to top

Single Feature Contribution Train-Test

Return the Predictive Power Score of all features, in order to estimate each feature's ability to predict the label.

Conditions Summary
Status Condition More Info
Train-Test features' Predictive Power Score difference is not greater than 0.2
Train features' Predictive Power Score is not greater than 0.7
Additional Outputs
The Predictive Power Score (PPS) is used to estimate the ability of a feature to predict the label by itself. (Read more about Predictive Power Score)
In the graph above, we should suspect we have problems in our data if:
1. Train dataset PPS values are high:
Can indicate that this feature's success in predicting the label is actually due to data leakage,
meaning that the feature holds information that is based on the label to begin with.
2. Large difference between train and test PPS (train PPS is larger):
An even more powerful indication of data leakage, as a feature that was powerful in train but not in test
can be explained by leakage in train that is not relevant to a new dataset.
3. Large difference between test and train PPS (test PPS is larger):
An anomalous value, could indicate drift in test dataset that caused a coincidental correlation to the target label.

Go to top

Train Test Samples Mix

Detect samples in the test data that appear also in training data.

Conditions Summary
Status Condition More Info
Percentage of test data samples that appear in train data not greater than 10%
Additional Outputs
0.29% (8 / 2724) of test data samples appear in train data
  urlLength numDigits numParams num_%20 num_@ entropy hasHttp hasHttps dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr ext_com ext_country ext_html ext_info ext_net ext_other ext_php target
Train indices: 2019-01-02 00:00:00, 2019-02-0.. Tot. 2 Test indices: 2019-11-20 00:00:00, 2019-11-2.. Tot. 2 0.85 -0.43 2.32 -0.09 -0.07 -1.53 0.24 -0.24 -1.09 -0.63 -0.38 -0.34 -0.40 -0.59 -0.36 -0.29 -1.22 -2.04 -0.19 -0.86 2.30 -0.40 -0.04 -0.28 -0.43 -0.23 0
Train indices: 2019-01-06 00:00:00 Test indices: 2019-11-06 00:00:00 -0.41 -0.02 2.98 -0.09 -0.07 -0.42 0.24 -0.24 -1.09 -0.63 -0.38 -0.34 -0.40 -0.59 -0.36 -0.29 -1.22 -2.04 -0.19 -0.86 -0.43 -0.40 -0.04 -0.28 -0.43 4.41 0
Train indices: 2019-09-24 00:00:00 Test indices: 2019-10-02 00:00:00 -0.18 -0.59 4.97 -0.09 -0.07 -0.03 0.24 -0.24 -0.71 -0.52 -0.38 -0.31 -0.40 -0.59 -0.36 -0.29 -1.22 -2.04 -0.19 -0.86 -0.43 -0.40 -0.04 -0.28 -0.43 4.41 0
Train indices: 2019-08-15 00:00:00 Test indices: 2019-12-02 00:00:00 -0.09 -0.54 4.97 -0.09 -0.07 -0.04 0.24 -0.24 -0.71 -0.52 -0.38 -0.31 -0.40 -0.59 -0.36 -0.29 -1.22 -2.04 -0.19 -0.86 -0.43 -0.40 -0.04 -0.28 -0.43 4.41 0
Train indices: 2019-04-09 00:00:00, 2019-05-1.. Tot. 4 Test indices: 2019-12-03 00:00:00 0.21 0.70 2.32 -0.09 -0.07 -1.40 0.24 -0.24 -1.09 -0.63 -0.38 -0.34 -0.40 -0.59 -0.36 -0.29 -1.22 -2.04 -0.19 -0.86 -0.43 -0.40 -0.04 -0.28 2.34 -0.23 0
Train indices: 2019-04-01 00:00:00 Test indices: 2019-12-14 00:00:00 0.21 0.70 2.32 -0.09 -0.07 -1.35 0.24 -0.24 -1.09 -0.63 -0.38 -0.34 -0.40 -0.59 -0.36 -0.29 -1.22 -2.04 -0.19 -0.86 -0.43 -0.40 -0.04 -0.28 2.34 -0.23 0
Train indices: 2019-05-01 00:00:00 Test indices: 2019-11-26 00:00:00 1.65 -0.28 2.32 -0.09 -0.07 -1.92 0.24 -0.24 -1.09 -0.63 -0.38 -0.34 -0.40 -0.59 -0.36 -0.29 -1.22 -2.04 -0.19 -0.86 -0.43 -0.40 -0.04 -0.28 2.34 -0.23 0

Go to top

Check Without Conditions Output


Other Checks That Weren't Displayed

Check Reason
Index Train Test Leakage There is no index defined to use. Did you pass a DataFrame instead of a Dataset?
Dominant Frequency Change Nothing found
Category Mismatch Train Test Nothing found
New Label Train Test Nothing found
String Mismatch Comparison Nothing found
Date Train-Test Leakage (duplicates) Nothing found
Identifier Leakage - Train Dataset Nothing found
Identifier Leakage - Test Dataset Nothing found

Go to top


Understanding the checks’ results!#

Whoa! It looks like we have some time leakage!

The Conditions Summary section showed that the Date Train-Test Leakage (overlap) check was the only failed check. The Additional Outputs section helped us understand that the latest date in the train set belongs to January 2020!

It seems some entries from January 2020 made their way into the train set. We assumed the month columns was enough to split the data with (which it would, have all data was indeed from 2019), but as in real life, things were a bit messy. We’ll adjust our preprocessing real quick, and with methodological errors out of the way we’ll get to checking our model’s performance.

it is also worth mentioning that deepchecks found that urlLength is the only feature that alone can predict the target with some measure of success. This is worth investigating!

Adjusting our preprocessing and refitting the model#

Let’s just drop any row from 2020 from the raw dataframe and take it all from there

df = df[~df['scrape_date'].str.contains('2020')]
df.shape

Out:

(10896, 25)
pipeline = get_url_preprocessor()
train_df = pipeline.fit_transform(raw_train_df)
train_X = train_df.drop('target', axis=1)
train_y = train_df['target']
train_X.head(3)
urlLength numDigits numParams num_%20 num_@ entropy hasHttp hasHttps dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr ext_com ext_country ext_html ext_info ext_net ext_other ext_php
scrape_date
2019-01-01 -0.271569 -0.329581 -0.327303 -0.089699 -0.068846 0.314615 0.239243 -0.241671 0.280235 -0.356485 -0.125958 -0.255521 -0.264688 1.393957 -0.059321 -0.068217 0.753133 0.753298 -0.054849 -0.859105 -0.434899 -0.401599 -0.035733 3.553473 -0.426577 -0.226517
2019-01-01 0.917509 2.357675 -0.327303 5.663025 -0.068846 2.991389 0.239243 -0.241671 -1.093947 -0.629844 -0.254032 -0.344488 -0.290751 -0.358447 -0.269256 -0.282689 -1.087302 -0.414405 -0.174310 -0.859105 2.299385 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517
2019-01-01 1.306246 -0.484615 6.957823 -0.089699 -0.068846 -0.421190 0.239243 -0.241671 0.406734 -0.480999 0.431238 0.189313 -0.160433 1.225340 0.517939 0.487306 0.953338 0.551243 -0.061609 -0.859105 -0.434899 -0.401599 -0.035733 3.553473 -0.426577 -0.226517


test_df = pipeline.transform(raw_test_df)
test_X = test_df.drop('target', axis=1)
test_y = test_df['target']
test_X.head(3)
urlLength numDigits numParams num_%20 num_@ entropy hasHttp hasHttps dsr dse bodyLength numTitles numImages numLinks specialChars scriptLength sbr bscr sscr ext_com ext_country ext_html ext_info ext_net ext_other ext_php
scrape_date
2019-10-01 -0.500238 -0.691327 -0.327303 -0.089699 -0.068846 0.956667 0.239243 -0.241671 -1.093947 -0.629844 -0.381413 -0.344488 -0.395006 -0.593305 -0.355159 -0.290053 -1.218560 -2.042381 -0.189730 -0.859105 2.299385 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517
2019-10-01 0.002834 0.238877 -0.327303 -0.089699 -0.068846 -0.498665 0.239243 -0.241671 -1.093947 -0.629844 10.879221 -0.136899 1.533700 0.153424 9.579742 8.281871 0.509814 0.087470 -0.034532 1.164002 -0.434899 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517
2019-10-01 -0.614572 0.342233 -0.327303 -0.089699 -0.068846 -0.030503 0.239243 -0.241671 -0.247266 -0.266319 -0.200150 -0.314833 -0.082243 -0.448777 -0.127258 -0.174697 0.020147 0.559584 -0.098683 1.164002 -0.434899 -0.401599 -0.035733 -0.281415 -0.426577 -0.226517


logreg.fit(train_X, train_y)

Out:

LogisticRegression(C=0.009, random_state=832)
pred_y = logreg.predict(test_X)
accuracy_score(test_y, pred_y)

Out:

0.9698972099853157

Deepchecks’ Performance Checks#

Ok! Now that we’re back on track lets run some performance checks to see how we did.

from deepchecks.tabular.suites import model_evaluation
msuite = model_evaluation()
ds_train = deepchecks.tabular.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True, cat_features=[])
ds_test = deepchecks.tabular.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[])
msuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test)

Out:

Model Evaluation Suite:   0%|           | 0/11 [00:00<?, ? Check/s]
Model Evaluation Suite:   0%|           | 0/11 [00:00<?, ? Check/s, Check=Confusion Matrix Report]
Model Evaluation Suite:   9%|#          | 1/11 [00:00<00:00, 28.57 Check/s, Check=Performance Report]
Model Evaluation Suite:  18%|##         | 2/11 [00:00<00:01,  5.80 Check/s, Check=Performance Report]
Model Evaluation Suite:  18%|##         | 2/11 [00:00<00:01,  5.80 Check/s, Check=Roc Report]
Model Evaluation Suite:  27%|###        | 3/11 [00:00<00:01,  5.80 Check/s, Check=Simple Model Comparison]
Model Evaluation Suite:  36%|####       | 4/11 [00:00<00:01,  5.80 Check/s, Check=Model Error Analysis]   Cannot use model's built-in feature importance on a Scikit-learn Pipeline, using permutation feature importance calculation instead
Calculating permutation feature importance without time limit. Expected to finish in 17 seconds

Model Evaluation Suite:  45%|#####      | 5/11 [00:09<00:12,  2.16s/ Check, Check=Model Error Analysis]
Model Evaluation Suite:  45%|#####      | 5/11 [00:09<00:12,  2.16s/ Check, Check=Calibration Score]
Model Evaluation Suite:  55%|######     | 6/11 [00:09<00:10,  2.16s/ Check, Check=Regression Systematic Error]
Model Evaluation Suite:  64%|#######    | 7/11 [00:09<00:08,  2.16s/ Check, Check=Regression Error Distribution]
Model Evaluation Suite:  73%|########   | 8/11 [00:09<00:06,  2.16s/ Check, Check=Boosting Overfit]
Model Evaluation Suite:  82%|#########  | 9/11 [00:09<00:04,  2.16s/ Check, Check=Unused Features]
Model Evaluation Suite:  91%|########## | 10/11 [00:09<00:02,  2.16s/ Check, Check=Model Inference Time]

Model Evaluation Suite

The suite is composed of various checks such as: Roc Report, Regression Systematic Error, Calibration Score, etc...
Each check may contain conditions (which will result in pass / fail / warning / error , represented by / / ! / ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Simple Model Comparison Model performance gain over simple model is not less than 10% Found metrics with gain below threshold: {'F1': {0: '2.34%', 1: '4.65%'}}
!
Model Error Analysis The performance difference of the detected segments must not be greater than 5% Found change in Accuracy in features above threshold: {'urlLength': '31.2%'}
Performance Report Train-Test scores relative degradation is not greater than 0.1
ROC Report - Train Dataset AUC score for all the classes is not less than 0.7
ROC Report - Test Dataset AUC score for all the classes is not less than 0.7
Unused Features Number of high variance unused features is not greater than 5
Model Inference Time - Train Dataset Average model inference time for one sample is not greater than 0.001
Model Inference Time - Test Dataset Average model inference time for one sample is not greater than 0.001

Check With Conditions Output

Performance Report

Summarize given scores on a dataset and model.

Conditions Summary
Status Condition More Info
Train-Test scores relative degradation is not greater than 0.1
Additional Outputs

Go to top

ROC Report - Train Dataset

Calculate the ROC curve for each class.

Conditions Summary
Status Condition More Info
AUC score for all the classes is not less than 0.7
Additional Outputs
The marked points are the optimal threshold cut-off points. They are determined using Youden's index defined as sensitivity + specificity - 1

Go to top

ROC Report - Test Dataset

Calculate the ROC curve for each class.

Conditions Summary
Status Condition More Info
AUC score for all the classes is not less than 0.7
Additional Outputs
The marked points are the optimal threshold cut-off points. They are determined using Youden's index defined as sensitivity + specificity - 1

Go to top

Simple Model Comparison

Compare given model score to simple model score (according to given model type).

Conditions Summary
Status Condition More Info
Model performance gain over simple model is not less than 10% Found metrics with gain below threshold: {'F1': {0: '2.34%', 1: '4.65%'}}
Additional Outputs

Go to top

Model Error Analysis

Find features that best split the data into segments of high and low model error.

Conditions Summary
Status Condition More Info
!
The performance difference of the detected segments must not be greater than 5% Found change in Accuracy in features above threshold: {'urlLength': '31.2%'}
Additional Outputs
The following graphs show the distribution of error for top features that are most useful for distinguishing high error samples from low error samples.

Go to top

Unused Features

Detect features that are nearly unused by the model.

Conditions Summary
Status Condition More Info
Number of high variance unused features is not greater than 5
Additional Outputs
Features above the line are a sample of the most important features, while the features below the line are the unused features with highest variance, as defined by check parameters

Go to top

Model Inference Time - Train Dataset

Measure model average inference time (in seconds) per sample.

Conditions Summary
Status Condition More Info
Average model inference time for one sample is not greater than 0.001
Additional Outputs
Average model inference time for one sample (in seconds): 1.09e-06

Go to top

Model Inference Time - Test Dataset

Measure model average inference time (in seconds) per sample.

Conditions Summary
Status Condition More Info
Average model inference time for one sample is not greater than 0.001
Additional Outputs
Average model inference time for one sample (in seconds): 9.9e-07

Go to top

Check Without Conditions Output

Confusion Matrix Report - Train Dataset

Calculate the confusion matrix of the model on the given dataset.

Additional Outputs

Go to top

Confusion Matrix Report - Test Dataset

Calculate the confusion matrix of the model on the given dataset.

Additional Outputs

Go to top

Calibration Metric - Train Dataset

Calculate the calibration curve with brier score for each class.

Additional Outputs
Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions.
The Brier score metric may be used to assess how well a classifier is calibrated. For more info, please visit https://en.wikipedia.org/wiki/Brier_score

Go to top

Calibration Metric - Test Dataset

Calculate the calibration curve with brier score for each class.

Additional Outputs
Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions.
The Brier score metric may be used to assess how well a classifier is calibrated. For more info, please visit https://en.wikipedia.org/wiki/Brier_score

Go to top

Other Checks That Weren't Displayed

Check Reason
Regression Systematic Error - Train Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Systematic Error - Test Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Train Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Test Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Boosting Overfit Check is relevant for Boosting models of type ('AdaBoostClassifier', 'GradientBoostingClassifier', 'LGBMClassifier', 'XGBClassifier', 'CatBoostClassifier', 'AdaBoostRegressor', 'GradientBoostingRegressor', 'LGBMRegressor', 'XGBRegressor', 'CatBoostRegressor'), but received model of type LogisticRegression

Go to top


Understanding the checks’ results!#

Ok! Now that we’re back on track lets run some performance checks to see how we did.

  • Simple Model Comparison - This checks make sure our model outperforms a very simple model to some degree. Having it fail means we might have a serious problem.

  • Model Error Analysis - This check analyses model errors and tries to find a way to segment our data in a way that is informative to error analysis. It seems that it found a valuable way to segment our data, error-wise, using the urlLength feature. We’ll look into it soon enough.

Looking at the metric plots for F1 for both our model and a simple one we see their performance are almost identical! How can this be? Fortunately the confusion matrices automagically generated for both the training and test sets help us understand what has happened.

Our evidently over-regularized classifier was over-impressed by the majority class (0, or non-malicious URL), and predicted a value of 0 for almost all samples in both the train and the test set, which yielded a seemingly-impressive 97% accuracy on the test set just due to the imbalanced nature of the problem.

deepchecks also generated plots for F1, precision and recall on both the train and test set, as part of the performance report, and these also help us see recall scores are almost zero for both sets and understand what happened.

Trying out a different classifier#

So let’s throw something a bit more rich in expressive power at the problem - a decision tree!

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(criterion='entropy', splitter='random', random_state=SEED)
model.fit(train_X, train_y)
msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test)

Out:

Model Evaluation Suite:   0%|           | 0/11 [00:00<?, ? Check/s]
Model Evaluation Suite:   0%|           | 0/11 [00:00<?, ? Check/s, Check=Confusion Matrix Report]
Model Evaluation Suite:   9%|#          | 1/11 [00:00<00:00, 23.20 Check/s, Check=Performance Report]
Model Evaluation Suite:  18%|##         | 2/11 [00:00<00:00, 10.82 Check/s, Check=Performance Report]
Model Evaluation Suite:  18%|##         | 2/11 [00:00<00:00, 10.82 Check/s, Check=Roc Report]
Model Evaluation Suite:  27%|###        | 3/11 [00:00<00:00, 10.82 Check/s, Check=Simple Model Comparison]
Model Evaluation Suite:  36%|####       | 4/11 [00:00<00:00, 10.82 Check/s, Check=Model Error Analysis]
Model Evaluation Suite:  45%|#####      | 5/11 [00:00<00:00, 11.48 Check/s, Check=Model Error Analysis]
Model Evaluation Suite:  45%|#####      | 5/11 [00:00<00:00, 11.48 Check/s, Check=Calibration Score]
Model Evaluation Suite:  55%|######     | 6/11 [00:00<00:00, 11.48 Check/s, Check=Regression Systematic Error]
Model Evaluation Suite:  64%|#######    | 7/11 [00:00<00:00, 11.48 Check/s, Check=Regression Error Distribution]
Model Evaluation Suite:  73%|########   | 8/11 [00:00<00:00, 11.48 Check/s, Check=Boosting Overfit]
Model Evaluation Suite:  82%|#########  | 9/11 [00:00<00:00, 11.48 Check/s, Check=Unused Features]
Model Evaluation Suite:  91%|########## | 10/11 [00:00<00:00, 11.48 Check/s, Check=Model Inference Time]

Model Evaluation Suite

The suite is composed of various checks such as: Roc Report, Regression Systematic Error, Calibration Score, etc...
Each check may contain conditions (which will result in pass / fail / warning / error , represented by / / ! / ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Performance Report Train-Test scores relative degradation is not greater than 0.1 F1 for class 1 (train=1 test=0.82) Precision for class 1 (train=1 test=0.79) Recall for class 1 (train=1 test=0.85)
!
Unused Features Number of high variance unused features is not greater than 5 Found number of unused high variance features above threshold: ['scriptLength', 'sscr', 'ext_info', 'numImages', 'num_@', 'hasHttp']
ROC Report - Train Dataset AUC score for all the classes is not less than 0.7
ROC Report - Test Dataset AUC score for all the classes is not less than 0.7
Simple Model Comparison Model performance gain over simple model is not less than 10%
Model Inference Time - Train Dataset Average model inference time for one sample is not greater than 0.001
Model Inference Time - Test Dataset Average model inference time for one sample is not greater than 0.001

Check With Conditions Output

Performance Report

Summarize given scores on a dataset and model.

Conditions Summary
Status Condition More Info
Train-Test scores relative degradation is not greater than 0.1 F1 for class 1 (train=1 test=0.82) Precision for class 1 (train=1 test=0.79) Recall for class 1 (train=1 test=0.85)
Additional Outputs

Go to top

ROC Report - Train Dataset

Calculate the ROC curve for each class.

Conditions Summary
Status Condition More Info
AUC score for all the classes is not less than 0.7
Additional Outputs
The marked points are the optimal threshold cut-off points. They are determined using Youden's index defined as sensitivity + specificity - 1

Go to top

ROC Report - Test Dataset

Calculate the ROC curve for each class.

Conditions Summary
Status Condition More Info
AUC score for all the classes is not less than 0.7
Additional Outputs
The marked points are the optimal threshold cut-off points. They are determined using Youden's index defined as sensitivity + specificity - 1

Go to top

Simple Model Comparison

Compare given model score to simple model score (according to given model type).

Conditions Summary
Status Condition More Info
Model performance gain over simple model is not less than 10%
Additional Outputs

Go to top

Unused Features

Detect features that are nearly unused by the model.

Conditions Summary
Status Condition More Info
!
Number of high variance unused features is not greater than 5 Found number of unused high variance features above threshold: ['scriptLength', 'sscr', 'ext_info', 'numImages', 'num_@', 'hasHttp']
Additional Outputs
Features above the line are a sample of the most important features, while the features below the line are the unused features with highest variance, as defined by check parameters

Go to top

Model Inference Time - Train Dataset

Measure model average inference time (in seconds) per sample.

Conditions Summary
Status Condition More Info
Average model inference time for one sample is not greater than 0.001
Additional Outputs
Average model inference time for one sample (in seconds): 1.07e-06

Go to top

Model Inference Time - Test Dataset

Measure model average inference time (in seconds) per sample.

Conditions Summary
Status Condition More Info
Average model inference time for one sample is not greater than 0.001
Additional Outputs
Average model inference time for one sample (in seconds): 1.01e-06

Go to top

Check Without Conditions Output

Confusion Matrix Report - Train Dataset

Calculate the confusion matrix of the model on the given dataset.

Additional Outputs

Go to top

Confusion Matrix Report - Test Dataset

Calculate the confusion matrix of the model on the given dataset.

Additional Outputs

Go to top

Calibration Metric - Train Dataset

Calculate the calibration curve with brier score for each class.

Additional Outputs
Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions.
The Brier score metric may be used to assess how well a classifier is calibrated. For more info, please visit https://en.wikipedia.org/wiki/Brier_score

Go to top

Calibration Metric - Test Dataset

Calculate the calibration curve with brier score for each class.

Additional Outputs
Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions.
The Brier score metric may be used to assess how well a classifier is calibrated. For more info, please visit https://en.wikipedia.org/wiki/Brier_score

Go to top

Other Checks That Weren't Displayed

Check Reason
Model Error Analysis Unable to train meaningful error model (r^2 score: -0.01)
Regression Systematic Error - Train Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Systematic Error - Test Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Train Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Test Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Boosting Overfit Check is relevant for Boosting models of type ('AdaBoostClassifier', 'GradientBoostingClassifier', 'LGBMClassifier', 'XGBClassifier', 'CatBoostClassifier', 'AdaBoostRegressor', 'GradientBoostingRegressor', 'LGBMRegressor', 'XGBRegressor', 'CatBoostRegressor'), but received model of type DecisionTreeClassifier

Go to top


Boosting our model!#

To try and solve the overfitting issue let’s try and throw at a problem an ensemble model that has a bit more resilience to overfitting than a decision tree: a gradient-boosted ensemble of them!

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=250, random_state=SEED, max_depth=20, subsample=0.8 , loss='exponential')
model.fit(train_X, train_y)
msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test)

Out:

Model Evaluation Suite:   0%|           | 0/11 [00:00<?, ? Check/s]
Model Evaluation Suite:   0%|           | 0/11 [00:00<?, ? Check/s, Check=Confusion Matrix Report]
Model Evaluation Suite:   9%|#          | 1/11 [00:00<00:01,  6.62 Check/s, Check=Confusion Matrix Report]
Model Evaluation Suite:   9%|#          | 1/11 [00:00<00:01,  6.62 Check/s, Check=Performance Report]
Model Evaluation Suite:  18%|##         | 2/11 [00:00<00:03,  2.91 Check/s, Check=Performance Report]
Model Evaluation Suite:  18%|##         | 2/11 [00:00<00:03,  2.91 Check/s, Check=Roc Report]
Model Evaluation Suite:  27%|###        | 3/11 [00:00<00:01,  4.06 Check/s, Check=Roc Report]
Model Evaluation Suite:  27%|###        | 3/11 [00:00<00:01,  4.06 Check/s, Check=Simple Model Comparison]
Model Evaluation Suite:  36%|####       | 4/11 [00:00<00:01,  4.06 Check/s, Check=Model Error Analysis]
Model Evaluation Suite:  45%|#####      | 5/11 [00:01<00:01,  4.77 Check/s, Check=Model Error Analysis]
Model Evaluation Suite:  45%|#####      | 5/11 [00:01<00:01,  4.77 Check/s, Check=Calibration Score]
Model Evaluation Suite:  55%|######     | 6/11 [00:01<00:00,  5.34 Check/s, Check=Calibration Score]
Model Evaluation Suite:  55%|######     | 6/11 [00:01<00:00,  5.34 Check/s, Check=Regression Systematic Error]
Model Evaluation Suite:  64%|#######    | 7/11 [00:01<00:00,  5.34 Check/s, Check=Regression Error Distribution]
Model Evaluation Suite:  73%|########   | 8/11 [00:01<00:00,  5.34 Check/s, Check=Boosting Overfit]
Model Evaluation Suite:  82%|#########  | 9/11 [00:03<00:00,  2.22 Check/s, Check=Boosting Overfit]
Model Evaluation Suite:  82%|#########  | 9/11 [00:03<00:00,  2.22 Check/s, Check=Unused Features]
Model Evaluation Suite:  91%|########## | 10/11 [00:03<00:00,  2.22 Check/s, Check=Model Inference Time]
Model Evaluation Suite: 100%|###########| 11/11 [00:03<00:00,  3.18 Check/s, Check=Model Inference Time]

Model Evaluation Suite

The suite is composed of various checks such as: Roc Report, Regression Systematic Error, Calibration Score, etc...
Each check may contain conditions (which will result in pass / fail / warning / error , represented by / / ! / ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Performance Report Train-Test scores relative degradation is not greater than 0.1 F1 for class 1 (train=1 test=0.87) Precision for class 1 (train=1 test=0.89) Recall for class 1 (train=1 test=0.86)
!
Unused Features Number of high variance unused features is not greater than 5 Found number of unused high variance features above threshold: ['sscr', 'ext_info', 'ext_country', 'ext_html', 'ext_other', 'num_@', 'hasHttps', 'hasHttp', 'numLinks', 'ext_php']
ROC Report - Train Dataset AUC score for all the classes is not less than 0.7
ROC Report - Test Dataset AUC score for all the classes is not less than 0.7
Simple Model Comparison Model performance gain over simple model is not less than 10%
Boosting Overfit Test score over iterations doesn't decline by more than 5% from the best score
Model Inference Time - Train Dataset Average model inference time for one sample is not greater than 0.001
Model Inference Time - Test Dataset Average model inference time for one sample is not greater than 0.001

Check With Conditions Output

Performance Report

Summarize given scores on a dataset and model.

Conditions Summary
Status Condition More Info
Train-Test scores relative degradation is not greater than 0.1 F1 for class 1 (train=1 test=0.87) Precision for class 1 (train=1 test=0.89) Recall for class 1 (train=1 test=0.86)
Additional Outputs

Go to top

ROC Report - Train Dataset

Calculate the ROC curve for each class.

Conditions Summary
Status Condition More Info
AUC score for all the classes is not less than 0.7
Additional Outputs
The marked points are the optimal threshold cut-off points. They are determined using Youden's index defined as sensitivity + specificity - 1

Go to top

ROC Report - Test Dataset

Calculate the ROC curve for each class.

Conditions Summary
Status Condition More Info
AUC score for all the classes is not less than 0.7
Additional Outputs
The marked points are the optimal threshold cut-off points. They are determined using Youden's index defined as sensitivity + specificity - 1

Go to top

Simple Model Comparison

Compare given model score to simple model score (according to given model type).

Conditions Summary
Status Condition More Info
Model performance gain over simple model is not less than 10%
Additional Outputs

Go to top

Boosting Overfit

Check for overfit caused by using too many iterations in a gradient boosted model.

Conditions Summary
Status Condition More Info
Test score over iterations doesn't decline by more than 5% from the best score
Additional Outputs
The check limits the boosting model to using up to N estimators each time, and plotting the Accuracy calculated for each subset of estimators for both the train dataset and the test dataset.

Go to top

Unused Features

Detect features that are nearly unused by the model.

Conditions Summary
Status Condition More Info
!
Number of high variance unused features is not greater than 5 Found number of unused high variance features above threshold: ['sscr', 'ext_info', 'ext_country', 'ext_html', 'ext_other', 'num_@', 'hasHttps', 'hasHttp', 'numLinks', 'ext_php']
Additional Outputs
Features above the line are a sample of the most important features, while the features below the line are the unused features with highest variance, as defined by check parameters

Go to top

Model Inference Time - Train Dataset

Measure model average inference time (in seconds) per sample.

Conditions Summary
Status Condition More Info
Average model inference time for one sample is not greater than 0.001
Additional Outputs
Average model inference time for one sample (in seconds): 2.47e-05

Go to top

Model Inference Time - Test Dataset

Measure model average inference time (in seconds) per sample.

Conditions Summary
Status Condition More Info
Average model inference time for one sample is not greater than 0.001
Additional Outputs
Average model inference time for one sample (in seconds): 2.477e-05

Go to top

Check Without Conditions Output

Confusion Matrix Report - Train Dataset

Calculate the confusion matrix of the model on the given dataset.

Additional Outputs

Go to top

Confusion Matrix Report - Test Dataset

Calculate the confusion matrix of the model on the given dataset.

Additional Outputs

Go to top

Calibration Metric - Train Dataset

Calculate the calibration curve with brier score for each class.

Additional Outputs
Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions.
The Brier score metric may be used to assess how well a classifier is calibrated. For more info, please visit https://en.wikipedia.org/wiki/Brier_score

Go to top

Calibration Metric - Test Dataset

Calculate the calibration curve with brier score for each class.

Additional Outputs
Calibration curves (also known as reliability diagrams) compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the true frequency of the positive label against its predicted probability, for binned predictions.
The Brier score metric may be used to assess how well a classifier is calibrated. For more info, please visit https://en.wikipedia.org/wiki/Brier_score

Go to top

Other Checks That Weren't Displayed

Check Reason
Model Error Analysis Unable to train meaningful error model (r^2 score: -5.96E-3)
Regression Systematic Error - Train Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Systematic Error - Test Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Train Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Test Dataset Check is relevant for models of type ['regression'], but received model of type 'binary'

Go to top


Understanding the checks’ results!#

Again, deepchecks supplied some interesting insights, including a considerable performance degradation between the train and test sets. We can see that the degradation in performance between the train and test set that we witnessed before was mitigated only very little.

However, for a boosted model we get a pretty cool Boosting Overfit check that plots the accuracy of the model along increasing boosting iterations of the model. This can help us see that we might have a minor case of overfitting here, as train set accuracy is achieved rather early on, and while test set performance improve for a little while longer, they show some degradation starting from iteration 135.

This at least points to possible value in adjusting the n_estimators parameter, either reducing it or increasing it to see if degradation continues or perhaps the trends shifts.

Wrapping it all up!#

We haven’t got a decent model yet, but deepchecks provides us with numerous tools to help us navigate our development and make better feature engineering and model selection decisions, by easily making critical issues in data drift, overfitting, leakage, feature importance and model calibration readily accessible.

And this is just what deepchecks can do out of the box, with the prebuilt checks and suites! There is a lot more potential in the way the package lends itself to easy customization and creation of checks and suites tailored to your needs. We will touch upon some such advanced uses in future guides.

We, however, hope this example can already provide you with a good starting point for getting some immediate benefit out of using deepchecks! Have fun, and reach out to us if you need assistance! :)

Total running time of the script: ( 0 minutes 34.309 seconds)

Gallery generated by Sphinx-Gallery