Note

Click here to download the full example code

Use Cases - Classifying Malicious URLs#

This notebook demonstrates how the deepchecks package can help you validate your basic data science workflow right out of the box!

The scenario is a real business use case: You work as a data scientist at a cyber security startup, and the company wants to provide the clients with a tool to automatically detect phishing attempts performed through emails and warn clients about them. The idea is to scan emails and determine for each web URL they include whether it points to a phishing-related web page or not.

Since phishing attempts are an always-adapting efforts, static black lists or white lists composed of good or bad URLs seen in the past are simply not enough to make a good filtering system for the future. The way the company chose to deal with this challenge is to have you train a Machine Learning model to generalize what a phishing URL looks like from historic data!

To enable you to do this the company’s security team has collected a set of benign (meaning OK, or Kosher) URLs and phishing URLs observed during 2019 (not necessarily in clients emails). They have also wrote a script extracting features they believe should help discern phishing URLs from benign ones.

These features are divided to three sub-sets:

String Characteristics - Extracted from the URL string itself.
Domain Characteristics - Extracted by interacting with the domain provider.
Web Page Characteristics - Extracted from the content of the web page the URL points to.

The string characteristics are based the way URLs are structured, and what their different parts do. Here is an informative illustration. You can read more at Mozilla’s What is a URL article. We’ll see the specific features soon.

from IPython.core.display import HTML
from IPython.display import Image

Image(url= "https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL/mdn-url-all.png")

(Note: This is a slightly synthetic dataset based on a great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors has released it under an open license per our request, and for that we are very grateful to them.)

Installing requirements

import sys
!{sys.executable} -m pip install deepchecks --quiet

Loading the data#

OK, let’s take a look at the data!

import numpy as np
import pandas as pd
import sklearn

import deepchecks

pd.set_option('display.max_columns', 45); SEED=832; np.random.seed(SEED);

from deepchecks.tabular.datasets.classification.phishing import load_data

df = load_data(data_format='dataframe', as_train_test=False)

df.shape

Out:

(11350, 25)

df.head(5)

	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	entropy	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
0	1	2019-01-01	net	102	8	0	0	-4.384032	True	False	False	4921	191	32486	3	5	330	9419	23919	0.736286	0.289940	2.539442
1	1	2019-01-01	country	154	60	0	2	-3.566515	True	False	False	0	0	16199	0	4	39	2735	794	0.049015	0.168838	0.290311
2	1	2019-01-01	net	171	5	11	0	-4.608755	True	False	False	5374	104	103344	18	9	302	27798	83817	0.811049	0.268985	2.412174
3	1	2019-01-01	com	94	10	0	0	-4.548921	True	False	False	6107	466	34093	11	43	199	9087	19427	0.569824	0.266536	2.137889
4	1	2019-01-01	other	95	11	0	0	-4.717188	True	False	False	3819	928	202	1	0	0	39	0	0.000000	0.193069	0.000000

Here is the actual list of features:

df.columns

Out:

Index(['target', 'month', 'scrape_date', 'ext', 'urlLength', 'numDigits',
       'numParams', 'num_%20', 'num_@', 'entropy', 'has_ip', 'hasHttp',
       'hasHttps', 'urlIsLive', 'dsr', 'dse', 'bodyLength', 'numTitles',
       'numImages', 'numLinks', 'specialChars', 'scriptLength', 'sbr', 'bscr',
       'sscr'],
      dtype='object')

Feature List#

And here is a short explanation of each:

Feature Name	Feature Group	Description
target	Meta Features	0 if the URL is benign, 1 if it is related to phishing
month	Meta Features	The month this URL was first encountered, as an int
scrape_date	Meta Features	The exact date this URL was first encountered
ext	String Characteristics	The domain extension
urlLength	String Characteristics	The number of characters in the URL
numDigits	String Characteristics	The number of digits in the URL
numParams	String Characteristics	The number of query parameters in the URL
num_%20	String Characteristics	The number of ‘%20’ substrings in the URL
num_@	String Characteristics	The number of @ characters in the URL
entropy	String Characteristics	The entropy of the URL
has_ip	String Characteristics	True if the URL string contains an IP addres
hasHttp	Domain Characteristics	True if the url’s domain supports http
hasHttps	Domain Characteristics	True if the url’s domain supports https
urlIsLive	Domain Characteristics	The URL was live at the time of scraping
dsr	Domain Characteristics	The number of days since domain registration
dse	Domain Characteristics	The number of days since domain registration expired
bodyLength	Web Page Characteristics	The number of charcters in the URL’s web page
numTitles	Web Page Characteristics	The number of HTML titles (H1/H2/…) in the page
numImages	Web Page Characteristics	The number of images in the page
numLinks	Web Page Characteristics	The number of links in the page
specialChars	Web Page Characteristics	The number of special characters in the page
scriptLength	Web Page Characteristics	The number of charcters in scripts embedded in the page
sbr	Web Page Characteristics	The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)
bscr	Web Page Characteristics	The ratio of bodyLength to specialChars (= specialChars / bodyLength)
sscr	Web Page Characteristics	The ratio of scriptLength to specialChars (= scriptLength / specialChars)

Data Integrity with Deepchecks!#

The nice thing about the deepchecks package is that we can already use it out of the box! Instead of running a single check, we use a pre-defined test suite to run a host of data validation checks.

We think it’s valuable to start off with these types of suites as there are various issues we can identify at the get go just by looking at raw data.

We will first import the appropriate factory function from the deepchecks.suites module - in this case, an integrity suite tailored for a single dataset (as opposed to a division into a train and test, for example) - and use it to create a new suite object:

from deepchecks.tabular.suites import single_dataset_integrity

integ_suite = single_dataset_integrity()

We will now run that suite on our data. While running on the native DataFrame is possible in some cases, it is recommended to wrap it with the deepchecks.tabular.Dataset object instead, to give the package a bit more context, namely what is the label column, and whether we have a datetime column (we have, as an index, so we’ll set set_datetime_from_dataframe_index=True), or any categorical features (we have none after one-hot encoding them, so we’ll set cat_features=[] explicitly).

dataset = deepchecks.tabular.Dataset(df=df, label='target',
                                     set_datetime_from_dataframe_index=True, cat_features=[])
integ_suite.run(dataset)

Out:

Single Dataset Integrity Suite:   0%|         | 0/9 [00:00<?, ? Check/s]
Single Dataset Integrity Suite:   0%|         | 0/9 [00:00<?, ? Check/s, Check=Is Single Value]
Single Dataset Integrity Suite:  11%|#        | 1/9 [00:00<00:00, 165.53 Check/s, Check=Mixed Nulls]
Single Dataset Integrity Suite:  22%|##       | 2/9 [00:00<00:00, 36.78 Check/s, Check=Mixed Data Types]
Single Dataset Integrity Suite:  33%|###      | 3/9 [00:00<00:00, 39.57 Check/s, Check=String Mismatch]
Single Dataset Integrity Suite:  44%|####     | 4/9 [00:00<00:00, 47.91 Check/s, Check=Data Duplicates]
Single Dataset Integrity Suite:  56%|#####    | 5/9 [00:00<00:00, 28.74 Check/s, Check=Data Duplicates]
Single Dataset Integrity Suite:  56%|#####    | 5/9 [00:00<00:00, 28.74 Check/s, Check=String Length Out Of Bounds]
Single Dataset Integrity Suite:  67%|######   | 6/9 [00:00<00:00, 28.74 Check/s, Check=Special Characters]
Single Dataset Integrity Suite:  78%|#######  | 7/9 [00:00<00:00, 28.74 Check/s, Check=Conflicting Labels]
Single Dataset Integrity Suite:  89%|######## | 8/9 [00:00<00:00,  7.68 Check/s, Check=Conflicting Labels]
Single Dataset Integrity Suite:  89%|######## | 8/9 [00:00<00:00,  7.68 Check/s, Check=Outlier Sample Detection]

Single Dataset Integrity Suite

The suite is composed of various checks such as: String Length Out Of Bounds, Outlier Sample Detection, Mixed Nulls, etc...
Each check may contain conditions (which will result in pass / fail / warning / error , represented by ✓ / ✖ / ! / ⁈ ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.

Conditions Summary

Status	Check	Condition	More Info
✖	Single Value in Column	Does not contain only a single value	Found columns with a single value: ['has_ip', 'urlIsLive']
!	Data Duplicates	Duplicate data ratio is not greater than 0%	Found 0.0088% duplicate data
✓	Mixed Nulls	Not more than 1 different null types
✓	Mixed Data Types	Rare data types in column are either more than 10% or less than 1% of the data
✓	String Mismatch	No string variants
✓	String Length Out Of Bounds	Ratio of outliers not greater than 0% string length outliers
✓	Special Characters	Ratio of entirely special character samples not greater than 0.1%
✓	Conflicting Labels	Ambiguous sample ratio is not greater than 0%

Check With Conditions Output

Single Value in Column

Check if there are columns which have only a single unique value in all rows.

Conditions Summary

Status	Condition	More Info
✖	Does not contain only a single value	Found columns with a single value: ['has_ip', 'urlIsLive']

Additional Outputs

The following columns have only one unique value

	has_ip	urlIsLive
Single unique value	0	False

		target	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	num_@	entropy	has_ip	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
Instances	Number of Duplicates
4696, 4719	2	0	6	2019-06-06	other	123	28	4	0	0	-4.91	0	True	False	False	0	0	0	0	0	0	0	0	0.00	0.00	0.00

Check	Reason
Outlier Sample Detection - Train Dataset	UFuncTypeError: Cannot cast ufunc 'true_divide' output from dtype('O') to dtype('float64') with casting rule 'same_kind'
Mixed Nulls	Nothing found
Mixed Data Types	Nothing found
String Mismatch	Nothing found
String Length Out Of Bounds	Nothing found
Special Characters	Nothing found
Conflicting Labels	Nothing found

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-01-01	-0.271569	-0.329581	-0.327303	-0.089699	-0.068846	0.314615	0.239243	-0.241671	0.280235	-0.356485	-0.125958	-0.255521	-0.264688	1.393957	-0.059321	-0.068217	0.753133	0.753298	-0.054849	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517
2019-01-01	0.917509	2.357675	-0.327303	5.663025	-0.068846	2.991389	0.239243	-0.241671	-1.093947	-0.629844	-0.254032	-0.344488	-0.290751	-0.358447	-0.269256	-0.282689	-1.087302	-0.414405	-0.174310	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-01-01	1.306246	-0.484615	6.957823	-0.089699	-0.068846	-0.421190	0.239243	-0.241671	0.406734	-0.480999	0.431238	0.189313	-0.160433	1.225340	0.517939	0.487306	0.953338	0.551243	-0.061609	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-01	-0.500238	-0.691327	-0.327303	-0.089699	-0.068846	0.956667	0.239243	-0.241671	-1.093947	-0.629844	-0.381413	-0.344488	-0.395006	-0.593305	-0.355159	-0.290053	-1.218560	-2.042381	-0.189730	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	0.002834	0.238877	-0.327303	-0.089699	-0.068846	-0.498665	0.239243	-0.241671	-1.093947	-0.629844	10.879221	-0.136899	1.533700	0.153424	9.579742	8.281871	0.509814	0.087470	-0.034532	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	-0.614572	0.342233	-0.327303	-0.089699	-0.068846	-0.030503	0.239243	-0.241671	-0.247266	-0.266319	-0.200150	-0.314833	-0.082243	-0.448777	-0.127258	-0.174697	0.020147	0.559584	-0.098683	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517

Status	Check	Condition	More Info
✖	Date Train-Test Leakage (overlap)	Date leakage ratio is not greater than 0%	Found 100% leaked dates
✓	Train Test Drift	PSI <= 0.2 and Earth Mover's Distance <= 0.1
✓	Train Test Label Drift	PSI <= 0.2 and Earth Mover's Distance <= 0.1 for label drift
✓	Whole Dataset Drift	Drift value is not greater than 0.25
✓	Dominant Frequency Change	Change in ratio of dominant value in data is not greater than 25%
✓	Category Mismatch Train Test	Ratio of samples with a new category is not greater than 0%
✓	New Label Train Test	Number of new label values is not greater than 0
✓	String Mismatch Comparison	No new variants allowed in test data
✓	Datasets Size Comparison	Test-Train size ratio is not smaller than 0.01
✓	Date Train-Test Leakage (duplicates)	Date leakage ratio is not greater than 0%
✓	Single Feature Contribution Train-Test	Train-Test features' Predictive Power Score difference is not greater than 0.2
✓	Single Feature Contribution Train-Test	Train features' Predictive Power Score is not greater than 0.7
✓	Train Test Samples Mix	Percentage of test data samples that appear in train data not greater than 10%
✓	Identifier Leakage - Train Dataset	Identifier columns PPS is not greater than 0
✓	Identifier Leakage - Test Dataset	Identifier columns PPS is not greater than 0

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
Train indices: 2019-01-02 00:00:00, 2019-02-0.. Tot. 2 Test indices: 2019-11-20 00:00:00, 2019-11-2.. Tot. 2	0.85	-0.43	2.32	-0.09	-0.07	-1.53	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	2.30	-0.40	-0.04	-0.28	-0.43	-0.23
Train indices: 2019-01-06 00:00:00 Test indices: 2019-11-06 00:00:00	-0.41	-0.02	2.98	-0.09	-0.07	-0.42	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	-0.43	4.41
Train indices: 2019-09-24 00:00:00 Test indices: 2019-10-02 00:00:00	-0.18	-0.59	4.97	-0.09	-0.07	-0.03	0.24	-0.24	-0.71	-0.52	-0.38	-0.31	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	-0.43	4.41
Train indices: 2019-08-15 00:00:00 Test indices: 2019-12-02 00:00:00	-0.09	-0.54	4.97	-0.09	-0.07	-0.04	0.24	-0.24	-0.71	-0.52	-0.38	-0.31	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	-0.43	4.41
Train indices: 2019-04-09 00:00:00, 2019-05-1.. Tot. 4 Test indices: 2019-12-03 00:00:00	0.21	0.70	2.32	-0.09	-0.07	-1.40	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	2.34	-0.23
Train indices: 2019-04-01 00:00:00 Test indices: 2019-12-14 00:00:00	0.21	0.70	2.32	-0.09	-0.07	-1.35	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	2.34	-0.23
Train indices: 2019-05-01 00:00:00 Test indices: 2019-11-26 00:00:00	1.65	-0.28	2.32	-0.09	-0.07	-1.92	0.24	-0.24	-1.09	-0.63	-0.38	-0.34	-0.40	-0.59	-0.36	-0.29	-1.22	-2.04	-0.19	-0.86	-0.43	-0.40	-0.04	-0.28	2.34	-0.23

Check	Reason
Index Train Test Leakage	There is no index defined to use. Did you pass a DataFrame instead of a Dataset?
Dominant Frequency Change	Nothing found
Category Mismatch Train Test	Nothing found
New Label Train Test	Nothing found
String Mismatch Comparison	Nothing found
Date Train-Test Leakage (duplicates)	Nothing found
Identifier Leakage - Train Dataset	Nothing found
Identifier Leakage - Test Dataset	Nothing found

Status	Check	Condition	More Info
✖	Simple Model Comparison	Model performance gain over simple model is not less than 10%	Found metrics with gain below threshold: {'F1': {0: '2.34%', 1: '4.65%'}}
!	Model Error Analysis	The performance difference of the detected segments must not be greater than 5%	Found change in Accuracy in features above threshold: {'urlLength': '31.2%'}
✓	Performance Report	Train-Test scores relative degradation is not greater than 0.1
✓	ROC Report - Train Dataset	AUC score for all the classes is not less than 0.7
✓	ROC Report - Test Dataset	AUC score for all the classes is not less than 0.7
✓	Unused Features	Number of high variance unused features is not greater than 5
✓	Model Inference Time - Train Dataset	Average model inference time for one sample is not greater than 0.001
✓	Model Inference Time - Test Dataset	Average model inference time for one sample is not greater than 0.001

Check	Reason
Regression Systematic Error - Train Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Systematic Error - Test Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Train Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Test Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Boosting Overfit	Check is relevant for Boosting models of type ('AdaBoostClassifier', 'GradientBoostingClassifier', 'LGBMClassifier', 'XGBClassifier', 'CatBoostClassifier', 'AdaBoostRegressor', 'GradientBoostingRegressor', 'LGBMRegressor', 'XGBRegressor', 'CatBoostRegressor'), but received model of type LogisticRegression

Status	Check	Condition	More Info
✖	Performance Report	Train-Test scores relative degradation is not greater than 0.1	F1 for class 1 (train=1 test=0.82) Precision for class 1 (train=1 test=0.79) Recall for class 1 (train=1 test=0.85)
!	Unused Features	Number of high variance unused features is not greater than 5	Found number of unused high variance features above threshold: ['scriptLength', 'sscr', 'ext_info', 'numImages', 'num_@', 'hasHttp']
✓	ROC Report - Train Dataset	AUC score for all the classes is not less than 0.7
✓	ROC Report - Test Dataset	AUC score for all the classes is not less than 0.7
✓	Simple Model Comparison	Model performance gain over simple model is not less than 10%
✓	Model Inference Time - Train Dataset	Average model inference time for one sample is not greater than 0.001
✓	Model Inference Time - Test Dataset	Average model inference time for one sample is not greater than 0.001

Check	Reason
Model Error Analysis	Unable to train meaningful error model (r^2 score: -0.01)
Regression Systematic Error - Train Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Systematic Error - Test Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Train Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Regression Error Distribution - Test Dataset	Check is relevant for models of type ['regression'], but received model of type 'binary'
Boosting Overfit	Check is relevant for Boosting models of type ('AdaBoostClassifier', 'GradientBoostingClassifier', 'LGBMClassifier', 'XGBClassifier', 'CatBoostClassifier', 'AdaBoostRegressor', 'GradientBoostingRegressor', 'LGBMRegressor', 'XGBRegressor', 'CatBoostRegressor'), but received model of type DecisionTreeClassifier

Status	Check	Condition	More Info
✖	Performance Report	Train-Test scores relative degradation is not greater than 0.1	F1 for class 1 (train=1 test=0.87) Precision for class 1 (train=1 test=0.89) Recall for class 1 (train=1 test=0.86)
!	Unused Features	Number of high variance unused features is not greater than 5	Found number of unused high variance features above threshold: ['sscr', 'ext_info', 'ext_country', 'ext_html', 'ext_other', 'num_@', 'hasHttps', 'hasHttp', 'numLinks', 'ext_php']
✓	ROC Report - Train Dataset	AUC score for all the classes is not less than 0.7
✓	ROC Report - Test Dataset	AUC score for all the classes is not less than 0.7
✓	Simple Model Comparison	Model performance gain over simple model is not less than 10%
✓	Boosting Overfit	Test score over iterations doesn't decline by more than 5% from the best score
✓	Model Inference Time - Train Dataset	Average model inference time for one sample is not greater than 0.001
✓	Model Inference Time - Test Dataset	Average model inference time for one sample is not greater than 0.001

Use Cases - Classifying Malicious URLs#

Loading the data#

Feature List#

Data Integrity with Deepchecks!#

Single Dataset Integrity Suite

Conditions Summary

Check With Conditions Output

Single Value in Column

Conditions Summary

Additional Outputs

Data Duplicates

Conditions Summary

Additional Outputs

Check Without Conditions Output

Other Checks That Weren't Displayed

Understanding the checks’ results!#

Preprocessing#

Train Test Validation Suite

Conditions Summary

Check With Conditions Output

Train Test Drift

Conditions Summary

Additional Outputs

Train Test Label Drift

Conditions Summary

Additional Outputs

Whole Dataset Drift

Conditions Summary

Additional Outputs

Main features contributing to drift

Datasets Size Comparison

Conditions Summary

Additional Outputs

Date Train-Test Leakage (overlap)

Conditions Summary

Additional Outputs

Single Feature Contribution Train-Test

Conditions Summary

Additional Outputs

Train Test Samples Mix

Conditions Summary

Additional Outputs

Check Without Conditions Output

Other Checks That Weren't Displayed

Understanding the checks’ results!#

Adjusting our preprocessing and refitting the model#

Deepchecks’ Performance Checks#

Model Evaluation Suite

Conditions Summary

Check With Conditions Output

Performance Report

Conditions Summary

Additional Outputs

ROC Report - Train Dataset

Conditions Summary

Additional Outputs

ROC Report - Test Dataset

Conditions Summary

Additional Outputs

Simple Model Comparison

Conditions Summary

Additional Outputs

Model Error Analysis

Conditions Summary

Additional Outputs

Unused Features

Conditions Summary

Additional Outputs

Model Inference Time - Train Dataset

Conditions Summary

Additional Outputs

Model Inference Time - Test Dataset

Conditions Summary

Additional Outputs

Check Without Conditions Output

Confusion Matrix Report - Train Dataset

Additional Outputs

Confusion Matrix Report - Test Dataset

Additional Outputs

Calibration Metric - Train Dataset