Note

Go to the end to download the full example code

Use Cases - Classifying Malicious URLs#

This notebook demonstrates how the deepchecks package can help you validate your basic data science workflow right out of the box!

The scenario is a real business use case: You work as a data scientist at a cyber security startup, and the company wants to provide the clients with a tool to automatically detect phishing attempts performed through emails and warn clients about them. The idea is to scan emails and determine for each web URL they include whether it points to a phishing-related web page or not.

Since phishing attempts are an always-adapting efforts, static black lists or white lists composed of good or bad URLs seen in the past are simply not enough to make a good filtering system for the future. The way the company chose to deal with this challenge is to have you train a Machine Learning model to generalize what a phishing URL looks like from historic data!

To enable you to do this the company’s security team has collected a set of benign (meaning OK, or Kosher) URLs and phishing URLs observed during 2019 (not necessarily in clients emails). They have also wrote a script extracting features they believe should help discern phishing URLs from benign ones.

These features are divided to three sub-sets:

String Characteristics - Extracted from the URL string itself.
Domain Characteristics - Extracted by interacting with the domain provider.
Web Page Characteristics - Extracted from the content of the web page the URL points to.

The string characteristics are based the way URLs are structured, and what their different parts do. Here is an informative illustration. You can read more at Mozilla’s What is a URL article. We’ll see the specific features soon.

from IPython.core.display import HTML
from IPython.display import Image

Image(url= "https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL/mdn-url-all.png")

(Note: This is a slightly synthetic dataset based on a great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors has released it under an open license per our request, and for that we are very grateful to them.)

Installing requirements

import sys
!{sys.executable} -m pip install deepchecks --quiet

Loading the data#

OK, let’s take a look at the data!

import numpy as np
import pandas as pd
import sklearn

import deepchecks

pd.set_option('display.max_columns', 45); SEED=832; np.random.seed(SEED);

from deepchecks.tabular.datasets.classification.phishing import load_data

df = load_data(data_format='dataframe', as_train_test=False)

df.shape

(11350, 25)

df.head(5)

	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	entropy	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
0	1	2019-01-01	net	102	8	0	0	-4.384032	True	False	False	4921	191	32486	3	5	330	9419	23919	0.736286	0.289940	2.539442
1	1	2019-01-01	country	154	60	0	2	-3.566515	True	False	False	0	0	16199	0	4	39	2735	794	0.049015	0.168838	0.290311
2	1	2019-01-01	net	171	5	11	0	-4.608755	True	False	False	5374	104	103344	18	9	302	27798	83817	0.811049	0.268985	2.412174
3	1	2019-01-01	com	94	10	0	0	-4.548921	True	False	False	6107	466	34093	11	43	199	9087	19427	0.569824	0.266536	2.137889
4	1	2019-01-01	other	95	11	0	0	-4.717188	True	False	False	3819	928	202	1	0	0	39	0	0.000000	0.193069	0.000000

Here is the actual list of features:

df.columns

Index(['target', 'month', 'scrape_date', 'ext', 'urlLength', 'numDigits',
       'numParams', 'num_%20', 'num_@', 'entropy', 'has_ip', 'hasHttp',
       'hasHttps', 'urlIsLive', 'dsr', 'dse', 'bodyLength', 'numTitles',
       'numImages', 'numLinks', 'specialChars', 'scriptLength', 'sbr', 'bscr',
       'sscr'],
      dtype='object')

Feature List#

And here is a short explanation of each:

Feature Name	Feature Group	Description
target	Meta Features	0 if the URL is benign, 1 if it is related to phishing
month	Meta Features	The month this URL was first encountered, as an int
scrape_date	Meta Features	The exact date this URL was first encountered
ext	String Characteristics	The domain extension
urlLength	String Characteristics	The number of characters in the URL
numDigits	String Characteristics	The number of digits in the URL
numParams	String Characteristics	The number of query parameters in the URL
num_%20	String Characteristics	The number of ‘%20’ substrings in the URL
num_@	String Characteristics	The number of @ characters in the URL
entropy	String Characteristics	The entropy of the URL
has_ip	String Characteristics	True if the URL string contains an IP addres
hasHttp	Domain Characteristics	True if the url’s domain supports http
hasHttps	Domain Characteristics	True if the url’s domain supports https
urlIsLive	Domain Characteristics	The URL was live at the time of scraping
dsr	Domain Characteristics	The number of days since domain registration
dse	Domain Characteristics	The number of days since domain registration expired
bodyLength	Web Page Characteristics	The number of charcters in the URL’s web page
numTitles	Web Page Characteristics	The number of HTML titles (H1/H2/…) in the page
numImages	Web Page Characteristics	The number of images in the page
numLinks	Web Page Characteristics	The number of links in the page
specialChars	Web Page Characteristics	The number of special characters in the page
scriptLength	Web Page Characteristics	The number of charcters in scripts embedded in the page
sbr	Web Page Characteristics	The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)
bscr	Web Page Characteristics	The ratio of bodyLength to specialChars (= specialChars / bodyLength)
sscr	Web Page Characteristics	The ratio of scriptLength to specialChars (= scriptLength / specialChars)

Data Integrity with Deepchecks!#

The nice thing about the deepchecks package is that we can already use it out of the box! Instead of running a single check, we use a pre-defined test suite to run a host of data validation checks.

We think it’s valuable to start off with these types of suites as there are various issues we can identify at the get go just by looking at raw data.

We will first import the appropriate factory function from the deepchecks.suites module - in this case, an integrity suite tailored for a single dataset (as opposed to a division into a train and test, for example) - and use it to create a new suite object:

from deepchecks.tabular.suites import data_integrity

integ_suite = data_integrity()

We will now run that suite on our data. While running on the native DataFrame is possible in some cases, it is recommended to wrap it with the deepchecks.tabular.Dataset object instead, to give the package a bit more context, namely what is the label column, and whether we have a datetime column (we have, as an index, so we’ll set set_datetime_from_dataframe_index=True), or any categorical features (we have none after one-hot encoding them, so we’ll set cat_features=[] explicitly).

dataset = deepchecks.tabular.Dataset(df=df, label='target',
                                     set_datetime_from_dataframe_index=True, cat_features=[])
integ_suite.run(dataset)

Data Integrity Suite:
|            | 0/12 [Time: 00:00]
Data Integrity Suite:
|███         | 3/12 [Time: 00:00, Check=Mixed Nulls]
Data Integrity Suite:
|██████      | 6/12 [Time: 00:00, Check=Data Duplicates]
Data Integrity Suite:
|████████    | 8/12 [Time: 00:01, Check=Conflicting Labels]
Data Integrity Suite:
|██████████  | 10/12 [Time: 00:07, Check=Feature Label Correlation]
Data Integrity Suite:
|████████████| 12/12 [Time: 00:07, Check=Identifier Label Correlation]

Data Integrity Suite

Understanding the checks’ results!#

Ok, so we’ve got some interesting results! Even though this is quite a tidy dataset without even any preprocessing, deepchecks has found a couple of columns (has_ip and urlIsLive) containing only a single value and a couple of duplicate values.

We also get a nice list of all checks that turned out ok, and what each check is about.

So nothing dramatic, but we will be sure to drop those useless columns. :)

Preprocessing#

Let’s split the data to train and test first. Since we want to examine how well a model can generalize from the past to the future, we’ll simply assign the first months of the dataset to the training set, and the last few months to the test set.

raw_train_df = df[df.month <= 9]
len(raw_train_df)

raw_test_df = df[df.month > 9]
len(raw_test_df)

Ok! Let’s process the data real quick and see how some baseline classifiers perform!

We’ll just set the scrape date as our index, drop a few useless columns, one-hot encode our categorical ext column and scale all numeric data:

from deepchecks.tabular.datasets.classification.phishing import \
    get_url_preprocessor

pipeline = get_url_preprocessor()

Now we’ll fit on and transform the raw train dataframe:

train_df = pipeline.fit_transform(raw_train_df)
train_X = train_df.drop('target', axis=1)
train_y = train_df['target']
train_X.head(3)

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-01-01	-0.271569	-0.329581	-0.327303	-0.089699	-0.068846	0.314615	0.239243	-0.241671	0.280235	-0.356485	-0.125958	-0.255521	-0.264688	1.393957	-0.059321	-0.068217	0.753133	0.753298	-0.054849	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517
2019-01-01	0.917509	2.357675	-0.327303	5.663025	-0.068846	2.991389	0.239243	-0.241671	-1.093947	-0.629844	-0.254032	-0.344488	-0.290751	-0.358447	-0.269256	-0.282689	-1.087302	-0.414405	-0.174310	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-01-01	1.306246	-0.484615	6.957823	-0.089699	-0.068846	-0.421190	0.239243	-0.241671	0.406734	-0.480999	0.431238	0.189313	-0.160433	1.225340	0.517939	0.487306	0.953338	0.551243	-0.061609	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517

And apply the same fitted preprocessing pipeline (with the fitted scaler, for example) to the test dataframe:

test_df = pipeline.transform(raw_test_df)
test_X = test_df.drop('target', axis=1)
test_y = test_df['target']
test_X.head(3)

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-01	-0.500238	-0.691327	-0.327303	-0.089699	-0.068846	0.956667	0.239243	-0.241671	-1.093947	-0.629844	-0.381413	-0.344488	-0.395006	-0.593305	-0.355159	-0.290053	-1.218560	-2.042381	-0.189730	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	0.002834	0.238877	-0.327303	-0.089699	-0.068846	-0.498665	0.239243	-0.241671	-1.093947	-0.629844	10.879221	-0.136899	1.533700	0.153424	9.579742	8.281871	0.509814	0.087470	-0.034532	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	-0.614572	0.342233	-0.327303	-0.089699	-0.068846	-0.030503	0.239243	-0.241671	-0.247266	-0.266319	-0.200150	-0.314833	-0.082243	-0.448777	-0.127258	-0.174697	0.020147	0.559584	-0.098683	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517

from sklearn.linear_model import LogisticRegression; from sklearn.metrics import accuracy_score; hyperparameters = {'penalty': 'l2', 'fit_intercept': True, 'random_state': SEED, 'C': 0.009}

logreg = LogisticRegression(**hyperparameters)
logreg.fit(train_X, train_y);
pred_y = logreg.predict(test_X)

accuracy_score(test_y, pred_y)

0.9698972099853157

Ok, so we’ve got a nice accuracy score from the get go! Let’s see what deepchecks can tell us about our model…

from deepchecks.tabular.suites import train_test_validation

vsuite = train_test_validation()

Now that we have separate train and test DataFrames, we will create two deepchecks.tabular.Dataset objects to enable this suite and the next one to run addressing the train and test dataframes according to their role. Notice that here we pass the label as a column instead of a column name, because we’ve seperated the feature DataFrame from the target.

ds_train = deepchecks.tabular.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True,
                                      cat_features=[])
ds_test = deepchecks.tabular.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[])

Dataframe index has duplicate indexes, setting index to [0,1..,n-1].
Dataframe index has duplicate indexes, setting index to [0,1..,n-1].

Now we just have to provide the run method of the suite object with both the model and the Dataset objects.

vsuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test)

Train Test Validation Suite:
|            | 0/12 [Time: 00:00]
Train Test Validation Suite:
|█████       | 5/12 [Time: 00:00, Check=Date Train Test Leakage Duplicates]
Train Test Validation Suite:
|████████    | 8/12 [Time: 00:01, Check=Train Test Samples Mix]
Train Test Validation Suite:
|█████████   | 9/12 [Time: 00:02, Check=Feature Label Correlation Change]
Train Test Validation Suite:
|██████████  | 10/12 [Time: 00:04, Check=Feature Drift]
Train Test Validation Suite:
|████████████| 12/12 [Time: 00:05, Check=Multivariate Drift]

Train Test Validation Suite

Understanding the checks’ results!#

Whoa! It looks like we have some time leakage!

The Conditions Summary section showed that the Date Train-Test Leakage (overlap) check was the only failed check. The Additional Outputs section helped us understand that the latest date in the train set belongs to January 2020!

It seems some entries from January 2020 made their way into the train set. We assumed the month columns was enough to split the data with (which it would, have all data was indeed from 2019), but as in real life, things were a bit messy. We’ll adjust our preprocessing real quick, and with methodological errors out of the way we’ll get to checking our model’s performance.

it is also worth mentioning that deepchecks found that urlLength is the only feature that alone can predict the target with some measure of success. This is worth investigating!

Adjusting our preprocessing and refitting the model#

Let’s just drop any row from 2020 from the raw dataframe and take it all from there

df = df[~df['scrape_date'].str.contains('2020')]
df.shape

(10896, 25)

pipeline = get_url_preprocessor()

train_df = pipeline.fit_transform(raw_train_df)
train_X = train_df.drop('target', axis=1)
train_y = train_df['target']
train_X.head(3)

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-01-01	-0.271569	-0.329581	-0.327303	-0.089699	-0.068846	0.314615	0.239243	-0.241671	0.280235	-0.356485	-0.125958	-0.255521	-0.264688	1.393957	-0.059321	-0.068217	0.753133	0.753298	-0.054849	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517
2019-01-01	0.917509	2.357675	-0.327303	5.663025	-0.068846	2.991389	0.239243	-0.241671	-1.093947	-0.629844	-0.254032	-0.344488	-0.290751	-0.358447	-0.269256	-0.282689	-1.087302	-0.414405	-0.174310	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-01-01	1.306246	-0.484615	6.957823	-0.089699	-0.068846	-0.421190	0.239243	-0.241671	0.406734	-0.480999	0.431238	0.189313	-0.160433	1.225340	0.517939	0.487306	0.953338	0.551243	-0.061609	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517

test_df = pipeline.transform(raw_test_df)
test_X = test_df.drop('target', axis=1)
test_y = test_df['target']
test_X.head(3)

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-01	-0.500238	-0.691327	-0.327303	-0.089699	-0.068846	0.956667	0.239243	-0.241671	-1.093947	-0.629844	-0.381413	-0.344488	-0.395006	-0.593305	-0.355159	-0.290053	-1.218560	-2.042381	-0.189730	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	0.002834	0.238877	-0.327303	-0.089699	-0.068846	-0.498665	0.239243	-0.241671	-1.093947	-0.629844	10.879221	-0.136899	1.533700	0.153424	9.579742	8.281871	0.509814	0.087470	-0.034532	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	-0.614572	0.342233	-0.327303	-0.089699	-0.068846	-0.030503	0.239243	-0.241671	-0.247266	-0.266319	-0.200150	-0.314833	-0.082243	-0.448777	-0.127258	-0.174697	0.020147	0.559584	-0.098683	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517

logreg.fit(train_X, train_y)

LogisticRegression(C=0.009, random_state=832)

pred_y = logreg.predict(test_X)

accuracy_score(test_y, pred_y)

0.9698972099853157

Deepchecks’ Performance Checks#

Ok! Now that we’re back on track lets run some performance checks to see how we did.

from deepchecks.tabular.suites import model_evaluation

msuite = model_evaluation()

ds_train = deepchecks.tabular.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True, cat_features=[])
ds_test = deepchecks.tabular.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[])

Dataframe index has duplicate indexes, setting index to [0,1..,n-1].
Dataframe index has duplicate indexes, setting index to [0,1..,n-1].

msuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test)

Model Evaluation Suite:
|           | 0/11 [Time: 00:00]
Model Evaluation Suite:
|█          | 1/11 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|████       | 4/11 [Time: 00:00, Check=Prediction Drift]
Model Evaluation Suite:
|████       | 4/11 [Time: 00:16, Check=Weak Segments Performance]
Model Evaluation Suite:
|██████     | 6/11 [Time: 01:12, Check=Weak Segments Performance]
Model Evaluation Suite:
|█████████  | 9/11 [Time: 01:12, Check=Unused Features]

Model Evaluation Suite

Understanding the checks’ results!#

Ok! Now that we’re back on track lets run some performance checks to see how we did.

Simple Model Comparison - This checks make sure our model outperforms a very simple model to some degree. Having it fail means we might have a serious problem.
Model Error Analysis - This check analyses model errors and tries to find a way to segment our data in a way that is informative to error analysis. It seems that it found a valuable way to segment our data, error-wise, using the urlLength feature. We’ll look into it soon enough.

Looking at the metric plots for F1 for both our model and a simple one we see their performance are almost identical! How can this be? Fortunately the confusion matrices automagically generated for both the training and test sets help us understand what has happened.

Our evidently over-regularized classifier was over-impressed by the majority class (0, or non-malicious URL), and predicted a value of 0 for almost all samples in both the train and the test set, which yielded a seemingly-impressive 97% accuracy on the test set just due to the imbalanced nature of the problem.

deepchecks also generated plots for F1, precision and recall on both the train and test set, as part of the performance report, and these also help us see recall scores are almost zero for both sets and understand what happened.

Trying out a different classifier#

So let’s throw something a bit more rich in expressive power at the problem - a decision tree!

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(criterion='entropy', splitter='random', random_state=SEED)
model.fit(train_X, train_y)
msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test)

Model Evaluation Suite:
|           | 0/11 [Time: 00:00]
Model Evaluation Suite:
|█          | 1/11 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|████       | 4/11 [Time: 00:00, Check=Prediction Drift]
Model Evaluation Suite:
|████       | 4/11 [Time: 00:13, Check=Weak Segments Performance]
Model Evaluation Suite:
|██████     | 6/11 [Time: 00:58, Check=Weak Segments Performance]
Model Evaluation Suite:
|█████████  | 9/11 [Time: 00:58, Check=Unused Features]

Model Evaluation Suite

Boosting our model!#

To try and solve the overfitting issue let’s try and throw at a problem an ensemble model that has a bit more resilience to overfitting than a decision tree: a gradient-boosted ensemble of them!

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=250, random_state=SEED, max_depth=20, subsample=0.8 , loss='exponential')
model.fit(train_X, train_y)
msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test)

Model Evaluation Suite:
|           | 0/11 [Time: 00:00]
Model Evaluation Suite:
|█          | 1/11 [Time: 00:00, Check=Train Test Performance]
Model Evaluation Suite:
|██         | 2/11 [Time: 00:01, Check=Roc Report]
Model Evaluation Suite:
|███        | 3/11 [Time: 00:01, Check=Confusion Matrix Report]
Model Evaluation Suite:
|████       | 4/11 [Time: 00:01, Check=Prediction Drift]
Model Evaluation Suite:
|█████      | 5/11 [Time: 00:01, Check=Simple Model Comparison]
Model Evaluation Suite:
|██████     | 6/11 [Time: 01:02, Check=Weak Segments Performance]
Model Evaluation Suite:
|███████    | 7/11 [Time: 01:02, Check=Calibration Score]
Model Evaluation Suite:
|█████████  | 9/11 [Time: 01:02, Check=Unused Features]
Model Evaluation Suite:
|██████████ | 10/11 [Time: 01:05, Check=Boosting Overfit]

Model Evaluation Suite

Understanding the checks’ results!#

Again, deepchecks supplied some interesting insights, including a considerable performance degradation between the train and test sets. We can see that the degradation in performance between the train and test set that we witnessed before was mitigated only very little.

However, for a boosted model we get a pretty cool Boosting Overfit check that plots the accuracy of the model along increasing boosting iterations of the model. This can help us see that we might have a minor case of overfitting here, as train set accuracy is achieved rather early on, and while test set performance improve for a little while longer, they show some degradation starting from iteration 135.

This at least points to possible value in adjusting the n_estimators parameter, either reducing it or increasing it to see if degradation continues or perhaps the trends shifts.

Wrapping it all up!#

We haven’t got a decent model yet, but deepchecks provides us with numerous tools to help us navigate our development and make better feature engineering and model selection decisions, by easily making critical issues in data drift, overfitting, leakage, feature importance and model calibration readily accessible.

And this is just what deepchecks can do out of the box, with the prebuilt checks and suites! There is a lot more potential in the way the package lends itself to easy customization and creation of checks and suites tailored to your needs. We will touch upon some such advanced uses in future guides.

We, however, hope this example can already provide you with a good starting point for getting some immediate benefit out of using deepchecks! Have fun, and reach out to us if you need assistance! :)

Total running time of the script: (3 minutes 54.179 seconds)

Gallery generated by Sphinx-Gallery

Train-Test Validation Suite Quickstart

Creating a Custom Check