Data Integrity Suite

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "user-guide/tabular/auto_tutorials/plot_phishing_urls.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_user-guide_tabular_auto_tutorials_plot_phishing_urls.py: Use Cases - Classifying Malicious URLs ************************************** This notebook demonstrates how the ``deepchecks`` package can help you validate your basic data science workflow right out of the box! The scenario is a real business use case: You work as a data scientist at a cyber security startup, and the company wants to provide the clients with a tool to automatically detect phishing attempts performed through emails and warn clients about them. The idea is to scan emails and determine for each web URL they include whether it points to a phishing-related web page or not. Since phishing attempts are an always-adapting efforts, static black lists or white lists composed of good or bad URLs seen in the past are simply not enough to make a good filtering system for the future. The way the company chose to deal with this challenge is to have you train a Machine Learning model to generalize what a phishing URL looks like from historic data! To enable you to do this the company's security team has collected a set of benign (meaning OK, or Kosher) URLs and phishing URLs observed during 2019 (not necessarily in clients emails). They have also wrote a script extracting features they believe should help discern phishing URLs from benign ones. These features are divided to three sub-sets: * String Characteristics - Extracted from the URL string itself. * Domain Characteristics - Extracted by interacting with the domain provider. * Web Page Characteristics - Extracted from the content of the web page the URL points to. The string characteristics are based the way URLs are structured, and what their different parts do. Here is an informative illustration. You can read more at Mozilla's `What is a URL `__ article. We'll see the specific features soon. .. GENERATED FROM PYTHON SOURCE LINES 42-48 .. code-block:: default from IPython.core.display import HTML from IPython.display import Image Image(url= "https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL/mdn-url-all.png") .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 49-56 (Note: This is a slightly synthetic dataset based on `a great project `__ by `Rohith Ramakrishnan `__ and others, accompanied by a `blog post `__. The authors has released it under an open license per our request, and for that we are very grateful to them.) .. GENERATED FROM PYTHON SOURCE LINES 58-64 **Installing requirements** .. code:: python import sys !{sys.executable} -m pip install deepchecks --quiet .. GENERATED FROM PYTHON SOURCE LINES 66-69 Loading the data ================ OK, let's take a look at the data! .. GENERATED FROM PYTHON SOURCE LINES 69-78 .. code-block:: default import numpy as np import pandas as pd import sklearn import deepchecks pd.set_option('display.max_columns', 45); SEED=832; np.random.seed(SEED); .. GENERATED FROM PYTHON SOURCE LINES 79-82 .. code-block:: default from deepchecks.tabular.datasets.classification.phishing import load_data .. GENERATED FROM PYTHON SOURCE LINES 83-86 .. code-block:: default df = load_data(data_format='dataframe', as_train_test=False) .. GENERATED FROM PYTHON SOURCE LINES 87-90 .. code-block:: default df.shape .. rst-class:: sphx-glr-script-out .. code-block:: none (11350, 25) .. GENERATED FROM PYTHON SOURCE LINES 91-94 .. code-block:: default df.head(5) .. raw:: html

	month	scrape_date	ext	urlLength	numDigits	numParams	num_%20	entropy	hasHttp	hasHttps	urlIsLive	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr
0	1	2019-01-01	net	102	8	0	0	-4.384032	True	False	False	4921	191	32486	3	5	330	9419	23919	0.736286	0.289940	2.539442
1	1	2019-01-01	country	154	60	0	2	-3.566515	True	False	False	0	0	16199	0	4	39	2735	794	0.049015	0.168838	0.290311
2	1	2019-01-01	net	171	5	11	0	-4.608755	True	False	False	5374	104	103344	18	9	302	27798	83817	0.811049	0.268985	2.412174
3	1	2019-01-01	com	94	10	0	0	-4.548921	True	False	False	6107	466	34093	11	43	199	9087	19427	0.569824	0.266536	2.137889
4	1	2019-01-01	other	95	11	0	0	-4.717188	True	False	False	3819	928	202	1	0	0	39	0	0.000000	0.193069	0.000000

.. GENERATED FROM PYTHON SOURCE LINES 95-96 Here is the actual list of features: .. GENERATED FROM PYTHON SOURCE LINES 96-99 .. code-block:: default df.columns .. rst-class:: sphx-glr-script-out .. code-block:: none Index(['target', 'month', 'scrape_date', 'ext', 'urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'entropy', 'has_ip', 'hasHttp', 'hasHttps', 'urlIsLive', 'dsr', 'dse', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars', 'scriptLength', 'sbr', 'bscr', 'sscr'], dtype='object') .. GENERATED FROM PYTHON SOURCE LINES 100-133 Feature List ------------ And here is a short explanation of each: ============= ========================= ======================================================================= Feature Name Feature Group Description ============= ========================= ======================================================================= target Meta Features 0 if the URL is benign, 1 if it is related to phishing month Meta Features The month this URL was first encountered, as an int scrape_date Meta Features The exact date this URL was first encountered ext String Characteristics The domain extension urlLength String Characteristics The number of characters in the URL numDigits String Characteristics The number of digits in the URL numParams String Characteristics The number of query parameters in the URL num_%20 String Characteristics The number of '%20' substrings in the URL num_@ String Characteristics The number of @ characters in the URL entropy String Characteristics The entropy of the URL has_ip String Characteristics True if the URL string contains an IP addres hasHttp Domain Characteristics True if the url's domain supports http hasHttps Domain Characteristics True if the url's domain supports https urlIsLive Domain Characteristics The URL was live at the time of scraping dsr Domain Characteristics The number of days since domain registration dse Domain Characteristics The number of days since domain registration expired bodyLength Web Page Characteristics The number of charcters in the URL's web page numTitles Web Page Characteristics The number of HTML titles (H1/H2/...) in the page numImages Web Page Characteristics The number of images in the page numLinks Web Page Characteristics The number of links in the page specialChars Web Page Characteristics The number of special characters in the page scriptLength Web Page Characteristics The number of charcters in scripts embedded in the page sbr Web Page Characteristics The ratio of scriptLength to bodyLength (`= scriptLength / bodyLength`) bscr Web Page Characteristics The ratio of bodyLength to specialChars (`= specialChars / bodyLength`) sscr Web Page Characteristics The ratio of scriptLength to specialChars (`= scriptLength / specialChars`) ============= ========================= ======================================================================= .. GENERATED FROM PYTHON SOURCE LINES 135-148 Data Integrity with Deepchecks! =============================== The nice thing about the ``deepchecks`` package is that we can already use it out of the box! Instead of running a single check, we use a pre-defined test suite to run a host of data validation checks. We think it's valuable to start off with these types of suites as there are various issues we can identify at the get go just by looking at raw data. We will first import the appropriate factory function from the ``deepchecks.suites`` module - in this case, an integrity suite tailored for a single dataset (as opposed to a division into a train and test, for example) - and use it to create a new suite object: .. GENERATED FROM PYTHON SOURCE LINES 148-153 .. code-block:: default from deepchecks.tabular.suites import single_dataset_integrity integ_suite = single_dataset_integrity() .. rst-class:: sphx-glr-script-out .. code-block:: none the single_dataset_integrity suite is deprecated, use the data_integrity suite instead .. GENERATED FROM PYTHON SOURCE LINES 154-160 We will now run that suite on our data. While running on the native DataFrame is possible in some cases, it is recommended to wrap it with the ``deepchecks.tabular.Dataset`` object instead, to give the package a bit more context, namely what is the label column, and whether we have a datetime column (we have, as an index, so we'll set ``set_datetime_from_dataframe_index=True``), or any categorical features (we have none after one-hot encoding them, so we'll set ``cat_features=[]`` explicitly). .. GENERATED FROM PYTHON SOURCE LINES 160-165 .. code-block:: default dataset = deepchecks.tabular.Dataset(df=df, label='target', set_datetime_from_dataframe_index=True, cat_features=[]) integ_suite.run(dataset) .. rst-class:: sphx-glr-script-out .. code-block:: none Data Integrity Suite: | | 0/12 [Time: 00:00] Data Integrity Suite: |#### | 4/12 [Time: 00:00, Check=Mixed Data Types] Data Integrity Suite: |######## | 8/12 [Time: 00:01, Check=Conflicting Labels] Data Integrity Suite: |########## | 10/12 [Time: 00:05, Check=Feature Label Correlation] Data Integrity Suite: |############| 12/12 [Time: 00:05, Check=Identifier Label Correlation] .. raw:: html

Data Integrity Suite

.. GENERATED FROM PYTHON SOURCE LINES 166-182 Understanding the checks' results! ================================== Ok, so we've got some interesting results! Even though this is quite a tidy dataset without even any preprocessing, ``deepchecks`` has found a couple of columns (``has_ip`` and ``urlIsLive``) containing only a single value and a couple of duplicate values. We also get a nice list of all checks that turned out ok, and what each check is about. So nothing dramatic, but we will be sure to drop those useless columns. :) Preprocessing ============= Let's split the data to train and test first. Since we want to examine how well a model can generalize from the past to the future, we'll simply assign the first months of the dataset to the training set, and the last few months to the test set. .. GENERATED FROM PYTHON SOURCE LINES 182-186 .. code-block:: default raw_train_df = df[df.month <= 9] len(raw_train_df) .. rst-class:: sphx-glr-script-out .. code-block:: none 8626 .. GENERATED FROM PYTHON SOURCE LINES 187-191 .. code-block:: default raw_test_df = df[df.month > 9] len(raw_test_df) .. rst-class:: sphx-glr-script-out .. code-block:: none 2724 .. GENERATED FROM PYTHON SOURCE LINES 192-196 Ok! Let's process the data real quick and see how some baseline classifiers perform! We'll just set the scrape date as our index, drop a few useless columns, one-hot encode our categorical ext column and scale all numeric data: .. GENERATED FROM PYTHON SOURCE LINES 196-202 .. code-block:: default from deepchecks.tabular.datasets.classification.phishing import \ get_url_preprocessor pipeline = get_url_preprocessor() .. GENERATED FROM PYTHON SOURCE LINES 203-204 Now we'll fit on and transform the raw train dataframe: .. GENERATED FROM PYTHON SOURCE LINES 204-210 .. code-block:: default train_df = pipeline.fit_transform(raw_train_df) train_X = train_df.drop('target', axis=1) train_y = train_df['target'] train_X.head(3) .. raw:: html

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-01-01	-0.271569	-0.329581	-0.327303	-0.089699	-0.068846	0.314615	0.239243	-0.241671	0.280235	-0.356485	-0.125958	-0.255521	-0.264688	1.393957	-0.059321	-0.068217	0.753133	0.753298	-0.054849	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517
2019-01-01	0.917509	2.357675	-0.327303	5.663025	-0.068846	2.991389	0.239243	-0.241671	-1.093947	-0.629844	-0.254032	-0.344488	-0.290751	-0.358447	-0.269256	-0.282689	-1.087302	-0.414405	-0.174310	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-01-01	1.306246	-0.484615	6.957823	-0.089699	-0.068846	-0.421190	0.239243	-0.241671	0.406734	-0.480999	0.431238	0.189313	-0.160433	1.225340	0.517939	0.487306	0.953338	0.551243	-0.061609	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517

.. GENERATED FROM PYTHON SOURCE LINES 211-213 And apply the same fitted preprocessing pipeline (with the fitted scaler, for example) to the test dataframe: .. GENERATED FROM PYTHON SOURCE LINES 213-219 .. code-block:: default test_df = pipeline.transform(raw_test_df) test_X = test_df.drop('target', axis=1) test_y = test_df['target'] test_X.head(3) .. raw:: html

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-01	-0.500238	-0.691327	-0.327303	-0.089699	-0.068846	0.956667	0.239243	-0.241671	-1.093947	-0.629844	-0.381413	-0.344488	-0.395006	-0.593305	-0.355159	-0.290053	-1.218560	-2.042381	-0.189730	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	0.002834	0.238877	-0.327303	-0.089699	-0.068846	-0.498665	0.239243	-0.241671	-1.093947	-0.629844	10.879221	-0.136899	1.533700	0.153424	9.579742	8.281871	0.509814	0.087470	-0.034532	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	-0.614572	0.342233	-0.327303	-0.089699	-0.068846	-0.030503	0.239243	-0.241671	-0.247266	-0.266319	-0.200150	-0.314833	-0.082243	-0.448777	-0.127258	-0.174697	0.020147	0.559584	-0.098683	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517

.. GENERATED FROM PYTHON SOURCE LINES 220-223 .. code-block:: default from sklearn.linear_model import LogisticRegression; from sklearn.metrics import accuracy_score; hyperparameters = {'penalty': 'l2', 'fit_intercept': True, 'random_state': SEED, 'C': 0.009} .. GENERATED FROM PYTHON SOURCE LINES 224-229 .. code-block:: default logreg = LogisticRegression(**hyperparameters) logreg.fit(train_X, train_y); pred_y = logreg.predict(test_X) .. GENERATED FROM PYTHON SOURCE LINES 230-233 .. code-block:: default accuracy_score(test_y, pred_y) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9698972099853157 .. GENERATED FROM PYTHON SOURCE LINES 234-236 Ok, so we've got a nice accuracy score from the get go! Let's see what deepchecks can tell us about our model... .. GENERATED FROM PYTHON SOURCE LINES 236-239 .. code-block:: default from deepchecks.tabular.suites import train_test_validation .. GENERATED FROM PYTHON SOURCE LINES 240-243 .. code-block:: default vsuite = train_test_validation() .. GENERATED FROM PYTHON SOURCE LINES 244-247 Now that we have separate train and test DataFrames, we will create two ``deepchecks.tabular.Dataset`` objects to enable this suite and the next one to run addressing the train and test dataframes according to their role. Notice that here we pass the label as a column instead of a column name, because we've seperated the feature DataFrame from the target. .. GENERATED FROM PYTHON SOURCE LINES 247-252 .. code-block:: default ds_train = deepchecks.tabular.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True, cat_features=[]) ds_test = deepchecks.tabular.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[]) .. GENERATED FROM PYTHON SOURCE LINES 253-255 Now we just have to provide the ``run`` method of the suite object with both the model and the ``Dataset`` objects. .. GENERATED FROM PYTHON SOURCE LINES 255-258 .. code-block:: default vsuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test) .. rst-class:: sphx-glr-script-out .. code-block:: none Train Test Validation Suite: | | 0/12 [Time: 00:00] Train Test Validation Suite: |######## | 8/12 [Time: 00:01, Check=Train Test Samples Mix] Train Test Validation Suite: |######### | 9/12 [Time: 00:01, Check=Feature Label Correlation Change] Train Test Validation Suite: |########## | 10/12 [Time: 00:03, Check=Train Test Feature Drift] Train Test Validation Suite: |############| 12/12 [Time: 00:03, Check=Whole Dataset Drift] .. raw:: html

Train Test Validation Suite

.. GENERATED FROM PYTHON SOURCE LINES 259-280 Understanding the checks' results! ================================== Whoa! It looks like we have some time leakage! The ``Conditions`` Summary section showed that the ``Date Train-Test Leakage (overlap)`` check was the only failed check. The ``Additional Outputs`` section helped us understand that the latest date in the train set belongs to January 2020! It seems some entries from January 2020 made their way into the train set. We assumed the ``month`` columns was enough to split the data with (which it would, have all data was indeed from 2019), but as in real life, things were a bit messy. We'll adjust our preprocessing real quick, and with methodological errors out of the way we'll get to checking our model's performance. it is also worth mentioning that deepchecks found that ``urlLength`` is the only feature that alone can predict the target with some measure of success. This is worth investigating! Adjusting our preprocessing and refitting the model --------------------------------------------------- Let's just drop any row from 2020 from the raw dataframe and take it all from there .. GENERATED FROM PYTHON SOURCE LINES 280-284 .. code-block:: default df = df[~df['scrape_date'].str.contains('2020')] df.shape .. rst-class:: sphx-glr-script-out .. code-block:: none (10896, 25) .. GENERATED FROM PYTHON SOURCE LINES 285-288 .. code-block:: default pipeline = get_url_preprocessor() .. GENERATED FROM PYTHON SOURCE LINES 289-295 .. code-block:: default train_df = pipeline.fit_transform(raw_train_df) train_X = train_df.drop('target', axis=1) train_y = train_df['target'] train_X.head(3) .. raw:: html

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-01-01	-0.271569	-0.329581	-0.327303	-0.089699	-0.068846	0.314615	0.239243	-0.241671	0.280235	-0.356485	-0.125958	-0.255521	-0.264688	1.393957	-0.059321	-0.068217	0.753133	0.753298	-0.054849	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517
2019-01-01	0.917509	2.357675	-0.327303	5.663025	-0.068846	2.991389	0.239243	-0.241671	-1.093947	-0.629844	-0.254032	-0.344488	-0.290751	-0.358447	-0.269256	-0.282689	-1.087302	-0.414405	-0.174310	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-01-01	1.306246	-0.484615	6.957823	-0.089699	-0.068846	-0.421190	0.239243	-0.241671	0.406734	-0.480999	0.431238	0.189313	-0.160433	1.225340	0.517939	0.487306	0.953338	0.551243	-0.061609	-0.859105	-0.434899	-0.401599	-0.035733	3.553473	-0.426577	-0.226517

.. GENERATED FROM PYTHON SOURCE LINES 296-302 .. code-block:: default test_df = pipeline.transform(raw_test_df) test_X = test_df.drop('target', axis=1) test_y = test_df['target'] test_X.head(3) .. raw:: html

	urlLength	numDigits	numParams	num_%20	num_@	entropy	hasHttp	hasHttps	dsr	dse	bodyLength	numTitles	numImages	numLinks	specialChars	scriptLength	sbr	bscr	sscr	ext_com	ext_country	ext_html	ext_info	ext_net	ext_other	ext_php
scrape_date
2019-10-01	-0.500238	-0.691327	-0.327303	-0.089699	-0.068846	0.956667	0.239243	-0.241671	-1.093947	-0.629844	-0.381413	-0.344488	-0.395006	-0.593305	-0.355159	-0.290053	-1.218560	-2.042381	-0.189730	-0.859105	2.299385	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	0.002834	0.238877	-0.327303	-0.089699	-0.068846	-0.498665	0.239243	-0.241671	-1.093947	-0.629844	10.879221	-0.136899	1.533700	0.153424	9.579742	8.281871	0.509814	0.087470	-0.034532	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517
2019-10-01	-0.614572	0.342233	-0.327303	-0.089699	-0.068846	-0.030503	0.239243	-0.241671	-0.247266	-0.266319	-0.200150	-0.314833	-0.082243	-0.448777	-0.127258	-0.174697	0.020147	0.559584	-0.098683	1.164002	-0.434899	-0.401599	-0.035733	-0.281415	-0.426577	-0.226517

.. GENERATED FROM PYTHON SOURCE LINES 303-306 .. code-block:: default logreg.fit(train_X, train_y) .. rst-class:: sphx-glr-script-out .. code-block:: none LogisticRegression(C=0.009, random_state=832) .. GENERATED FROM PYTHON SOURCE LINES 307-310 .. code-block:: default pred_y = logreg.predict(test_X) .. GENERATED FROM PYTHON SOURCE LINES 311-314 .. code-block:: default accuracy_score(test_y, pred_y) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9698972099853157 .. GENERATED FROM PYTHON SOURCE LINES 315-318 Deepchecks' Performance Checks ============================== Ok! Now that we're back on track lets run some performance checks to see how we did. .. GENERATED FROM PYTHON SOURCE LINES 318-321 .. code-block:: default from deepchecks.tabular.suites import model_evaluation .. GENERATED FROM PYTHON SOURCE LINES 322-325 .. code-block:: default msuite = model_evaluation() .. GENERATED FROM PYTHON SOURCE LINES 326-330 .. code-block:: default ds_train = deepchecks.tabular.Dataset(df=train_X, label=train_y, set_datetime_from_dataframe_index=True, cat_features=[]) ds_test = deepchecks.tabular.Dataset(df=test_X, label=test_y, set_datetime_from_dataframe_index=True, cat_features=[]) .. GENERATED FROM PYTHON SOURCE LINES 331-334 .. code-block:: default msuite.run(model=logreg, train_dataset=ds_train, test_dataset=ds_test) .. rst-class:: sphx-glr-script-out .. code-block:: none Model Evaluation Suite: | | 0/12 [Time: 00:00] Model Evaluation Suite: |# | 1/12 [Time: 00:00, Check=Train Test Performance] Model Evaluation Suite: |#### | 4/12 [Time: 00:00, Check=Train Test Prediction Drift]One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] Model Evaluation Suite: |#### | 4/12 [Time: 00:13, Check=Weak Segments Performance] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] Model Evaluation Suite: |###### | 6/12 [Time: 00:35, Check=Weak Segments Performance] Model Evaluation Suite: |########### | 11/12 [Time: 00:35, Check=Boosting Overfit] .. raw:: html

Model Evaluation Suite

.. GENERATED FROM PYTHON SOURCE LINES 335-363 Understanding the checks' results! ================================== Ok! Now that we're back on track lets run some performance checks to see how we did. * ``Simple Model Comparison`` - This checks make sure our model outperforms a very simple model to some degree. Having it fail means we might have a serious problem. * ``Model Error Analysis`` - This check analyses model errors and tries to find a way to segment our data in a way that is informative to error analysis. It seems that it found a valuable way to segment our data, error-wise, using the ``urlLength`` feature. We'll look into it soon enough. Looking at the metric plots for F1 for both our model and a simple one we see their performance are almost identical! How can this be? Fortunately the confusion matrices automagically generated for both the training and test sets help us understand what has happened. Our evidently over-regularized classifier was over-impressed by the majority class (0, or non-malicious URL), and predicted a value of 0 for almost all samples in both the train and the test set, which yielded a seemingly-impressive 97% accuracy on the test set just due to the imbalanced nature of the problem. ``deepchecks`` also generated plots for F1, precision and recall on both the train and test set, as part of the performance report, and these also help us see recall scores are almost zero for both sets and understand what happened. Trying out a different classifier ================================= So let's throw something a bit more rich in expressive power at the problem - a decision tree! .. GENERATED FROM PYTHON SOURCE LINES 363-370 .. code-block:: default from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(criterion='entropy', splitter='random', random_state=SEED) model.fit(train_X, train_y) msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test) .. rst-class:: sphx-glr-script-out .. code-block:: none Model Evaluation Suite: | | 0/12 [Time: 00:00] Model Evaluation Suite: |# | 1/12 [Time: 00:00, Check=Train Test Performance] Model Evaluation Suite: |#### | 4/12 [Time: 00:00, Check=Train Test Prediction Drift]One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] Model Evaluation Suite: |###### | 6/12 [Time: 00:14, Check=Weak Segments Performance] .. raw:: html

Model Evaluation Suite

.. GENERATED FROM PYTHON SOURCE LINES 371-376 Boosting our model! =================== To try and solve the overfitting issue let's try and throw at a problem an ensemble model that has a bit more resilience to overfitting than a decision tree: a gradient-boosted ensemble of them! .. GENERATED FROM PYTHON SOURCE LINES 376-383 .. code-block:: default from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier(n_estimators=250, random_state=SEED, max_depth=20, subsample=0.8 , loss='exponential') model.fit(train_X, train_y) msuite.run(model=model, train_dataset=ds_train, test_dataset=ds_test) .. rst-class:: sphx-glr-script-out .. code-block:: none Model Evaluation Suite: | | 0/12 [Time: 00:00] Model Evaluation Suite: |# | 1/12 [Time: 00:00, Check=Train Test Performance] Model Evaluation Suite: |## | 2/12 [Time: 00:01, Check=Roc Report] Model Evaluation Suite: |### | 3/12 [Time: 00:01, Check=Confusion Matrix Report] Model Evaluation Suite: |#### | 4/12 [Time: 00:01, Check=Train Test Prediction Drift] Model Evaluation Suite: |##### | 5/12 [Time: 00:01, Check=Simple Model Comparison] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] One or more of the test scores are non-finite: [ nan nan] Model Evaluation Suite: |###### | 6/12 [Time: 00:43, Check=Weak Segments Performance] Model Evaluation Suite: |####### | 7/12 [Time: 00:44, Check=Calibration Score] Model Evaluation Suite: |########### | 11/12 [Time: 00:47, Check=Boosting Overfit] .. raw:: html

Model Evaluation Suite

.. GENERATED FROM PYTHON SOURCE LINES 384-416 Understanding the checks' results! ================================== Again, ``deepchecks`` supplied some interesting insights, including a considerable performance degradation between the train and test sets. We can see that the degradation in performance between the train and test set that we witnessed before was mitigated only very little. However, for a boosted model we get a pretty cool *Boosting Overfit* check that plots the accuracy of the model along increasing boosting iterations of the model. This can help us see that we might have a minor case of overfitting here, as train set accuracy is achieved rather early on, and while test set performance improve for a little while longer, they show some degradation starting from iteration 135. This at least points to possible value in adjusting the ``n_estimators`` parameter, either reducing it or increasing it to see if degradation continues or perhaps the trends shifts. Wrapping it all up! =================== We haven't got a decent model yet, but ``deepchecks`` provides us with numerous tools to help us navigate our development and make better feature engineering and model selection decisions, by easily making critical issues in data drift, overfitting, leakage, feature importance and model calibration readily accessible. And this is just what ``deepchecks`` can do out of the box, with the prebuilt checks and suites! There is a lot more potential in the way the package lends itself to easy customization and creation of checks and suites tailored to your needs. We will touch upon some such advanced uses in future guides. We, however, hope this example can already provide you with a good starting point for getting some immediate benefit out of using deepchecks! Have fun, and reach out to us if you need assistance! :) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 9.015 seconds) .. _sphx_glr_download_user-guide_tabular_auto_tutorials_plot_phishing_urls.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_phishing_urls.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_phishing_urls.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_