load_data#

load_data(data_format: str = 'Dataset', as_train_test: bool = True) Union[Tuple, Dataset, DataFrame][source]#

Load and returns the phishing url dataset (classification).

The phishing url dataset contains slightly synthetic dataset of urls - some regular and some used for phishing.

The dataset is based on the great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors have released it under an open license per our request, and for that we are very grateful to them.

This dataset is licensed under the Creative Commons Zero v1.0 Universal (CC0 1.0).

The typical ML task in this dataset is to build a model that predicts the if the url is part of a phishing attack.

Dataset Shape:
Dataset Shape#

Property

Value

Samples Total

11.35K

Dimensionality

25

Features

real, string

Targets

boolean

Description:
Dataset Description#

Column name

Column Role

Description

target

Label

0 if the URL is benign, 1 if it is related to phishing

month

Data

The month this URL was first encountered, as an int

scrape_date

Date

The exact date this URL was first encountered

ext

Feature

The domain extension

urlLength

Feature

The number of characters in the URL

numDigits

Feature

The number of digits in the URL

numParams

Feature

The number of query parameters in the URL

num_%20

Feature

The number of ‘%20’ substrings in the URL

num_@

Feature

The number of @ characters in the URL

entropy

Feature

The entropy of the URL

has_ip

Feature

True if the URL string contains an IP address

hasHttp

Feature

True if the url’s domain supports http

hasHttps

Feature

True if the url’s domain supports https

urlIsLive

Feature

The URL was live at the time of scraping

dsr

Feature

The number of days since domain registration

dse

Feature

The number of days since domain registration expired

bodyLength

Feature

The number of characters in the URL’s web page

numTitles

Feature

The number of HTML titles (H1/H2/…) in the page

numImages

Feature

The number of images in the page

numLinks

Feature

The number of links in the page

specialChars

Feature

The number of special characters in the page

scriptLength

Feature

The number of characters in scripts embedded in the page

sbr

Feature

The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)

bscr

Feature

The ratio of bodyLength to specialChars (= specialChars / bodyLength)

sscr

Feature

The ratio of scriptLength to specialChars (= scriptLength / specialChars)

Parameters
data_formatstr , default: Dataset

Represent the format of the returned value. Can be ‘Dataset’|’Dataframe’ ‘Dataset’ will return the data as a Dataset object ‘Dataframe’ will return the data as a pandas Dataframe object

as_train_testbool , default: True

If True, the returned data is splitted into train and test exactly like the toy model was trained. The first return value is the train data and the second is the test data. In order to get this model, call the load_fitted_model() function. Otherwise, returns a single object.

Returns
——-
datasetUnion[deepchecks.Dataset, pd.DataFrame]

the data object, corresponding to the data_format attribute.

train, testTuple[Union[deepchecks.Dataset, pd.DataFrame],Union[deepchecks.Dataset, pd.DataFrame]

tuple if as_train_test = True. Tuple of two objects represents the dataset splitted to train and test sets.