phishing#

The phishing dataset contains a slightly synthetic dataset of urls - some regular and some used for phishing.

The phishing url dataset contains slightly synthetic dataset of urls - some regular and some used for phishing.

The dataset is based on the great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors have released it under an open license per our request, and for that we are very grateful to them.

This dataset is licensed under the Creative Commons Zero v1.0 Universal (CC0 1.0).

The typical ML task in this dataset is to build a model that predicts the if the url is part of a phishing attack.

Dataset Shape:

Dataset Shape#
Property	Value
Samples Total	11.35K
Dimensionality	25
Features	real, string
Targets	boolean

Description:

Dataset Description#
Column name	Column Role	Description
target	Label	0 if the URL is benign, 1 if it is related to phishing
month	Data	The month this URL was first encountered, as an int
scrape_date	Date	The exact date this URL was first encountered
ext	Feature	The domain extension
urlLength	Feature	The number of characters in the URL
numDigits	Feature	The number of digits in the URL
numParams	Feature	The number of query parameters in the URL
num_%20	Feature	The number of ‘%20’ substrings in the URL
num_@	Feature	The number of @ characters in the URL
entropy	Feature	The entropy of the URL
has_ip	Feature	True if the URL string contains an IP address
hasHttp	Feature	True if the url’s domain supports http
hasHttps	Feature	True if the url’s domain supports https
urlIsLive	Feature	The URL was live at the time of scraping
dsr	Feature	The number of days since domain registration
dse	Feature	The number of days since domain registration expired
bodyLength	Feature	The number of characters in the URL’s web page
numTitles	Feature	The number of HTML titles (H1/H2/…) in the page
numImages	Feature	The number of images in the page
numLinks	Feature	The number of links in the page
specialChars	Feature	The number of special characters in the page
scriptLength	Feature	The number of characters in scripts embedded in the page
sbr	Feature	The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)
bscr	Feature	The ratio of bodyLength to specialChars (= specialChars / bodyLength)
sscr	Feature	The ratio of scriptLength to specialChars (= scriptLength / specialChars)

Functions

`load_data`([data_format, as_train_test])	Load and returns the phishing url dataset (classification).
`load_fitted_model`([pretrained])	Load and return a fitted regression model to predict the target in the phishing dataset.

load_fitted_model

load_data