phishing#
The phishing dataset contains a slightly synthetic dataset of urls - some regular and some used for phishing.
The phishing url dataset contains slightly synthetic dataset of urls - some regular and some used for phishing.
The dataset is based on the great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors have released it under an open license per our request, and for that we are very grateful to them.
This dataset is licensed under the Creative Commons Zero v1.0 Universal (CC0 1.0).
The typical ML task in this dataset is to build a model that predicts the if the url is part of a phishing attack.
- Dataset Shape:
# Property
Value
Samples Total
11.35K
Dimensionality
25
Features
real, string
Targets
boolean
- Description:
# Column name
Column Role
Description
target
Label
0 if the URL is benign, 1 if it is related to phishing
month
Data
The month this URL was first encountered, as an int
scrape_date
Date
The exact date this URL was first encountered
ext
Feature
The domain extension
urlLength
Feature
The number of characters in the URL
numDigits
Feature
The number of digits in the URL
numParams
Feature
The number of query parameters in the URL
num_%20
Feature
The number of ‘%20’ substrings in the URL
num_@
Feature
The number of @ characters in the URL
entropy
Feature
The entropy of the URL
has_ip
Feature
True if the URL string contains an IP address
hasHttp
Feature
True if the url’s domain supports http
hasHttps
Feature
True if the url’s domain supports https
urlIsLive
Feature
The URL was live at the time of scraping
dsr
Feature
The number of days since domain registration
dse
Feature
The number of days since domain registration expired
bodyLength
Feature
The number of characters in the URL’s web page
numTitles
Feature
The number of HTML titles (H1/H2/…) in the page
numImages
Feature
The number of images in the page
numLinks
Feature
The number of links in the page
specialChars
Feature
The number of special characters in the page
scriptLength
Feature
The number of characters in scripts embedded in the page
sbr
Feature
The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)
bscr
Feature
The ratio of bodyLength to specialChars (= specialChars / bodyLength)
sscr
Feature
The ratio of scriptLength to specialChars (= scriptLength / specialChars)
Functions
|
Load and returns the phishing url dataset (classification). |
|
Load and return a fitted regression model to predict the target in the phishing dataset. |