The phishing dataset contains a slightly synthetic dataset of urls - some regular and some used for phishing.
The phishing url dataset contains slightly synthetic dataset of urls - some regular and some used for phishing.
The dataset is based on the great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors have released it under an open license per our request, and for that we are very grateful to them.
This dataset is licensed under the Creative Commons Zero v1.0 Universal (CC0 1.0).
The typical ML task in this dataset is to build a model that predicts the if the url is part of a phishing attack.
- Dataset Shape:
0 if the URL is benign, 1 if it is related to phishing
The month this URL was first encountered, as an int
The exact date this URL was first encountered
The domain extension
The number of characters in the URL
The number of digits in the URL
The number of query parameters in the URL
The number of ‘%20’ substrings in the URL
The number of @ characters in the URL
The entropy of the URL
True if the URL string contains an IP address
True if the url’s domain supports http
True if the url’s domain supports https
The URL was live at the time of scraping
The number of days since domain registration
The number of days since domain registration expired
The number of characters in the URL’s web page
The number of HTML titles (H1/H2/…) in the page
The number of images in the page
The number of links in the page
The number of special characters in the page
The number of characters in scripts embedded in the page
The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)
The ratio of bodyLength to specialChars (= specialChars / bodyLength)
The ratio of scriptLength to specialChars (= scriptLength / specialChars)
Load and returns the phishing url dataset (classification).
Load and return a fitted regression model to predict the target in the phishing dataset.