- load_data(data_format: str = 'Dataset', as_train_test: bool = True) Union[Tuple, Dataset, DataFrame] #
Load and returns the phishing url dataset (classification).
The phishing url dataset contains slightly synthetic dataset of urls - some regular and some used for phishing.
The dataset is based on the great project by Rohith Ramakrishnan and others, accompanied by a blog post. The authors have released it under an open license per our request, and for that we are very grateful to them.
This dataset is licensed under the Creative Commons Zero v1.0 Universal (CC0 1.0).
The typical ML task in this dataset is to build a model that predicts the if the url is part of a phishing attack.
- Dataset Shape:
0 if the URL is benign, 1 if it is related to phishing
The month this URL was first encountered, as an int
The exact date this URL was first encountered
The domain extension
The number of characters in the URL
The number of digits in the URL
The number of query parameters in the URL
The number of ‘%20’ substrings in the URL
The number of @ characters in the URL
The entropy of the URL
True if the URL string contains an IP address
True if the url’s domain supports http
True if the url’s domain supports https
The URL was live at the time of scraping
The number of days since domain registration
The number of days since domain registration expired
The number of characters in the URL’s web page
The number of HTML titles (H1/H2/…) in the page
The number of images in the page
The number of links in the page
The number of special characters in the page
The number of characters in scripts embedded in the page
The ratio of scriptLength to bodyLength (= scriptLength / bodyLength)
The ratio of bodyLength to specialChars (= specialChars / bodyLength)
The ratio of scriptLength to specialChars (= scriptLength / specialChars)
- data_formatstr , default: Dataset
Represent the format of the returned value. Can be ‘Dataset’|’Dataframe’ ‘Dataset’ will return the data as a Dataset object ‘Dataframe’ will return the data as a pandas Dataframe object
- as_train_testbool , default: True
If True, the returned data is splitted into train and test exactly like the toy model was trained. The first return value is the train data and the second is the test data. In order to get this model, call the load_fitted_model() function. Otherwise, returns a single object.
- datasetUnion[deepchecks.Dataset, pd.DataFrame]
the data object, corresponding to the data_format attribute.
- train, testTuple[Union[deepchecks.Dataset, pd.DataFrame],Union[deepchecks.Dataset, pd.DataFrame]
tuple if as_train_test = True. Tuple of two objects represents the dataset splitted to train and test sets.