I stumbled across this dataset on Phishing Websites during my search for some useful and meaningful datasets in the field of wave propagation. Since I think that AI is highly underestimated in cyber-security research, I thought I give this dataset a try to see if it is a challenge. If not, then it is just another entry in my dataset exploration series.

It originates from research by Rami Mohammad and others:
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0

Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709

import time
import numpy as np
import pandas as pd
from scipy.io import arff
from io import StringIO
import matplotlib.pyplot as plt
import sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, KFold
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
import xgboost

input_file_path = "./data/PhishingData.arff"
input_data, input_meta = arff.loadarff(input_file_path)
input_data_df = pd.DataFrame(input_data)

display(input_data_df.head(3))
display(input_data_df.tail(3))
SFH popUpWidnow SSLfinal_State Request_URL URL_of_Anchor web_traffic URL_Length age_of_domain having_IP_Address Result
0 b'1' b'-1' b'1' b'-1' b'-1' b'1' b'1' b'1' b'0' b'0'
1 b'-1' b'-1' b'-1' b'-1' b'-1' b'0' b'1' b'1' b'1' b'1'
2 b'1' b'-1' b'0' b'0' b'-1' b'0' b'-1' b'1' b'0' b'1'
SFH popUpWidnow SSLfinal_State Request_URL URL_of_Anchor web_traffic URL_Length age_of_domain having_IP_Address Result
1350 b'-1' b'0' b'-1' b'-1' b'-1' b'0' b'-1' b'-1' b'0' b'1'
1351 b'0' b'0' b'1' b'0' b'0' b'0' b'-1' b'1' b'0' b'1'
1352 b'1' b'0' b'1' b'1' b'1' b'0' b'-1' b'-1' b'0' b'-1'

Since all entries are binary it’s time to change that:

for col in input_data_df:
    if col == "Result":
        temp = list(map(lambda x: int(x.decode('UTF-8')),input_data_df[col]))
        input_data_df[col] = temp
    else:
        temp = list(map(lambda x: x.decode('UTF-8'),input_data_df[col]))
        input_data_df[col] = temp
        input_data_df[col] = pd.Categorical(input_data_df[col])

display(input_data_df.head(2))
display(input_data_df.tail(2))
SFH popUpWidnow SSLfinal_State Request_URL URL_of_Anchor web_traffic URL_Length age_of_domain having_IP_Address Result
0 1 -1 1 -1 -1 1 1 1 0 0
1 -1 -1 -1 -1 -1 0 1 1 1 1
SFH popUpWidnow SSLfinal_State Request_URL URL_of_Anchor web_traffic URL_Length age_of_domain having_IP_Address Result
1351 0 0 1 0 0 0 -1 1 0 1
1352 1 0 1 1 1 0 -1 -1 0 -1

Let’s one-hot-encode it and see if the classes are distinguishable visually:

input_data_df_categorical = input_data_df.copy(deep=True)
for col in input_data_df:
    if col != "Result":
        dummies = pd.get_dummies(input_data_df_categorical[col], prefix=('categorical_'+col))
        input_data_df_categorical.drop(col, inplace=True, axis=1)
        input_data_df_categorical = pd.concat([input_data_df_categorical, dummies], axis=1)



# split data into X and y
y_df = input_data_df["Result"].copy(deep=True)
X_df = input_data_df_categorical.copy(deep=True)

plt.figure(figsize=(11,9))
for Class in input_data_df['Result'].unique():
    plt.plot(X_df[input_data_df['Result'] == Class].values[0], label=Class)
plt.title("Examples of all classes (unscaled)")
plt.legend()
plt.show()

That looks way to simple, and indeed it is way too easy on a scaled dataset: Read my commentary below!

In this paper:
Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

they claim that their best neural network reaches a test set accuracy of 92.48%. In one section they presented realy bad results of traditional algorithms as well. I really question the authors skillsets providing that this is the same dataset that they used. It seems like a free udacity course on machine learning basics from quite some years ago provides much better education in terms of ML applicability than many universities. I see this in quite a lot of papers!

Update:

my rant on “scientific papers” and dissertations