An empirical study on feature importance and model performance for phishing website detection using a random forest classifier

Noor Sabah Asker; Essa Ibrahim Essa

doi:10.33545/2707661X.2024.v5.i2a.85

2024, Vol. 5, Issue 2, Part A

An empirical study on feature importance and model performance for phishing website detection using a random forest classifier

Author(s): Noor Sabah Asker and Essa Ibrahim Essa

Abstract:

One of the most dangerous threats to the internet users is the existence of fake websites with the intention of emulating the real ones in an effort to obtain private data. This paper discuss about the detection of phishing websites with the help of Random Forest classifier. The dataset has 10,000 samples with 48 features derived from URLs which are basic features like the number of dots, subdomain level, length of the URL, some special characters, and other such predictors.

In this paper, we have also done a wide range of exploratory data analysis to determine the distribution and relevance оf each variable. Some of the findings that can be considered as quite significant are the following: the URL length was found to be 70 characters on average; the average number of dots was equal to approximately 2. 45, and still, such features as the absence оf HTTPS and the use оf insecure forms turn out to be quite frequent. These features proved very useful in differentiating between actual websites and fake, phishing websites.

This dataset was used for training and testing the Random Forest classifier which obtained the accuracy of 98%. 2% and F1 score at 0. 98, and the model achieved an accuracy of 99. 22%, and F1-score 98. 22%. The confusion matrix shows a good performance that equates the true negatives and the true positives to 970 and 994 respectively, with few false positives and false negatives of 18 each. Such findings prove the consistency and aptitude of the model as well as its ability tо accurately distinguish between phishing and legitimate sites.

Feature importance analysis suggested that the features like, ‘NumDots’, ‘SubdomainLevel’, ‘UrlLength’, ‘NumDash’, ‘NumQueryComponents’, etc., are some of the most important features that help in classifying the URLs as phishing. The ‘NoHttps’, ‘ InsecureForms’, ‘PctExtHyperlinks’ and other features connected with the security of webpages and the presence of suspicious elements also made a great contribution to increasing the model’s capacity for prediction.

This research sheds light on the benefits of using machine learning techniques, particular Random Forest classifiers in improving cybersecurity defense against phishing threats. Subsequent work will investigate expandability of the system and other improvements using more sophisticated learning algorithms to enhance the detection performance.

DOI: 10.33545/2707661X.2024.v5.i2a.85

Pages: 01-10 | Views: 107 | Downloads: 50

Download Full Article: Click Here

International Journal of Communication and Information Technology

How to cite this article:

Noor Sabah Asker, Essa Ibrahim Essa. An empirical study on feature importance and model performance for phishing website detection using a random forest classifier. Int J Commun Inf Technol 2024;5(2):01-10. DOI: 10.33545/2707661X.2024.v5.i2a.85

P-ISSN: 2707-661X, E-ISSN: 2707-6628

2024, Vol. 5, Issue 2, Part A

An empirical study on feature importance and model performance for phishing website detection using a random forest classifier

Related Links

Related Journal Subscription