K-Gen PhishGuard: an Ensemble Approach for Phishing Detection with K-Means and Genetic Algorithm
pdf

Keywords

AdaBoost; ensemble learning; feature selection; genetic algorithm; K-means clustering; machine learning; phishing detection

How to Cite

K-Gen PhishGuard: an Ensemble Approach for Phishing Detection with K-Means and Genetic Algorithm. (2025). Al-Khwarizmi Engineering Journal, 21(2), 117-135. https://doi.org/10.22153/kej.2025.04.011

Abstract

Phishing detection is considered a critical problem in cybersecurity, and utilising machine learning with an efficient feature selection method for precisely identifying malicious websites is deemed the most critical challenge. This research presents a two-phase phishing detection system by employing unsupervised feature selection and supervised classification. In the first phase, the best set of features is identified by the Genetic algorithm and is utilised by the K-means clustering algorithm to divide the dataset into groups with similar traits. In the second phase, the best set of features in each group is identified through the Genetic algorithm to enhance the classification process. Finally, a voting ensemble technique is applied, in which the Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Adaptive boosting (AdaBoost) models are combined. Predictions are aggregated using a soft voting mechanism. This research utilises the web page phishing detection dataset, which consists of 11,430 URLs with 87 features. From the results, an accuracy of 99% is achieved using the voting ensemble technique with feature selection compared with 77.3% without feature selection. The model performance experiences a significant boost through the GA-optimised feature selection by reducing computational complexity and improving key metrics such as accuracy, precision and F1-score. Additionally, the performance across four clusters demonstrates the positive impact of K-Means clustering in improving classification accuracy for specific data groups. As proven by the obtained results, integrating feature selection with ensemble learning is effective for phishing detection; moreover, the scalability and efficiency of such a solution in real-world applications are demonstrated.

pdf

References

[1] M. S. Bakken, "Webpage Fingerprinting using Infrastructure-based Features," NTNU, 2023.

[2] P. Patel, D. M. Sarno, J. E. Lewis, M. Shoss, M. B. Neider, and C. J. Bohil, "Perceptual representation of spam and phishing emails," Applied Cognitive Psychology, vol. 33, no. 6, pp. 1296-1304, 2019.

[3] J. A. Chaudhry, S. A. Chaudhry, and R. G. Rittenhouse, "Phishing attacks and defenses," International journal of security and its applications, vol. 10, no. 1, pp. 247-256, 2016.

[4] M. A. Chargo, "You've been hacked: How to better incentivize corporations to protect consumers' data," Transactions: The Tennessee Journal of Business Law, vol. 20, pp. 115-143, 2018.

[5] G. Ho et al., "Understanding the Efficacy of Phishing Training in Practice," in 2025 IEEE Symposium on Security and Privacy (SP), 2024: IEEE Computer Society, pp. 76-76.

[6] R. A. Al Mudhafar and N. K. El Abbadi, "Image Noise Detection and Classification Based on Combination of Deep Wavelet and Machine Learning," Al-Salam Journal for Engineering and Technology, vol. 3, no. 1, pp. 23-36, 2024.

[7] L. Al-Shalabi and Y. Hasan Jazyah, "Phishing Detection Using Hybrid Algorithm Based on Clustering and Machine Learning," International Journal of Computing and Digital Systems, vol. 15, no. 1, pp. 1-13, 2024.

[8] G. Sonowal and K. Kuppusamy, "PhiDMA–A phishing detection model with multi-filter approach," Journal of King Saud University-Computer and Information Sciences, vol. 32, no. 1, pp. 99-112, 2020.

[9] K. L. Chiew, C. L. Tan, K. Wong, K. S. Yong, and W. K. Tiong, "A new hybrid ensemble feature selection framework for machine learning-based phishing detection system," Information Sciences, vol. 484, pp. 153-166, 2019, doi: 10.1016/j.ins.2019.01.064.

[10] Y. Mourtaji, M. Bouhorma, D. Alghazzawi, G. Aldabbagh, and A. Alghamdi, "Hybrid Rule‐Based Solution for Phishing URL Detection Using Convolutional Neural Network," Wireless Communications and Mobile Computing, vol. 2021, p. 24, 2021, doi: 10.1155/2021/8241104.

[11] J. Solanki and R. G. Vaishnav, "Website phishing detection using heuristic based approach," in Proceedings of the third international conference on advances in computing, electronics and electrical technology, 2015.

[12] L. A. T. Nguyen and H. K. Nguyen, "Developing an efficient fuzzy model for phishing identification," in 2015 10th Asian Control Conference (ASCC), 2015: IEEE, pp. 1-6.

[13] R. M. Mohammad, F. Thabtah, and L. McCluskey, "Predicting phishing websites based on self-structuring neural network," Neural Computing and Applications, vol. 25, pp. 443-458, 2014.

[14] R. M. Mohammad, F. Thabtah, and L. McCluskey, "An assessment of features related to phishing websites using an automated technique," in 2012 international conference for internet technology and secured transactions, 2012: IEEE, pp. 492-497.

[15] M. S. I. Ovi, M. H. Rahman, and M. A. Hossain, "PhishGuard: A Multi-Layered Ensemble Model for Optimal Phishing Website Detection," arXiv preprint arXiv:2409.19825, 2024.

[16] A. R. Mahmood and S. M. Hameed, "A Smishing Detection Method Based on SMS Contents Analysis and URL Inspection Using Google Engine and VirusTotal," Iraqi Journal of Science, pp. 6276-6291, 2023.

[17] A. R. Mahmood and S. M. Hameed, "Review of Smishing Detection Via Machine Learning," Iraqi Journal of Science, pp. 4244-4259, 2023.

[18] A. A. Zuraiq and M. Alkasassbeh, "Phishing detection approaches," in 2019 2nd International Conference on new Trends in Computing Sciences (ICTCS), Amman, Jordan, 2019: IEEE, pp. 1-6, doi: 10.1109/ICTCS.2019.8923069.

[19] M. Pratiwi, T. Lorosae, and F. Wibowo, "Phishing site detection analysis using artificial neural network," Journal of Physics: Conference Series, vol. 1140, p. 012048, 2018, doi: 10.1088/1742-6596/1140/1/012048.

[20] A. Odeh, I. Keshta, and E. Abdelfattah, "PHIBOOST-a novel phishing detection model using Adaptive boosting approach," Jordanian Journal of Computers and Information Technology (JJCIT), vol. 7, no. 1, pp. 65-74, 2021.

[21] H. Shirazi, K. Haefner, and I. Ray, "Improving auto-detection of phishing websites using fresh-phish framework," International Journal of Multimedia Data Engineering and Management (IJMDEM), vol. 9, no. 1, p. 14, 2018, doi: 10.4018/IJMDEM.2018010104.

[22] W. Wang, F. Zhang, X. Luo, and S. Zhang, "PDRCNN: Precise phishing detection with recurrent convolutional neural networks," Security and Communication Networks, vol. 2019, p. 15, 2019, doi: 10.1155/2019/2595794.

[23] Z. Liu, B. Yang, J. An, and C. Huang, "Similarity evaluation of graphic design based on deep visual saliency features," The Journal of Supercomputing, pp. 1-22, 2023.

[24] M. Sheykhmousa, M. Mahdianpari, H. Ghanbari, F. Mohammadimanesh, P. Ghamisi, and S. Homayouni, "Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 6308-6325, 2020.

[25] S. Juanita and R. D. Cahyono, "K-means clustering with comparison of Elbow and silhouette methods for medicines clustering based on user reviews," Jurnal Teknik Informatika (JUTIF), vol. 5, no. 1, pp. 283-289, 2024.

[26] S. Mathankar, S. R. Sharma, T. Wankhede, M. Sahu, and S. Thakur, "Phishing Website Detection using Machine Learning Techniques," in 2023 11th International Conference on Emerging Trends in Engineering & Technology-Signal and Information Processing (ICETET-SIP), 2023: IEEE, pp. 1-6.

[27] R. Mahajan and I. Siddavatam, "Phishing website detection using machine learning algorithms," International Journal of Computer Applications, vol. 181, no. 23, pp. 45-47, 2018.

[28] A. Altaher, "Phishing websites classification using hybrid SVM and KNN approach," International Journal of Advanced Computer Science and Applications, vol. 8, no. 6, 2017.

[29] K. G. Liakos, P. Busato, D. Moshou, S. Pearson, and D. Bochtis, "Machine learning in agriculture: A review," Sensors, vol. 18, no. 8, p. 2674, 2018.

[30] D. M. Abdullah and A. M. Abdulazeez, "Machine learning applications based on SVM classification a review," Qubahan Academic Journal, vol. 1, no. 2, pp. 81-90, 2021.

[31] A. Roy and S. Chakraborty, "Support vector machine in structural reliability analysis: A review," Reliability Engineering & System Safety, vol. 233, p. 109126, 2023.

[32] A. Parmar, R. Katariya, and V. Patel, "A review on random forest: An ensemble classifier," in International conference on intelligent data communication technologies and internet of things (ICICI) 2018, 2019: Springer, pp. 758-763.

[33] W. Wang and D. Sun, "The improved AdaBoost algorithms for imbalanced data classification," Information Sciences, vol. 563, pp. 358-374, 2021.

[34] S. S. Azmi and S. Baliga, "An overview of boosting decision tree algorithms utilizing AdaBoost and XGBoost boosting strategies," Int. Res. J. Eng. Technol, vol. 7, no. 5, pp. 6867-6870, 2020.

[35] O. Sagi and L. Rokach, "Ensemble learning: A survey," Wiley interdisciplinary reviews: data mining and knowledge discovery, vol. 8, no. 4, p. e1249, 2018.

[36] J. Tang, S. Alelyani, and H. Liu, "Feature selection for classification: A review," Data classification: Algorithms and applications, p. 37, 2014.

[37] B. Venkatesh and J. Anuradha, "A review of feature selection and its methods," Cybernetics and information technologies, vol. 19, no. 1, pp. 3-26, 2019.

[38] S. N. Mohammed and A. J. Jabir, "A Ranked-Aware GA with HoG Features for Infant Cry Classification," International Journal of Intelligent Engineering & Systems, vol. 16, no. 6, 2023.

[39] A. Sohail, "Genetic algorithms in the fields of artificial intelligence and data sciences," Annals of Data Science, vol. 10, no. 4, pp. 1007-1018, 2023.

[40] W. Ali and F. Saeed, "Hybrid filter and genetic algorithm-based feature selection for improving cancer classification in high-dimensional microarray data," Processes, vol. 11, no. 2, p. 562, 2023.

[41] X. Liu and Y. Du, "Towards effective feature selection for iot botnet attack detection using a genetic algorithm," Electronics, vol. 12, no. 5, p. 1260, 2023.

[42] S. Katoch, S. S. Chauhan, and V. Kumar, "A review on genetic algorithm: past, present, and future," Multimedia tools and applications, vol. 80, pp. 8091-8126, 2021.

[43] G. K. Soon, T. T. Guan, C. K. On, R. Alfred, and P. Anthony, "A comparison on the performance of crossover techniques in video game," in 2013 IEEE international conference on control system, computing and engineering, 2013: IEEE, pp. 493-498.

[44] B. Mahesh, "Machine learning algorithms-a review," International Journal of Science and Research (IJSR).[Internet], vol. 9, no. 1, pp. 381-386, 2020.

[45] G. Ascenso, M. H. Yap, T. Allen, S. S. Choppin, and C. Payton, "A review of silhouette extraction algorithms for use within visual hull pipelines," Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 8, no. 6, pp. 649-670, 2020.

[46] Kaggle, "Web page Phishing Detection Dataset," 2021. [Online]. Available: https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset

[47] A. Hannousse and S. Yahiouche, "Towards benchmark datasets for machine learning based website phishing detection: An experimental study," Engineering Applications of Artificial Intelligence, vol. 104, p. 104347, 2021.

[48] K. Adane, B. Beyene, and M. Abebe, "Single and hybrid-ensemble learning-based phishing website detection: examining impacts of varied nature datasets and informative feature selection technique," Digital Threats: Research and Practice, vol. 4, no. 3, pp. 1-27, 2023.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2025 Al-Khwarizmi Engineering Journal