Start Over

Classifying SMS as Spam or Ham: Leveraging NLP and Machine Learning Techniques.

Authors :: Dharrao, Deepak
Gaikwad, Pratik
Gawai, Shailesh V.
Bongale, Anupkumar M.
Patel, Kishan
Singh, Aniket
Source :: International Journal of Safety & Security Engineering; Feb2024, Vol. 14 Issue 1, p289-296, 8p
Publication Year :: 2024
Abstract: In an era dominated by mobile communication, Short Message Service (SMS) plays a pivotal role in interpersonal interactions. However, the surge in unsolicited spam messages necessitates effective differentiation mechanisms. This exploratory data analysis (EDA) utilizes a dataset from the renowned UCI Machine Learning Repository to discern key characteristics distinguishing spam from legitimate messages. Employing Natural Language Processing (NLP) technique vectorization (BOW and TF-IDF), including the use of a Naïve Bayes algorithm and sentiment analysis, this investigation uncovers patterns and peculiarities specific to spam content. The findings highlight distinct differences in lexicon usage, message structure, and linguistic markers between spam and legitimate messages. For instance, spam messages often exhibit aggressive language and utilize unconventional structures. To elucidate, specific examples of such language patterns and structural anomalies are provided, offering a more nuanced understanding of the study's outcomes. Rooted in data-driven insights, this study lays the foundation for future endeavours in developing robust, NLP-powered spam detection mechanisms to preserve the essence of personal communication in the SMS sphere. Evaluating the model on a test dataset of 5,572 SMS messages yielded noteworthy results. The model demonstrated a precision rate of 98% for legitimate messages and an impeccable 100% precision for identifying spam without any false categorization. However, a notable dip in the recall rate for spam messages, at 85%, raises important considerations. This suggests potential challenges in detecting certain types of spam, emphasizing the need for further refinement in the model. The respective f1-scores for ham and spam messages were 99% and 92%, shedding light on the model's overall efficacy. These performance metrics not only quantify the model's accuracy at an admirable 98% but also prompt deeper reflections on the practical implications of the results, emphasizing areas for future research and enhancement in spam detection mechanisms within the dynamic landscape of mobile communication. [ABSTRACT FROM AUTHOR]