Sunday, November 20, 2022

Detect a Spam Email using Natural Language Processing

A spam email may also be a phishing email. It targets unsuspecting users. Phishing emails usually contain various subjects that is designed to lure users to open the email. Like for example, a person desperately in need of money, when she opens her email, she has an email with subject line "You just won $1,000!" coming from acds@tr3smar1as.n3t. It would be very tempting for her so she opens it and the email contains "The Tres Marias is giving away $1,000, clicked this link to claimed your price hurry, th1s is offer is being given to first fiv3 claimerz". Once she clicks the link, a msfvenom or a virus was downloaded to her pc and her pc was compromised. 

To examine further the example email, the subject line sounds suspicious already, it is a compelling line that would attract curiousity even if the unsuspecting receiver is not in need of money. The sender's email seems to be obviously fake. No reputable company would use combination of numeric and alphanumeric characters in their domain name. And lastly the email body is full of grammatical errors. A desperate person would tend to ignore this and proceed to click the link anyway.

To prevent this from happening, a spam detect program must automatically delete emails like these as a first line of defense. 

In this post, I created a simple spam email detection in python using Natural Language Processing. It consist of 2 programs, the first program is the creation of the model including training, testing and saving the model for deployment.The second program is an example of how to deploy the model and how it is used to detect spam emails.

The data structure I used to train and test the model consists of 2 columns(1: subject with the email body; 2: spam indicator). The model has achieved 99% accuracy so it is useable and ready for actual deployment.


Here is the code:

1. Train and Test the model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from nltk import word_tokenize
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import joblib
from sklearn.model_selection import KFold, cross_val_score
import os
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('emails.csv')
def count_words(text):
    words = word_tokenize(text)
    return len(words)

df['count']=df['text'].apply(count_words)
def process_text(text):
    no_punc = [char for char in text if char not in string.punctuation]
    no_punc = ''.join(no_punc)    
    return ' '.join([word for word in no_punc.split() if word.lower() not in stopwords.words('english')])

df['text']=df['text'].apply(process_text)
def stemming (text):
    return ''.join([stemmer.stem(word) for word in text])

df['text']=df['text'].apply(stemming)
vectorizer= CountVectorizer(
    ngram_range=(1, 3), 
    stop_words="english",
    lowercase=False,    
)

message_bow = vectorizer.fit_transform(df['text'])
X_train,X_test,y_train,y_test = train_test_split(message_bow,df['spam'],test_size=0.20)
nb= MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
filename = "model.sav"
bow = 'vect.sav'
joblib.dump(nb, filename)
joblib.dump(vectorizer, bow)
kfold = KFold(n_splits=5,shuffle=True)
print("Accuracy using Cross Validation is :",np.mean(cross_val_score(nb,message_bow,df['spam'],cv=kfold,scoring="accuracy"))*100," %")

2. Sample Model Deployment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import string
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
import joblib
import os
import warnings
warnings.filterwarnings('ignore')
x=''
filename = "model.sav"
bow = 'vect.sav'
df = pd.read_csv('email 1.csv')
x =(df.iloc[0,0])
def process_text(text):
    no_punc = [char for char in text if char not in string.punctuation]
    no_punc = ''.join(no_punc)    
    return ' '.join([word for word in no_punc.split() if word.lower() not in stopwords.words('english')])


x = process_text(x)

def stemming (text):
    return ''.join([stemmer.stem(word) for word in text])
x = stemming(x)
vectorizer = joblib.load(bow)
print(vectorizer)
message_bow = vectorizer.transform([x])


loaded_model = joblib.load(filename)
result = loaded_model.predict(message_bow)
print(result)

No comments:

Post a Comment