A spam email may also be a phishing email. It targets unsuspecting users. Phishing emails usually contain various subjects that is designed to lure users to open the email. Like for example, a person desperately in need of money, when she opens her email, she has an email with subject line "You just won $1,000!" coming from acds@tr3smar1as.n3t. It would be very tempting for her so she opens it and the email contains "The Tres Marias is giving away $1,000, clicked this link to claimed your price hurry, th1s is offer is being given to first fiv3 claimerz". Once she clicks the link, a msfvenom or a virus was downloaded to her pc and her pc was compromised.
To examine further the example email, the subject line sounds suspicious already, it is a compelling line that would attract curiousity even if the unsuspecting receiver is not in need of money. The sender's email seems to be obviously fake. No reputable company would use combination of numeric and alphanumeric characters in their domain name. And lastly the email body is full of grammatical errors. A desperate person would tend to ignore this and proceed to click the link anyway.
To prevent this from happening, a spam detect program must automatically delete emails like these as a first line of defense.
In this post, I created a simple spam email detection in python using Natural Language Processing. It consist of 2 programs, the first program is the creation of the model including training, testing and saving the model for deployment.The second program is an example of how to deploy the model and how it is used to detect spam emails.
The data structure I used to train and test the model consists of 2 columns(1: subject with the email body; 2: spam indicator). The model has achieved 99% accuracy so it is useable and ready for actual deployment.
Here is the code:
1. Train and Test the model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from nltk import word_tokenize import string from nltk.corpus import stopwords from nltk.stem import PorterStemmer stemmer = PorterStemmer() from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB import joblib from sklearn.model_selection import KFold, cross_val_score import os import warnings warnings.filterwarnings('ignore') df = pd.read_csv('emails.csv') def count_words(text): words = word_tokenize(text) return len(words) df['count']=df['text'].apply(count_words) def process_text(text): no_punc = [char for char in text if char not in string.punctuation] no_punc = ''.join(no_punc) return ' '.join([word for word in no_punc.split() if word.lower() not in stopwords.words('english')]) df['text']=df['text'].apply(process_text) def stemming (text): return ''.join([stemmer.stem(word) for word in text]) df['text']=df['text'].apply(stemming) vectorizer= CountVectorizer( ngram_range=(1, 3), stop_words="english", lowercase=False, ) message_bow = vectorizer.fit_transform(df['text']) X_train,X_test,y_train,y_test = train_test_split(message_bow,df['spam'],test_size=0.20) nb= MultinomialNB() nb.fit(X_train,y_train) y_pred = nb.predict(X_test) filename = "model.sav" bow = 'vect.sav' joblib.dump(nb, filename) joblib.dump(vectorizer, bow) kfold = KFold(n_splits=5,shuffle=True) print("Accuracy using Cross Validation is :",np.mean(cross_val_score(nb,message_bow,df['spam'],cv=kfold,scoring="accuracy"))*100," %") |
2. Sample Model Deployment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import pandas as pd import string from nltk.stem import PorterStemmer stemmer = PorterStemmer() import joblib import os import warnings warnings.filterwarnings('ignore') x='' filename = "model.sav" bow = 'vect.sav' df = pd.read_csv('email 1.csv') x =(df.iloc[0,0]) def process_text(text): no_punc = [char for char in text if char not in string.punctuation] no_punc = ''.join(no_punc) return ' '.join([word for word in no_punc.split() if word.lower() not in stopwords.words('english')]) x = process_text(x) def stemming (text): return ''.join([stemmer.stem(word) for word in text]) x = stemming(x) vectorizer = joblib.load(bow) print(vectorizer) message_bow = vectorizer.transform([x]) loaded_model = joblib.load(filename) result = loaded_model.predict(message_bow) print(result) |
No comments:
Post a Comment