Due date: 10/23/2020 Page length: 25 single-spaced pages. Note that the introduction, table of content, and reference pages are not counted. APA Citation: Use at least 12 APA citation references from...

1 answer below »
Please review and then let me know if you will be able to accurately perform an analysis on the attached spambase.csv dataset. This course is called DBST 667: Data Mining. This is a graduate level research project. All required instructions are attached.


Due date: 10/23/2020 Page length: 25 single-spaced pages. Note that the introduction, table of content, and reference pages are not counted. APA Citation: Use at least 12 APA citation references from the UMGC library. Use articles for full text and scholarly journals only. Limit the results to dates from 2015 to 2020. Data Analysis on a SPAM Database Instructions: complete details of this project requirement and rubric is located at: file:///D:/UMUC/UMUC/DBST667/PRO/Project/Instructions%20for%20the%20Research%20Project.htm/DBST667/ResearchProject/Instructions%20for%20the%20Research%20Project.htm Use the above instructions to develop a project on Data Analysis on a SPAM Database using the attached spambase.csv dataset. Draft Project background. E-mail spam is one of the common issues of the current internet era. These emails have the varying degree of damaging or commercial content, ranging from multiple copies of the content, product marketing or porn content. Most of the times, these emails are causing financial frauds to the customers. To control this situation, many researchers and IT experts are working on it so that the spam emails can be identified and categorized in a different folder which will help the everyday users to understand to be extra cautious while dealing with such emails. The previous work in this direction is based on different data mining algorithms such as Random Forest and Neural Network. In this current research, Naïve Bayes and J48 decision tree algorithms should be used to analyze its performance on the SPAMBASE dataset along with clustering techniques. The dataset was retrieved from the UCI library link. The performance of this algorithm is compared with different statistical parameters. The algorithms should be executed with the help of Weka tool. Topic: Data Analysis on a SPAM Database Thesis statement:  I will analyze the spambase datasets and will try to predict ways of identifying and filtering high spam emails. A competitive analysis and performance will be conducted using the Naïve Bayes and J48 decision tree algorithms, along with some clustering techniques. The Weka data mining software package will be used during this research in order to aid performance analysis. Algorithms involved will be compared with other statistical data generated. Instructions for the Research Project file:///D:/UMUC/UMUC/DBST667/PRO/Project/Instructions%20for%20the%20Research%20Project.htm/DBST667/ResearchProject/Instructions%20for%20the%20Research%20Project.htm Sample project breakdown Introduction Background Methodology Dataset Naive Bayes classifier Result and Discussions Naïve Bayes Classifier Output J48 Decision Tree EM Simple K means Comparison Conclusion References Deliverables: Use a combination of the Project background, thesis statement, and the Instructions for the Research Project, above to write a graduate level research project on Data Analysis on a SPAM Database. Provide data tables, R-scripts, and screenshots of all test cases that you will use. Use the Sample project breakdown above as a guide to help you develop your table of content. Do not limit your table of content to just what’s in the Sample project breakdown above. Try to add more additional content to it. NB: In the above Instructions for the Research Project, please disregard the following only: “ IV. Next step. Please select your topic for research project and post 1-2 paragraphs summary (abstract) on your intended topic as a New topic in this Conference. Please change the title of your post with the title of your project.” Warning: Wikipedia and sites such as blogs are not scholarly sources and will result in a zero grade for APA citation and references. You should use the UMGC library to perform graduate-level research for scholarly articles from peer-reviewed sources. Therefore, use only references from the UMGC library for all research in this project. This research paper will be submitted to Turnitin; therefore, similarity index should be less than 10%. UMGC Libray: https://sites.umgc.edu/library/index.cfm?_ga=2.26659579.75802204.1590034987-36080969.1569956532 If it prompts you to login, use this Username: saku Password: chk@25SA
Answered Same DayOct 08, 2021

Answer To: Due date: 10/23/2020 Page length: 25 single-spaced pages. Note that the introduction, table of...

Neha answered on Oct 25 2021
144 Votes
67832 - Research paper/ML coding.R
# Loading packages
library(e1071)
library(caTools)
library(dplyr)
# library(RWeka)
library(party)
library(factoextra)
library(caret)
# Importing spambase data
spambase <- read.csv('spambase.csv', header = TRUE, sep = ',')
# Structure
str(spambase)
# converting spam variable integer datatype to factor
spambase$spam <- as.factor(spambase$spam)
# ============== NavieBayes ===================
# Splitting data into train and test data
set.seed(1234)
split <- sample.split(spambase, SplitRatio = 0.7)
train_cl <- subset(spambase, split == "TRUE")
test_cl <- subset(spambase, split == "FALSE")
# Fitting Naive Bayes Model to training dataset
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(spam ~ ., data = train_cl)
classifier_cl
# Predicting on test data'
y_pred <- predict(classifier_cl, newdata = test_cl)
# Confusion Matrix
cm <- table(test_cl$spam, y_pred)
# print confusion matrix
print(cm)
# Model Evauation
confusionMatrix(cm)
# ========================= K means =====================
df <- spambase
# Scaling the features
df <- subset(spambase, select=-spam) %>% scale()# %>% cbind(spam = spambase$spam)
# Finding Optimal number of clusters
fviz_nbclust(x = df,
FUNcluster = kmeans,
method = 'wss'
)
fviz_nbclust(x = df,
FUNcluster = kmeans,
method = "silhouette"
)
# Building KMeans Cluster model
km1 <- kmeans(x=df, centers = 4, nstart = 25)
# Print model
print(km1)
# Calculating means for each cluster for original data
aggregate(spambase[, -ncol(spambase)],
by = list(cluster = km1$cluster),
FUN = mean) %>% head()
# Showing clusters
head(km1$cluster)
# Finding clustering size
table(km1$cluster)
# Making cluster plot using fviz_cluster function
fviz_cluster(km1, data = spambase[,-ncol(spambase)])
# ============== SVM ===================
# Splitting data into train and test data
set.seed(1234)
split <- sample.split(spambase, SplitRatio = 0.7)
train_svm <- subset(spambase, split == "TRUE")
test_svm <- subset(spambase, split == "FALSE")
# Fitting SVM to the Training set
classifier = svm(formula = spam ~ .,
data = train_svm,
type = 'C-classification',
kernel = 'linear')
# Printing model
print(classifier)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_svm[,-ncol(test_svm)])
# Making the Confusion Matrix
cm = table(test_svm$spam, y_pred)
confusionMatrix(cm)
# ============== Decision Tree ===================
# Performing Decision Tree Model
DT.model <- ctree(spam ~. , data = train_svm)
# print model
print(DT.model)
# Plot the model
plot(DT.model,
main = 'Decission Tree',
col = rainbow(7))
# Predict test data based on model
predict_tree <- predict(DT.model,
test_svm, type = "response")
# Evaluating model accuracy using confusion matrix
DT.cm = table(test_svm$spam, predict_tree)
confusionMatrix(DT.cm)
# ============== Logistic Regression ===================
spambase$spam <- as.numeric(spambase$spam)
# Splitting data into train and test data
set.seed(1234)
split <- sample.split(spambase, SplitRatio = 0.7)
train_l <- subset(spambase, split == "TRUE")
test_l <- subset(spambase, split == "FALSE")
# Fitting Logistic regression to the Training set
glm.model = glm(formula = spam ~ .,
data = train_l
)
# Printing model
print(glm.model)
# Predicting the Test set results
y_pred <- predict(glm.model, newdata = test_l[,-ncol(test_l)])
y_pred <- ifelse(y_pred<=1.5,1,2) %>% as.factor()
# Making the Confusion Matrix
cm = table(test_l$spam, factor(y_pred) )
confusionMatrix(cm)
67832 - Research paper/report-pqn2k1ej.docx
Abstract
The spam emails can result into the wastage of all the resources present over the simple Mail transfer protocol servers as they had to manage a large amount of the unwanted Emails. The amount of the spam emails contains malicious course and malware between the 4th quarter and the first quarter also. Perfectively handle all the threats which were posed by these spam emails for the leading email providers like the outlook, Gmail and Yahoo Mail had employed different combinations of the multiple machine learning techniques like the neural network for the spam filters. These machine learning techniques were able to identify spam emails by learning data and all t
he phishing messages with the help of load analyzation of such emails throughout the huge collection of the systems. The machine learning is able to adapt different conditions and this help Yahoo and Gmail spam Mail filters to perform more than single task and they were checking junk emails with the help of already existing rules.
Contents
Abstract    1
Introduction    3
Background    11
Methodology    14
Dataset    15
Naive Bayes classifier    16
Support Vector Machine    17
Decision tree    17
Result and Discussions    20
Naïve Bayes Classifier Output    20
Random Forest Classifier    21
Support Vector Machine    22
Comparison    22
Conclusion    24
References    25
Introduction
A very big problem in the recent time is the unwanted commercial emails which is known as the spam are filling the inbox in bulk. The person who is initialising the spam messages can be called as a spammer. This type of person collects email addresses present at different viruses, websites and chat rooms. Spam can prevent the user to utilise their capacity,     time and capacity network bandwidth efficiently. When the spam emails are in huge volume then it can create very disturbing effects over the memory storage of the bandwidth, user time, CPU power and email servers for the communication. The spam emails in the inbox are increasing on the yearly basis and it is responsible for approximately 78% of the overall traffic of email globally.
Users who are receiving these same emails which is not requested by them finds it very irritating during their working hours. It also results in their financial loss for different users as they can get crap as the victim for fraudulent practices and any other Internet scams of the spammers who send these emails to pretend as a reputable companies which have intention to persuade the individual for disclosing their sensitive and personal information like the verification number of bank, credit card numbers and passwords (Thomas, K., Grier, C., Ma, J., Paxson, V., & Song, D, 2011.).
As per a report shared by the Kaspersky Lab the amount of the spam image which is sent has decreased by a 12-year low. The spam email amount can decrease lower than the 50% for the first time since 2003. In 2015 the volume of the spam emails drops down to 50% and in July 2015 this figure reduced 2 46% As for the antivirus software developer Symantec. This decline can be used as a reduction in the number of most botnets which are responsible for sending the spam emails in millions.
The figure of spam emails which was detected by the lab was between 3,000,000 to 6,000,000. As the 2015 was about to end the volume of spam email escalated. Further report from the same lab showed that the spam emails messages can have pernicious attachment like the malicious macros, JavaScript, ransomware or the malware which started to decrease in last of the 2015 (Subramaniam, T., Jalab, H. A., & Taqa, A. Y).
The drift of the spam emails remained sustained in the 2016 and by the March the volume of spam emails increased 4 times as witnessed in the 2015. The volume of email spams was discovered as 22,890,956 by the lab. By the time the amount of emails spam had increased a lot when average of 60% for the first four months of the year later the statistics showed that the spam messages registered as 56% of all the email traffic globally and it is for the most familiar type of the spam emails which was deleting spam and healthcare spam.
The spam emails can result into the wastage of all the resources present over the simple Mail transfer protocol servers as they had to manage a large amount of the unwanted Emails. The amount of the spam emails contains malicious course and malware between the 4th quarter and the first quarter also. Perfectively handle all the threats which were posed by these spam emails for the leading email providers like the outlook, Gmail and Yahoo Mail had employed different combinations of the multiple machine learning techniques like the neural network for the spam filters. These machine learning techniques were able to identify spam emails by learning data and all the phishing messages with the help of load analyzation of such emails throughout the huge collection of the systems. The machine learning is able to adapt different conditions and this help Yahoo and Gmail spam Mail filters to perform more than single task and they were checking junk emails with the help of already existing rules (Tretyakov, K, 2004).
They were also generating new rules themselves on the basis of whatever they have learned and continue with the spam filtering operation. The implication of this help to handle the 1000 messages and evaded them from the spam email filter. The detection model of the Google was also having incorporated tools which was known as the Google safe browsing to identify websites which have malicious unified resource locator.
The major purpose of delay in the delivery of such malicious emails was to perform a deeper and wider examination by the time another message arrives, and the new algorithms can be updated in real time. Approximately 0.5% of the emails were affected due to this deliberate delay. Though there were different filtering methods for email spam but in this report, I've discussed about the state-of-the-art approach. We will explain about the different categories for the techniques of spam filtering which has been applied widely to solve the problem of email spam.
Content based filtering technique: the content based filtering technique is generally used for creating automatic filtering rules and it can also be used to classify the emails which are using machine learning approaches like support vector machine, neural network, Naïve Bayesian classification and K nearest neighbour. this method is generally used to analyse the words, distribution of words and phrases and their occurrences in the image and then we can use them with the generated rules for filtering the incoming email spams.
Case based spam filtering method: it is a simple base filtering method which is a very popular method for filtering the spam. Initially all the emails like both spam and non-spam emails are extracted from the email inbox of the user with the help of collection model. Subsequently the pre-processing steps are performed to transform the email with the help of client interface, selection and grouping of the email data, feature extraction and evaluation process. This data is then classified with the help of 2 vectors sets. Lastly, we can use the machine learning algorithm which can train the data sets and test them to find out whether the incoming emails are non-spam for spam (Zhang, L., Zhu, J., & Yao, T, 2004).
Rule based or heuristic spam filtering technique: in this approach we use the already existing rules or the heuristics which can access large number of the patterns which usually follows a regular expression for the chosen message. There are several similar patterns which increases the score for each message. It can deduct the score if any of the pattern did not correspond. The score of the message which surpasses a specific threshold is declared as the spam otherwise it is counted as the valid email. There are some ranking who's which never change over the time and other rules need some constant updating for become capable to cope with menace of the spammers effectively who are continuously introducing new spam emails and try to escape without getting noticed from the email filters. A very good example of such spam filter is the spam assassin (Almeida, T. A., & Yamakami, A, 2012).
Previous likeness best spam filtering technique: in this approach we use the memory based or the instance-based methods of the machine learning which can classify the email on the basis of their resemblance with the stored examples. The attributes of such image can be used to create a multi-dimensional space vector which can be used to plot the points as the new instances. The new instances can be allocated for the most popular class of its closest training instances. In this approach we use the K nearest neighbour algorithm to filter the spam emails.
Adaptive spam filtering technique: in this technique we try to detect and filter the spam by assigning them a group in 2 different classes.
It tries to divide an email into different groups at each group will have emblematic text. We can create a comparison between each group and incoming email add production will be off percentage of the similarity which can help to decide the group of the email to which it belongs. There are many researchers who have proposed different classification techniques for the spam email and they have worked successfully to classify the data into different groups these methods include support vector machine, case based technique, probabilistic, artificial immune system and artificial neural networks. It has been shown in the background that it is possible to use different classification methods to perform spam email filtering with the help of content-based filtering technique which can identify few features. The existence of these features in the email is are certain and the probabilities for search characteristic in the email ascertain once it is measured against the overall threshold value. The email messages which have exceeded the overall pressured value will be classified as the spam. Artificial neural network behaves as the non-linear models which tries to imitate the functions on the basis of biological neural networks (Subramaniam, T., Jalab, H. A., & Taqa, A. Y., 2010).
It consists of simple processing components like the neutrons and it can carry out different computational operations after processing the information. Few researchers have done research work to employ the neural network for classifying unwanted image as the spam image with the help of content-based filtering. These techniques can be helpful in deciding the properties bye computing the rate of occurrence of each keyboard or the pattern present in the email message. The researchers have used the multilayer perceptron neural network method as the classifier to filter out the spam but most of them have not used the radial basis function neural network to perform the classification (Dhanaraj, S., & Karthikeyani, V., 2013).
Support vector machine has proved itself over the years as the most powerful and efficient classification technique of the state of art to solve the problem of spam email. It comes under supervised learning models which can analyse the data and try to identify patterns which are used for categorising and exploring the relationship existing between the variables. Support vector machine algorithm is very potential to identify the patterns and classify them into a specific group or class. They have the cap ability to gain training easily and as per few researchers they can perform efficiently when compared with the other popular classification method of email spam. It is because during the training time period support vector machine uses data from the email corpus. For high dimension data Casey and strength of this 3rd can diminish over the time due to some computational complexities present in the process data. The support vector machine is a very good classifier as it has sparse data format, and it can satisfactorily use the precision value. It also provides a high accuracy for the classification. It is considered as the example of kernel methods which is used as one of the central areas in the machine learning.
Decision tree is another algorithm of the machine learning which can be successfully applied for the filtering of email spam. When compared with the support vector machine decision tree need more effort from the users to train the data sets. The decision tree can perform very well from the email corpus data training. The performance of the overall decision tree is not dependent over the relationship which exist among the parameters. The great benefit of using decision tree is its capacity for assigning unambiguous values for the problems, decisions and the result for each decision.
This can decrease the vagueness in the decision-making process. Another advantage of using the decision tree when compared with other techniques of the machine learning is the fact that it can provide different type of options and finally follows each option 2 it's end and provides room for the state forward evaluation among different notes of the tree. The decision tree has numerous advantages but there are few drawbacks also like unless there is any appropriate pruning it becomes difficult to control the growth of the tree.
Decision tree type of non-parametric machine learning algorithm which is vulnerable and adaptable for the overfitting training data. This makes the decision tree of poor classifier and it can also provide a limit for the accuracy of this classification. There are different types of decision trees which can be applied for the filtering of email spam like logistic model tree induction, NBTree classifier and J 48 decision tree algorithm.
Another efficient and wonderful machine learning algorithm is Naïve Bayes which can be applied for the email spam filtering classifier we can apply the over the context of classification in each email add it will have a strong assumption of the words which can be included in the email and independent of each other. It is a most desirable technique for the email spam due to its simplicity implementation when compared with conditional models like the logistic regression. Training data.
It is a very scalable classification there is no bottleneck which can be created you to the increase in the number of predictors or the unit of information. The naïve bayes can be used to solve the problems which include 2 or more classes based on the classification. We can use it to make any forecast for any probability variation. They are capable of managing the discrete and continuous data effectively. The algorithms are not susceptible for any relevant features. This algorithm is predominantly famous for the open source and business-related spam filters. The reason behind this is the advantages of this algorithm and it is not needing more time for the training and also show speedy assessment for filtering and detecting the spam emails.
These filters need training which can be offered by some past set of the spam and non-spam emails. It can keep the record of all the changes which are taking place in each word. We can apply this algorithm to the spam messages in large amount of data set which has different features and attributes. Stochastic Optimization technique like the evolutionary algorithm can also be applied for spam filtering. The reason behind this is that they do not need any sophisticated competition of mathematics. They can also handle any solutions which is generated and try to recognize individuals which have optimal solution for their problem.
There are several works which exist, and they integrated the genetic algorithm with the neural networks and this resulted in enhancement of the performance of neural network algorithm. The related approach for the evolutionary computation method like the genetic algorithm is a particle swarm optimization which is a technique, and we can use it to optimize different continuous non-linear functions and classification techniques. The particle swarm optimization technique was inspired by the social behaviour of Different animals like the fishes and the birds. We can apply this in different areas of the human like the swam of robotics, signal processing, neural network, data mining and telecommunications. This algorithm can operate on large amount of population of the particles and it has the characteristic of no crossover and mutation calculation which can be found in the genetic algorithm. Each particle has a proper position and velocity. Each of the particle comes with a potential solution in this algorithm which makes it easier for the implementation process (Cormack, G. V., Hidalgo, J. M. G., & Sánz, E. P, 2007).
The most efficient method in the spam filtering approach is the automatic filtering of the email which has been implemented successfully and it is also used to frustrate the malicious intentions of spammers. The largest part of the email spam was at rest efficiently with the help of stopping these emails which are originating from some specified address or remove the messages which have specific subject lines in it. More sophisticated technique like the utilization of arbitrary center addresses for inserting an A hazard characters in the beginning of the message subject line can be used by the spammers who hurdle any filtering method.
We can on the fact that there are a good number of filters which can be used with combination of different techniques of machine learning and some knowledge of the application specific the format of hand coded rules and some revolutionary attributes of the spam and there are many studies which are performed over the subject to solve the problem. The spam results into the unproductive usage of all the resources on the simple Mail transfer protocol servers as they had to process a large amount of the unsolicited Emails.
The volume of the spam emails contains malicious course and malware between the 4th quarter and the first quarter also. Perfectively handle all the threats which were posed by these spam emails for the leading email providers like the outlook, Gmail and Yahoo Mail had employed different combinations of the multiple machine learning techniques like the neural network for the spam filters (Deshpande, V. P., Erbacher, R. F., & Harris, C., 2007).
Background    
In this report we will try to study about the architecture of email spam filtering. The major aim of the spam filtering is to reduce the volume of the unsolicited name as to the minimum. Email filtering can be defined as a process of arranging the emails in such a format that it can be according to some definite standard of the emails. Email filters are generally used for managing the incoming emails, detection and elimination of such emails and filtering spam emails which can contain some malicious code like the malware or virus. The process of the image can he be influenced by some basic protocols like the smtp.
There are different types of agents like Mail user agent which are used. The Mail user agents are Mozilla Thunderbird, balsa, Microsoft Outlook, Mutt, Elm, Kmail, Eudora, Balsa and Pine. We can filter the spam mails with the help of deployment at the strategic places in the server and client system. The spam filters can be deployed with the help of different Internet service providers at each layer of the network before the Mail server. The firewall can be used as the network security system for managing monetary call the incoming and outgoing traffic over the network on the basis of predetermined security rules (Biggio, B., Fumera, G., Pillai, I., & Roli, F, 2011).
The email server can be used to serve as the incorporated anti-spam and antivirus solution which helps to provide a better safety measure for all the email present over the network perimeter. We can implement filter process at the client system where they can be used as the add ONS in the computer to serve between the endpoint of the devices at the intermediary. Filters can be used to block suspicious or unsolicited image which can be a threat for the security of network if it enters into the computer system. It is the choice of the user to customize spam filter at the email level so that it can block the spam emails as per the conditions which are fixed in the system.
Working of Gmail, Yahoo and Outlook emails spam filters
There are different formulas for filtering the spam emails and the different methods are used by the Gmail, Yahoo Mail and Outlook.com. The uses different methods to deliver only the valid emails to the users and remove all the illegitimate messages after filtering them out. these filters can also block some useful and authentic messages erroneously. It is also reported that around 20% of the emails which are based on the authorization gets failed to get into the inbox of the recipient. The email providers have also design different mechanists which can be used to filter the anti-spam emails to remove any dangers which can be created by ransomware, email borne malware and phishing for the users (Deshpande, V. P., Erbacher, R. F., & Harris, C., 2011).
The mechanism which is used to decide the risk level of the incoming email is dependent over each company. Example of such mechanics can be sender policy frameworks, recipient verification tool, white list and blacklist and satisfactory spam limits. We can use this mechanism for single or multiple users as per the requirement. Whenever the threshold of the satisfactory spam is low then it can result in more spam evading by the spam filter and entering into the inbox of the user. If this threshold is high, then it can also lead to some important image getting isolated unless there's any redirection by the administration for them. In this section we will discuss about the operations which are used by the Yahoo, outlook, Gmail to use the anti-spam filter.
Email spam filtering process
The email message consists of mainly 2 components which are the body and the header of the message. In the header area we provide the broad information about the content of the email and it includes sender address, receiver address and the subject of the email. The body behaves as the heart of the whole email message. It can include an information which do not have any predefined data. Examples can be audio, video, HTML markup, files, images, analogue data or any webpage. The email header will comprise of few fields like the recipient address, centre address or the timestamp in which it was sent by the intermediary servers to the agents of message transport which behaves as the office for organising all the emails. The headline of the email generally starts with from and then it follows a modification whenever it moves from one server to another via the in between servers.
The header can allow the user to check for the route of the email from which it passes, and the time taken by each server to check for that email message. The available information can be passed through some processing before any classifier can use it to perform filtering over it. The important stage which we can observe in the mining of the data from any email message can be categorised as the following.
Pre-processing: It is the first stage in which we will receive the incoming message and it will be executed. In this step we perform tokenization.
Tokenization: In this process we will remove all the words which are present in the body of the email. It will transform the whole message into the meaningful parts, and we can divide the email into a proper sequence of representative symbols which are known as the tokens. Few authors also made more focus over these representative symbols which are extracted from the main body of the email, subject and the header.
Feature selection: It is the sequel process for the pre-processing stage and in the speeches election we perform a reduction method to measure the special coverage which can have direct impact over the fragments of the email messages as they will be compressed by the feature vector.
Methodology    
In the recent times the classification of spam email was handled by the algorithm of machine learning and the intention was to find out differences between the non-spam and spam emails seen learning algorithms were able to achieve this with the help of adaptive and automatic technique. Instead of depending over the hand coded rules which were very susceptible for the perpetual wearing characteristics of the email messages it was better to have machine learning methods which are capable to obtain information from the different sets of messages which are provided and then they can use the resulted information for classifying the new messages whenever they are received.
According to the authors machine learning algorithm are capable to perform more effectively and efficiently on the basis of their past experience. In this section we will try to review about some of the very well-known machine learning techniques which can be used for the spam detection.
The clustering is a very important method which generally deals with classification of group of patterns on the basis of related case classes. It is a type of approach which we can use to divide the objects or examine the cases into comparatively method for the similar type collections and they are known as clusters.
The clustering technique was selected by most of the users due to its efficiency and it was used in various fields of the application. The clustering algorithms belongs to the unsupervised learning tools and they are used over the emails from data set which can be labelled as true. If there are appropriate representations, then we can have different algorithms of clustering and they all have the ability to classify emails from data sets into the spam or non-spam clusters. Authors also proved in their research work that clustering technique can be very efficient for the clustering of email spam. The results were noted due to their performance which was better than any of those existing semi-supervised technique and they also demonstrated that clustering is a very formidable technique to filter the spam emails.
Dataset    
For the given scenario of creating of filtering method for spam and ham emails the data set which I have selected is in the format of file. It is a text file in which the data is present in the form of a normal text instead of CSV file. R Hey code allows us to read the CSV file easily and store it in the form of data in some variable and later we can perform different types of classification or filtering with the data. The data was imported into the code I performed data cleaning and pre-processing before applying any filtering technique over it. The data set contains thousands of lines and each of them has a spammer have written before it. Weekend perform the date of filtration method directly without performing the cleaning of the data, but this will not give us accurate result and there can be wrong results of spearmint have messages. The cleaning of the data set will need to remove any duplicates or redundancy from the data file before applying the filter techniques over it. The data set contains thousands of lines over which we will perform different techniques to filter out the spam emails or ham messages or emails.
Naive Bayes classifier    
This classification method is generally used as the technique from supervised learning group. It can also behave as the statistical technique to perform the classification at the same time. This method can behave as the fundamental probabilistic model and we can have similar ambiguity about the model in the most ethical manner which can influence the probability of the results. We can use this method to get solution of the predictive and analytical problems. This classification was named after the name of Thomas Bayes. In this classification it offers different types of learning algorithms and some previous experimental and knowledge data which can be merged together. We can use this classification method as it provides some beneficial viewpoint to appraise and comprehend different types of learning algorithms. It can compute some likelihood for the postulation, and it is very robust for the noise present in the input data. This classifier is a very straight forward classifier which works on the probability and it was founded on the basis of bayes theorem which had some assumptions which are not dependent over the nature. The notion of this class is very restrictive for the autonomy and it was created to is down the computation methods and it is based on the tagging of the algorithm naive. This algorithm is very effective and robust in nature. It is similar to the supervised learning algorithms and there have been few up search in the acceptance of this algorithm has been a very simple an efficient algorithm computationally and it also provides satisfactory performances to solve the real world problems. As a result of the exceptional qualities this classifier was found in the application of solving the spam email, sentiment analysis, spam reviews, recommender system and few other online applications also.
Support Vector Machine
The support vector machine is also a part of supervised learning algorithms which are always proven with better performance when compared with any other attendant learning algorithm. It can be defined as the group of algorithms which was proposed to solve different problems of regression in classification. The support vector machine algorithm is find the applications which can provide solutions to solve the problem of quadratic programming and it can also solve the problem which have inequality constraints and some linear equality as it can differentiate over the different groups with the help of hyperplanes.
It utilises all the boundaries. This algorithm is not fast and efficient as other classification methods but it has the cap ability to drop the strength of high accuracy due to its capacity to model the border lines of multidimensional model which are not straight forward or sequential. This classification method is not susceptible for any situation where the model is complex in disproportionate manner like it has different parameters as compared with the number of observations. These qualities make this method more ideal for the applications in which we have to recognize digital handwriting, speaker recognition and text categorization.
Decision tree
The decision tree algorithm can be defined as a type of classifier in which week look at the pattern which has a tree structure. The induction of decision tree is a very important technique which can result to gain the knowledge on the basis of classification. Every node present in the decision tree can be a leaf node which is used to specify the value of its intended feature. Each node of the tree can be a decision node which can be used to show a certain test which will be conducted on the basis of the feature value with the help of a branch and a subtree which represents the result of the test.
These reasons can be used to implement solution for the different problems of classification in the beginning of the tree which is the root of the tree and moving through it until we reach the leaf node and gives the result of the classification. The decision tree learning can be defined as approach which we can apply for the spam filtering. The aim of producing our decision tree model and training the model to achieve the forecast off the value of any goal variable which has input variables at center. The node present on the inner side of the tree can communicate with Input variables present in the data. The individual leaves are used to denote the value of the overall goal variable which is provided with the values of the input variables which are present over the path and it leads from the route to the leaf node.
It is also possible to understand and identify all the nodes of a tree by breaking it down into the fundamental set on the basis of different shape sets and the value of the feature is used for this break. This procedure can be iterated to find a result and resulted subset repeatedly and it can be used to suggest the reason which is called as the recursive partitioning. Once the recursion terminates all the subset present over the particular node will have similar gold values. Another criterion which can be used as the stopping of the recursion it is to divide the set in such predictions so that there is no more enhancement.    

Result and Discussions    
Naïve Bayes Classifier Output
Random Forest Classifier
Support Vector Machine
Comparison    
The machine learning techniques and algorithms can be applied extensively in the field of spam email filtering. Authors have already done substantial work for improving efficiency and effectiveness of email spam filters to classify these emails either as the valid messages or unwanted messages or the spam messages with the help of machine learning classifiers. The machine learning classifiers have the cap ability to recognize different characteristics of all the content or the body of the message hockey mis. The authors have already done significant work in this field to filter the spam image with the help of different techniques which do not have the ability to pose the adaption of different conditions and to solve the problem which are very exclusive for some fields like identifying the messages which are hidden inside an image. The machine learning algorithms were designed to learn about the objective groups which are inactive (Blanzieri, E., & Bryl, A, 2008).
There are many problems which are faced by the researchers. The open research problems in the field of spam email filtering are shown below.
· There was absence of some efficient strategy which can manage all the threats of the security against the spam email filters. Such type of attacks can be exploratory, indiscriminative, targeted or causative attack.
· there was inability of the filtering techniques in the current spam techniques which can effectively handle the concept of drift phenomenon.
· Most of the current spam email filters were not able to post the capacity to learn the techniques in real time.
· Conventional classification technique of spam email was not viable any longer to handle the real time environment which had characteristics of evolving data streams and the drift.
· There was failure of different spam filter methods which reduced the positive rate of the techniques.
· It was important to develop more efficient spam image filters. Most of the spam filters were only classified for filtering the spam messages which were in the form of text. But there were many spammers who used to send a spam email with the images which had embedded text in it and it was not possible for the filters to detect such images and evade them from the inbox.
· It was very important to design and implement scalable, adapted and integrated filters with the help of ontology web to create a new filtering technique for spam email.
· There was lack of the filters which had the cap ability to dynamically update space of the features in techniques. Most of the spam filters which are existing but not able to add or delete the features incrementally without having any recreation of the that model and keep it abreast of the current trend which was present in the spam email filtering method.
Conclusion    
When the spam emails are in huge volume then it can create very disturbing effects over the memory storage of the bandwidth, user time, CPU power and email servers for the communication. The spam emails in the inbox are increasing on the yearly basis and it is responsible for approximately 78% of the overall traffic of email globally. Users who are receiving these same emails which is not requested by them finds it very irritating during their working hours. It also results in their financial loss for different users as they can get crap as the victim for fraudulent practices and any other Internet scams of the spammers who send these emails to pretend as a reputable companies which have intention to persuade the individual for disclosing their sensitive and personal information like the verification number of bank, credit card numbers and passwords.
The drift of the spam emails remained sustained in the 2016 and by the March the volume of spam emails increased 4 times as witnessed in the 2015. The volume of email spams was discovered as 22,890,956 by the lab. By the time the amount of spam emails had reached the sky limit when average of 60% for the first four months of the year later the statistics showed that the spam messages registered as 56% of all the email traffic globally and it is for the most familiar type of the spam emails which was deleting spam and healthcare spam. The spam email results into the wastage usage of all the resources on the simple Mail transfer protocol servers as they had to manage a large amount of the unwanted Emails.
The support vector machine algorithm is find the applications which can provide solutions to solve the problem of quadratic programming and it can also solve the problem which have inequality constraints and some linear equality as it can differentiate over the different groups with the help of hyperplanes. It utilises all the boundaries. This algorithm is not fast and efficient as other classification methods but it has the cap ability to drop the strength of high accuracy due to its capacity to model the border lines of multidimensional model which are not straight forward or sequential. The decision tree algorithm is a type of classifier in which week look at the pattern which has a tree structure (Khorsi, A, 2007).
The induction of decision tree is a very different and important technique which can result to gain the information on the basis of classification of dataset. Every node present in the decision tree can be a leaf node which is used to specify the value of its intended feature. Each node of the tree can be a decision node which can be used to show a certain test which will be conducted on the basis of the feature value with the help of a branch and a subtree which represents the result of the test. 30 season can be used to provide solution for the different problems of classification in the beginning of the tree which is the root of the tree and moving through it until we reach the leaf node and gives the output of the classification. The decision tree learning can be defined as approach which we can apply for the spam filtering.
References
Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4), 243-269.
https://dl.acm.org/doi/abs/10.1145/1039621.1039625
Blanzieri, E., & Bryl, A. (2008). A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 29(1), 63-92.
https://link.springer.com/article/10.1007/s10462-009-9109-6
Khorsi, A. (2007). An overview of content-based spam filtering techniques. Informatica, 31(3).
Tretyakov, K. (2004, May). Machine learning techniques in spam filtering. In Data Mining Problem-oriented Seminar, MTAT (Vol. 3, No. 177, pp. 60-79). Citeseer.
https://courses.cs.ut.ee/2004/dm-seminar-spring/uploads/Main/P06.pdf
Subramaniam, T., Jalab, H. A., & Taqa, A. Y. (2010). Overview of textual anti-spam filtering techniques. International Journal of Physical Sciences, 5(12), 1869-1882.
https://academicjournals.org/journal/IJPS/article-abstract/6D3313D32098
Biggio, B., Fumera, G., Pillai, I., & Roli, F. (2011). A survey and experimental evaluation of image spam filtering techniques. Pattern recognition letters, 32(10), 1436-1446.
https://www.sciencedirect.com/science/article/abs/pii/S0167865511000936
Almeida, T. A., & Yamakami, A. (2012). Advances in spam filtering techniques. In Computational Intelligence for Privacy and Security (pp. 199-214). Springer, Berlin, Heidelberg.
https://link.springer.com/chapter/10.1007/978-3-642-25237-2_12
Deshpande, V. P., Erbacher, R. F., & Harris, C. (2007, June). An evaluation of Naïve Bayesian anti-spam filtering techniques. In 2007 IEEE SMC Information Assurance and Security Workshop (pp. 333-340). IEEE.
https://ieeexplore.ieee.org/abstract/document/4267579/
Cormack, G. V., Hidalgo, J. M. G., & Sánz, E. P. (2007, July). Feature engineering for mobile (SMS) spam filtering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 871-872).
https://dl.acm.org/doi/abs/10.1145/1277741.1277951
Dhanaraj, S., & Karthikeyani, V. (2013, February). A study on e-mail image spam filtering techniques. In 2013 international conference on pattern recognition, informatics and mobile engineering (pp. 49-55). IEEE.
https://ieeexplore.ieee.org/abstract/document/6496446
Thomas, K., Grier, C., Ma, J., Paxson, V., & Song, D. (2011, May). Design and evaluation of a real-time url spam filtering service. In 2011 IEEE symposium on security and privacy (pp. 447-462). IEEE.
https://ieeexplore.ieee.org/abstract/document/5958045/?casa_token=V8UmAoSJrtgAAAAA:kbdq8ca3FZJIzOLMTX_isUVfDyImgq5TyGiPi0rKSw0TYRGPPyM5oERBJwojxJ1ji3O7WdNNqg
67832 - Research paper/Screen shots.docx
67832 - Research...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here