Microsoft Word - NIT3202 Data Analytics for Cyber Security Assignment.docx NIT3202 Data Analytics for Cyber Security Assignment Objectives • To apply skills and knowledge acquired throughout the...

1 answer below »
I failed this assignment. And a lecturer gave me one more chance.This is very important unit for me.


Microsoft Word - NIT3202 Data Analytics for Cyber Security Assignment.docx NIT3202 Data Analytics for Cyber Security Assignment Objectives • To apply skills and knowledge acquired throughout the semester in classification algorithms and machine learning process. • To rationalize the use of machine learning algorithms to effectively and efficiently process of data in big size. • To demonstrate ability to use R to perform email classification tasks that are common for corporate security analyst. • To scientifically conduct and document machine learning experiments for analytics purposes. Due Date: 2pm, Friday, Week 12 This assignment consists of a report worth 25 marks. Delays caused by student's own computer downtime cannot be accepted as a valid reason for late submission without penalty. Students must plan their work to allow for both scheduled and unscheduled downtime. Submission instructions: You must submit an electronic copy of all your assignment files via VU Collaborate Dropbox. You must include both your report and put your source files as appendix inside your report document. Assignments will not be accepted through any other manner of submission. Students should note that email and paper-based submissions will ordinarily be rejected. Late submissions: Submissions received after the due date are penalized at a rate of 5% (out of the full mark) per day, no exceptions. Late submission after 5 days would be penalized at a rate of 100% out of the full mark. It is the student's responsibility to ensure that they understand the submission instructions. If you have ANY difficulties, ask the Lecturer/Tutor for assistance (prior to the submission date). Copying, Plagiarism Notice This is an individual assignment. You are not permitted to work as a part of a group when writing this assignment. The University's policy on plagiarism can be viewed online at https://policy.vu.edu.au/view.current.php?id=27 Overview The popularity of social media networks, such as Twitter, leads to an increasing number of spamming activities. Researchers employed various machine learning methods to detect Twitter spams. In this assignment, you are required to classify spam tweets by using provided datasets. The features have been extracted and clearly structured in JSON format. The extracted features can be categorized into two groups: user profile-based features and tweet content-based features as summarized in Table 1. The provided training dataset and testing dataset are separately listed in Table 2 and Table 3. In testing dataset, we can find that the ratio of spam to non-spam is 1:1 in Dataset 1, while the ratio is 1:19 in Dataset 2. In most of previous work, the testing datasets are nearly evenly distributed. However, in real world, there are only around 5% spam tweets in Twitter, which indicates that testing Dataset 2 simulates the real-world scenario. You are required to classify spam tweets, evaluate the classifiers’ performance and compare the Dataset 1 and Dataset 2 outcomes by conducting experiments. Twitter Spam Detection Work Flow Problem Statement This is an individual assessment task. Each student is required to submit a report of approximately 2,000-2,500 words along with exhibits to support findings with respect to the provided spam and non-spam messages. This report should consist of: • Overview of classifiers and evaluation metrics • Construction of data sets, identification of features and the process of conducting classification • Technical findings of experiment results • Justified discussion of the performance evaluation outcomes for different classifiers To demonstrate your achievement of these goals, you must write a report of at least 2,000 words (2,500 words maximum). Your report should consist of the following chapters: 1. A proper title which matches the contents of your report. 2. Your name and student number in the author line. 3. An executive summary which summarizes your findings. (You may find hints on writing good executive summaries from http://unilearning.uow.edu.au/report/4bi1.html.) 4. An introduction chapter which lists the classification algorithms of your choice (at least 5 algorithms), the features used for classification, the performance evaluation metrics (at least 5 evaluation metrics), the brief summary of your findings, and the organization of the rest of your report. (You may find hints on features used for classification from Twitter Developer Documentation https://dev.twitter.com/overview/api ) 5. A literature review chapter which surveys the latest academic papers regarding the classifiers and performance evaluation metrics of your choice. With respect to each classifier and performance evaluation metrics, you are advised to identify and cite at least one paper published by ACM and IEEE journals or conference proceedings. In addition, Your aim of this part of the report is to demonstrate deep and thorough understanding of the existing body of knowledge encompassing multiple classification techniques for security data analytics, specifically, your argument should explain why machine learning algorithms should be used rather than human readers. (Please read through the hints on this web page before writing this chapter http://www.uq.edu.au/student-services/learning/literature-review.) 6. Technical demonstration chapter which consists of fully explained screenshots when your experiments were conducted in R. That is, you should explain each step of the procedure of classification, and the performance results for your classifiers. Note, what classifiers you presented in literature review should be what you conduct experiments. 7. Performance evaluation chapter which evaluates the performance of classifiers. You should analyze each classifier’s performance with respect to the performance metrics of your choice. In addition, you should compare the performance results in terms of evaluation metrics, e.g., accuracy, false positive, recall, F-measure, speed and so on, for the selected classifiers and datasets. 8. A conclusions chapter which summarizes major findings of the study (You should use at least 5 evaluation metrics to evaluate the performance of classifiers and compare the performance of different classifiers. You can demonstrate your experiment results in the form of table and plots), discusses whether the results match your hypotheses prior to the experiments and recommends the best performing classification algorithm. 9. A bibliography list of all cited papers and other resources. You must use in-text citations in Harvard style and each citation must correspond to a bibliography entry. There must be no bibliography entries that are not cited in the report. (You should know the contents from this page https://www.vu.edu.au/library/get-help/referencing/referencing-guides.) Proficient (above 80%) Average (60-79%) Satisfactory (50-59%) Below Expectation (0-50%) Score Scientific Writing in Introduction and Conclusion Use appropriate language and genre to extend the knowledge of a range of audiences. Use discipline-specific language and genres to address gaps of a self- selected audience. Apply innovatively the knowledge developed to a different context. Use some discipline-specific language and prescribed genre to demonstrate understanding from a stated perspective and for a specified audience. Apply to different contexts the knowledge developed. Fail to demonstrate understanding for lecturer/teacher as audience. Fail to apply to a similar context the knowledge developed. Out 0f 7 marks Literature Review Collect and record self-determined Information from self-selected sources, choosing or devising an appropriate methodology with self- structured guidelines; Organize information using student determined structures and management of processes; Generate questions/aims/hypotheses based on literature Collect and record self- determined information/ data from self-selected sources, choosing an appropriate methodology based on structured guidelines; Organize information/data using student-determined structures, and manage the processes, within the parameters set by the guidelines; Generate questions/aims/hypotheses framed within structured guidelines Collect and record required information/ data from self- selected sources using one of several prescribed methodologies; Organize information/data using recommended structures. Manage self-determined processes with multiple possible pathways; Respond to questions/tasks generated from a closed inquiry. Fail to collect required information or data from the prescribed source; Fail to organize information/data using prescribed structure; Fail to respond to questions/tasks arising explicitly from a closed inquiry Out 0f 7 marks Technical Demonstration Provide fully explained screenshots with R script. Explain each step of the procedure of classification, and the performance results in details. The entire demo is clear, correct and covers all findings. Provide fully explained screenshots with R script. Explain each step of the procedure of classification, and the performance results. The entire demo is clear, but there are some mistakes. Provide screenshots with R script. Explain each step of the procedure of classification, and the performance results. But many parts of demo are not clear enough and/or contain major flows or mistakes. No screenshots and explanations provided. Out 0f 7 marks Performance Evaluation Evaluate information/data and inquiry process rigorously based on the latest literature. Reflect insightfully to renew others' processes. Construct and use one testing data set and two training data sets. 5 classifiers work correctly. 5 evaluation metrics apply to analyze the performance of classifiers. Evaluate information/data and the inquiry process comprehensively developed within the scope of the given literature. Reflect insightfully to renew others' processes. Construct and use one testing data set and two training data sets. 4 classifiers work correctly. 4 evaluation metrics apply to analyze the performance of classifiers. Evaluate information/data and reflect on the inquiry process based on the given literature. Use only one testing data set. Less than 4 classifiers work correctly. Less than 4 evaluation metrics apply to analyze the performance of classifiers. Fail to evaluate information/data and to reflect on inquiry process. Use one or no testing dataset. Less than 2 classifiers work correctly. Less than 2 evaluation metrics apply to analyze the performance of classifiers. Out 0f 7 marks Reference More than 10 bibliographic items (all of them are academic papers and at least 1 item per classifier/ at More than 10 bibliographic items (most of them are academic papers and
Answered Same DayFeb 24, 2021NIT3202

Answer To: Microsoft Word - NIT3202 Data Analytics for Cyber Security Assignment.docx NIT3202 Data Analytics...

Aditya Kumar answered on Feb 25 2021
156 Votes
Comparison of Different Classifiers for Filtering Spam Twitter Account
Introduction: In the last decade, social life has become an integrated part of our daily life. Different platforms such as Facebook, Twitter, Instagram etc. have grown a lot in this context. Now, many companies and institutions want to mine text data from different online sources to know what people really thinks about their organization. Among these online sources, different social platforms have gained the highest priority. Twitter is one of the first choices that
most organizations make. In this report, I have a set of twitter data indicating which account are spam and which are not, along with detail information of that particular account which consist of “Account Age”, “No of Followers”, “No of Following”, “No of User Favourites”, “No of Lists”, “No of Tweets” and also some detail information about the content of those accounts such as “No of Retweets”, “No of Tweet Favourites”, “No of Hash Tag”, “No of User Mention”, “No of URLs”, “No of Characters”, and “No of Digits”. Here I have performed the 5 algorithms to classify whether an account is spam or not, they are Logistic Regression, Naïve Bayes, Support Vector Machine (SVM), k-Nearest Neighbours (k-NN) and Decision Tree.
Literature Review: In [1] this work, the authors found that in general people tweet about the trending topics but in most of the cases, it is hard to found what is the topic is about, so this work is one step forward to generalization of trending topics with higher accuracy for better information retrieval.
Here [2], different evaluation metrics have been discussed which maybe used for classification purpose.
[3] is based on Logistic Regression which is a well-known machine learning algorithm for classification if there are only two levels in the response variable.
Here, [4] the authors have analysed the data characteristics which may affect the performance of naive Bayes.
Here, [5] a detailed analysis has been done on which Naïve Bayes classifier would be best for spam filtering purpose.
Here, [6] a detailed information on Support Vector Machine (SVM) is provided.
Here, [7] Decision Tree approach has been used to solve land cover mapping problems.
Here, [8] k-NN algorithm has been used for breast cancer classification problem.
Here, [9] k-NN classifier has been used to predict water pollution index.
Here, [10] a new k-NN classifier has been introduced for Big Data based on Efficient Data Pruning.
Technical Demonstration: In this work, I have used the R software for coding purpose in order to perform the classification. As the given is in .txt format, “read.table()” function is used to import the dataset into R. And next to it, the variable names are also been assigned to the corresponding column of the datasets. The Train dataset have also been attached as it helps to directly call the variables instead of calling the data variable everytime.
Next, a thorough chekup has been done if there is any missing data in the training dataset as it could mislead the analysis.
And, there was no missing values in the data, which means the analysis can be carried on without any missing value treatment being performed.
1. Logistic Regression: As, in this dataset, the response variable has only two levels such as Spam and Not Spam, so a logistic regression can be performed as this model is capable to classify two levels of data having binomial property, which means occurance of one level leads to non-occurance of the other level. Here, a tweeter acount can not be spam and not spam simultaneously, hence the binomial property holds. The following screenshot shows an initial Logistic Regression model.
In this block of code all the features have been taken to check whther they are important to predict the class or not. The model result is as follows:
Here, the right most column of the table shows the p-value of the particular predictor. If it is more than 0.05, then in 95% cases, this predictor is not important to predict the class of the response variable. Hence I am going to keep only those predictors which have p-value less than 0.05. And the final model is as follows:
And the model output is as follows:
Here, in the furnished model, all the predictors are more than 95% important to predict the class of the response variable.
So, we can see for the Logistic Regression approach, “Account Age”, “No of Tweets”, “No of Retweets”, “No of Hashtag”, ”No of URLs”, “No of Characters”, “No of Digits” are the predictor variables which plays vital role in classification whether a tweetwr account is spam or not.
Here, the above code performs the prediction based on the furnished Logistic model over the two given datasets and calculate the 5 metrics for both the prediction. The ROC Curve for the two test datasets are given below:
The metrics value for both the test sets will be provided together at last.
2. Support Vector Machine (SVM): A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for classification purposes. SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.
The following code shows the model in this algorithm:
All the predictor variables have been used here except “No of Tweet Favourites” as this column does not have any values rather than 0, and hence this column becomes unnecessary for...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here