1.1P Basic Linux Security SIT719 Security and Privacy Issues in Analytics Distinction/Higher Distinction Task 5.1 End-to-end project delivery on cyber-security data analytics Overview Do you know what...

1 answer below »
security and privacy issues


1.1P Basic Linux Security SIT719 Security and Privacy Issues in Analytics Distinction/Higher Distinction Task 5.1 End-to-end project delivery on cyber-security data analytics Overview Do you know what is an end-to-end data science project? See the lifecycle of an end-to-end data science project. If you are doing data science application for security analysis, your problem will be related to the cybersecurity and your data analysis needs to follow the below steps. See the task description for the detailed instructions. Figure 1: Data Science Lifecycle [source: http://sudeep.co/] In this Distinction/Higher Distinction Task, you will experiment with Machine Learning classification algorithms. Please see more details in the Task description. Before attempting this task, please make sure you are already up to date with all Credit and Pass tasks. http://sudeep.co/ Task Description Instructions: Suppose, you are working in an organization as a security analyst. You need to conduct an end to end project on “cyber-attack classification in the network traffic database”. To complete the project you follow the steps in Figure 1. Here, some of the steps are already solved for you (by the teaching team, you don’t need to take any action) and the remaining you need to complete (highlighted in blue) by yourself to submit this task. Step 1: Business Understanding (Problem Definitions) Your task is to develop a 5-class machine learning-based classification model to identify the normal network traffics and attack classes. Step 2: Data Gathering (Identify the source of data) In the industry, you need to communicate either with your manager, client, other stakeholder and/or IT team to understand the source of data and to gather it. Here, the teaching team already gathered data for you. Please see the below link to obtain the NSL-KDD dataset with 5 classes. Dataset Link (cloudDeakin): https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325 If you are interested to learn more about the dataset, please visit the website below: https://www.unb.ca/cic/datasets/nsl.html ***A starting example code for the 5 class classification is also given for your benefit, where some of the steps are already implemented. Please see the following link to access it: https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325 Step 3: Data Cleaning (Filtering anomalous data) You need to take care of missing values and inconsistent data. In week 2, you have learnt how to deal with missing values and manipulate a database. Here, it has already been taken care of for this dataset (so no action is needed for this task). Step 4: Data Exploration (Understanding the data) Here, you need to do the following tasks and write in your report: 1. Identify the attribute names (Header) 2. Check the length of the Train and Test dataset 3. Check the total number of samples that belong to each of the five classes of the training dataset. https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325 https://www.unb.ca/cic/datasets/nsl.html https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325 Step 5: Feature Engineering (Select Important Feature) You may need to do feature extraction or selection during your data analysis process. Some fundamental concepts of feature engineering using python is discussed in the following link: https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html Here, relevant feature engineering is already done for you in the sample code. Step 6: Predictive Modelling (Prediction of the classes) The DecisionTreeClassifier has been implemented for you. Now, you need to implement other techniques and compare. Please do the following tasks: 1. Implement at least 5 benchmark classification algorithms. 2. Tune the parameters if applicable to obtain a good solution. 3. Obtain the confusion matrix for each of the scenarios. 4. Calculate the performance measures of the each of the classification algorithms that includes Accuracy (%), Precision (%), Recall (%), F-Score (%), False Alarm- FPR (%) You need to compare the results following the table below. Create one table for each algorithm. Attack Class Accu racy (%) Preci sion (%) Recall (%)… … … … ... DoS Norm al Prob R2L U2R Finally, you summarize the results similar to the below table: Algorit hms Accu racy (%) Preci sion (%) Recall (%)… … … … ... Alg 1 Alg 2 … … … https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html Your results need to be comparable against benchmark algorithms. For example, see the below results obtained from a recent article “An Adaptive Ensemble Machine Learning Model for Intrusion Detection” published in IEEE ACCESS, July 2019. Step 7: Data Visualization Perform the following tasks: 1. Visualize and compare the accuracy of different algorithms. 2. Plot the confusion matrix for each scenarios. Step 8: Results delivery: Once you have completed the data analysis task for your security project, you need to deliver the outcome. Results can be delivered as a product/tool/web-app or through a presentation or by submitting the report. Here, you need to write a report (at least 2000 word) based on the outcome and results you obtained by performing the above steps. The report will describe the algorithms used, their working principle, key parameters, and the results. Results should consider all the key performance measures and comparative results in the form of tables, graphs, etc. Compile everything in a PDF and submit through onTrack. You also need to include the python script during submission. Overview Task Description
Answered Same DayMay 18, 2021SIT719

Answer To: 1.1P Basic Linux Security SIT719 Security and Privacy Issues in Analytics Distinction/Higher...

Sandeep Kumar answered on Jun 08 2021
156 Votes
The list of columns or the features for the NSL-KDD dataset are:
'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_coun
t', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'outcome', 'difficulty'
The number of rows of the training set are 125973 while for the testing set has 22543 rows.
Also, the training set has 23 possible outcomes namely-normal, neptune, warezclient, ipsweep, portsweep, teardrop, nmap, satan, smurf, pod, back, guess_passwd, ftp_write, multihop, rootkit, buffer_overflow, imap, warezmaster, phf, land, loadmodule, spy, perl while the testing set has 38 possible outcomes:
neptune, normal, saint, mscan, guess_passwd, smurf, apache2, satan, buffer_overflow, back, warezmaster, snmpgetattack, processtable, pod, httptunnel, nmap, ps, snmpguess, ipsweep, mailbomb, portsweep, multihop, named, sendmail, loadmodule, xterm, worm, teardrop, rootkit, xlock, perl, land, xsnoop, sqlattack, ftp_write, imap, udpstorm, phf.
For the preprocessing of the dataset the following operations are needed that are extracting the labels as depicted previously the testing set has an additional 15 attack types which are absent in training data hence, we will need more general labels to train the model for the classification task.
All the 37 attack types present in the given dataset have been condensed and categorized into five general attack types, or the 5 classes of the attacks for this report:
· Denial of service attacks
· Remote to Local attacks
· User to Root
· Probe attacks
· Normal attacks
Our models namely the decision tree model, adaboost classifier, gradient boosting classifier, random forest classifier and extra tree classifier will perform classification of the data to five classes indicating whether the traffic is normal or the four attacks mentioned above, however we will use the five attack types to analyze the results and calculate performance metrics for each general attack type.
The next section replaces the current outcome field with a Class field that has one of the following values:
· Normal
· Dos
· R2L
· U2R
· Probe
Feature Engineering
For continuous features we use the MinMaxScaler provided by the scikit-learn library, we only allow the scaler to fit the training set values and then we use it to scale both the training and testing sets. The minmax_scale_values helper function does this task. As for the discrete features we use one hot encoding. The encode_text function achieves this.
The Model
In order to avoid the imbalance of the samples representing each attack type in the training data, and to avoid the model’s inability to learn about new attack types by observing existing ones, we present an approach of classifier-based machine learning algorithms provided by scikit-learn library like the decision tree classifier, ada boost classifier, gradient boosting classifier, random forest classifier and extra tree classifier.
1) Decision Tree classifier: The Decision Trees are a form of non-parametric supervised learning technique that is used for the purpose of classifying as well as regression. The objective of the operation is to generate a model which can predict the value of a target variable by learning simple decision rules that it learns from the data features. It is a simple to interpret model, as it can be easily visualized and it required no to little data engineering, and it is ideal to perform multi-class classification like the 5-class classification done in this assignment, the input features for this classifier as array of the size of the sample dataset and number of features, and in the parameters of the model it is random_state, for this model the random state is 17.
2) Random Forest Classifier: A random forest classifier is sort of a meta estimator which is used to fit a wide range of decision tree classifiers on...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here