This is an assignment that tests machine learning skill. The task is to follow all the required procedures for typical machine learning fraud detection project and design at least 3 different machine...

1 answer below »
This is an assignment that tests machine learning skill. The task is to follow all the required procedures for typical machine learning fraud detection project and design at least 3 different machine leaning models (RF, LG, and any other good one) and then compare and contrast using at least 3 different evaluation metrics to decide on the best model to use for this dataset. It requires write-up but I can do that on my own. I just need the codes with comments line by line and adequately labelled plots starting from data exploration to final step. Please you need to do adequate data balancing (e.g using SMOTE), feature selection, etc to achieve the best model. It should be beautified with beautiful plots.


Machine Learning, 2022S: HW5 CS 5033 - Machine Learning: Homework 5 Spring 2022 Due: Monday, April 11, 2022 In this homeworkwewill be practicingwith techniques on regularization, model assessment, and model selection. In this direction we will work with two different datasets from the UCI repository. Below I assume that you will be using scikit-learn. Exercise 1 – Preprocessing for Regression (20 points). We will be using a UCI superconduc- tivity data set that is available at: h�ps://archive.ics.uci.edu/ml/datasets/Superconductivty+Data. This dataset has 21,263 examples using 81 real-valued a�ributes, with a real-valued target variable. For this assignment, we will use the file train.csv and we can ignore the file unique_m.csv. The target value that we are predicting is the critical temperature in Kelvins, which is the last column in the dataset. (a) Remove 20% of the examples and keep them for testing. You may assume that all examples are independent, so it does not ma�er which 20% you remove. However, the testing data should not be used until a�er a model has been selected. (b) Split the remaining examples into training (75%) and validation (25%). Thus, you will train with 60% of the full dataset (75% of 80%) and validate with 20% of the full dataset (25% of 80%). Exercise 2 – Regression (30 points). We will be using the dataset from Exercise 1. For this problem please use the scikit-learn method, sklearn.linear_model.ElasticNet. (a) Fit an elastic net model to the training data with each possible combination of the following L1 and L2 regularization weights. • λ1 = 0, 10 −6, 10−5, 10−4, 10−3, 10−2, 10−1, 1 • λ2 = 0, 10 −6, 10−5, 10−4, 10−3, 10−2, 10−1, 1 If you check the documentation, you’ll see that the input arguments for ElasticNet do not include λ1 and λ2. Instead, they include alpha, and l1_ratio. λ1 is alpha * l1_ratio, and λ2 is alpha * (1 - l1_ratio). March 30, 2022 1/3 https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html CS 4033/5033 - Machine Learning: Homework 5 (b) For each model trained in step (a), make a prediction for each training example, using the predict method for sklearn.linear_model.ElasticNet and calculate the mean squared error (MSE) on the training examples. Report these values in your writeup. As a reminder, the mean squared error (MSE) on a dataset of sizem is: MSE = 1 m m∑ i=1 (yi − h(xi)) 2 , where, as usual, h(xi) is the prediction of our hypothesis and yi is the correct (ground truth) prediction as this is defined in the dataset. (c) This time, for each model trained in step (a), make a prediction for each validation example and calculate the mean squared error on the validation examples. Report these values in your writeup. (d) Which model (i.e., pair of λ1 and λ2) performed best on the training data? Which model performed best on the validation data? Report this in your writeup. (e) Find the best hyperparameter set (pair of λ1 and λ2) on the validation data. Train a model with the same λ1 and λ2 on both the training and validation data. Using this model, make predic- tions for each testing example and calculate the mean squared error on the test examples. Report this value in your writeup. Exercise 3 – Preprocessing for Classification (20 points). Here we will do the same thing as we did for Exercise 1, but for a different dataset. In particular, we will be using a UCI simulated electrical grid stability data set that is available here: h�ps://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+. This dataset has 10,000 examples using 11 real-valued a�ributes, with a binary target (stable vs. un- stable). The target value that you are predicting is the last column in the dataset. (a) Remove columns 5 and 13 (labeled p1 and stab); p1 is non-predictive and stab is target column that is exactly correlated with the binary target you are trying to predict (if this column is negative, the system is stable). (b) Change the target variable to a number. If the value is stable, change it to 1, and if the value is unstable, change it to 0. (c) Remove 20% of the examples and keep them for testing. You may assume that all examples are independent, so it does not ma�er which 20% you remove. However, the testing data should not be used until a�er a model has been selected. (d) Split the remaining examples into training (75%) and validation (25%). Thus, you will train with 60% of the full dataset (75% of 80%) and validate with 20% of the full dataset (25% of 80%). 2/3 March 30, 2022 https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+ CS 4033/5033 - Machine Learning: Homework 5 Exercise 4 – Logistic Regression (30 points). For this problem you can use the scikit-learn method sklearn.linear_model.LogisticRegression. (a) Fit a logistic regression model with L2 regularization (use the default value of λ) and another logistic regression model with no regularization. Note that, per the documentation, by default the function sklearn.linear_model.LogisticRegression performs L2 regularization and in order not to use regularization we need to pass penalty=’none’ as parameter in the cre- ation of the model. (b) Using the two models created in part (a) make a prediction for each validation example. What is the empirical risk (using the 0-1 loss) of each model on the validation set and what is the confusion matrix of each model (again, on the validation set)? (c) Which model performed be�er on the validation data? Report this in your writeup. We con- sider a model be�er than another one if the empirical risk of the first model is lower than the empirical risk of the second model. In case the empirical risk is a tie, compare the models using precision and determine which one is be�er. Now, train a new logistic regressionmodel on the training and validation data using whichever measure created the best model (i.e., is it beneficial to use L2 regularization in this dataset, or not?) in (a)-(b). Make a prediction for each testing example. Report the empirical risk on the test set and the confusion matrix that corresponds to the predictions of this last model on the test data. March 30, 2022 3/3 Machine Learning: Rubric for Projects CS 5033 – Machine Learning Rubric for Projects Dimitris Diochnos Spring 2022 1 Overview Each project is worth 100 points, judged by the following criteria. Criterion Max Points Names and Course Levels 2 Spelling 3 Project Domain 7 Hypotheses 8 Learning Methods 8 Contributions 2 Summary and Future Work 6 Related Work / Literature Review 10 Organization 8 Metrics 8 Checkpoint 10 Experiments 12 Analysis and Explanations 8 Implementation and learning 8 Total points: 100 2 Breakdown of Points We have the following. Names and Course Levels (max 2 pts). • Names and course levels for each member of the group are given in the first page of the write-up, gives you 2 points. • If this information is left unclear, you get 0 points but you also risk of getting 0 points in the project in general (because I may not be able to do the mapping and only one person per group should submit the project). 1 Spelling (max 3pts). • All text is spelled correctly gives you 3 points. • One spelling error gives you 2 points. • Anything else gives you 0 points. Project Domain (max 7 pts). • Effective communication of the domain 7 points. • If the domain is unclear you get 4 points. • If the domain is not mentioned, you get 0 points. Hypotheses (max 8 pts). • The hypotheses are clearly stated on the write-up, gives you 8 points. • For each hypotheses not stated out of the n that you need to deliver, 8/n points are subtracted. Learning Methods (max 8 pts). • Effective communication of the learning methods, why the were chosen, and what they are, gives you 8 points. • Learning methods unclear gives you 5 points. • No methods mentioned gives you 0 points. Contributions (max 2 pts). • Contribution of each member of the group is clear (which learning method was implemented), gives you 2 points. (If you are working alone you get all 2 points.) • Unclear description gives you 1 point. • No description gives you 0 points. Summary and Future Work (max 6 pts). • Effective summary of the results and thoughtful future work, gives you 6 points. • Missing future work, gives you 4 points. • Missing summary, gives you 3 points. • Missing both a summary and ideas for future work, gives you 0 points. 2 Related Work / Literature Review (max 10 pts) • Discussion of related work as expected by each member of the group, gives you 10 points. • Omission of 1 paper from related work, gives you 5 points. • Anything else gives you 0 points. IMPORTANT: Related work counts as refereed papers or book chapters. You cannot cite some post that someone made somewhere online and think that this is an actual reference – it is not. Some top or good conferences from where you can find papers are (this is not a complete list, but it can give you an idea): NeurIPS, ICML, AAAI, AISTATS, COLT, ALT, IROS, IJCAI, KDD, ECAI, ECML/PKDD, AAMAS, UAI, ICLR, SDM, ICMLA, ICDM, ICDE, SIGIR. Similarly, a non-exhaustive list of some good journals from where you can draw papers for your references, are: Machine Learning (ML), Journal of Machine Learning Research (JMLR), Proceedings of Machine Learning Research (these are conference papers but from good conferences – PMLR), Artificial Intelligence (AI), Journal of Artificial Intelligence Research (JAIR), Annals of Mathematics and Artificial Intelligence (AMAI), Journal of Field Robotics (JFR), Communications of the ACM (CACM). Organization (max 8 pts). • Organization of the presentation (in whatever form is chosen) is clear, gives you 8 points. • Organization is confusing, gives you 3 points. Metrics (max 8 pts). • Metrics used to evaluate results were explained well, gives you 8 points. • Metrics were confusing, or reasoning for the selection of said metrics was confusing, gives you 4 points. • Metrics not explained gives you 0 points. Checkpoint (max 5 pts). • Checkpoint submitted on-time, gives you 5 points. • Checkpoint submitted, but not on-time, gives you 2 points. • Checkpoint not submitted gives you 0 points. Experiments (max 12 pts). • Experiments were clearly explained and learning methods shown for all members of the group, gives you 12 points. • Some confusion on the experiments gives you 9 points. • Missing experiments for one group member,
Answered 10 days AfterApr 21, 2022

Answer To: This is an assignment that tests machine learning skill. The task is to follow all the required...

Sathishkumar answered on Apr 26 2022
97 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here