temp XXXXXXXXXX Homework Assignment #5 (Individual)Homework Assignment #5 (Individual) Using SVMs and PCA with new data: The Palmer Penguins DatasetUsing SVMs and PCA with new data: The Palmer...

1 answer below »
assignment is attached. part 1 can be skipped


temp-163754362839928395 Homework Assignment #5 (Individual)Homework Assignment #5 (Individual) Using SVMs and PCA with new data: The Palmer Penguins DatasetUsing SVMs and PCA with new data: The Palmer Penguins Dataset � Put your name here.� Put your name here. � Put your _GitHub username_ here.� Put your _GitHub username_ here. Goals for this homeworkGoals for this homework assignmentassignment By the end of this assignment, you should be able to: Use git to track your work and turn in your assignment Read in data and prepare it for modeling Build, fit, and evaluate an SVC model of data Use PCA to reduce the number of important features Build, fit, and evaluate an SVC model of PCA-transformed data Systematically investigate the effects of the number of PCA components on an SVC model of data Assignment instructions:Assignment instructions: Work through the following assignment, making sure to follow all of the directions and answer all of the questions. There are 47 points (+2 bonus points)47 points (+2 bonus points) possible on this assignment. Point values for each part are included in the section headers. This assignment is due at 11:59 pm on Friday, December 3. It should be pushed to your repo (see Part 1) anddue at 11:59 pm on Friday, December 3. It should be pushed to your repo (see Part 1) and submitted to D2Lsubmitted to D2L. ImportsImports It's useful to put all of the imports you need for this assignment in one place. Read through the assignment to figure out which imports you'll need or add them here as you go. In [ ]: # Put all necessary imports here 1. Add to your Git repository to track your progress on your assignment1. Add to your Git repository to track your progress on your assignment (4 points)(4 points) As usual, for this assignment, you're going to add it to the cmse202-f21-turnin repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to � Do the following� Do the following : 1. Navigate to your cmse202-f21-turnin repository and create a new directory called hw-05 . 2. Move this notebook into that new directorynew directory in your repository, then add it and commit it to your repositoryadd it and commit it to your repository . 3. Finally, to test that everything is working, "git push" the file so that it ends up in your GitHub repository. ImportantImportant: Make sure you've added your Professor and your TA as collaborators to your "turnin" respository with "Read" access so that we can see your assignment (you should have done this in the previous homework assignment) Also importantAlso important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the noteobok, none of your changes willnone of your changes will be trackedbe tracked! If everything went as intended, the file should now show up on your GitHub account in the " cmse202-f21- turnin " repository inside the hw-05 directory that you just created. Periodically, you'll be asked to commityou'll be asked to commit your changes to the repository and push them to the remote GitHub locationyour changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit. � Do thisDo this : Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below. # Put the command for cloning your repository here! 2. Loading a new dataset: The Palmer Penguins data (8 points)2. Loading a new dataset: The Palmer Penguins data (8 points) We've the seen the iris dataset a number of times in the course so far and it has a number of nice features that make it useful for getting some practice with some of the machine learning methods that are around today. However, recently a new dataset was suggested as a possible replacement/alternative for the iris data: the "Palmer Penguins" -- perhaps you've already seen it before! This dataset also has some nice properties that make it a good playground for experiment with machine learning tools. You can learn more about the dataset on the their website. Since the goal for this assignment is to practice using the SVM and PCA tools we've covered in class, we'll going to use this relatively simple dataset and avoid any complicated data wrangling headaches! The dataThe data The penguins dataset is pretty straight forward, but you'll need to download the data and give yourself some time to get familiar with it. � Do This:� Do This: To get started, you'll need to download the following fileyou'll need to download the following file : https://raw.githubusercontent.com/msu-cmse-courses/cmse202-F21- data/main/data/penguins_size.csv Once you've downloaded the data, open the files using a text browser or other tool on your computer and take aopen the files using a text browser or other tool on your computer and take a look at the data to get a sense for the information it contains.look at the data to get a sense for the information it contains. You'll probably also want to read through the information on the palmerpenguins website to get a sense for what the values correspond to. The website talks about two different versions of the data, a simplified one and a "raw" one with more values. Which one are youWhich one are you working with?working with? 2.1 Load the data2.1 Load the data � Task 2.1 (2 points):� Task 2.1 (2 points): Read the penguin_size.csv file into your notebook. For the purposes of this assignment, we're going to use "species" as the class that we'll be trying to predict with our classification model. To make this clear, you should rename the rename the speciesspecies column to be column to be classclass . The species class should currently have the following class labels: https://allisonhorst.github.io/palmerpenguins/ https://allisonhorst.github.io/palmerpenguins/ "Adelie" "Chinstrap" "Gentoo" Once you've loaded in the data and changed the species column to class , display the DataFrame to makedisplay the DataFrame to make sure it looks reasonablesure it looks reasonable. You should have 7 columns7 columns and 344 rows344 rows . In [ ]: # Put your code here 2.2 Relabeling the classes2.2 Relabeling the classes To simplify the process of modeling the penguin data, we should convert the class labels from strings to integers. For example, rather than Adelie , we can consider this to be class " 0 ". � Task 2.2 (2 points):� Task 2.2 (2 points): Replace all of the strings in your "class" column with integers based on the following: original labeloriginal label replaced labelreplaced label Adelie 0 Chinstrap 1 Gentoo 2 Once you've replaced the labels, display your DataFrame and confirm that it looks correct. In [ ]: # Put your code here 2.3 Removing rows with missing data2.3 Removing rows with missing data At this point, you've hopefully noticed that some of the rows seems to be missing data values as indicated by the existence of NaN values. Since we don't necessarily know what to replace these values with, let's just play it safe and remove all of the rows that have NaN in any of the column entries. This should help to ensure that we don't end up with errors or confusing results when we try to classify the data. � Task 2.3 (1 point):� Task 2.3 (1 point): Remove all of the rows that contain a NaN in any column. Make sure you actually store thisMake sure you actually store this new version of your dataframe either in the original variable name or in a new variable namenew version of your dataframe either in the original variable name or in a new variable name. If everything went as intended, you should find that you have 334 rows left over. In [ ]: # Put your code here 2.4 Separating the "features" from the "labels"2.4 Separating the "features" from the "labels" As we've seen when working with sklearn it can be much easier to work with the data if we have separate variables that store the features and the labels. � Task 2.4 (1 point):� Task 2.4 (1 point): Split your DataFrame so that you have two separate DataFrames, one called features , which contains all of the penguin features, and one called labels , which contains all of the new penguin integer labels you just created. In [ ]: # Put your code here � Question 2.1 (1 point):Question 2.1 (1 point): How balanced is your set of penguin classes? Does it matter for the set of classes to be balanced? Why or why not? (You might need to write a bit of code to figure out how balanced your set of penguin classes is.) ✎ Erase this and put your answer here. 2.5 Dropping the non-numeric features2.5 Dropping the non-numeric features The last thing we should probably do before you move on to building your classifier model is to drop the two categorical (i.e. non-numeric) features from our set of features to avoid confusing or complicating the model. � Task 2.5 (1 point):� Task 2.5 (1 point): Drop the two non-numeric columns from your new features dataframe. You should end up with your final four features, which should all have floating point values. Display your new Display your new featuresfeatures dataframe to make sure this is truedataframe to make sure this is true. In [ ]: # Put your code here � STOP� STOP Pause to commit your changes to your Git repository!Pause to commit your changes to your Git repository! Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing Part 2", and push the changes to GitHub. 3. Building an SVC model (4 points)3. Building an SVC model (4 points) Now, to tackle this classification problem, we will use a support vector machine just like we've done previously (e.g. in the Day 19 and Day 20 assignmentsDay 19 and Day 20 assignments ). Of course, we could easily replace this with any sklearn classifier we choose, but for now we will just use an SVC with a linear kernel. 3.1 Splitting the data3.1 Splitting the data But first, we need to split our data into training and testing data! � Task 3.1 (1 point):� Task 3.1 (1 point): Split your data into a training and testing set with a training set representing 75% of your data. For reproducibility , set the random_state argument to 314159 . Print the lengths to show you have the right number of entries.
Answered 5 days AfterNov 22, 2021

Answer To: temp XXXXXXXXXX Homework Assignment #5 (Individual)Homework Assignment #5 (Individual) Using SVMs...

Sathishkumar answered on Nov 27 2021
134 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here