I need to write 3 programs for a lab.Python Lab Question 1 The spam datafile contains 4601 emails,...

Question

I need to write 3 programs for a lab.Python Lab  Question 1  The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include  indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters etc.,  and a numeric spam variable for whether each email is tagged as spam by a human reader (spam  column is 1 for spam, 0 for important emails). Use CSV attachment named Spam.  You must predict the probability that a message is spam or not.   Requirements  1) Partition the data into a training set (with 70% of the observations), and testing set (with 30% of the  observations) using the random state of 12345 for cross validation. (10 points)   2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the  best value of k? How do you decide the ‘best k’?) (15 points)   3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers. (15  points)   4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify  the data? Provide an explanation to support your argument. Python Lab  Question 2  The Scikit-learn library has several built-in datasets. In this exercise, we will use the Diabetes dataset.   First, you have to import the datasets module of the scikit-learn library and then you call the  load_diabetes() function to load the dataset into a variable that we name diabetes. This dataset contains physiological data of 442 patients and as corresponding target an indicator of the  disease progression after a year. The physiological data occupies the first 10 columns respectively:   • Age   • Sex   • Body Mass Index   • Blood Pressure   • S1, S2, S3, S4, S5, S6 (six blood serum measurements)   These measurements can be obtained by calling the data attribute. For example, we look at the 10  values for the first patient.   As for the indicators of the progress of the disease, that is, the values that must correspond to the  results of your predictions, these are obtainable by means of the target attribute.  Partition the 442 patients into a training set (composed of the first 422 patients) and a test set (the last  20 patients). (10 points)   Once the model is trained (let’s say using sklearn) you can get the ten coefficients calculated for each  physiological variable, using the coef_ attribute of the predictive model. How to do this, is up to you. I  am only showing you the sample result of this:   If you apply the test set to the linear regression prediction model you will get a series of targets to be  compared with the values actually observed.   A good indicator of what prediction should be perfect is the variance. The more the variance is close to 1  the more the prediction is perfect.  Since 0.58 is not close to 1, we will examine a single physiological factor at a time. For example, you can  start from the age:   Draw the linear correlation between the ages of patients and the disease progression in the form of a  scatterplot and the associated line. Something like: Python Lab  Since we have 10 physiological factors within the diabetes dataset. Therefore, to have a more complete  picture of all the training set, you can make a linear regression for every physiological feature, creating  10 models and seeing the result for each of them through a linear chart (Hint: Use a For loop). (20  points)   Which physiological factors from the above calculation show correlation with the target? Explain your  reasoning. If you combine these physiological factors that you deem influential on the target, what  model do you get ? Show graph and associated numbers. Question 3  A person is playing a guessing game in which they have 3 guesses to figure out the computer’s secret  number which will be between 1 and 20 inclusive (use randint to generate the number).   • If they guess the number correctly on the first guess the program should stop making them guess and  they should get 10 points.   • If they guess the number correctly on the second guess the program should stop making them guess  and they should get 5 points.   • If they guess the number correctly on the third guess, they should get 1 point.   After an incorrect guess tell the user if their guess was too high or too low.   If they fail to get the number correct after 3 guesses they get 0 points. Be sure to tell them what score  they got, and show the computer’s secret number.   Ask the user if they want to continue playing the game. If they choose to play again, start at the  beginning.

Python Lab Question 1 The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc),...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment