I need to write 3 programs for a lab.
Python Lab Question 1 The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, 0 for important emails). Use CSV attachment named Spam. You must predict the probability that a message is spam or not. Requirements 1) Partition the data into a training set (with 70% of the observations), and testing set (with 30% of the observations) using the random state of 12345 for cross validation. (10 points) 2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the ‘best k’?) (15 points) 3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers. (15 points) 4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide an explanation to support your argument. Python Lab Question 2 The Scikit-learn library has several built-in datasets. In this exercise, we will use the Diabetes dataset. First, you have to import the datasets module of the scikit-learn library and then you call the load_diabetes() function to load the dataset into a variable that we name diabetes. This dataset contains physiological data of 442 patients and as corresponding target an indicator of the disease progression after a year. The physiological data occupies the first 10 columns respectively: • Age • Sex • Body Mass Index • Blood Pressure • S1, S2, S3, S4, S5, S6 (six blood serum measurements) These measurements can be obtained by calling the data attribute. For example, we look at the 10 values for the first patient. As for the indicators of the progress of the disease, that is, the values that must correspond to the results of your predictions, these are obtainable by means of the target attribute. Partition the 442 patients into a training set (composed of the first 422 patients) and a test set (the last 20 patients). (10 points) Once the model is trained (let’s say using sklearn) you can get the ten coefficients calculated for each physiological variable, using the coef_ attribute of the predictive model. How to do this, is up to you. I am only showing you the sample result of this: If you apply the test set to the linear regression prediction model you will get a series of targets to be compared with the values actually observed. A good indicator of what prediction should be perfect is the variance. The more the variance is close to 1 the more the prediction is perfect. Since 0.58 is not close to 1, we will examine a single physiological factor at a time. For example, you can start from the age: Draw the linear correlation between the ages of patients and the disease progression in the form of a scatterplot and the associated line. Something like: Python Lab Since we have 10 physiological factors within the diabetes dataset. Therefore, to have a more complete picture of all the training set, you can make a linear regression for every physiological feature, creating 10 models and seeing the result for each of them through a linear chart (Hint: Use a For loop). (20 points) Which physiological factors from the above calculation show correlation with the target? Explain your reasoning. If you combine these physiological factors that you deem influential on the target, what model do you get ? Show graph and associated numbers. Question 3 A person is playing a guessing game in which they have 3 guesses to figure out the computer’s secret number which will be between 1 and 20 inclusive (use randint to generate the number). • If they guess the number correctly on the first guess the program should stop making them guess and they should get 10 points. • If they guess the number correctly on the second guess the program should stop making them guess and they should get 5 points. • If they guess the number correctly on the third guess, they should get 1 point. After an incorrect guess tell the user if their guess was too high or too low. If they fail to get the number correct after 3 guesses they get 0 points. Be sure to tell them what score they got, and show the computer’s secret number. Ask the user if they want to continue playing the game. If they choose to play again, start at the beginning.