Python Homework Questions( partially SQL )
1. The following code is relevant to this question: Question 1a: What is the overall purpose of this code? What does this code hope to predict? Based on what factors does it form that predictions (i.e. what are the independent/predictor variables of the model)? Question 1b: What is the purpose of lines 31 through 36? Why do apply such code to passenger class and sex, but not other variables such as age and fare paid? Question 1c: What does line 38 do? (Be specific.) Question 1d: What is the purpose of lines 39 and 40? What are we trying to avoid by using lines 39 and 40? 2. The following code is relevant to this question: Question 2a: Suppose lines 52, 53, and 55 produce the following results. Interpret lines 52, 53, and 55 and their output in the context of this question. Be thorough in your explanation. What does this imply about survival probabilities? Question 2b: Suppose we next run the following pieces of code: Interpret this output in the context of this question. Your Answer: Question 3: A data analyst performs a linear regression analysis on a piece of data. Suppose the analysis computes an r2 value on a randomly selected subset of data that is used as a training set. The analysis also computes an r2 value on the remainder of the data set which is used as a testing set. The r2 value on the training data set was 0.7813. The r2 value on the testing data set was 0.7754. Assume that both the training data set and testing data set have large sample sizes (i.e., there is no problem with the samples being too small). (a) What should the analyst conclude about the predictive ability of the model? (b) If the r2 value on the testing data set was 0.082 (instead of 0.7754), what should the analyst conclude? Your Answer: Question 4: Suppose there is a Pandas data frame called weeklyEmployeeInfo. In a column of the data frame titled “Base Hourly”, the hourly base pay of the employee is listed (e.g. an entry of 12 in this column means $12 per hour). A column titled “HoursWorked” lists the total number of hours worked by the employee over the past week. An employee is paid their hourly base pay for up to the first 40 hours of work. Any work beyond the first 40 hours is paid at a rate of 1.5 times the hourly base pay. Write code using Numpy and/or Pandas functions to compute the sum total of all wages to be paid to the employees this week. Your Answer: Question 5: Define a Python function that converts Celsius temperatures to Fahrenheit. Then use the function to convert each temperature -40.0C, -39.0C, -38.0C, …, 49.0C, 50.0C. Try to be efficient in the code you write. Print the converted temperatures to the console window. (The formula from Celsius (C) to Fahrenheit (F) is F = 1.8 * C + 32.) Your Answer: Question 6: Suppose you have a SQL database with a table called Customers. The first few records of the table are seen below. Part a: How can you find all information about customers from the city of Berlin using SQL code? Part b: What set of SQL commands should you use to get a count of the number of customer IDs for each country, ordered from most customer IDs to least? The desired output is similar in structure to below, which indicates there are 13 customer IDs that list USA as the country in the corresponding record; 11 customer IDs list France as the country in the corresponding record, etc. Part c: Same as part b, except you should only return the countries that rank 6th to 10th in terms of most customer IDs. Your Answer: Question 7: Suppose some online e-retail store offers thousands of products. Your manager wants you to find out how many of the products they offer are produced by Vizio. Unfortunately, when using the e-retailer’s search function, you discover that it only displays up to five search results at a time. However, you notice that the URLs for all the products offered by the e-retailer have the following format. http://website.com/products?productID=00001 http://website.com/products?productID=00002 … http://website.com/products?productID=16721 The last valid productID appears to be 16721. You want to count how many of these product pages contain the brand name “Vizio” somewhere on the page. First, describe your general strategy for scraping this data, including which Python packages you would use. Second, write code to accomplish this task. Your Answer: Question 8: Question 8a: Suppose the star in the center of the figure is a point you’d like to classify as either part of Class A or Class B. If k=3, how would you classify the star point? If k=6, how would you classify the star point? (You may assume the data has already been scaled. ) Question 8b: Why is it usually important to scale data in the k-Nearest neighbors?