Solve assignment 1 (Data normalisation in Python)
Assignment 1 June 30, 2021 Individual Assignment 1 Due at 9am on Saturday July. 10 Total: 10 points • Instructions: Open a new notebook file and write codes to answer the following questions. You need to use a new code block for each question. You need to use comments at the beginning of each code block to indicate the question number (e.g. # 1., or # 2.1). Each code block written by you needs to be executed to show the results. • How to Turn In: – Make sure that all the code blocks are executed. – Within Jupyter notebook, go to File -> Download as -> HTML (.html) – Go to your download folder and find out the .html file. Open it to check that all your results are there. – Upload the .html file to Canvas. 1 Normalizing and Rescaling Data The range of values of raw data varies widely. For example, if the heights of baseball players are measured in inches, the values should be around 75. In the meantime, if we also have the annual income (measured in dollars) of each baseball player, the values are several millions. Yet, major statistical models compute the distance between two points (e.g. two baseball players) by the Euclidean distance. If one of the dimensions has a broad range of values, the distance will be governed by this particular dimension. Therefore, the range of all features should be normalized so that each dimension contributes approximately proportionately to the final distance. Normalizing and rescaling data is a very important pre-processing step of the raw data in data analytics. To this end, this question asks you to standardize the heights based on the following equation. height_s = height − meanstd You need to: 1. Find out the mean and standard deviation of the heights across all the baseball players. The mean and standard deviaion of a numpy array can be computed by np.mean() and np.std(), respectively. 1 2. Subtract the mean from each height. 3. Divide the values in the second step (mean is already subtracted) by its standard deviation. The origianl height values are given as height = [72,78, 69,71,76, 79]. Store the resulting numpy array after standarization as height_s and print it out. Write down your codes and then execute it to show your results. NOTE, your output should only include the print-out of height_s, additional output will result in a deduction. [ ]: # 1. 2 Boston Housing The Boston Housing data contain information on census tracts in Boston for which several mea- surements are taken (e.g., crime rate, pupil/teacher ratio). Each row represents a town, and each column corresponds to one measurement. We are interested in how different variables affect the median value of owner-occupied homes in tract. Hence median value is our target variable. You need to first load the dataset and read its description from the sklearn package by executing the codes provided below. After executing the next code block, answer the following questions in seperate code blocks. You need to use comments at the beginning of each code block to indicate the question number (e.g. # 2.1.). [ ]: from sklearn.datasets import load_boston boston = load_boston() attr = boston.data # attributes that may affact median values medV = boston.target # median value print(boston.feature_names) [ ]: print( boston.DESCR ) Questions: 1. Print out the 10th row of attr. [ ]: # 2.1 2. Print out the average number of rooms per dwelling (RM) of the 20th tract. [ ]: # 2.2 3. Print out all observations whose median values are small than 6. You do not need to include the median value column in your output. [ ]: # 2.3 4. Print out the average pupil-teacher ratio (PTRATIO) of all the towns whose median values are small than 20. 2 [ ]: # 2.4 5.Use numpy array indexing to select and print out INDUS, TAX and LSTAT of the first 5 towns. [ ]: # 2.5 3 Arithmetic Computation for List Elements Write a function my_f(a) that takes a list a as input and computes 6x2 + 2 for every element x in the list, so that if we execute the following codes, the printed output will be [ 8 56 152]. Complete the definition of my_f function. [ ]: def my_f(a): # function definition goes here return my_f( [1,3,5] ) 3 Normalizing and Rescaling Data Boston Housing Arithmetic Computation for List Elements