Only need question 1e and question 3i
hw12 August 6, 2021 [1]: # Initialize Otter import otter grader = otter.Notebook("hw12.ipynb") 1 Homework 12: Principal Component Analysis In lecture we discussed how PCA can be used for dimensionality reduction. Specifically, given a high dimensional dataset, PCA allows us to: 1. Understand the rank of the data. If k principal components capture almost all of the variance, then the data is roughly rank k. 2. Create 2D scatterplots of the data. Such plots are a rank 2 representation of our data, and allow us to visually identify clusters of similar observations. A solid geometric understanding of PCA will help you understand why PCA is able to do these two things. In this homework, we’ll build that geometric intuition and look at PCA on two datasets: one where PCA works poorly, and the other where it works pretty well. 1.1 Due Date This assignment is due Monday, August 9th at 11:59 PM PDT. Collaboration Policy Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If you do discuss the assignments with others please include their names in the cell below. Collaborators: … 1.2 Score Breakdown Question Points Question 1a 1 Question 1b 1 Question 1c 1 Question 1d 1 Question 1e 1 Question 2a 2 Question 2b 1 Question 2c 1 1 Question Points Question 2d 3 Question 2e 2 Question 3a 1 Question 3b 1 Question 3c 1 Question 3d 2 Question 3e 2 Question 3f 2 Question 3g 1 Question 3h 2 Question 3i 2 Total 28 [2]: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import plotly.express as px # Note: If you're having problems with the 3d scatter plots, uncomment the two␣ ↪→lines below, and you should see a version that # number that is at least 4.1.1. # import plotly # plotly.__version__ 1.3 Question 1: PCA on 3D Data In question 1, our goal is to see visually how PCA is simply the process of rotating the coordinate axes of our data. The code below reads in a 3D dataset. We have named the DataFrame surfboard because the data resembles a surfboard when plotted in 3D space. [3]: surfboard = pd.read_csv("data3d.csv") surfboard.head(5) [3]: x y z 0 0.005605 2.298191 1.746604 1 -1.093255 2.457522 0.170309 2 0.060946 0.473669 -0.003543 3 -1.761945 2.151108 3.132426 4 1.950637 -0.194469 -2.101949 2 The cell below will allow you to view the data as a 3D scatterplot. Rotate the data around and zoom in and out using your trackpad or the controls at the top right of the figure. You should see that the data is an ellipsoid that looks roughly like a surfboard or a hashbrown patty. That is, it is pretty long in one direction, pretty wide in another direction, and relatively thin along its third dimension. We can think of these as the “length”, “width”, and “thickness” of the surfboard data. Observe that the surfboard is not aligned with the x/y/z axes. If you get an error that your browser does not support webgl, you may need to restart your kernel and/or browser. [4]: fig = px.scatter_3d(surfboard, x='x', y='y', z='z', range_x = [-10, 10],␣ ↪→range_y = [-10, 10], range_z = [-10, 10]) fig.show() To give the figure a little more visual pop, the following cell does the same plot, but also assigns a pre-determined color value (that we’ve arbitrarily chosen) to each point. These colors do not mean anything important, they’re simply there as a visual aid. You might find it useful to use colorize_surfboard_data later in this assignment. [5]: def colorize_surfboard_data(df): colors = pd.read_csv("surfboard_colors.csv", header = None).values df_copy = df.copy() df_copy.insert(loc = 3, column = "color", value = colors) return df_copy fig = px.scatter_3d(colorize_surfboard_data(surfboard), x='x', y='y', z='z',␣ ↪→range_x = [-10, 10], range_y = [-10, 10], range_z = [-10, 10], color =␣ ↪→"color", color_continuous_scale = 'RdBu') fig.show() 1.4 Question 1a Now that we’ve understood the data, let’s work on understanding what PCA will do when applied to this data. To properly perform PCA, we will first need to “center” the data so that the mean of each feature is 0. Compute the columnwise mean of surfboard in the cell below, and store the result in surfboard_mean. You can choose to make surfboard_mean a numpy array or a series, whichever is more convenient for you. Regardless of what data type you use, surfboard_mean should have 3 means, 1 for each attribute, with the x coordinate first, then y, then z. Then, subtract surfboard_mean from surfboard, and save the result in surfboard_centered. The order of the columns in surfboard_centered should be x, then y, then z. 3 https://www.google.com/search?q=hashbrown+patty&source=lnms&tbm=isch https://www.google.com/search?q=hashbrown+patty&source=lnms&tbm=isch [6]: surfboard_mean = np.mean(surfboard, axis = 0) surfboard_centered = surfboard - surfboard_mean [7]: grader.check("q1a") [7]: q1a results: All test cases passed! 1.5 Question 1b As you may recall from lecture, PCA is a specific application of the singular value decomposition (SVD) for matrices. If we have a data matrix X, we can decompose it into U , Σ and V T such that X = UΣV T . In the following cell, use the np.linalg.svd function to compute the SVD of surfboard_centered. Store the U , Σ, and V T matrices in u, s, and vt respectively. This is one line of simple code, exactly like what we saw in lecture. Hint: Set the full_matrices argument of np.linalg.svd to False. [8]: u, s, vt = np.linalg.svd(surfboard_centered, full_matrices = False) u, s, vt [8]: (array([[-0.02551985, -0.02108339, -0.03408865], [-0.02103979, -0.0259219 , 0.05432967], [-0.00283413, -0.00809889, 0.00204459], …, [ 0.01536972, -0.00483066, 0.05673824], [-0.00917593, 0.0345672 , 0.03491181], [-0.01701236, 0.02743128, -0.01966704]]), array([103.76854043, 40.38357469, 21.04757518]), array([[ 0.38544534, -0.67267377, -0.63161847], [-0.5457216 , -0.7181477 , 0.43180066], [-0.74405633, 0.17825229, -0.64389929]])) [9]: grader.check("q1b") [9]: q1b results: All test cases passed! 1.6 Question 1c: Total Variance Let’s now consider the relationship between the singular values s and the variance of our data. Recall that the total variance is the sum of the variances of each column of our data. Below, we provide code that computes the variance for each column of the data. Note: The variances are the same for both surfboard_centered and surfboard, so we show only one to avoid redundancy. [10]: np.var(surfboard, axis=0) 4 https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html [10]: x 2.330704 y 5.727527 z 4.783513 dtype: float64 The total variance of our dataset is given by the sum of these numbers. [11]: total_variance_computed_from_data = sum(np.var(surfboard, axis=0)) total_variance_computed_from_data [11]: 12.841743509780109 As discussed in lecture, the total variance of the data is also equal to the sum of the squares of the singular values divided by the number of data points, that is: V ar(X) = ∑d i=1 σ 2 i N where σi is the singular value corresponding to the ith principal component, N is the total number of data points, and V ar(X) is the total variance of the data. In the cell below, compute the total variance using the the formula above and store the result in the variable total_variance_computed_from_singular_values. Your result should be very close to total_variance_computed_from_data. [12]: total_variance_computed_from_singular_values = np.sum(s**2)/surfboard.shape[0] total_variance_computed_from_singular_values [12]: 12.841743509780104 [13]: grader.check("q1c") [13]: q1c results: All test cases passed! 1.7 Question 1d: Explained Variance and Scree Plots In the cell below, set variance_explained_by_1st_pc to the proportion of the total variance explained by the 1st principal component. Your answer should be a number between 0 and 1. Note: This topic was discussed in this section of the PCA lecture slides. [14]: variance_explained_by_1st_pc = (s[0]**2/surfboard.shape[0])/␣ ↪→total_variance_computed_from_data variance_explained_by_1st_pc [14]: 0.8385084140449129 [15]: grader.check("q1d") 5 https://docs.google.com/presentation/d/1zpawVI7o2cYA_C_kSQLBjOMrFkSwMDk23JcedzrzttA/edit#slide=id.ge684cfc9d0_2_98 [15]: q1d results: All test cases passed! We can also create a scree plot that shows the proportion of variance explained by all of our principal components, ordered from most to least. An example scree plot is given below. Note that the variance explained by the first principal component matches the value we calculated above for variance_explained_by_1st_pc. Note: If you’re wondering where len(surfboard_centered) went, it got canceled out when we divided the variance of a given PC by the total variance. [16]: plt.plot([1, 2, 3], s**2 / sum(s**2)); plt.xticks([1, 2, 3], [1, 2, 3]); plt.xlabel('PC #'); plt.ylabel('Fraction of Variance Explained'); plt.title('Fraction of Variance Explained by each Principal Component') [16]: Text(0.5, 1.0, 'Fraction of Variance Explained by each Principal Component') For this small toy problem, the scree plot is not particularly useful. We’ll see why they are useful in practice later in this homework. 1.8 Question 1e: V as a Rotation Matrix In lecture, we saw that the first column of XV contained the first principal component values for each observation, the second column of XV contained the second principal component values for 6 each observation, and so forth. Let’s give this matrix a name: P = XV is sometimes known as the “principal component matrix”. Compute the P matrix for the surfboard dataset and store it in the variable surfboard_pcs. [17]: surfboard_pcs = u @ np.diag(s) [18]: grader.check("q1e") [18]: q1e results: Trying: all(np.isclose(surfboard_pcs.loc[0], [-2.648, -0.851, -0.717], atol=1e-3)) Expecting: True ********************************************************************** Line 1, in q1e 2 Failed example: all(np.isclose(surfboard_pcs.loc[0], [-2.648, -0.851, -0.717], atol=1e-3)) Exception raised: Traceback (most recent call last): File "/opt/conda/lib/python3.8/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "
", line 1, in all(np.isclose(surfboard_pcs.loc[0], [-2.648, -0.851, -0.717], atol=1e-3)) AttributeError: 'numpy.ndarray' object has no attribute 'loc' 1.9 Visualizing the Principal Component Matrix In some sense, we can think of P as an output of the PCA procedure. It is simply a rotation of the data such that the data will now appear “axis aligned”. Specifically, for a 3d dataset, if we plot PC1, PC2, and PC3 along the x, y, and z axes of our plot, then the greatest amount of variation happens along the x-axis, the second greatest amount along the y-axis, and the smallest amount along the z-axis. To visualize this, run the cell below, which will show our data now projected onto the principal component space. Compare with your