How much would this be and how long will it take
homework-1-JeffJeon-main/homework1.pdf CS 506 Spring 2021 - HW1 Clustering and Visualization Total: 29 points Package Limitations: Problem 1: Numpy is allowed; Problem 2: You may use libraries such as Sklearn that implements clustering methods; Problem 3: Suggested packages: Folium, Pandas; Problem 4: Suggested packages: K-means in CS506 python package or K-means implemented in Problem 1. 1 Understanding K-means Clustering In this exercise, you will implement the K-means clustering algorithm. You will start on an example 2D dataset that will help you gain an intuition of how the K-means algorithm works. You will be using k means clustering.py for this part of the exercise. 1.1 Implementing K-means The K-means algorithm is a method to automatically cluster similar data ex- amples together. Concretely, you are given a training set { x1, · · · , xm } where xi ∈ Rn, and want to group the data into a few cohesive clusters. The intuition behind K -means is an iterative procedure that starts by guessing the initial centroids, and then refines this guess by repeatedly assigning examples to their closest centroids and then recomputing the centroids based on the assignments. The inner loop of the algorithm repeatedly carries out two steps: • Assigning each training example x to its closest centroid • Recomputing the mean of each centroid using the points assigned to it. The K-means algorithm will always converge to some final set of means for the centroids. Note that the converged solution may not always be ideal and depends on the initial setting of the centroids. Therefore, in practice the K- means algorithm is usually run a few times with different random initializations. One way to choose between these different solutions from different random ini- tializations is to choose the one with the lowest cost function value (distortion). You will implement the two phases of the K-means algorithm separately in the next sections. 1 1.1.1 Finding Closest Centroids [1 pt.] In the cluster assignment phase of the K-means algorithm, the algorithm assigns every training example xi to its closest centroid, given the current positions of centroids. Specifically, for every example xi we set ci := arg min j ∥∥xi − µj∥∥2 where ci is the index of the centroid that is closest to xi, and j is the position (index) of the j-th centroid. Your task is to complete the code in function find closest centroids. This function takes the data matrix samples and the locations of all centroids inside centroids and should output a one-dimensional array of clusters that holds the index (a value in {1, · · · ,K}, where K is total number of centroids) of the closest centroid to every training example. You can implement this using a loop over every training example and every centroid. Once you have completed the code in find closest centroids, you can run it and you should see the output [0, 2, 1, ] corresponding to the centroid assignments for the first 3 examples. Please take a look at Figure 1 to gain an understanding of the distribution of the data. It is two dimentional, with x1 and x2. Figure 1: Result 1 1.1.2 Computing Centroid Means [2 pts.] Given assignments of every point to a centroid, the second phase of the algorithm recomputes, for each centroid, the mean of the points that were assigned to it. 2 Specifically, for every centroid k we set µk := 1 |Ck| ∑ i∈Ck xi where Ck is the set of examples that are assigned to centroid k. Concretely, if two examples say x3 and x5 are assigned to centroid k = 2, then you should update µ2 = 1 2 (x(3) + x(5)) Once you have completed the code in get centroids, the k means clustering.py will run your code and output the centroids after the first step of K-means. The final centroids should be [[1.95399466 5.02557006] [3.04367119 1.01541041] [6.03366736 3.00052511]]. When you run the next step, the K-means code will produce a visualization that steps you through the progress of the algorithm at each iteration. Close figure multiple times to see how each step of the K-means algorithm changes the centroids and cluster assignments. At the end, your figure should look as the one displayed in Figure1 1.2 Random Initialization [1 pt.] The initial assignments of centroids for the example dataset in previous section were designed so that you will see the same figure as Figure1. In practice, a good strategy for initializing the centroids is to select random examples from the training set. In this part of the exercise, you should complete the function choose random centroids. You should randomly permute the indices of the ex- amples (using random seed 7). Then, it selects the first K examples based on the random permutation of the indices. This should allow the examples to be selected at random without the risk of selecting the same example twice. You will see how random initialization will affect the first few iterations of clustering, and also possibly, result in a different cluster assignment. 2 Working with the Algorithms • In Problem 2&3 we will be working with the AirBnB dataset, that you can also find here. Our goal is to visualize areas of the NYC with respect to the price of the AirBnb listings in those areas. From the detailed nyc listings.csv file, you will use longitude and lati- tude to cluster closeness and price to cluster for expensiveness. Note that spatial coordinates and price are in different units, so you may need to consider scaling in order to avoid arbitrary skewed results. a) [8 pts.] Find clusters using the 3 different techniques we discussed in class: k-means++, hierarchical, and GMM. Explain your data 3 http://insideairbnb.com/get-the-data.html representation and how you determined certain parameters (for example, the number of clusters in k-means++). You may use libraries such as Sklearn that implements these clustering methods. Note that you don’t need to visualize the clusters here, that’s Problem 3’s task. b) [3 pts.] List a few bullet points describing the pros and cons of the various clustering algorithms. A few hints: -Some listings contain missing values. Better strategy for this problem is to completely ignore those listings. -Pay attention to the data type of every column when you read a .csv file and convert them to the appropriate types (e.g. float or integer). 3 Data visualization a) [1pt.] Start by producing a Heatmap using the Folium package (you can install it using pip). You can use the code below to help you (hint: use Pandas to load the CSV file as a Dataframes variable ‘df’, then use the following code to generate an HTML file named ‘index.html’; open it in your browser and you’ll see the heatmap): def generateBaseMap ( d e f a u l t l o c a t i o n =[40.693943 , −73.985880]) : base map = fo l ium .Map( l o c a t i o n=d e f a u l t l o c a t i o n ) return base map base map = generateBaseMap ( ) HeatMap( data=df [ [ ’ l a t i t u d e ’ , ’ l ong i tude ’ , ’ p r i c e ’ ] ] . groupby ( [ ’ l a t i t u d e ’ , ’ l ong i tude ’ ] ) . mean ( ) . r e s e t i n d e x ( ) . va lue s . t o l i s t ( ) , r ad iu s =8, max zoom =13) . add to ( base map ) base map . save ( ’ index . html ’ ) Is this heatmap useful in order to draw conclusions about the expensive- ness of areas within NYC? If not, why? b) [2pts.] Visualize the clusters by plotting the longitude/latitude of every listing in a scatter plot. c) [2pts.] For every cluster, report the average price of the listings within this cluster. d) Bonus points [1pt.] if you provide a plot on an actual NYC map! You may use Folium or any other package for this. 4 e) [1pt.] Are the findings in agreement with what you have in mind about the cost of living for neighborhoods in NYC? If you are unfamiliar with NYC, you can consult the web. 4 Image Manipulation a) [8 pts.] Download the image found by clicking here. For this problem, you will use the K-means algorithm in the CS506 python package that you built in class to manipulate this image. Or you may use the K-means algorithm in Problem 1. The goal is to give this image as input, and return the image with like pixels combined into 10 clusters. A few hints: -There are a number of useful packages for working with images; we rec- ommend using cv2 (obtained by running pip install opencv). Using this package, you can use the line img = cv2.imread(“file.jpg”) in order to load the image as a numpy array (note that this means you will also need to import numpy). -If you follow the hint above, your data is no longer being opened from a file inside your k mean() function so you may need to tweak it a bit. -To display the image after you have run k-means, you can use the lines cv2.imshow(‘Display Window’, manipulated img) cv2.waitKey(0) cv2.destroyAllWindows() -Each pixel is represented by three features: their red, green, and blue values. You only need to tweak your algorithm to find clusters and then replace the pixels with the value of their cluster center. -The more clusters you work with, the slower this algorithm will run, so for testing it may be useful to work with only 2 clusters. Here is the starting image: 5 https://storage.needpix.com/rsynced_images/boston-1993606_1280.jpg And here is an example output: 6 Understanding K-means Clustering Implementing K-means Finding Closest Centroids [1 pt.] Computing Centroid Means [2 pts.] Random Initialization [1 pt.] Working with the Algorithms Data visualization Image Manipulation homework-1-JeffJeon-main/k_means_clustering.py from __future__ import absolute_import import random import matplotlib.pyplot as plt import numpy as np import scipy.io import scipy.misc def plot_data(samples, centroids, clusters=None): """ Plot samples and color it according to cluster centroid. :param samples: samples that need to be plotted. :param centroids: cluster centroids. :param clusters: list of clusters corresponding to each sample. """ colors = ['blue', 'green', 'gold'] assert centroids is not None if clusters is not None: sub_samples = [] for cluster_id in range(centroids[0].shape[0]):