Python Clustering and Distributed Computing assignment.
Python Clustering and Distrusted Computing Assignment Total Points: 50 1 Objective The objective is to make sure 1. You can use some popular Python machine learning libraries. 2. You are familiar with the k-means clustering algorithm and how it works. 3. You can use the matplotlib library to plot results. 4. You are familiar with map-reduce and distributed computing techniques. 2 K-Means Clustering K-Means clustering is a method used in unsupervised learning techniques to cluster or group the most similar data points together. Clustering is done based on a distance function. Unlike other methods, like single linkage, k-means would give you slightly different results on different runs due to inherent randomness, which is why the algorithm is run several times before the result is produced. The scikit library implements several clustering methods and contains several test datasets on which you can experiment. For this homework, we are going to run k-means clustering on the Handwritten Digit Dataset. 3 Libararies For this assignment, you will need to use the following libraries • numpy • scipy • matplotlib • pandas 1 • scikit While you will not explicitly use numpy and scipy, the scikit library needs them to be installed. 3.1 scikit The scikit library contains a lot of the machine learning algorithms, including k-means, which we will be using. It also comes with a bunch of preloaded datasets. The easiest way to install scikit is to use pip, as follows: pip install -U scikit-learn This will take a while but at the end, you should be good to go. 3.2 pandas pandas is a library that provides a lot of useful data structures and data analysis tools. You can get pandas from PyPi as follows: pip install pandas This may require installation of a few dependencies, like numpy, but, at the end, you should be good to go. 4 Specifications For this assignment, you will analyze the intro level clustering program we have provided (iris.py) and adapt it to perform clustering on the Handwritten Digit dataset. 1. Simple Implementation - 30 points • Call this file digits.py • Run the sample program on the iris dataset to familiarize yourself with scikit and clustering. • Load the digits dataset instead of the iris dataset. (5 points) • Run a Principal Component Analysis (PCA) on the dataset to reduce the number of features (compo- nents) from 64 to 2. (10 points) • Run k-means on this datatset to cluster the data into 10 classes. (5 points) • Plot the results using matplotlib. Choose a set of 10 fairly well- separated colors for the scatterplot. (10 points) • This is only one run if the k-means algorithm. In the real world, we run it several times and assign the point to a majority label. 2 2. Distributed Implementation - 70 points • For the distributed implementation, you will need a mapper program and a reducer program. • mapper.py – For the mapper, load the same dataset, perform PCA (choosing the same features every time) and run k-means. (10 points) – Turn the result into a dictionary, where each key is the Post PCA data point and the value is its corresponding label. (10 points) • reducer.py – In the reducer function, collate the results. If you run several mappers, each point could have been assigned different labels. – Calculate the label for each point, counting how many times they are assigned to a particular label. Some points may only have one label for all runs, while some might have several different labels. (15 points) – The final label for a point would be the one that is assigned the most number of times (majority poll of labels). (10 points) – It is possible that due to the inherent randomness of the al- gorithm, that the same class could have different label names for different runs. That is alright. We’re not concerned about “correctness”, we’re more concerned about implementing a map- reduce problem. • plot.py – Plot the same graph, but with the new labels. To do this, just save the point and the label as a CSV file (directly or by redi- recting from stdout) then, write another python program to read that into a list and plot it. (10 points) – If you wish to truly test your setup, you may set up a Hadoop cluster using the guide (posted on the class website). Test it with 1 mapper and 1 reducer first. Then, spawn 8 mappers and reduce it to one plot. This is not necessary for your homework. • Include a PDF document that contains a small description of the k-means algorithm (a couple of paragraphs) and a screenshot of your output. Please write the description by yourself. Copy-pasting from Wikipedia or other sites of the internet is not acceptable.(10 points) • Make sure you programs are documented/commented (5 points) 3 5 Dataset The Handwritten Digits dataset is a set of handwritten digits that have been re- duced to pixels. You can find more information about the dataset here: http:// archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+ digits. You do not have to get the datatset from the website. It is available automati- cally from the sklearn datasets. 6 Sample Output Your result might come out looking slightly different due to the inherent ran- domness of k-means and the dimensions you are using to plot the result. 7 Submission 4 http://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits http://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits http://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits • You are required to submit the following files: – digits.py – mapper.py – reducer.py – plot.py – Kmeans.pdf • If we have listed a specification and allocated point for it, you will lose points if that particular item is missing from your code, even if it is trivial. • Your outputs will be different from mine depending on colors and due to the nature of k-means. The shape would look similar though. • Your program should load and run without issues. Every interpretation error will result in a loss of 5 points each. • You are restricted to standard Python (built-ins), sklearn, pandas, numpy and matplotlib. Use of any other libraries would result in loss of 10 points per library. 5 Objective K-Means Clustering Libararies scikit pandas Specifications Dataset Sample Output Submission