you will implement a simple k-means clustering algorithm. K-means is a popular iterative clusteringalgorithm which finds k groups in a given training dataset. Inputs of the algorithm are: 1) k= Number of clusters to befound, and 2) Number of iterations to be performed. Initially, k-means algorithm starts with random cluster centersand updates them at each iteration. Figure 1 and 2 show initial and final cluster centers and data points for each clusteras an example.
Microsoft Word - Assignment 4.docx Assignment 4 Clustering Due: January 7, 2020 (6am) ITC 534 Object-oriented Programming in Python In this assignment, you will implement a simple k-means clustering algorithm. K-means is a popular iterative clustering algorithm which finds k groups in a given training dataset. Inputs of the algorithm are: 1) k= Number of clusters to be found, and 2) Number of iterations to be performed. Initially, k-means algorithm starts with random cluster centers and updates them at each iteration. Figure 1 and 2 show initial and final cluster centers and data points for each cluster as an example. Please watch the following videos to learn the k-Means clustering algorithm: • Andrew Ng’s Introduction to Machine Learning Course: https://www.youtube.com/watch?v=hDmNF9JG3lo • MIT Intro. to Computational Thinking and Data Science Course: https://youtu.be/esmzYhuFnds?t=986 Additional Videos • Louis Serrano: https://www.youtube.com/watch?v=QXOkPvFM6NU • StatQuest: https://www.youtube.com/watch?v=4b5d3muPQmA&t=1s • Viktor Lavrenko: https://www.youtube.com/watch?v=_aWzGGNrcic • Visualization of kMeans Clustering: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ Figure 1. Initial random cluster centers and data points that belong to each of them. Figure 2. Cluster centers at the 7th iteration and their data points. Data Generation Generate a random dataset which contains three clusters as shown in Figure 2. Use Gaussian distribution to generate 200 data points for each cluster. Parameters of the Gaussian distribution (mean and standard deviation) for clusters are given as follows: • Cluster 1: Mean (x=0.3, y=0.7) with standard deviation (stdx=0.06, stdy=0.06) • Cluster 2: Mean (x=0.8, y=0.3) with standard deviation (stdx=0.15, stdy=0.15) • Cluster 3: Mean (x=0.2, y=0.2) with standard deviation (stdx=0.10, stdy=0.10) Client and Class Codes In your program, you should have a client code similar to the one given below. Code first generates training data and then creates a k-Means object and calls its fit method to perform cluster computations. # Sample client code # Training data generation sample_count_per_class = 200 # number of data samples for each cluster training_data = generate_data(sample_count_per_class) # generate data using a function # Create kMeans clustering object k_means = KMeans(cluster_count=3, max_iteration=7) # Perform kMeans clustering training, i.e, find and plot cluster centers k_means.fit(training_data) Figure 3. Sample client code. In your implementation, you should use Point class to model 2D (x,y) points. k-Means clustering algorithm should be modeled as a class, named KMeans. Data attributes and methods of both classes are provided as UML class diagrams in Figure 4. It is mandatory to use Point class to represent points and list of Point objects to represent training data. Do not use other Python structures such as numpy/pandas in your assignment. Figure 4. UML class diagrams for the Point and KMeans classes. In your report, provide graphical outputs of the k-Means algorithm for k=2, k=3, and k=5. For each k, please provide initial cluster centers (see Figure 1) and final cluster centers (see Figure 2). For plotting, you can use the sample code given below: # Sample code for plotting import matplotlib.pyplot as plt fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'}) ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.plot(0.6, 0.3, 'ro', markersize=6) ax.plot(0.4, 0.7, 'bo', markersize=12) plt.title('Sample Plot') plt.draw() plt.savefig("output.png", bbox_inches='tight', dpi=200) plt.show() Figure 1. Sample code for plotting. Figure 2. Graphical output of the plotting code. Evaluation Criteria and Grading Code 20% Compliance to submission rules and programming style, e.g., file names, formats, directory structure, naming conventions, indentation, and comments. 60% Correctness of the solution. Report 20% Completeness of the report, compliance to the report format, correctness of the content and language. Submission Guide Submission Files Submit a single compressed (.zip) file to Blackboard. Name your zip file as name_surname.zip. Zip file should contain all source codes (under the \code directory), and report (in PDF format, under the \report directory). Name your code name_surname.py etc. Name your report as name_surname.pdf. Contents of each file should start with your name, student ID, date, and a brief code summary in comment block. Mandatory Submission Submission of assignments is mandatory. If you do not submit an assignment, you will fail the course. Late Submission Policy Maximum submission delay is two days. Late submission will be graded on a scale of 50% of the original grade. Submission is mandatory even if you submit your assignment late. Plagiarism Plagiarism leads to grade F and YÖK regulations will be applied