this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can befound under Lab 4 module on D2L. Save it in your working directory. The dataset is about crime rate oroccurrence for 50 states in the US. Crimes include murder, assault,and rape. The urban population (inmillions) of the states, and the predefined cluster is also provided. As you can see, this is a 5 dimensionaldataset with 4 predefined clusters this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can befound under Lab 4 module on D2L. Save it in your working directory. The dataset is about crime rate oroccurrence for 50 states in the US. Crimes include murder, assault,and rape. The urban population (inmillions) of the states, and the predefined cluster is also provided. As you can see, this is a 5 dimensionaldataset with 4 predefined clusters
Lab 4_Clustering.docx MIS 545 Lab 4: Clustering Algorithm: K-means Find clusters of states based on crimes 1 Overview In this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can be found under Lab 4 module on D2L. Save it in your working directory. The dataset is about crime rate or occurrence for 50 states in the US. Crimes include murder, assault, and rape. The urban population (in millions) of the states, and the predefined cluster is also provided. As you can see, this is a 5 dimensional dataset with 4 predefined clusters. 2 Data Packages For lab 4, we will use ggplot2, and animation to visualize the clustering result. ggplot2: a plotting system for R, powerful to produce complex multi-layered graphics. You can find an excellent ggplot2 tutorial here. animation: a public package that provides functions for animations in statistics, covering topics in multiple area like data mining and machine learning. You can find the details here. # Install package ggplot2, animation, and fpc install.packages("ggplot2") install.packages("animation") # To use the package in an R session, we need to load it in an R session via library() library(ggplot2) 3 Data overview and Normalization First load the data. # Read in csv file crime_data.csv. Only two dimensions, assault and murder will be needed. crime0 <- read.csv("crime_data.csv")="" #="" note="" the="" comma="" below="" crime="">-><- crime0[,="" c('murder','assault')]="" #="" check="" out="" the="" distribution="" of="" data="" plot(crime)="" k-means="" is="" a="" clustering="" algorithm="" that="" assumes="" your="" input="" data="" is="" isotropic.="" in="" other="" word,="" it="" takes="" features="" evenly="" important,="" which="" means="" invisible="" hyper="" planes="" that="" separate="" clusters="" are="" spherical="" shape="" if="" in="" a="" 3-dimension="" space.="" given="" this,="" we="" would="" like="" to="" normalize="" original="" dataset="" to="" avoid="" bias="" due="" to="" metrics="" scale="" of="" data.="" here,="" we="" take="" max/min="" as="" our="" approach.="" #="" create="" a="" normalization="" function="" normit="">-><- function(feature){="" normalized="">-><- ((feature="" -="" min(feature))="" (max(feature)="" -="" min(feature)))="" return="" (normalized)="" }="" #="" apply="" the="" customized="" function="" to="" our="" data,="" then="" convert="" it="" to="" data="" frame="" nor_crime="">-><- apply(crime[,c(1,2)],="" 2,="" fun="normIt)" nor_crime="">-><- as.data.frame(nor_crime)="" try="" the="" number="" of="" clusters="" to="" be="" 5.="" kmeans()="" function="" takes="" input="" dataset="" and="" the="" number="" of="" clusters.="" it="" is="" from="" the="" build-in="" package="" stats().="" c1="">-><- kmeans(nor_crime,5)="" class(c1)="" #="" analyze="" the="" result="" of="" clustering="" str(c1)="" ##="" cluster:="" indicates="" which="" cluster="" a="" obs="" belongs="" to="" ##="" centers:="" a="" series="" of="" geographic-value="" pairs="" betweenss:="" between="" sum="" of="" squares,="" i.e.="" intracluster="" similarity="" withinss:="" within="" sum="" of="" square,="" i.e.="" intercluster="" similarity.="" dramatically="" reduce="" when="" the="" number="" of="" clusters="" gets="" close="" to="" the="" point="" where="" asymptotic="" distortion="" converges.="" tot.withinss:="" sum="" of="" all="" the="" withinss="" of="" all="" the="" clusters.="" a="" metric="" of="" system="" measure.="" i.e.="" total="" intra-cluster="" similarity="" 4="" elbow="" curve="" plot="" and="" function="" usually="" a="" good="" clustering="" will="" return="" us="" a="" lower="" value="" of="" withinss="" and="" higher="" value="" of="" betweenss.="" for="" k-means,="" the="" performance="" depends="" on="" the="" number="" of="" clusters="" that="" we="" arbitrarily="" determine="" and="" on="" the="" randomly="" initialed="" geo-value="" of="" centers="" at="" first.="" #="" create="" a="" function="" that="" returns="" the="" value="" of="" totwithinss,="" and="" takes="" input="" dataset="" and="" number="" of="" clusters="" kmeans.totwithinss.k="">-><- function(dataset,="" number_of_centers){="" km="">-><- kmeans(dataset,="" number_of_centers)="" km$tot.withinss="" }="" call="" the="" function="" we="" customized="" above,="" kmeans.withinss.k="" #="" test="" k="3," k="5." it="" can="" be="" seen="" that="" as="" the="" value="" of="" k="" increases,="" distortion="" decrease="" kmeans.totwithinss.k(nor_crime,="" 3)="" ##="" [1]="" 1.363057="" kmeans.totwithinss.k(nor_crime,="" 5)="" ##="" [1]="" 1.002086="" we="" would="" like="" to="" know="" the="" different="" values="" of="" totwithinss.="" create="" a="" function="" that="" returns="" a="" series="" of="" totwithinss="" value,="" and="" take="" input="" maxk.="" #="" vec="" is="" a="" vector="" that="" contains="" totwithinss="" values="" associated="" with="" k="" from="" 1="" to="" maxk="" kmeans.distortion="">-><- function(dataset,="" maxk){="" vec="">-><- as.vector(1:maxk)="" vec[1:maxk]="">-><- sapply(1:maxk,="" kmeans.totwithinss.k,="" dataset="dataset)" return(vec)="" }="" plot="" totwithinss="" in="" a="" graph="" to="" observe="" the="" relationship="" between="" distortion="" and="" the="" value="" of="" k.="" #="" max="" k="10" maxk="">-><- 10="" dis_vct="">-><- kmeans.distortion(nor_crime,="" maxk)="" #="" elbow="" curve="" plot(1:maxk,="" #="" horizontal="" axis="" dis_vct,="" #="" vertical="" axis="" type='b' ,="" #="" curve="" col='blue' ,="" xlab="Number of cluster" ,="" ylab="Distortion" )="" we="" can="" observe="" the="" distortion="" reduce="" less="" when="" the="" number="" of="" cluster="" grows.="" the="" value="" of="" distortion="" becomes="" stable="" beyond="" a="" certain="" threshold,="" which="" is="" the="" optimal="" value="" of="" k.="" here="" around="" k="4" or="" 5,="" model="" reaches="" its="" asymptotic="" distortion="" convergence.="" 5="" k-means="" and="" animation="" let="" us="" apply="" some="" animation="" to="" understand="" how="" r="" gave="" us="" the="" clustered="" results.="" #="" number="" of="" clusters,="" k="4" num_cluster="4" result="">-><- kmeans.ani(nor_crime,="" num_cluster)="" result$centers="" contains="" average="" geo-location,="" which="" are="" the="" centers="" for="" each="" clusters.="" the="" second="" aggregate="" method="" counts="" the="" number="" of="" points="" in="" each="" cluster="" centers="">-><- as.data.frame(result$centers)="" counts="">-><- aggregate(nor_crime,="" by="list(result$cluster)," fun="length)[," 2]="" add="" cluster="" label="" to="" crime="" crime$cluster="">-><- result$cluster="" 6="" visualization="" we="" can="" visualize="" the="" clusters="" over="" the="" raw="" data="" using="" the="" ggplot()="" method.="" #="" base="" layer="" plot.crime="">-><- ggplot(data="nor_crime," aes(x="Murder," y="Assault," color="result$cluster))" add="" more="" layers="" to="" base="" plot="" #="" alpha:="" semi-transparent="" points="" plot.crime="" +="" geom_point(alpha=".25," size="5)" +="" #="" cluster="" centers,="" colored="" black:="" geom_point(data="centers," aes(x="Murder," y="Assault)," size="5," color='black' )="" +="" #="" cool="" colors="" for="" each="" cluster:="" scale_color_gradientn(colours="rainbow(num_cluster))" +="" #="" add="" a="" title,="" align="" to="" the="" center="" theme(plot.title="element_text(hjust" =="" 0.5))="" +="" ggtitle("k-means="" clusters")="" because="" of="" randomness,="" your="" results="" are="" likely="" to="" look="" a="" little="" different="" 7="" lab="" 4="" source.r="" if(1="=1){" #="" high-quality="" plots="" if(!require(ggplot2)){="" install.packages("ggplot2")="" }="" library(ggplot2)="" #="" animation="" of="" k-means="" if(!require(animation)){="" install.packages("animation")="" }="" library(animation)="" }="" ###########################################################="" ##########="" k-means="" ###########="" ###########################################################="" ################################################="" ###="" data="" overview="" and="" normalization="" ###="" ################################################="" if(2="=2){" #="" load="" in="" crime="" data="" setwd("c:\\users\\yongcheng\\desktop\\mis="" 545\\lab")="" getwd()="" crime0="">-><- read.csv("crime_data.csv")="" str(crime0)="" #="" display="" distribution="" in="" two="" dimensions,="" murder="" and="" assault="" crime="">-><- crime0[,c('murder',="" 'assault')]="" plot(crime)="" ###="" normalization="" function="" normit="">-><- function(feature){="" normalized="">-><- ((feature="" -="" min(feature))="" (max(feature)="" -="" min(feature)))="" return="" (normalized)="" }="" #="" normalization,="" the="" result="" nor_crime="" is="" converted="" to="" data="" frame="" nor_crime="">-><- apply(crime[,c(1,2)],="" 2,="" fun="normIt)" nor_crime="">-><- as.data.frame(nor_crime)="" #="" let="" us="" take="" the="" number="" of="" clusters="" to="" be="" 5.="" #="" kmeans()="" function="" takes="" the="" input="" data="" and="" the="" number="" of="" clusters="" in="" which="" #="" the="" data="" is="" to="" be="" clustered.="" the="" syntax="" is="" :="" kmeans(="" data,="" k)="" #="" where="" k="" is="" the="" number="" of="" cluster="" centers.="" c1="">-><- kmeans(nor_crime,="" 5)="" class(c1)="" #="" analyzing="" the="" clustering="" :="" str(c1)="" }="" ############################################="" ###="" find="" the="" optimal="" value="" of="" 'k'="" ###="" ############################################="" if(3="=3){" #="" create="" a="" function="" that="" returns="" withinss="" value="" of="" a="" k-means()="" result="" kmeans.totwithinss.k="">-><- function(dataset,="" number_of_centers){="" km="">-><- kmeans(dataset, number_of_centers) km$tot.withinss } # for k=3, withinss is kmeans.totwithinss.k(nor_crime, 3) # it can be seen that as the value of k increases, distortion decreases. kmeans.totwithinss.k(nor_crime, 5) # create a function that returns a series of withinss value kmeans(dataset,="" number_of_centers)="" km$tot.withinss="" }="" #="" for="" k="3," withinss="" is="" kmeans.totwithinss.k(nor_crime,="" 3)="" #="" it="" can="" be="" seen="" that="" as="" the="" value="" of="" k="" increases,="" distortion="" decreases.="" kmeans.totwithinss.k(nor_crime,="" 5)="" #="" create="" a="" function="" that="" returns="" a="" series="" of="" withinss="">- kmeans(dataset, number_of_centers) km$tot.withinss } # for k=3, withinss is kmeans.totwithinss.k(nor_crime, 3) # it can be seen that as the value of k increases, distortion decreases. kmeans.totwithinss.k(nor_crime, 5) # create a function that returns a series of withinss value>