this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can befound under Lab 4 module on D2L. Save it in your working directory. The dataset is about crime rate...

1 answer below »

this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can befound under Lab 4 module on D2L. Save it in your working directory. The dataset is about crime rate oroccurrence for 50 states in the US. Crimes include murder, assault,and rape. The urban population (inmillions) of the states, and the predefined cluster is also provided. As you can see, this is a 5 dimensionaldataset with 4 predefined clusters this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can befound under Lab 4 module on D2L. Save it in your working directory. The dataset is about crime rate oroccurrence for 50 states in the US. Crimes include murder, assault,and rape. The urban population (inmillions) of the states, and the predefined cluster is also provided. As you can see, this is a 5 dimensionaldataset with 4 predefined clusters


Lab 4_Clustering.docx MIS 545 Lab 4: Clustering Algorithm: K-means Find clusters of states based on crimes 1 Overview In this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can be found under Lab 4 module on D2L. Save it in your working directory. The dataset is about crime rate or occurrence for 50 states in the US. Crimes include murder, assault, and rape. The urban population (in millions) of the states, and the predefined cluster is also provided. As you can see, this is a 5 dimensional dataset with 4 predefined clusters. 2 Data Packages For lab 4, we will use ggplot2, and animation to visualize the clustering result. ggplot2: a plotting system for R, powerful to produce complex multi-layered graphics. You can find an excellent ggplot2 tutorial here. animation: a public package that provides functions for animations in statistics, covering topics in multiple area like data mining and machine learning. You can find the details here. # Install package ggplot2, animation, and fpc install.packages("ggplot2") install.packages("animation") # To use the package in an R session, we need to load it in an R session via library() library(ggplot2) 3 Data overview and Normalization First load the data. # Read in csv file crime_data.csv. Only two dimensions, assault and murder will be needed. crime0 <- read.csv("crime_data.csv")="" #="" note="" the="" comma="" below="" crime=""><- crime0[,="" c('murder','assault')]="" #="" check="" out="" the="" distribution="" of="" data="" plot(crime)="" k-means="" is="" a="" clustering="" algorithm="" that="" assumes="" your="" input="" data="" is="" isotropic.="" in="" other="" word,="" it="" takes="" features="" evenly="" important,="" which="" means="" invisible="" hyper="" planes="" that="" separate="" clusters="" are="" spherical="" shape="" if="" in="" a="" 3-dimension="" space.="" given="" this,="" we="" would="" like="" to="" normalize="" original="" dataset="" to="" avoid="" bias="" due="" to="" metrics="" scale="" of="" data.="" here,="" we="" take="" max/min="" as="" our="" approach.="" #="" create="" a="" normalization="" function="" normit=""><- function(feature){="" normalized=""><- ((feature="" -="" min(feature))="" (max(feature)="" -="" min(feature)))="" return="" (normalized)="" }="" #="" apply="" the="" customized="" function="" to="" our="" data,="" then="" convert="" it="" to="" data="" frame="" nor_crime=""><- apply(crime[,c(1,2)],="" 2,="" fun="normIt)" nor_crime=""><- as.data.frame(nor_crime)="" try="" the="" number="" of="" clusters="" to="" be="" 5.="" kmeans()="" function="" takes="" input="" dataset="" and="" the="" number="" of="" clusters.="" it="" is="" from="" the="" build-in="" package="" stats().="" c1=""><- kmeans(nor_crime,5)="" class(c1)="" #="" analyze="" the="" result="" of="" clustering="" str(c1)="" ##="" cluster:="" indicates="" which="" cluster="" a="" obs="" belongs="" to="" ##="" centers:="" a="" series="" of="" geographic-value="" pairs="" betweenss:="" between="" sum="" of="" squares,="" i.e.="" intracluster="" similarity="" withinss:="" within="" sum="" of="" square,="" i.e.="" intercluster="" similarity.="" dramatically="" reduce="" when="" the="" number="" of="" clusters="" gets="" close="" to="" the="" point="" where="" asymptotic="" distortion="" converges.="" tot.withinss:="" sum="" of="" all="" the="" withinss="" of="" all="" the="" clusters.="" a="" metric="" of="" system="" measure.="" i.e.="" total="" intra-cluster="" similarity="" 4="" elbow="" curve="" plot="" and="" function="" usually="" a="" good="" clustering="" will="" return="" us="" a="" lower="" value="" of="" withinss="" and="" higher="" value="" of="" betweenss.="" for="" k-means,="" the="" performance="" depends="" on="" the="" number="" of="" clusters="" that="" we="" arbitrarily="" determine="" and="" on="" the="" randomly="" initialed="" geo-value="" of="" centers="" at="" first.="" #="" create="" a="" function="" that="" returns="" the="" value="" of="" totwithinss,="" and="" takes="" input="" dataset="" and="" number="" of="" clusters="" kmeans.totwithinss.k=""><- function(dataset,="" number_of_centers){="" km=""><- kmeans(dataset,="" number_of_centers)="" km$tot.withinss="" }="" call="" the="" function="" we="" customized="" above,="" kmeans.withinss.k="" #="" test="" k="3," k="5." it="" can="" be="" seen="" that="" as="" the="" value="" of="" k="" increases,="" distortion="" decrease="" kmeans.totwithinss.k(nor_crime,="" 3)="" ##="" [1]="" 1.363057="" kmeans.totwithinss.k(nor_crime,="" 5)="" ##="" [1]="" 1.002086="" we="" would="" like="" to="" know="" the="" different="" values="" of="" totwithinss.="" create="" a="" function="" that="" returns="" a="" series="" of="" totwithinss="" value,="" and="" take="" input="" maxk.="" #="" vec="" is="" a="" vector="" that="" contains="" totwithinss="" values="" associated="" with="" k="" from="" 1="" to="" maxk="" kmeans.distortion=""><- function(dataset,="" maxk){="" vec=""><- as.vector(1:maxk)="" vec[1:maxk]=""><- sapply(1:maxk,="" kmeans.totwithinss.k,="" dataset="dataset)" return(vec)="" }="" plot="" totwithinss="" in="" a="" graph="" to="" observe="" the="" relationship="" between="" distortion="" and="" the="" value="" of="" k.="" #="" max="" k="10" maxk=""><- 10="" dis_vct=""><- kmeans.distortion(nor_crime,="" maxk)="" #="" elbow="" curve="" plot(1:maxk,="" #="" horizontal="" axis="" dis_vct,="" #="" vertical="" axis="" type='b' ,="" #="" curve="" col='blue' ,="" xlab="Number of cluster" ,="" ylab="Distortion" )="" we="" can="" observe="" the="" distortion="" reduce="" less="" when="" the="" number="" of="" cluster="" grows.="" the="" value="" of="" distortion="" becomes="" stable="" beyond="" a="" certain="" threshold,="" which="" is="" the="" optimal="" value="" of="" k.="" here="" around="" k="4" or="" 5,="" model="" reaches="" its="" asymptotic="" distortion="" convergence.="" 5="" k-means="" and="" animation="" let="" us="" apply="" some="" animation="" to="" understand="" how="" r="" gave="" us="" the="" clustered="" results.="" #="" number="" of="" clusters,="" k="4" num_cluster="4" result=""><- kmeans.ani(nor_crime,="" num_cluster)="" result$centers="" contains="" average="" geo-location,="" which="" are="" the="" centers="" for="" each="" clusters.="" the="" second="" aggregate="" method="" counts="" the="" number="" of="" points="" in="" each="" cluster="" centers=""><- as.data.frame(result$centers)="" counts=""><- aggregate(nor_crime,="" by="list(result$cluster)," fun="length)[," 2]="" add="" cluster="" label="" to="" crime="" crime$cluster=""><- result$cluster="" 6="" visualization="" we="" can="" visualize="" the="" clusters="" over="" the="" raw="" data="" using="" the="" ggplot()="" method.="" #="" base="" layer="" plot.crime=""><- ggplot(data="nor_crime," aes(x="Murder," y="Assault," color="result$cluster))" add="" more="" layers="" to="" base="" plot="" #="" alpha:="" semi-transparent="" points="" plot.crime="" +="" geom_point(alpha=".25," size="5)" +="" #="" cluster="" centers,="" colored="" black:="" geom_point(data="centers," aes(x="Murder," y="Assault)," size="5," color='black' )="" +="" #="" cool="" colors="" for="" each="" cluster:="" scale_color_gradientn(colours="rainbow(num_cluster))" +="" #="" add="" a="" title,="" align="" to="" the="" center="" theme(plot.title="element_text(hjust" =="" 0.5))="" +="" ggtitle("k-means="" clusters")="" because="" of="" randomness,="" your="" results="" are="" likely="" to="" look="" a="" little="" different="" 7="" lab="" 4="" source.r="" if(1="=1){" #="" high-quality="" plots="" if(!require(ggplot2)){="" install.packages("ggplot2")="" }="" library(ggplot2)="" #="" animation="" of="" k-means="" if(!require(animation)){="" install.packages("animation")="" }="" library(animation)="" }="" ###########################################################="" ##########="" k-means="" ###########="" ###########################################################="" ################################################="" ###="" data="" overview="" and="" normalization="" ###="" ################################################="" if(2="=2){" #="" load="" in="" crime="" data="" setwd("c:\\users\\yongcheng\\desktop\\mis="" 545\\lab")="" getwd()="" crime0=""><- read.csv("crime_data.csv")="" str(crime0)="" #="" display="" distribution="" in="" two="" dimensions,="" murder="" and="" assault="" crime=""><- crime0[,c('murder',="" 'assault')]="" plot(crime)="" ###="" normalization="" function="" normit=""><- function(feature){="" normalized=""><- ((feature="" -="" min(feature))="" (max(feature)="" -="" min(feature)))="" return="" (normalized)="" }="" #="" normalization,="" the="" result="" nor_crime="" is="" converted="" to="" data="" frame="" nor_crime=""><- apply(crime[,c(1,2)],="" 2,="" fun="normIt)" nor_crime=""><- as.data.frame(nor_crime)="" #="" let="" us="" take="" the="" number="" of="" clusters="" to="" be="" 5.="" #="" kmeans()="" function="" takes="" the="" input="" data="" and="" the="" number="" of="" clusters="" in="" which="" #="" the="" data="" is="" to="" be="" clustered.="" the="" syntax="" is="" :="" kmeans(="" data,="" k)="" #="" where="" k="" is="" the="" number="" of="" cluster="" centers.="" c1=""><- kmeans(nor_crime,="" 5)="" class(c1)="" #="" analyzing="" the="" clustering="" :="" str(c1)="" }="" ############################################="" ###="" find="" the="" optimal="" value="" of="" 'k'="" ###="" ############################################="" if(3="=3){" #="" create="" a="" function="" that="" returns="" withinss="" value="" of="" a="" k-means()="" result="" kmeans.totwithinss.k=""><- function(dataset,="" number_of_centers){="" km=""><- kmeans(dataset, number_of_centers) km$tot.withinss } # for k=3, withinss is kmeans.totwithinss.k(nor_crime, 3) # it can be seen that as the value of k increases, distortion decreases. kmeans.totwithinss.k(nor_crime, 5) # create a function that returns a series of withinss value kmeans(dataset,="" number_of_centers)="" km$tot.withinss="" }="" #="" for="" k="3," withinss="" is="" kmeans.totwithinss.k(nor_crime,="" 3)="" #="" it="" can="" be="" seen="" that="" as="" the="" value="" of="" k="" increases,="" distortion="" decreases.="" kmeans.totwithinss.k(nor_crime,="" 5)="" #="" create="" a="" function="" that="" returns="" a="" series="" of="" withinss="">
Answered Same DayAug 08, 2021

Answer To: this lab, we will work with K-means clustering algorithm. The data file, crime_data.csv, can befound...

Subhanbasha answered on Aug 09 2021
152 Votes
Report
For doing clustering we need two packages to do clustering and visualize.
# Installing required packages
install.packages('ggplot2')
in
stall.packages("animation")
# Calling required packages
library(ggplot2)
library(animation)
# Reading data set
iris_data <-read.csv('iris.csv')
Before do the clustering need to check the normalization. If the data not follows normal distribution need to convert it into normal.
# Data overview and distribution
plot(iris_data$Petal.Length)
plot(iris_data$Petal.Width)
From the above two plots the data not following the normal distribution so we need to convert it into normal distribution. Here we have selected only two features for the analysis.
# Normalizing the data
normIt <- function(feature){
normalized <- ((feature - min(feature)) / (max(feature) - min(feature)))
return (normalized)
}
nor_iris <- apply(iris_data[,c(4,5)], 2, FUN = normIt)
nor_iris <- as.data.frame(nor_iris)
After normalizing the data need to check or find the optimal cluster value to get the good classification.
# Finding optimal cluster value
set.seed(200)
k.max <- 10
wss<- sapply(1:k.max,function(k){kmeans(iris_data[,c(4,5)],k,nstart = 20,iter.max = 20)$tot.withinss})
wss
plot(1:k.max,wss, type= "b", xlab = "Number of clusters(k)", ylab = "Within cluster sum of squares")
The above plot will be useful to identify the optimal clusters. By using the plot the optimal cluster value is 3 so we can use 3...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here