Title Layout PROG8430 – Data Analysis, Modeling and Algorithms LECTURE 10 – UNSUPERVISED CLASSIFICATION: CLUSTERING SUMMARIZED (AND CONDENSED!) ALGORITHM SELECTION START Dimension Reduction Topic...

1 answer below »
R language quiz


Title Layout PROG8430 – Data Analysis, Modeling and Algorithms LECTURE 10 – UNSUPERVISED CLASSIFICATION: CLUSTERING SUMMARIZED (AND CONDENSED!) ALGORITHM SELECTION START Dimension Reduction Topic Modeling Probabilistic PRINCIPAL COMPONENT ANALYSIS SINGULAR VALUE DECOMPOSITION FORWARD/ BACKWARD SELECTION Have Responses Predicting Numeric Speed or Accuracy DECISION TREE LINEAR REGRESSION RANDOM FOREST NEURAL NETWORK GRADIENT BOOSTING TREE Speed or Accuracy Explainable Data is Too Large Hierarchical Categorical Variables Prefer Probability SUPPORT VECTOR MACHINES RANDFOM FOREST NEURAL NETWORK GRADIENT BOOSTING TREE DECISION TREE LOGISTIC REGRESSION NAÏVE BAYES LINEAR SVM NAÏVE BAYES HIERARCHICAL K-MODESK-MEANS GAUSSIAN MIXTURE YES NO NO YES YESNO Adapted from Li, Hui (2017) YES YES NOYES NO YES YES YES NO NO ACCURACY SPEED YES NO NO YES ACCURACY SPEED THIS LECTURE K-Means Clustering HOW TO CLASSIFY MULTIPLE “OUTCOMES” Clustering Clustering (K-Means) is an unsupervised form analysis (i.e. data is unlabeled with no “right” answer). Clustering is also known as segmentation, stratification, or grouping. K-Means invented by Steinhaus and Lloyd in 1957, came into common use in 1980s. Still popular because of : ◦ Speed ◦ Interpretability Why bother clustering? • Marketing Segmentation Treat Customers Differently • Some variables may be more predictive with some segments than others. Build Models on sub-groups • Can evaluate variable effects on sub-groups (e.g. geographic impacts) Evaluate key variables. USER STEPS IN K-MEANS • Data needs to be normalized to have similar scaling Standardize Data • Data defines the clusters is often different from the data that described the clusters. Choose data to cluster around • Choose the ``K`` in K-Means. Choose number of clusters Consider single variable Looking at the FIFA Ranking for 20 national soccer teams. Points Country 1209 Estonia 1293 Benin 1433 Egypt 1454 Czech Republic 1368 Ecuador 1625 Spain 1406 Scotland 1599 Denmark 1755 Belgium 1286 Belarus 1245 Vietnam 1201 India 1367 Albania 1339 Canada 1371 Bulgaria 1548 Peru 1323 Curacao 1248 Luxembourg 1344 FYR Macedonia 1281 Haiti Try with 2, 3, 4 clusters. Try with 2, 3… iterations. K-Means Clustering – Step by Step 1. Choose the number of clusters (k=5 in this case) 2. Arbitrarily choose cluster centre points (centroids). 3. Assign each datapoint to the cluster with the closest centre. 4. Each cluster recalculates it’s centroid based on the new set of data. 5. Repeat steps 3 and 4 until convergence. 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 3.5 4 Let’s Do This with the Denver Crime Data • Data needs to be standardized to have similar scaling Standardize Data • For Denver I’ll choose Crime and Population • NOTE – I’m only using two for demonstration purposes. In reality it could be many. Choose data to cluster around • Choose the ``K`` in K-Means. • We will try ? = [2: 10] Choose number of clusters So, let’s do this with Denver. Step 1: STANDARDIZE! Two ways to re-scale (or standardize) data. • Yields data with N(0,1) distribution • Generally better for data with outliers. ?? = ?? − ҧ? ? • Yields data distributed over [0,1] • Generally better that is tightly clustered. ?? = ?? −min(?) max ? −min(?) So, let’s do this with Denver. Step 2: Choose Variables; Step 3: Choose K 1. For this experiment, we will choose: ◦ Population ◦ Crime ◦ as the ‘centroids’ #Create a quick normalization function norm01 <- function(x)="" {="" return="" ((x="" -="" min(x))="" (max(x)="" -="" min(x)))="" }="" #a="" different="" normalization="" function="" normn=""><- function(x)="" {="" return="" ((x-mean(x))/sd(x))="" }="" denver$pop_minmax=""><- norm01(denver$pop)="" denver$pop_normx=""><- normn(denver$pop)="" evaluating="" output="" 1.="" notice="" the="" size="" of="" the="" clusters="" (30,13)="" 2.="" notice="" the="" means="" of="" each="" centroid="" (centre)="" for="" each="" cluster.="" 3.="" notice="" the="" “within="" cluster="" sum="" of="" squares”.="" 1.="" =="" .???ℎ??="" +="" 2.="" .???ℎ??="" =="" (???ℎ??)="" 3.="" 49.2%="" ratio="" of="" variance="" explained.="" 4.="" tot.within="" ss="33.070109" +="" 9.643542="42.71" clstrden=""><- kmeans(denverclstrdata, iter.max=10, centers=2, nstart=10) k-means clustering with 2 clusters of sizes 30, 13 cluster means: pop_normx crime_normx 1 0.3657893 -0.5312835 2 -0.8441292 1.2260388 within cluster sum of squares by cluster: [1] 33.070109 9.643542 (between_ss / total_ss = 49.2 %) number of random starting points to try. number of iterations to try for convergence. choosing number of “k” using the “elbow” no hard and fast rule about number of clusters. ◦ business rules ◦ experience rules ◦ purpose based rules ◦ finding the “most for the least” using the elbow ◦ size of clusters notice the point of inflection for the k value. let’s choose 3 (going to 4 give some small cluster sizes). note – some times it is a very sharply defined ‘elbow’ note – there are other ways (e.g. silhouette, gap analysis) to determine number of clusters. 0% 20% 40% 60% 80% 100% 0.0 10.0 20.0 30.0 40.0 50.0 2 3 4 5 6 7 8 9 10 to ta l w it h in s s nbr of k values of `k`for denver clusters k 2 3 4 5 6 7 8 9 10 between to total 49% 67% 75% 81% 85% 87% 90% 91% 92% tot. within ss 42.7 27.4 20.8 16.3 14.2 12.2 10.3 9.8 8.5 choosing k visually notice the size and distribution of clusters notice position of centres summarizing clusters cluster pop popchg child lunch incchg income crime n 1 6.25 9.04 25.6 35.2 26.8 $46,939 63.8 18 2 11.4 2.99 29 54.4 27.1 $40,556 73 12 3 3.98 10.9 27.5 68.7 28 $40,676 146 13 in groups of 3 or 4, answer these question: • based on these summaries, what do we see? • can we use these to help plan social services? • what descriptive names can we give them? potential issues • more art than science. there is no hard and fast rule regarding number of clusters or even cluster size. number of clusters • depending on the data, the algorithm may not converge to a solution. convergence • often the results are not useful interpretability k-nearest neighbours k-nearest neighbour overview can be used for classification or predictions. very simple algorithm ◦ i.e. only two user inputs: ◦ distance measurement ◦ k – i.e. number of neighbours very “lazy learner” ◦ i.e. not used to generalize a function, but a dataset. requires 3 things: ◦ feature space(training data) ◦ distance metric ◦ to compute distance between records ◦ the value of k ◦ the number of nearest neighbors to retrieve from which to get majority class to classify an unknown record: ◦ compute distance to other training records ◦ identify k nearest neighbors ◦ use class labels of nearest neighbors to determine the class label of unknown record ? icdm: top ten data mining algorithms k nearest neighbor classification december 2006 knn - summary 20 knn – main decisions common distance metrics: ◦ euclidean distance(continuous distribution) ◦ hamming distance (overlap metric) ◦ discrete metric (boolean metric) determine the class from k nearest neighbor list ◦ take the majority vote of class labels among the k-nearest neighbors ◦ weighted factor w =1/d(generalized linear interpolation) or 1/d2 icdm: top ten data mining algorithms k nearest neighbor classification december 2006 choosing the value of k: ◦ if k is too small, sensitive to noise points ◦ if k is too large, neighborhood may include points from other classes ◦ choose an odd value for k, to eliminate ties  k = 3:  belongs to triangle class  k = 7:  belongs to square class icdm: top ten data mining algorithms k nearest neighbor classification december 2006 ?  k = 1:  belongs to square class knn – example simple technique that is easily implemented building model is inexpensive extremely flexible classification scheme ◦ does not involve preprocessing well suited for ◦ multi-modal classes (classes of multiple forms) ◦ records with multiple class labels asymptotic error rate at most twice bayes rate ◦ cover & hart paper (1967) can sometimes be the best method icdm: top ten data mining algorithms k nearest neighbor classification december 2006 knn – advantages classifying unknown records are relatively expensive ◦ requires distance computation of k-nearest neighbors ◦ computationally intensive, especially when the size of the training set grows accuracy can be severely degraded by the presence of noisy or irrelevant features data needs to be standardized to be effective nn classification expects class conditional probability to be locally constant ◦ bias of high dimensions icdm: top ten data mining algorithms k nearest neighbor classification december 2006 knn – disadvantges k-means k-nearest neighbors clustering algorithm classification algorithm uses distance from data points to k- centroids to cluster data into k-groups. calculates k nearest data points from data point x. uses these points to determine which class x belongs to centroids are not necessarily data points. “centroid” is the point x to be classified. updates centroid on each pass by calculations over all data in a class. data point to be classified remains the same. must iterate over data until center point doesn’t move. only requires k distance calculations. k-means vs. knn title layout prog8430 – data analysis, modeling and algorithms lecture 12 – classification techniques (2) logistic regression . did you summarize the data? stop! not a data analysis no did you report the summaries without interpretation? descriptive did you quantify whether your discoveries will hold in a new sample? exploratory no yes yes no are you trying to determine how changing the average of one measurement affects another? yes are you trying to predict measurements for individuals? is the effect you are seeking average or deterministic? inferential predictive causal mechanistic yes no no yes average deterministic summarized (and condensed!) algorithm selection start dimension reduction topic modeling probabilistic principal component analysis singular value decomposition forward/ backward selection have responses predicting numeric speed or accuracy decision tree linear regression random forest neural network gradient boosting tree speed or accuracy explainable data is too large hierarchical categorical variables prefer probability support vector machines randfom forest neural network gradient boosting tree decision tree logistic regression naïve bayes linear svm naïve bayes hierarchical k-modesk-means gaussian mixture yes no no yes yesno adapted from li, hui (2017) yes yes noyes no yes yes yes no no accuracy speed yes no no yes accuracy speed classification logistic regression blood donation dataset dataset information the donor database of blood transfusion service center in hsin-chu city in taiwan. the center passes their blood transfusion kmeans(denverclstrdata,="" iter.max="10," centers="2," nstart="10)" k-means="" clustering="" with="" 2="" clusters="" of="" sizes="" 30,="" 13="" cluster="" means:="" pop_normx="" crime_normx="" 1="" 0.3657893="" -0.5312835="" 2="" -0.8441292="" 1.2260388="" within="" cluster="" sum="" of="" squares="" by="" cluster:="" [1]="" 33.070109="" 9.643542="" (between_ss="" total_ss="49.2" %)="" number="" of="" random="" starting="" points="" to="" try.="" number="" of="" iterations="" to="" try="" for="" convergence.="" choosing="" number="" of="" “k”="" using="" the="" “elbow”="" no="" hard="" and="" fast="" rule="" about="" number="" of="" clusters.="" ◦="" business="" rules="" ◦="" experience="" rules="" ◦="" purpose="" based="" rules="" ◦="" finding="" the="" “most="" for="" the="" least”="" using="" the="" elbow="" ◦="" size="" of="" clusters="" notice="" the="" point="" of="" inflection="" for="" the="" k="" value.="" let’s="" choose="" 3="" (going="" to="" 4="" give="" some="" small="" cluster="" sizes).="" note="" –="" some="" times="" it="" is="" a="" very="" sharply="" defined="" ‘elbow’="" note="" –="" there="" are="" other="" ways="" (e.g.="" silhouette,="" gap="" analysis)="" to="" determine="" number="" of="" clusters.="" 0%="" 20%="" 40%="" 60%="" 80%="" 100%="" 0.0="" 10.0="" 20.0="" 30.0="" 40.0="" 50.0="" 2="" 3="" 4="" 5="" 6="" 7="" 8="" 9="" 10="" to="" ta="" l="" w="" it="" h="" in="" s="" s="" nbr="" of="" k="" values="" of="" `k`for="" denver="" clusters="" k="" 2="" 3="" 4="" 5="" 6="" 7="" 8="" 9="" 10="" between="" to="" total="" 49%="" 67%="" 75%="" 81%="" 85%="" 87%="" 90%="" 91%="" 92%="" tot.="" within="" ss="" 42.7="" 27.4="" 20.8="" 16.3="" 14.2="" 12.2="" 10.3="" 9.8="" 8.5="" choosing="" k="" visually="" notice="" the="" size="" and="" distribution="" of="" clusters="" notice="" position="" of="" centres="" summarizing="" clusters="" cluster="" pop="" popchg="" child="" lunch="" incchg="" income="" crime="" n="" 1="" 6.25="" 9.04="" 25.6="" 35.2="" 26.8="" $46,939="" 63.8="" 18="" 2="" 11.4="" 2.99="" 29="" 54.4="" 27.1="" $40,556="" 73="" 12="" 3="" 3.98="" 10.9="" 27.5="" 68.7="" 28="" $40,676="" 146="" 13="" in="" groups="" of="" 3="" or="" 4,="" answer="" these="" question:="" •="" based="" on="" these="" summaries,="" what="" do="" we="" see?="" •="" can="" we="" use="" these="" to="" help="" plan="" social="" services?="" •="" what="" descriptive="" names="" can="" we="" give="" them?="" potential="" issues="" •="" more="" art="" than="" science.="" there="" is="" no="" hard="" and="" fast="" rule="" regarding="" number="" of="" clusters="" or="" even="" cluster="" size.="" number="" of="" clusters="" •="" depending="" on="" the="" data,="" the="" algorithm="" may="" not="" converge="" to="" a="" solution.="" convergence="" •="" often="" the="" results="" are="" not="" useful="" interpretability="" k-nearest="" neighbours="" k-nearest="" neighbour="" overview="" can="" be="" used="" for="" classification="" or="" predictions.="" very="" simple="" algorithm="" ◦="" i.e.="" only="" two="" user="" inputs:="" ◦="" distance="" measurement="" ◦="" k="" –="" i.e.="" number="" of="" neighbours="" very="" “lazy="" learner”="" ◦="" i.e.="" not="" used="" to="" generalize="" a="" function,="" but="" a="" dataset.="" requires="" 3="" things:="" ◦="" feature="" space(training="" data)="" ◦="" distance="" metric="" ◦="" to="" compute="" distance="" between="" records="" ◦="" the="" value="" of="" k="" ◦="" the="" number="" of="" nearest="" neighbors="" to="" retrieve="" from="" which="" to="" get="" majority="" class="" to="" classify="" an="" unknown="" record:="" ◦="" compute="" distance="" to="" other="" training="" records="" ◦="" identify="" k="" nearest="" neighbors="" ◦="" use="" class="" labels="" of="" nearest="" neighbors="" to="" determine="" the="" class="" label="" of="" unknown="" record="" icdm:="" top="" ten="" data="" mining="" algorithms="" k="" nearest="" neighbor="" classification="" december="" 2006="" knn="" -="" summary="" 20="" knn="" –="" main="" decisions="" common="" distance="" metrics:="" ◦="" euclidean="" distance(continuous="" distribution)="" ◦="" hamming="" distance="" (overlap="" metric)="" ◦="" discrete="" metric="" (boolean="" metric)="" determine="" the="" class="" from="" k="" nearest="" neighbor="" list="" ◦="" take="" the="" majority="" vote="" of="" class="" labels="" among="" the="" k-nearest="" neighbors="" ◦="" weighted="" factor="" w="1/d(generalized" linear="" interpolation)="" or="" 1/d2="" icdm:="" top="" ten="" data="" mining="" algorithms="" k="" nearest="" neighbor="" classification="" december="" 2006="" choosing="" the="" value="" of="" k:="" ◦="" if="" k="" is="" too="" small,="" sensitive="" to="" noise="" points="" ◦="" if="" k="" is="" too="" large,="" neighborhood="" may="" include="" points="" from="" other="" classes="" ◦="" choose="" an="" odd="" value="" for="" k,="" to="" eliminate="" ties="" ="" k="3:" ="" belongs="" to="" triangle="" class="" ="" k="7:" ="" belongs="" to="" square="" class="" icdm:="" top="" ten="" data="" mining="" algorithms="" k="" nearest="" neighbor="" classification="" december="" 2006="" ="" k="1:" ="" belongs="" to="" square="" class="" knn="" –="" example="" simple="" technique="" that="" is="" easily="" implemented="" building="" model="" is="" inexpensive="" extremely="" flexible="" classification="" scheme="" ◦="" does="" not="" involve="" preprocessing="" well="" suited="" for="" ◦="" multi-modal="" classes="" (classes="" of="" multiple="" forms)="" ◦="" records="" with="" multiple="" class="" labels="" asymptotic="" error="" rate="" at="" most="" twice="" bayes="" rate="" ◦="" cover="" &="" hart="" paper="" (1967)="" can="" sometimes="" be="" the="" best="" method="" icdm:="" top="" ten="" data="" mining="" algorithms="" k="" nearest="" neighbor="" classification="" december="" 2006="" knn="" –="" advantages="" classifying="" unknown="" records="" are="" relatively="" expensive="" ◦="" requires="" distance="" computation="" of="" k-nearest="" neighbors="" ◦="" computationally="" intensive,="" especially="" when="" the="" size="" of="" the="" training="" set="" grows="" accuracy="" can="" be="" severely="" degraded="" by="" the="" presence="" of="" noisy="" or="" irrelevant="" features="" data="" needs="" to="" be="" standardized="" to="" be="" effective="" nn="" classification="" expects="" class="" conditional="" probability="" to="" be="" locally="" constant="" ◦="" bias="" of="" high="" dimensions="" icdm:="" top="" ten="" data="" mining="" algorithms="" k="" nearest="" neighbor="" classification="" december="" 2006="" knn="" –="" disadvantges="" k-means="" k-nearest="" neighbors="" clustering="" algorithm="" classification="" algorithm="" uses="" distance="" from="" data="" points="" to="" k-="" centroids="" to="" cluster="" data="" into="" k-groups.="" calculates="" k="" nearest="" data="" points="" from="" data="" point="" x.="" uses="" these="" points="" to="" determine="" which="" class="" x="" belongs="" to="" centroids="" are="" not="" necessarily="" data="" points.="" “centroid”="" is="" the="" point="" x="" to="" be="" classified.="" updates="" centroid="" on="" each="" pass="" by="" calculations="" over="" all="" data="" in="" a="" class.="" data="" point="" to="" be="" classified="" remains="" the="" same.="" must="" iterate="" over="" data="" until="" center="" point="" doesn’t="" move.="" only="" requires="" k="" distance="" calculations.="" k-means="" vs.="" knn="" title="" layout="" prog8430="" –="" data="" analysis,="" modeling="" and="" algorithms="" lecture="" 12="" –="" classification="" techniques="" (2)="" logistic="" regression="" .="" did="" you="" summarize="" the="" data?="" stop!="" not="" a="" data="" analysis="" no="" did="" you="" report="" the="" summaries="" without="" interpretation?="" descriptive="" did="" you="" quantify="" whether="" your="" discoveries="" will="" hold="" in="" a="" new="" sample?="" exploratory="" no="" yes="" yes="" no="" are="" you="" trying="" to="" determine="" how="" changing="" the="" average="" of="" one="" measurement="" affects="" another?="" yes="" are="" you="" trying="" to="" predict="" measurements="" for="" individuals?="" is="" the="" effect="" you="" are="" seeking="" average="" or="" deterministic?="" inferential="" predictive="" causal="" mechanistic="" yes="" no="" no="" yes="" average="" deterministic="" summarized="" (and="" condensed!)="" algorithm="" selection="" start="" dimension="" reduction="" topic="" modeling="" probabilistic="" principal="" component="" analysis="" singular="" value="" decomposition="" forward/="" backward="" selection="" have="" responses="" predicting="" numeric="" speed="" or="" accuracy="" decision="" tree="" linear="" regression="" random="" forest="" neural="" network="" gradient="" boosting="" tree="" speed="" or="" accuracy="" explainable="" data="" is="" too="" large="" hierarchical="" categorical="" variables="" prefer="" probability="" support="" vector="" machines="" randfom="" forest="" neural="" network="" gradient="" boosting="" tree="" decision="" tree="" logistic="" regression="" naïve="" bayes="" linear="" svm="" naïve="" bayes="" hierarchical="" k-modesk-means="" gaussian="" mixture="" yes="" no="" no="" yes="" yesno="" adapted="" from="" li,="" hui="" (2017)="" yes="" yes="" noyes="" no="" yes="" yes="" yes="" no="" no="" accuracy="" speed="" yes="" no="" no="" yes="" accuracy="" speed="" classification="" logistic="" regression="" blood="" donation="" dataset="" dataset="" information="" the="" donor="" database="" of="" blood="" transfusion="" service="" center="" in="" hsin-chu="" city="" in="" taiwan.="" the="" center="" passes="" their="" blood="">
Answered 1 days AfterApr 18, 2021

Answer To: Title Layout PROG8430 – Data Analysis, Modeling and Algorithms LECTURE 10 – UNSUPERVISED...

Mohd answered on Apr 20 2021
155 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here