I need to have a clean copy of codes and solutions in a word document and R script
BANA 200 Assignment 3 Cluster Analysis Due Tuesday, September 1st on Canvas by 6PM (PST) 220 Points Recall the data cleaning assignments we did on the investments data. We showed that customers who had negative service encounters took out a lot more money than customers who had positive service encounters. One main limitation of the previous work was that it was carried out across the entire sample of data without looking to see if different segments of customers responded differently to bad vs. good service. The executives at the investments firm have asked you to reanalyze the data in greater detail. Specifically, the executives want to know if there are certain customer segments who are at higher risk of taking out more money. In order to answer this question, you will perform a cluster analysis on the data, and calculate the average changes in dollars (Chg good service – Chg bad service) within each segment to see if there are certain groups of customers the firm needs to make sure always have good service encounters. You must provide all of your R code in a script file and we must be able to replicate your results in order for you to receive full credit for this assignment. Please upload both your answers as well as the R file. The dataset "hw3_data.txt" contains 1759 observations and 10 variables. The most important variables are defined below: 1. “categ”: Describes whether a customer had a good or a bad customer service experience. 860 customers had a bad service experience (answered a 1 or 2 on the customer satisfaction survey), and 899 customers had a good service experience (answered a 4 or 5 on the survey). Customers answering a “3” (average service experience) were omitted from this dataset. 2. “Inv_Chg”: This variable is the primary variable of interest to the firm. It represents the change in investment dollars for each customer 1 month before the service encounter vs. 3 months after the service encounter: Inv_Chg = Inv_3M_Aft – Inv_1M_Bef. 3. “Inv_1M_Bef”: This variable represents the total investment dollars customers had with the firm 1 month before they had a service encounter and survey. 4. “cust_age”: This variable represents each customer’s age in years at the time of the survey. 5. “cust_tenure”: This variable represents how long each customer has been with the firm (measured in years) at the time of the survey. 6. “tottrans”: Total monthly transactions the customer had with the firm at the time of the survey. 7. “cust_id”: Customer ID number: unique ID field used to identify each customer. The firm has asked you to do a segmentation (i.e. cluster analysis) on the following four variables: 1) Inv_1M_Bef, 2) cust_age, 3) cust_tenure, 4) tottrans to see if you can identify segments of customers who are especially likely to take out a lot of money should a bad service encounter occur. If you can identify segments of customers likely to take out a lot of money, then this can be used to proactively identify future customers who may also be especially at risk of disengaging after a bad customer service experience. The firm believes these four variables are especially important for clustering purposes. Part A: Standardize Data (20 Points) Standardize the four variables: Inv_1M_Bef, cust_age, cust_tenure, tottrans to a mean of 0 and a standard deviation of 1. Don’t replace the original data. Instead, create a new data matrix called “X.scaled” that contains the four variables that have been standardized so that they are all scaled the same way. Be sure to show your work in R. Part B: Elbow Plot (20 Points) Determine the number of clusters by first using an Elbow Plot. For the Elbow Plot, do the following: · Set the maximum number of clusters to consider = 20. · Set the number of starting values to 1000. · Set the number of iterations for the algorithm to 1000. · Set the y range of the elbow plot to go from 0 to 8000. · Paste the Elbow Plot below showing the results for 1 up to 20 clusters and interpret the results. Based on the results, how many clusters should you retain? How do you know? Part C: NbClust approach to number of clusters (20 Points) Now estimate the optimal number of clusters using the “NbClust” package in R to examine the more than 20 different approaches to determining the optimal number of clusters. Create a bar plot and paste it below. Based on the “majority rules” results of NbClust and your Elbow Plot, how many clusters do you think there are in the data? Do the results agree? When creating the diagnostics using NbClust, set max.nc = 20 (max number of clusters equal to 20). Make sure to also set the method = “kmeans”. Part D. Run K-Means cluster analysis using the optimal number of clusters found in Steps B and C. (50 Points) Using the optimal number of clusters determined from Steps B and C above, now run your K-Means cluster analysis on standardized data “X.scaled” using 1000 different starting values and allowing for 1000 maximum iterations. Submit the following commands exactly as shown below (including the set.seed command): set.seed(123) results <- kmeans(x.scaled, centers=k, iter.max=1000, nstart=1000) where k is the optimal number of clusters you determined from parts b and c above. once you have run your cluster analysis, you may need to reorder your cluster numbers. the firm has asked that you relabel the cluster numbers assigned by r, according to the cluster center values for the variable “inv_1m_bef”. because r randomly assigns the cluster labels, you can run the same algorithm twice and get different cluster number labels e.g. “1”, “2”, “3” etc. corresponding to the cluster assignments. the labels from k-means are arbitrary. in order to correct for this issue and to make sure the firm understands the clusters you have created, the firm is asking that you reassign the cluster labels from smallest to largest based on the cluster center values for inv_1m_bef (smallest = 1, largest = k). for example, assume you have your k clusters, and assume k = 3. the values for the centers for inv_1m_bef are: cluster 1 = 5.6, cluster 2 = 2.3, cluster 3 = 1.8. you will need to relabel not only the cluster centers in r but also the cluster number assignments as follows: old.clust.num = results$cluster. then map old.clust.num to new.clust.num for every customer: in the above example, the cluster numbers have been reassigned (cluster 1 becomes cluster3, cluster 2 stays the same, and cluster 3 becomes cluster 1). this is based on the values of inv_1m_bef and by ordering it from smallest to largest. once you have created the remapping of all cluster numbers, paste a table below with the following 3 columns: 1) average value of inv_1m_bef (cluster center values from your k-means algorithm), 2) old cluster assignment number, and 3) new cluster assignment number. round the average values of inv_1m_bef to four decimal places. once you have mapped the old cluster numbers to the new cluster numbers, report the frequency count of the number of customers in each new cluster number by using the table command. so for example, calculate table(new.clust.num) and report it below. in other words, how many customers are in each segment? but be sure to use the new cluster numbers assigned. once you have mapped the old customer numbers to the new cluster numbers, use the aggregate function in r to calculate the cluster centers on the original data. in other words, do not report the cluster centers from your “results” object (from part d above) because this data is standardized and hard to interpret; rather, calculate the average cluster center values for inv_1m_bef, cust_age, cust_tenure, and tottrans based on the original raw (unscaled) data. the firm does not want to look at averages for the scaled data because it is too hard to interpret what the results means; rather, they want you to report the averages by new cluster number for the unscaled, raw data. conduct this analysis and be sure to double check your work. paste in a table below the averages for the four variables, broken out by cluster number. for example, if you have k clusters, you should have k rows and five columns (the first column should be the new cluster number: 1,2,…k). the other four columns are the averages for your four raw variables by new cluster number. round all averages to two decimal places. be sure to show all of your work in r in order to receive full credit. part e. interpretation of the segments (30 points). based on the table above, describe the major differences you see in the segments. summarize some major findings and insights. is there anything that surprised you? do the cluster center values look roughly the same or do they vary a lot? part f. amount of money lost by cluster and good – bad service (30 points). now that you know which segment each customer is in, calculate the average change in dollars (average of inv_chg) by cluster number and by the variable categ. hint: use the aggregate function to average inv_chg by 1) cluster number and 2) by categ. be sure to use the new cluster assignments. once you have calculated the k*2 values (change in investments for each of the k clusters and for good vs. bad service), calculate the average change within each cluster as “good” – “bad”. for example, if for cluster 1, the average of inv_chg good = $200 and the average of inv_chg bad = $40, the average change (good – bad) of inv_chg for cluster 1 is $200 – $40 = $160, which means the firm loses an average of $160 for each customer in this segment having a bad (vs. good) service encounter. repeat this calculation for all of your clusters. report the results below, rounded to two decimal places. your final table should have k rows and four columns: 1) new cluster number, 2) average inv_chg bad, 3) average inv_chg good, 4) average inv_chg good – average inv_chg bad. are there any segments where having a bad service experience is causing the firm to lose a very large amount of money relative to having a good service encounter? in other words, where does having bad service “hurt the most”? identify the top two clusters where the firm is losing the most money by having bad (vs. good) service. how much do the results vary by segment? do the results look roughly the same or does the amount of money that the firm is losing vary a lot across each cluster? are there any results that don’t make any sense? if so, can you come up with an explanation for why some results don’t make sense? part g. identifying future customers who might take out a kmeans(x.scaled,="" centers="K," iter.max="1000," nstart="1000)" where="" k="" is="" the="" optimal="" number="" of="" clusters="" you="" determined="" from="" parts="" b="" and="" c="" above.="" once="" you="" have="" run="" your="" cluster="" analysis,="" you="" may="" need="" to="" reorder="" your="" cluster="" numbers.="" the="" firm="" has="" asked="" that="" you="" relabel="" the="" cluster="" numbers="" assigned="" by="" r,="" according="" to="" the="" cluster="" center="" values="" for="" the="" variable="" “inv_1m_bef”.="" because="" r="" randomly="" assigns="" the="" cluster="" labels,="" you="" can="" run="" the="" same="" algorithm="" twice="" and="" get="" different="" cluster="" number="" labels="" e.g.="" “1”,="" “2”,="" “3”="" etc.="" corresponding="" to="" the="" cluster="" assignments.="" the="" labels="" from="" k-means="" are="" arbitrary.="" in="" order="" to="" correct="" for="" this="" issue="" and="" to="" make="" sure="" the="" firm="" understands="" the="" clusters="" you="" have="" created,="" the="" firm="" is="" asking="" that="" you="" reassign="" the="" cluster="" labels="" from="" smallest="" to="" largest="" based="" on="" the="" cluster="" center="" values="" for="" inv_1m_bef="" (smallest="1," largest="K)." for="" example,="" assume="" you="" have="" your="" k="" clusters,="" and="" assume="" k="3." the="" values="" for="" the="" centers="" for="" inv_1m_bef="" are:="" cluster="" 1="5.6," cluster="" 2="2.3," cluster="" 3="1.8." you="" will="" need="" to="" relabel="" not="" only="" the="" cluster="" centers="" in="" r="" but="" also="" the="" cluster="" number="" assignments="" as="" follows:="" old.clust.num="results$cluster." then="" map="" old.clust.num="" to="" new.clust.num="" for="" every="" customer:="" in="" the="" above="" example,="" the="" cluster="" numbers="" have="" been="" reassigned="" (cluster="" 1="" becomes="" cluster3,="" cluster="" 2="" stays="" the="" same,="" and="" cluster="" 3="" becomes="" cluster="" 1).="" this="" is="" based="" on="" the="" values="" of="" inv_1m_bef="" and="" by="" ordering="" it="" from="" smallest="" to="" largest.="" once="" you="" have="" created="" the="" remapping="" of="" all="" cluster="" numbers,="" paste="" a="" table="" below="" with="" the="" following="" 3="" columns:="" 1)="" average="" value="" of="" inv_1m_bef="" (cluster="" center="" values="" from="" your="" k-means="" algorithm),="" 2)="" old="" cluster="" assignment="" number,="" and="" 3)="" new="" cluster="" assignment="" number.="" round="" the="" average="" values="" of="" inv_1m_bef="" to="" four="" decimal="" places.="" once="" you="" have="" mapped="" the="" old="" cluster="" numbers="" to="" the="" new="" cluster="" numbers,="" report="" the="" frequency="" count="" of="" the="" number="" of="" customers="" in="" each="" new="" cluster="" number="" by="" using="" the="" table="" command.="" so="" for="" example,="" calculate="" table(new.clust.num)="" and="" report="" it="" below.="" in="" other="" words,="" how="" many="" customers="" are="" in="" each="" segment?="" but="" be="" sure="" to="" use="" the="" new="" cluster="" numbers="" assigned.="" once="" you="" have="" mapped="" the="" old="" customer="" numbers="" to="" the="" new="" cluster="" numbers,="" use="" the="" aggregate="" function="" in="" r="" to="" calculate="" the="" cluster="" centers="" on="" the="" original="" data.="" in="" other="" words,="" do="" not="" report="" the="" cluster="" centers="" from="" your="" “results”="" object="" (from="" part="" d="" above)="" because="" this="" data="" is="" standardized="" and="" hard="" to="" interpret;="" rather,="" calculate="" the="" average="" cluster="" center="" values="" for="" inv_1m_bef,="" cust_age,="" cust_tenure,="" and="" tottrans="" based="" on="" the="" original="" raw="" (unscaled)="" data.="" the="" firm="" does="" not="" want="" to="" look="" at="" averages="" for="" the="" scaled="" data="" because="" it="" is="" too="" hard="" to="" interpret="" what="" the="" results="" means;="" rather,="" they="" want="" you="" to="" report="" the="" averages="" by="" new="" cluster="" number="" for="" the="" unscaled,="" raw="" data.="" conduct="" this="" analysis="" and="" be="" sure="" to="" double="" check="" your="" work.="" paste="" in="" a="" table="" below="" the="" averages="" for="" the="" four="" variables,="" broken="" out="" by="" cluster="" number.="" for="" example,="" if="" you="" have="" k="" clusters,="" you="" should="" have="" k="" rows="" and="" five="" columns="" (the="" first="" column="" should="" be="" the="" new="" cluster="" number:="" 1,2,…k).="" the="" other="" four="" columns="" are="" the="" averages="" for="" your="" four="" raw="" variables="" by="" new="" cluster="" number.="" round="" all="" averages="" to="" two="" decimal="" places.="" be="" sure="" to="" show="" all="" of="" your="" work="" in="" r="" in="" order="" to="" receive="" full="" credit.="" part="" e.="" interpretation="" of="" the="" segments="" (30="" points).="" based="" on="" the="" table="" above,="" describe="" the="" major="" differences="" you="" see="" in="" the="" segments.="" summarize="" some="" major="" findings="" and="" insights.="" is="" there="" anything="" that="" surprised="" you?="" do="" the="" cluster="" center="" values="" look="" roughly="" the="" same="" or="" do="" they="" vary="" a="" lot?="" part="" f.="" amount="" of="" money="" lost="" by="" cluster="" and="" good="" –="" bad="" service="" (30="" points).="" now="" that="" you="" know="" which="" segment="" each="" customer="" is="" in,="" calculate="" the="" average="" change="" in="" dollars="" (average="" of="" inv_chg)="" by="" cluster="" number="" and="" by="" the="" variable="" categ.="" hint:="" use="" the="" aggregate="" function="" to="" average="" inv_chg="" by="" 1)="" cluster="" number="" and="" 2)="" by="" categ.="" be="" sure="" to="" use="" the="" new="" cluster="" assignments.="" once="" you="" have="" calculated="" the="" k*2="" values="" (change="" in="" investments="" for="" each="" of="" the="" k="" clusters="" and="" for="" good="" vs.="" bad="" service),="" calculate="" the="" average="" change="" within="" each="" cluster="" as="" “good”="" –="" “bad”.="" for="" example,="" if="" for="" cluster="" 1,="" the="" average="" of="" inv_chg="" good="$200" and="" the="" average="" of="" inv_chg="" bad="$40," the="" average="" change="" (good="" –="" bad)="" of="" inv_chg="" for="" cluster="" 1="" is="" $200="" –="" $40="$160," which="" means="" the="" firm="" loses="" an="" average="" of="" $160="" for="" each="" customer="" in="" this="" segment="" having="" a="" bad="" (vs.="" good)="" service="" encounter.="" repeat="" this="" calculation="" for="" all="" of="" your="" clusters.="" report="" the="" results="" below,="" rounded="" to="" two="" decimal="" places.="" your="" final="" table="" should="" have="" k="" rows="" and="" four="" columns:="" 1)="" new="" cluster="" number,="" 2)="" average="" inv_chg="" bad,="" 3)="" average="" inv_chg="" good,="" 4)="" average="" inv_chg="" good="" –="" average="" inv_chg="" bad.="" are="" there="" any="" segments="" where="" having="" a="" bad="" service="" experience="" is="" causing="" the="" firm="" to="" lose="" a="" very="" large="" amount="" of="" money="" relative="" to="" having="" a="" good="" service="" encounter?="" in="" other="" words,="" where="" does="" having="" bad="" service="" “hurt="" the="" most”?="" identify="" the="" top="" two="" clusters="" where="" the="" firm="" is="" losing="" the="" most="" money="" by="" having="" bad="" (vs.="" good)="" service.="" how="" much="" do="" the="" results="" vary="" by="" segment?="" do="" the="" results="" look="" roughly="" the="" same="" or="" does="" the="" amount="" of="" money="" that="" the="" firm="" is="" losing="" vary="" a="" lot="" across="" each="" cluster?="" are="" there="" any="" results="" that="" don’t="" make="" any="" sense?="" if="" so,="" can="" you="" come="" up="" with="" an="" explanation="" for="" why="" some="" results="" don’t="" make="" sense?="" part="" g.="" identifying="" future="" customers="" who="" might="" take="" out="">- kmeans(x.scaled, centers=k, iter.max=1000, nstart=1000) where k is the optimal number of clusters you determined from parts b and c above. once you have run your cluster analysis, you may need to reorder your cluster numbers. the firm has asked that you relabel the cluster numbers assigned by r, according to the cluster center values for the variable “inv_1m_bef”. because r randomly assigns the cluster labels, you can run the same algorithm twice and get different cluster number labels e.g. “1”, “2”, “3” etc. corresponding to the cluster assignments. the labels from k-means are arbitrary. in order to correct for this issue and to make sure the firm understands the clusters you have created, the firm is asking that you reassign the cluster labels from smallest to largest based on the cluster center values for inv_1m_bef (smallest = 1, largest = k). for example, assume you have your k clusters, and assume k = 3. the values for the centers for inv_1m_bef are: cluster 1 = 5.6, cluster 2 = 2.3, cluster 3 = 1.8. you will need to relabel not only the cluster centers in r but also the cluster number assignments as follows: old.clust.num = results$cluster. then map old.clust.num to new.clust.num for every customer: in the above example, the cluster numbers have been reassigned (cluster 1 becomes cluster3, cluster 2 stays the same, and cluster 3 becomes cluster 1). this is based on the values of inv_1m_bef and by ordering it from smallest to largest. once you have created the remapping of all cluster numbers, paste a table below with the following 3 columns: 1) average value of inv_1m_bef (cluster center values from your k-means algorithm), 2) old cluster assignment number, and 3) new cluster assignment number. round the average values of inv_1m_bef to four decimal places. once you have mapped the old cluster numbers to the new cluster numbers, report the frequency count of the number of customers in each new cluster number by using the table command. so for example, calculate table(new.clust.num) and report it below. in other words, how many customers are in each segment? but be sure to use the new cluster numbers assigned. once you have mapped the old customer numbers to the new cluster numbers, use the aggregate function in r to calculate the cluster centers on the original data. in other words, do not report the cluster centers from your “results” object (from part d above) because this data is standardized and hard to interpret; rather, calculate the average cluster center values for inv_1m_bef, cust_age, cust_tenure, and tottrans based on the original raw (unscaled) data. the firm does not want to look at averages for the scaled data because it is too hard to interpret what the results means; rather, they want you to report the averages by new cluster number for the unscaled, raw data. conduct this analysis and be sure to double check your work. paste in a table below the averages for the four variables, broken out by cluster number. for example, if you have k clusters, you should have k rows and five columns (the first column should be the new cluster number: 1,2,…k). the other four columns are the averages for your four raw variables by new cluster number. round all averages to two decimal places. be sure to show all of your work in r in order to receive full credit. part e. interpretation of the segments (30 points). based on the table above, describe the major differences you see in the segments. summarize some major findings and insights. is there anything that surprised you? do the cluster center values look roughly the same or do they vary a lot? part f. amount of money lost by cluster and good – bad service (30 points). now that you know which segment each customer is in, calculate the average change in dollars (average of inv_chg) by cluster number and by the variable categ. hint: use the aggregate function to average inv_chg by 1) cluster number and 2) by categ. be sure to use the new cluster assignments. once you have calculated the k*2 values (change in investments for each of the k clusters and for good vs. bad service), calculate the average change within each cluster as “good” – “bad”. for example, if for cluster 1, the average of inv_chg good = $200 and the average of inv_chg bad = $40, the average change (good – bad) of inv_chg for cluster 1 is $200 – $40 = $160, which means the firm loses an average of $160 for each customer in this segment having a bad (vs. good) service encounter. repeat this calculation for all of your clusters. report the results below, rounded to two decimal places. your final table should have k rows and four columns: 1) new cluster number, 2) average inv_chg bad, 3) average inv_chg good, 4) average inv_chg good – average inv_chg bad. are there any segments where having a bad service experience is causing the firm to lose a very large amount of money relative to having a good service encounter? in other words, where does having bad service “hurt the most”? identify the top two clusters where the firm is losing the most money by having bad (vs. good) service. how much do the results vary by segment? do the results look roughly the same or does the amount of money that the firm is losing vary a lot across each cluster? are there any results that don’t make any sense? if so, can you come up with an explanation for why some results don’t make sense? part g. identifying future customers who might take out a>