Assessment 2: Visualisation Assessor: Total __/34 Objective: The purpose of this tutorial project is to implement and explore techniques mainly covered in weeks two and three of the Foundations of Data Science subject. Some questions will require you to independently investigate R functions. This assessment will focus on a dataset containing a collection of credit card applications and the subsequent credit approval decisions (positive/successful or negative/unsuccessful). The project is segmented into the following parts: 1. Importing data and handling variable types, variable names and missing values 2. Calculating and visualising proximity measurements 3. Visually exploring data relationships using ggplot2() Answering: Inset your response into the box below each question. Answer boxes can be enlarged if needed. Submitting: Save your completed assessment as a pdf file using the naming convention: LastName_FirstName_MA5800_A2.pdf Submit your pdf document in to Learn JCU in the Assessment 2 area in MA5800. Questions Section 1. Importing The data is publicly available at: http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data This dataset is interesting because there is a good mix of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values. It contains data regarding corporate MasterCard (credit card) applications from the Commonwealth Bank during 1984. This was a time when credit approvals were done manually, and research into automation was active to improve equity and accuracy into the credit approval process. In the public source where the data is currently available, the variable names have been removed for confidentiality reasons; however, we provide the variable names below, based on the original publication: Variable Name Comments Gender Gender of applicant. Nominal variable with two factor levels Age Age of applicant. Numeric variable MonthlyExpenses Monthly house hold expenses. Numeric variable. Units of $100 MaritalStatus Marital status of applicant (Married, Single, Other). Nominal variable, three factor levels HomeStatus Home living arrangements of applicant (Renting, Own/Buying, Living with Relatives). Nominal variable, three factor levels Occupation Occupation category of applicant. Nominal variable with multiple factor levels BankingInstitution Primary banking/credit union institution used by applicant. Nominal variable YearsEmployed Number of years the applicant has worked in current or previous employment. Numeric variable NoPriorDefault Finical judgements of defaulting on a repayment. Nominal variable with two levels Employed Current employment status of applicant. Nominal variable, two factor levels CreditScore Offset normalised credit rating score: summary attribute score of tabulated values corresponding to application. Numeric variable DriversLicense If the applicant has a current drivers licence. Nominal variable, two factor levels AccountType Type of account in primary banking institution, e.g. savings account, etc. Nominal variable MonthlyIncome Monthly disposable income. Units of $1. Summary attribute from application. Numeric variable AccountBalance Amount in primary account in primary banking institution. Numeric variable Approved Approval status of application. Nominal variable, two factor levels Import the data using the R function read.table(). Note: in the data file, missing values are recorded using the “?” character. So, in order to correctly import the missing values in R, you will need to use the input argument na.strings = “?” in the call to read.table(). Also, note that the data file does not contain the variable names, so you can use the argument header = FALSE to instruct R that the first line of the file contains data, rather than variable names. Question 1 Insert your R code used to import the data. Insert your answer here /2 Variable names and types Because the data file does not contain the variable names, we will need to explicitly set them using the R function names(). Use this R code to add names to the dataset names(Data) <- c("gender",="" "age",="" "monthlyexpenses",="" "maritalstatus",="" "homestatus",="" "occupation",="" "bankinginstitution",="" "yearsemployed",="" "nopriordefault",="" "employed",="" "creditscore",="" "driverslicense",="" "accounttype",="" "monthlyincome",="" "accountbalance",="" "approved")="" when="" the="" data="" is="" imported="" using="" the="" function="" read.table(),="" the="" variable="" type="" is="" automatically="" assigned.="" data="" types="" can="" also="" be="" manually="" assigned="" and="" may="" need="" to="" be="" re-assigned="" to="" perform="" some="" types="" of="" numerical="" analysis="" –="" such="" as="" re-coding="" a="" two-level="" factor="" into="" a="" numeric="" binary="" variable,="" for="" example.="" use="" the="" following="" r="" code="" to="" manually="" define="" the="" variables="" data$gender="">-><- as.factor(data$gender)="" data$age="">-><- as.numeric(data$age)="" data$monthlyexpenses="">-><- as.integer(data$monthlyexpenses)="" data$maritalstatus="">-><- as.factor(data$maritalstatus)="" data$homestatus="">-><- as.factor(data$homestatus)="" data$occupation="">-><- as.factor(data$occupation)="" data$bankinginstitution="">-><- as.factor(data$bankinginstitution)="" data$yearsemployed="">-><- as.numeric(data$yearsemployed)="" data$nopriordefault="">-><- as.factor(data$nopriordefault)="" data$employed="">-><- as.factor(data$employed)="" data$creditscore="">-><- as.numeric(data$creditscore)="" data$driverslicense="">-><- as.factor(data$driverslicense)="" data$accounttype="">-><- as.factor(data$accounttype)="" data$monthlyincome="">-><- as.integer(data$monthlyincome)="" data$accountbalance="">-><- as.numeric(data$accountbalance)="" data$approved="">-><- as.factor(data$approved)="" question="" 2="" variables="" gender="" and="" driverslicense="" are="" both="" nominal="" binary="" variables,="" i.e.,="" unordered="" factors="" with="" two="" levels="" (values).="" they="" don’t="" need="" to,="" but="" they="" could,="" be="" represented="" as="" numeric="" binary="" variables,="" taking="" values="" 0="" and="" 1.="" the="" two="" values="" for="" gender="" stand="" for="" male="" and="" female,="" and="" the="" two="" values="" for="" driverslicense="" indicate="" whether="" or="" not="" the="" individual="" has="" a="" current="" drivers="" license.="" discuss="" if="" these="" variables="" are="" better="" interpreted="" as="" symmetric="" or="" asymmetric="" for="" the="" sake="" of="" credit="" approval="" analysis,="" justifying="" and/or="" contextualising="" your="" answer.="" insert="" your="" answer="" here="" 4="" records="" with="" missing="" values="" there="" are="" also="" a="" few="" missing="" values.="" for="" this="" project,="" observations="" with="" missing="" values="" need="" to="" be="" removed.="" to="" remove="" observations="" with="" missing="" values,="" use="" the="" r="" function="" na.omit().="" use="" this="" r="" code="" to="" remove="" the="" records="" with="" missing="" values="" from="" the="" dataset="" data="">-><- na.omit(data)="" question="" 3="" in="" the="" original="" data,="" how="" many="" missing="" values="" in="" total="" are="" there?="" how="" many="" records="" are="" removed="" by="" using="" the="" function="" na.omit()?="" insert="" your="" answer="" here="" 1="" section="" 2.="" calculating="" and="" visualising="" proximity="" measurements="" question="" 4="" the="" dataset="" contains="" variables="" with="" mixed="" types.="" use="" r="" function="" daisy()="" from="" package="" cluster="" to="" compute="" a="" gower="" dissimilarity="" (distance)="" matrix="" between="" the="" data="" records,="" and="" refer="" to="" the="" result="" as="" “dist”.="" enter="" the="" r="" code="" you="" used,="" including="" any="" libraries="" needed.="" insert="" your="" answer="" here="" 2="" the="" r="" object="" produced="" from="" the="" function="" daisy()="" is="" called="" a="" dissimilarity="" object="" and="" is="" efficient="" in="" storing="" information,="" but="" is="" not="" readily="" visualised="" or="" easy="" to="" extract="" information="" from.="" to="" make="" the="" dissimilarity="" object="" easier="" to="" work="" with,="" we="" can="" convert="" it="" to="" a="" matrix.="" use="" the="" r="" code="" to="" convert="" the="" gower="" dissimilarity="" object="" into="" a="" distance="" matrix="" dist="">-><- as.matrix(dist)="" question="" 5="" using="" the="" new="" distance="" matrix,="" what="" is="" the="" gower="" similarity="" measure="" between="" the="" 10th="" and="" the="" 60th="" observation="" (row)?="" answer="" using="" r="" command(s).="" insert="" your="" answer="" here="" 2="" because="" there="" are="" a="" large="" number="" of="" observations/records="" (rows)="" in="" the="" dataset,="" it="" is="" typical="" to="" visualise="" the="" distance="" matrix="" to="" gain="" insight="" into="" data="" structures.="" use="" the="" following="" r="" code="" to="" visualise="" the="" distance="" matrix="" dim="">-><- ncol(dist)="" #="" used="" to="" define="" axis="" in="" image="" image(1:dim,="" 1:dim,="" dist,="" axes="FALSE," xlab="" ,="" ylab="" ,="" col="rainbow(100))" note="" (optional):="" additionally,="" you="" could="" also="" reorder="" the="" rows="" and="" columns="" of="" the="" matrix="" according="" to="" their="" similarities="" before="" visualising,="" using="" a="" technique="" called="" clustering="" (which="" will="" be="" studied="" as="" part="" of="" other,="" more="" advanced="" subjects,="" namely,="" data="" mining="" and="" machine="" learning):="" heatmap(dist,="" rowv="TRUE," colv="Rowv" ,="" symm="TRUE)" question="" 6="" insert="" the="" image(s)="" of="" the="" distance="" matrix="" below,="" then="" describe="" the="" pattern="" you="" see="" when="" visualising="" it(them).="" insert="" image="" here="" 1="" describe="" the="" image="" here="" 1="" visualising="" a="" distance="" matrix="" is="" one="" form="" of="" initially="" exploring="" the="" dataset.="" correlation="" matrices="" between="" numerical="" data="" types="" can="" also="" be="" useful="" when="" exploring="" the="" data.="" question="" 7="" enter="" your="" r="" code="" used="" to="" calculate="" the="" pearson="" and="" then="" the="" spearman="" correlation="" matrices="" using="" all="" numerical="" variables.="" insert="" your="" answer="" here="" 2="" section="" 3.="" visually="" exploring="" data="" patterns="" and="" relationships="" we="" may="" have="" preconceived="" notions="" of="" what="" to="" expect="" in="" some="" datasets.="" in="" credit="" card="" applications,="" we="" may="" hypothesise="" that="" approval="" would="" be="" aligned,="" for="" example,="" with="" account="" balance,="" monthly="" expenses,="" credit="" score="" and/or="" age.="" question="" 8="" use="" the="" ggplot2="" library="" to="" produce="" box="" plots="" for="" accountbalance,="" monthlyexpenses,="" creditscore="" and="" age="" segmented="" by="" approval="" (variable="" “approved”).="" insert="" the="" r="" codes="" and="" resulting="" images="" into="" the="" table="" below.="" 4="" enter="" code="" here="" insert="" image="" here="" insert="" image="" here="" insert="" image="" here="" insert="" image="" here="" question="" 9="" describe="" the="" apparent="" patterns="" shown="" in="" the="" visualisations="" in="" question="" 8.="" insert="" your="" answer="" here="" 4="" question="" 10="" use="" the="" ggplot2="" library="" to="" produce="" bar="" plots="" for="" employed,="" maritalstatus,="" bankinginstitution,="" and="" nopriordefault,="" all="" segmented="" by="" approval="" (variable="" “approved”).="" insert="" the="" r="" codes="" and="" resulting="" images="" into="" the="" table="" below.="" 4="" insert="" code="" here="" insert="" image="" here="" insert="" image="" here="" insert="" image="" here="" insert="" image="" here="" question="" 11="" considering="" that="" the="" values="" “f”="" (false)="" and="" “t”="" (true)="" of="" variables="" employed="" and="" nopriordefault="" mean="" unemployed/employed="" and="" defaulted-before/never-defaulted-before,="" respectively,="" and="" regardless="" of="" the="" meaning="" of="" the="" values="" for="" maritalstatus="" and="" bankinginstitution,="" describe="" interesting="" relationships="" (if="" any)="" between="" these="" nominal="" variables="" and="" the="" approval="" of="" the="" application="" by="" visually="" inspecting="" the="" bar="" plots="" in="" question="" 10.="" insert="" your="" answer="" here="" 4="" apparently,="" the="" strongest="" influencing="" factor="" in="" the="" approval="" of="" the="" applications="" is="" if="" there="" has="" been="" a="" prior="" credit="" default.="" by="" using="" a="" contingency="" table,="" we="" can="" examine="" the="" strength="" of="" this="" relationship.="" question="" 11="" use="" the="" function="" table()="" and="" calculate="" the="" simple="" matching="" coefficient="" (smc)="" between="" nopriordefault="" and="" approved,="" assuming="" that="" values="" “f”="" (false)="" and="" “t”="" (true)="" fo="" nopriordefault="" are="" associated="" with="" “-“="" and="" “+”="" for="" approved,="" respectively.="" discuss="" the="" interpretation="" of="" the="" smc="" in="" this="" scenario.="" is="" jaccard="" meaningful="" in="" this="" case?="" enter="" your="" answer="" here="">->