Lab 2_Data Preprocessing1.pdf 1 MIS 545 Lab 2: Data Preprocessing 1. Overview In this lab, we will cover fundamental data preprocessing tasks in R. After this lab, we will be able to transform raw...

1 answer below »
For this assignment, you need to write R code to answer questions. For each question, turn in the R codeand the output. The datafileirisrequired is included in the preinstalled R package datasets;you need to load it.The instruction is shown below


Lab 2_Data Preprocessing1.pdf 1 MIS 545 Lab 2: Data Preprocessing 1. Overview In this lab, we will cover fundamental data preprocessing tasks in R. After this lab, we will be able to transform raw data generated in real world into a format that delivers business insights. 2. Install Data Packages Packages are user-contributed modules that contain R functions, data, and code in an organized format. There are tons of packages that we can download from CRAN. In this lab, we will use multiple data sets pre-generated in packages. Fortunately, installing a package can be done in R console or RStudio via function install.packages(), which is for the first time only. In this lab, download package VIM and mice. The dataset 'sleep' is included in the package VIM. It comes from a study about the sleeping pattern of 62 mammals. It wants to identify the relationships between sleep, and some physical characteristics of mammals, such as brain and body weight, life span, gestation time, time sleeping, and predation and danger indices. # install package “VIM” install.packages("VIM") # To use the package in an R session, we need to load it in an R session via library() library(VIM) # Load dataset “sleep”, which comes within the package “VIM” data(sleep, package ="VIM") # call function head() to get a feeling about data, or call sleep to see all values head(sleep) # download package “mice” and load it into R install.packages("mice") library(mice) 3. Explore missing value in R 3.1 missing value In R, we use “NA” stands for missing value; NaN(Not a Number) stands for an impossible value; And symbol “Inf” and “-Inf” stands for infinity and negative infinity respectively. In order to tell whether a value belongs to any case above, we use function is.na(), is.nan(), and is.infinite(). The return value of each function is Boolean type, either TRUE or FALSE. Remember, R is case-sensitive. Always 2 CAPITALIZE NA, Boolean value (TRUE, FALSE). Here we use “sleep” as an example to explore how missing values distribute. # First, we need to know how many rows in “sleep” nrow(sleep) ## [1] 62 # We use complete.cases() or na.omit() to see tuples without missing value. sleep[complete.cases(sleep),] # or try na.omit(sleep) # Count the number of rows without missing value nrow(sleep[complete.cases(sleep),]) ## [1] 42 # To reverse the condition logic (rows containing one or more missing value), we use the exclamation mark highlighted in Red sleep[!complete.cases(sleep),] nrow(sleep[!complete.cases(sleep),]) ## [1] 20 We tell R a Boolean value by inputting a TRUE or a FALSE. However, R can treat them as integer 1 or 0, respectively, which means information about missing value can be captured by using function sum() and mean(). # Check how many obs containing missing value in column “Dream” sum(is.na(sleep$Dream)) ## [1] 12 # About 19% of obs (observations) in column Dream contain missing value mean(is.na(sleep$Dream)) ## [1] 0.1935484 # 32% obs in data frame sleep containing one or more missing value mean(!complete.cases(sleep)) [1] 0.3225806 3.2 distribution of missing value Checking missing values is important, but little information about distribution and pattern can be 3 observed via functions we called above. It will be a convenient alternative if we can see how the missing value distributes in datasets. To do so, we can use function md.pattern() to generate a matrix that shows the pattern of missing value. # call function md.pattern(). Make sure you loaded package mice into R at first md.pattern(sleep) BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD 42 1 1 1 1 1 1 1 1 1 1 0 2 1 1 1 1 1 1 0 1 1 1 1 3 1 1 1 1 1 1 1 0 1 1 1 9 1 1 1 1 1 1 1 1 0 0 2 2 1 1 1 1 1 0 1 1 1 0 2 1 1 1 1 1 1 1 0 0 1 1 2 2 1 1 1 1 1 0 1 1 0 0 3 1 1 1 1 1 1 1 0 1 0 0 3 0 0 0 0 0 4 4 4 12 14 38 The returned table shows the pattern of missing value in data set sleep, in which 0 means there is a missing value in that column, 1 means there is no missing value. The first column without header indicates a number of observations following a particular pattern. The last column without header tells how many variables containing missing values in a particular pattern. Here, for example, three unique patterns come with missing value in column Dream; and only 1 row follows the pattern that contains missing value in 3 unique columns including Span, Dream, and NonD. 3.3 visualization of missing value Instead of seeing data in table via function md.pattern(), a visualization usually makes more sense to users. Taking care of this, package VIM provides several functions that visualize pattern of missing value. Some of them are popular, here we will apply aggr(), and scattMiss() to our dataset. Function aggr() returns a bar-chart and a heat-map showing the distribution of the missing value. On horizontal axis, you can see the list of column names of dataset sleep. The vertical axis indicates a number of observations following a particular pattern. # call function aggr (), prop = FALSE convert percentage value into counts aggr(sleep, prop = FALSE, numbers = TRUE) 4 ## you should be able to see visualizations in a graphic window We can easily find that 42 observations are complete, 9 observations contain missing values in both NonD and Dream columns. You can compare the results to what we got from functions sum() and mean(). Function marginplot() returns a scatter chart, where we can observe the relationship between two variables. # call function marginplot (), pch indicates notation of obs, col tells R how you would like to see results in different color marginplot(sleep[c("Gest", "Dream")], pch=c(20), col = c("darkgray","red","blue") ) ## you should be able to see visualizations in a graphic window 5 In the chart above, we can conclude Gest negatively correlates to Dream. The red boxplot on the very left margin tells the distribution of Dream with missing value in Gest; The grey one right next to red boxplot indicates the distribution of Dream with a Gest value in pairwise. We find that 4 Dream observations contain missing value in Gest, and that the distribution of these 4 observations is slightly higher than that of the others. In other words, dream value is greater than average in general when the Gest value is missing. 3.4 Correlation of missing value After seeing our data in visualization, we might need to quantify the correlation among the variables instead of a descriptive analysis. # First transform data into 1 and 0, indicating if a cell contains null value disofNa <- as.data.frame(abs(is.na(sleep)))="" #="" select="" variables="" that="" come="" with="" missing="" value="" corrna=""><- disofna[c("nond", "dream", "sleep", "span", "gest")] # call function cor() to see correlation coefficient matrix cor(corrna) nond dream sleep span gest nond 1.00000000 0.90711474 0.48626454 0.01519577 -0.14182716 dream 0.90711474 1.00000000 0.20370138 disofna[c("nond",="" "dream",="" "sleep",="" "span",="" "gest")]="" #="" call="" function="" cor()="" to="" see="" correlation="" coefficient="" matrix="" cor(corrna)="" nond="" dream="" sleep="" span="" gest="" nond="" 1.00000000="" 0.90711474="" 0.48626454="" 0.01519577="" -0.14182716="" dream="" 0.90711474="" 1.00000000="">
Answered Same DayJul 17, 2021

Answer To: Lab 2_Data Preprocessing1.pdf 1 MIS 545 Lab 2: Data Preprocessing 1. Overview In this lab, we will...

Suraj answered on Jul 18 2021
152 Votes
MIS 545 Lab 2Assignment
Introduction: This assignment is based on the visualization of the iris dat
aset with different types of plots like boxplot, violin plot and scatter plot by using R.
Steps to load dataset:
library(datasets)
data<-iris
df<-data.frame(data)
library(ggplot2)
The first plot is the box plot and the R code is given as...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here