PROG8430 – Data Analysis, Modeling and Algorithms Assignment 1 Exploratory Data Analysis with ‘R’ DUE BEFORE FEB XXXXXXXXXX; 10PM 1. Submission Guidelines All assignments must be submitted via the...

1 answer below »
R programming


PROG8430 – Data Analysis, Modeling and Algorithms Assignment 1 Exploratory Data Analysis with ‘R’ DUE BEFORE FEB 14 2021; 10PM 1. Submission Guidelines All assignments must be submitted via the econestoga course website before the due date in to the assignment folder. You may make multiple submissions, but only the most current submission will be graded. SUBMISSIONS In the Assignment 1 Folder submit: 1. Your R Code 2. A Word document containing your answers to the questions following the format provided in the assignment (i.e. using the example assignment). DO NOT PUT THE DOCUMENTS IN TO A ZIP FILE! All variables in your code must abide by the naming convention [variable_name]_[intials]. For example, my variable for State would be State_DM. You may only use base R (i.e. no additional packaged may be used) THIS IS AN INDIVIDUAL ASSIGNMENT. UNAUTHORIZED COLLABORATION IS AN ACADEMIC OFFENSE. Please see the Conestoga College Academic Integrity Policy for details. 2. Grading This assignment is worth 12.5% of your total grade in the course and you can expect it to take five to eight hours. It is out of 30 marks overall. Assignments submitted after 10pm will be reduced 20%. Assignments received after 8:00am the morning after the due date will receive a mark of 0%. Assignments which do not follow the submission instructions may have marks deducted. 3. Data Each student will have access to the study dataset. STUDY DATASET: PROG8430_Assign_Explore.Rdata Appendix one contains a data dictionary for the study file. 4. Background A survey of 2120 residents of Canada was conducted to determine the key factors associated with political engagement. A variety of variables were measured and recorded including some tests they were asked to complete. Appendix 1 contains the data dictionary for the data set. One group of respondents (“Treat”) were given additional education on political matters while the other (“Control”) were not. All of the tasks have been completed using the examples presented in class. A careful review of your notes from the lectures should give you everything you need to complete these tasks. Additionally, example output for most of the questions has been provided in the Appendix. NOTE – The sample output is to demonstrate only and your output data will not precisely match it. All of your charts, tables and graphs should be properly labelled. 5. Assignment Tasks Nbr Description Marks 1 Summarizing Data 1. Summary Table a. Create a table to show the total income by each category of marital status. b. Which status has the highest total income? 2. Calculate the mean a. Calculate the mean age of respondents born in Europe. b. Calculate the mean age of respondents born in Europe weighted by the number of children they have. 3. Table Comparison a. Create a table to show the mean Score for high school graduates compared to non-high school graduates. b. Which has a higher Score? 4. Calculate the 21st and 51st percentiles of percentage of time taken on the test. 8 2 Organizing Data 1. Pie Chart a. Create a pie chart showing the number of respondents from each Region. 12 b. Which region contains the most respondents (remember each row of your study file represents one respondent)? c. Which region contains the fewest respondents? 2. Summary table a. Create a table that shows the percentage of respondents from each Region that graduated high school. b. Which region has the highest percentage of high school graduates? c. Which region has the lowest percentage of high school graduates? 3. Bar Chart a. Create a bar chart showing the mean score on the Political Awareness Test for each Region. b. Which Region has the lowest mean score? c. Which Region has the highest mean score? 4. Histogram a. Create a histogram with 5 bins showing the distribution of the percentage of household income going to food. b. Which range of values has the highest frequency? 5. Box Plots a. Create a sequence of box plots showing the distribution of income separated by marital status. b. According to the charts, which martial status has the highest income? c. Which marital status has the lowest income? d. Which marital status has the greatest variability in income? 6. Scatter Plots a. Create a histogram for income. b. Create a histogram for standardized score. c. Create a scatter plot showing the relationship between the income and standardized score. d. What conclusions, if any, can you draw from the chart? e. Calculate a correlation coefficient between these two variables. What conclusion you draw from it? 3 Inference 1. Normality a. Create a QQ Normal plot of the Political Awareness Test Score. b. Conduct a statistical test for normality on the Political Awareness Test Score. c. Are the Political Awareness Test Scores normally distributed? 2. Statistically Significant Differences a. Compare Political Awareness Test Scores between the treatment and control group using a suitable hypothesis test. b. Explain why you chose the test you did. 6 c. Do you have strong evidence that the average votes are different? 4 Professionalism and Clarity 4 APPENDIX ONE: STUDY FILE DATA Variable Description id UserID (unique to each respondent) group Treatment or Control group hs.grad Graduated High School (Y or N) nation Nationality (Region) gender M/F age Age in Years m.status Marital Status political: Political Affiliation n.child Number of Children income Annual Household Income food Pct of Income to Food housing Pct of Income to Housing other Pct of Income to Other Expenses score Score on Political Awareness Test scr Standardized Score Test time1 Pct of Time Taken on Test time2 Time Taken on Section 1 (Standardized) time3 Time Taken on Section 2 (Standardized) Pol Measure of Political Involvement APPENDIX TWO: EXAMPLE OUTPUT – Numbers will be different for the Study File Question 1.1 Martial_Status Income 1 divorced 2 married 3 never 4 widowed Question 1.3 HighSchool Score 1 no 2 yes Question 2.2 High_School Region no yes Asia Europe North America Southern Question 2.3 Question 2.5 Question 2.6 PROG8430 – Data Analysis, Modeling and Algorithms Assignment 1 Exploratory Data Analysis with ‘R’ DUE BEFORE FEB 14 2021; 10PM 1. Submission Guidelines All assignments must be submitted via the econestoga course website before the due date in to the assignment folder. You may make multiple submissions, but only the most current submission will be graded. SUBMISSIONS In the Assignment 1 Folder submit: 1. Your R Code 2. A Word document containing your answers to the questions following the format provided in the assignment (i.e. using the example assignment). DO NOT PUT THE DOCUMENTS IN TO A ZIP FILE! All variables in your code must abide by the naming convention [variable_name]_[intials]. For example, my variable for State would be State_DM. You may only use base R (i.e. no additional packaged may be used) THIS IS AN INDIVIDUAL ASSIGNMENT. UNAUTHORIZED COLLABORATION IS AN ACADEMIC OFFENSE. Please see the Conestoga College Academic Integrity Policy for details. 2. Grading This assignment is worth 12.5% of your total grade in the course and you can expect it to take five to eight hours. It is out of 30 marks overall. Assignments submitted after 10pm will be reduced 20%. Assignments received after 8:00am the morning after the due date will receive a mark of 0%. Assignments which do not follow the submission instructions may have marks deducted. 3. Data Each student will have access to the study dataset. STUDY DATASET: PROG8430_Assign_Explore.Rdata Appendix one contains a data dictionary for the study file. 4. Background A survey of 2120 residents of Canada was conducted to determine the key factors associated with political engagement. A variety of variables were measured and recorded including some tests they were asked to complete. Appendix 1 contains the data dictionary for the data set. One group of respondents (“Treat”) were given additional education on political matters while the other (“Control”) were not. All of the tasks have been completed using the examples presented in class. A careful review of your notes from the lectures should give you everything you need to complete these tasks. Additionally, example output for most of the questions has been provided in the Appendix. NOTE – The sample output is to demonstrate only and your output data will not precisely match it. All of your charts, tables and graphs should be properly labelled. 5. Assignment Tasks Nbr Description Marks 1 Summarizing Data 1. Summary Table a. Create a table to show the total income by each category of marital status. b. Which status has the highest total income? 2. Calculate the mean a. Calculate the mean age of respondents born in Europe. b. Calculate the mean age of respondents born in Europe weighted by the number of children they have. 3. Table Comparison a. Create a table to show the mean Score for high school graduates compared to non-high school graduates. b. Which has a higher Score? 4. Calculate the 21st and 51st percentiles of percentage of time taken on the test. 8 2 Organizing Data 1. Pie Chart a. Create a pie chart showing the number of respondents from each Region. 12 b. Which region contains the most respondents (remember each row of your study file represents one respondent)? c. Which region contains the fewest respondents? 2. Summary table a. Create a table that shows the percentage of respondents from each Region that graduated high school. b. Which region has the highest percentage of high school graduates? c. Which
Answered 1 days AfterFeb 13, 2021

Answer To: PROG8430 – Data Analysis, Modeling and Algorithms Assignment 1 Exploratory Data Analysis with ‘R’...

Sanchi answered on Feb 14 2021
152 Votes
#load libraries
library(data.table)
#load dataset
load("C:/Users/sanchi.kalra/Desktop/Greynodes/AS15/prog8430assignexplore-4xtnoth1-r44z4lzs.RData")
#renaming data
dataset<-PROG8430_Assign_Explore
setDT(dataset)
rm(PROG8430_Assign_Explore) #delete original data
#table to show the total income by each category of marital status
mstatus_income<-dataset[,list(income=sum(income)),keyby='m.status']
#Status has the highest total income?
mstatus_income[income==max(income)]$m.status
#Calculate the mean age of respondents born in Europe.
dataset[nation=="Europe",list(mean_age=mean(age)),]
#Calculate the mean age of respondents born in Europe weighted by the number of children they have.
dataset[nation=="Europe",list(mean_age=sum(n.child*age)/sum(age)),]
#Create a table to show the mean Score for high school graduates compared to non-high school graduates.
mean_score_grad<-dataset[,list(mean_score=mean(score)),keyby=c("hs.grad")]
#Which has a higher Score?
mean_score_grad[mean_score==max(mean_score)]$hs.grad
#Calculate the 21st and 51st percentiles of percentage of time taken on the test.
quantile(dataset$time1, c(.21, .51))
#Create a pie chart showing the number of respondents from each Region.
# Create data for the graph.
graph_data<-dataset[,list(num_resp=.N),keyby='nation']

# Give the chart file a name.
png(file = "nation_resp.png")

# Plot the chart.
pie(graph_data$num_resp,graph_data$nation)

# Save the file.
dev.off()

#Which region contains the most...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here