The purpose of this assignment is to help students better understand the exploratory data analysis techniques via working on a real dataset. You will be provided with a real dataset for which you will develop your own R code and perform the necessary techniques for descriptive and exploratory data analysis; you will conclude your analysis through critical assessment of the discovered statistical results.
In particular, in this assignment, you will
- perform descriptive data analysis on the provided data set;
- summarize and critically assess your preliminary results;
- evaluate modelling concepts based on the characteristics of the data;
- develop your own R code to obtain inferential statistics;
- make judgments and critically assess the results you will obtain to conclude the data analysis.
CO4760: Exploratory Data Analysis Assignment Date issued: 07/11/2017 Hand in Date: Part A (Descriptive Analysis):05/12/2017 Part B (Inferential Analysis):09/01/2018 IMPORTANT · As work is submitted on-line, the deadline is midnight on the hand in date. · Read the marking scheme carefully. · This is an individual project and no group work is permitted. I. Assignment Purpose and Overview The purpose of this assignment is to help students better understand the exploratory data analysis techniques via working on a real dataset. You will be provided with a real dataset for which you will develop your own R code and perform the necessary techniques for descriptive and exploratory data analysis; you will conclude your analysis through critical assessment of the discovered statistical results. In particular, in this assignment, you will · perform descriptive data analysis on the provided data set; · summarize and critically assess your preliminary results; · evaluate modelling concepts based on the characteristics of the data; · develop your own R code to obtain inferential statistics; · make judgments and critically assess the results you will obtain to conclude the data analysis. The design and development of the project will be divided into two phases: · Part A – Descriptive Statistics (50%): · Descriptive analysis – summary of the variables · Visualizations of the data and variables · Identification of the appropriate methods to be used for inference · Part B – Inferential Statistics (50%): · Application of the methods proposed in Part A · Assessment of obtained inferential statistics · Conclusions ΙΙ. Description Two datasets are described below. You will choose one dataset for this assignment. Dataset 1: New York Air Quality Measurements Dataset name: airquality Source: The data were obtained from the New York State Department of Conservation and the National Weather Service. Storage: The dataset is stored in R. The dataset contains daily air quality measurements in New York, May 1st, 1973 to September 30th, 1973. The dataset is formed of a data frame with 154 observations on 6 variables: Ozone: Mean ozone in parts per billion (ppb) from 1300 to 1500 hours at Roosevelt Island Solar.R: Solar radiation in Langleys (lang) in the frequency band 400-7700 Angstroms from 800 to 1200 hours at Central Park Wind: Average wind speed in miles per hour (mph) at 0700 and 1000 hours at LaGuardia Airport Temp: Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport Month: Month (1-12) Day: Day of the month (1-31) Dataset 2: Salaries for Professors Dataset name: Salaries Storage: The dataset is stored in R, in the library “car” The dataset gives the 2008-2009 9-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on- going effort of the college’s administration to monitor salary differences between male and female faculty members. The dataset consists of 397 observations on 6 variables: rank: a factor with levels AssocProf, AsstProf, Prof discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments). Yts.since.phd: years since PhD Yrs.service: years of service sex: sex (male/female) salary: nine-month salary, in dollars Deliverables The assignment consists of two deliverables corresponding to Part A and Part B of the assignment respectively. For each deliverable you will hand in 1. a report document (in word form) and 2. an R script with your code for verification Grading Criteria Marks will be awarded based on the criteria described below. Your R code will account for 40% of your final score, while the analysis will account for 60% of your final score. If your R code replication produces an error, marks will be deducted from your R code score. So make sure that your R code contains all required commands so that the code will run on a new workspace without producing an error, and replicating the results described in your report. Furthermore, do not forget to comment on your R code. Within each part (A and B), aim to complete the work for each section described in the following table before moving on to the next as you will not get the full credit for later sections if there are significant defects in an earlier section. However, to the extent possible, do not simply stop if you are stuck on one part, but can do later parts. In assessing the work within a section, factors such as simplicity, quality and appropriateness of comments, and quality and completeness will be considered. Part Description Range Break down A-1 / B-1 Understanding of the problem, identification of the appropriate methods to be employed for the analysis 0-5 0: fail to understand the needs of the problem 3: understands the problem and identifies the objectives of the study 5: clear understanding of the problem, identification of the objectives and description of the methods to be used for the analysis Α-2 / B-2 Statistical analysis: descriptive / inferential 0-15 0-5: Graphics 0-10: Statistical analysis See explanations below A-3 / B-3 R code 0-20 0-10 R code implementation 0: does not provide any R code 1-5: does not provide R code for all methods described 8: provides R code to employ all methods described ( 9-10: provides efficient R code to employ all methods described 0-5 Commenting 0: does not comment on the R code 2: provides some comments but inadequate to explain the code 4: provides adequate comments in the c ode, clearly explaining all the steps of the code 5: provides detailed comments, clearly explaining all the steps of the code and matching the commands with the statistical methods described 0-5 Code replication 0: code replication produces an error 5: code replication does not produce an error and replicates the results presented in your report A-4 / B-4 Conclusions and reporting 0-10 0-7 Conclusions 0: no conclusions are drawn 3: some conclusions are drawn 5: adequate conclusions are drawn, shows understanding of the statistical results 7: draws the conclusions that result from the analysis, shows understanding of the statistical results and critically assess and interprets the results of the analysis 0-3 Reporting 0: bad quality report 1: limited quality report 3: high quality report ) Explanations with regard to A-3 / B-3 Statistical Analysis · The study objectives should be clearly stated. · The specific question your analysis aims to address should be clearly formulated. · The hypotheses to be tested should be stated. · Clearly indicate the independent and dependent variables. · Defend on any necessary relabel of variables or data transformation. · Check whether there are NA values in the data. · Sufficient detail regarding hypothesis testing and/or modelling should be provided. · Specify the specific statistical methods, tests and modelling methods used. · Provide details of the proposed analysis. · Specify and defend on whether one or two sided statistical tests are performed. · The presence of outliers should be noted. State your approach to handle them. · Report exact p-values. · Measures of central tendency and measures of variation should be reported and clearly defined. · Defend on the number of significant digits used to reflect on precision. · Figures and tables should clearly display the data and be able to stand alone. That is all information necessary for interpretation should be included within the figure/table and legend. · Record outputs of the analysis. Submission of assignment work · Anonymous marking is being used. You may include your University ID number (“G2…”) on the work. Apart from this, avoid doing anything that would allow you to be identified from your work. · Keep a complete copy of the work you hand in.