Data analyse assignment
Individual Term Project – Data Analysis Project 30% Each student will, individually, analyze the Individual Project Dataset (on Sakai). The dataset is from Sport Canada and it lists the board composition (size, gender composition, number of paid staff), revenues, and expenses for provincial sport organizations in three provinces. Sport Canada is looking for information regarding specific HRM factors that may help financial performance in these organizations. You have been hired as a consultant to review the dataset and provide a summary report to Sport Canada**. Please see “notes” tab on excel for details of the data. • Step 1: Write the Report (include a creative title page for your report) a. Clean the Data and Present Summaries - Approximately 4 pages: • Remove or replace missing data and outliers. • Run your frequency and distribution statistics (continue to remove or replace outliers until you have nearly normal distribution for your variables). • Following the above steps, summarize the sample by providing frequency and distribution statistics for Columns C-M. o Create a table of the frequency and distribution statistics (5 marks) o Write a paragraph that summarizes key findings, include how you removed and replaced missing data (5 marks) o Conclude this section by outlining why the data is useful for providing recommendations (i.e., which data could you use to test correlation/regression and which data could you use to test for differences in mean scores?) (5 marks). 15 marks b. Present Association Statistics - Approximately 5 pages: In the previous part (Step 1, a), you identify which data could be used to test correlations or regression statistics to provide useful recommendations to Sport Canada. • Start this section with an opening paragraph that identifies the columns of data you wish to test for THREE relationships (either correlation, regression, or a mix of both). Use textbook terminology to outline why a correlation or regression calculation is appropriate with the variables you have chosen (refer to columns C- M) (6 marks). • Provides a brief summary of your correlation/regression test results (6 marks, the appropriate test was run, and the appropriate statistics are reported for each relationship you have identified, you MUST show your work through the equations to demonstrate knowledge of course content). • The results in this section will also be graded on how they are presented, please include scatterplots of the relationships you are reporting (3 marks). 15 marks CONTINUED ON NEXT PAGE c. Present Comparison Statistics - Approximately 5 pages: In the previous part (Step 1, a), you identify which data could be used to test differences in mean scores to provide useful recommendations to Sport Canada. • Start this section with an opening paragraph that identifies the columns of data you wish to test for THREE differences in mean (as an example, will you test differences in mean score of number of paid staff in Ontario versus Alberta?). Use textbook terminology to outline why a difference test is appropriate with the variables you have chosen (refer to columns C-M) (6 marks). • Provides a brief summary of your three difference test results (6 marks, the appropriate test was run, and the appropriate statistics are reported for each difference you have identified, make sure to identify if the difference is significant or not based on your p value range). • The results in this section will also be graded on how they are presented, please include a box plot of each of the three mean score comparisons that you are reporting (3 marks). 15 marks d. Provide Conclusions and Recommendations - Approximately 2 pages: Outline 5 conclusions and 5 recommendations based on the results you believe would be meaningful to Sport Canada. 10 marks Total: 55 marks Due Date: Wednesday April 14th, 2021 at 12:00pm EST To the Sakai Drop Box **Partnering with Debra Gassewitz at SIRC (see our guest lecture from Week 7), five (5) exemplar Individual Projects (i.e., superior mark and professionalism of project presentation) will be selected by Dr. Kerwin and the TAs to have the opportunity to present their findings and recommendations to Debra, her SIRC team, and members of Sport Canada. If chosen, Dr. Kerwin will ask for the students’ voluntary engagement in the presentation (note. This is not a requirement of the project, but rather than added incentive to conduct quality work) and ensure the students are prepared to make a presentation to these industry stakeholders. RUNNING HEAD: Sport Canada Research Report Sport Canada Research Report NAME Student Number SPMA 3P07 Date Dr. Shannon Kerwin Sport Canada Research Report 2 Part A Frequency and Distribution of Data Throughout this report, I will be focusing on Sport Canada’s National Sport Organizations (NSO’s). When analyzing the original data set, and going through each individual column (board size, memberships, total year end revenues) of the 53 rows of data, there are outliers present which should be removed. I have gone through the data and have taken out 6 rows of data in various columns due to the fact that they would have skewed the data. We want to try and minimize the amount of asymmetry within our graphs and data. Looking at Table 1, this shows the frequency and distribution statistics for the remaining of the data, after it has been cleaned and outliers have been removed (Refer to Table 1). After cleaning out these outliers, there were gaps in the data set. With these gaps, instead of completely removing important data from the set, I decided to take the mean from each column from the clean data and use that number fill in those empty gaps according to the respective column. This helped to normalize the data while keeping other important records of the data provided. This is something that I had to keep in mind while going through the next steps. Table 1 I have also included the histograms that were made along with these frequency tables to further display a visual of what this table represents (Refer to Figures 1-10). With these outliers removed, the data is more symmetrical. However, most of the data is still skewed mainly right skewed. By the majority of the graphs showing that it is unimodal and right skewed (if not symmetric) this represents that the mean will likely be greater than the median. Figure 1 Figure 2 Sport Canada Research Report 3 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 The removal of outliers helps data to become symmetric and less skewed. With real life distributions, they are usually skewed. By removing the outliers, it helps to create less errors within the data and minimizes the “skewness” within the data such as the graphs and mean Sport Canada Research Report 4 scores for each column. There are some histograms in the figures above which show potential outliers. I have left these in as they are close enough to the remainder of the data from their respective columns as there may have been more than one with a similar value and/or from which they provide valuable information that will be discussed later on in this report. After cleaning the data, I have calculated the mean scores, standard deviations, medians and IQR for each column (Refer to Table 2). This presents these values for all columns provided in the dataset. When looking at the histograms Board Size, Number of Male Board Members and Memberships, since those three are roughly symmetric, we want to focus on the mean and the standard deviation. For the other graphs that are skewed, we’re better off using the data from the median and IQR. Table 2 Key Findings When analyzing the entire dataset, as mentioned previously, I have decided to only focus on the national sport organizations (NSO’s). I chose to do this as I noticed that there was a great amount of variability as well as having many outliers within each of the provincial regions. Therefore, the majority of the provincial data would be greatly skewed. If I were to choose a region and remove all of the outliers within the provincial sport organizations, there would be limited data to work with, resulting in an unfair representation of the statistics presented. Focusing on the just the national sport organizations not only provides us more data to work with, but also gives us a full nationwide representation of Canadian sport organizations. This gives Sport Canada a larger dataset which is more accurate to those of PSO’s, while still being able to analyze nationwide. Secondly, there were some columns in which some national sport organizations had missing data. As we cannot just make up random numbers to fill in, I have decided to use the mean of the respective columns as this will keep the dataset normalized by using the average number of these sport organizations as their data. By doing this, we will be able to limit skewed data and try to normalize the shape of distribution. We do not have to wipe out valuable information provided by these organizations, and therefore we can keep these organizations in the report without skewing the data. My third key finding is noticing the year of the sport organization’s data. This is very important. All of the data has been provided in different years. Some are the same, however the data ranges from as early as 2014 to as recent as 2018. When working in an NSO, numbers can greatly change in the span of four years to due various sponsorships, a growth in popularity, a decline in activity, etc. The data