Answer To: [Type text][Type text][Type text] SHA571: Understanding and Visualizing Data Cornell University...
Pooja answered on Aug 21 2021
Table of Contents
Part 1 - Data collection plan 2
Part 2 - Data summary and Visualization 3
Part 3 – Modelling 5
Analysis 5
Dashboard 6
Appendix 7
Part 1 7
Part 2 11
Part 3 13
References 14
Part 1 - Data collection plan
A leading cause of death in the United States is suicide. This cause is ranked number 2 for the age group of 10-14, 15-24, and 15-34 years. For the age group of 35-44 and 45-54, the cause of suicidal death is ranked as Number 4. With the help of data analysis, I want to analyze the trend of the suicidal rates on the basis of gender. This situation is the increasing rate of suicides from 4.1 per 100,000 in the year 2001 to 26.1 per 100,000 in the year 2017. The data can be helpful to predict the total number of suicides in the future. The data can also be useful to know if there is a significant difference in the average number of suicides between male and female.
The data set corresponding to suicide rates in the United States would be beneficial to understand the situation. The data is filtered for the year 2000 until 2015. The variables of concern are year, sex, suicides/100k pop. The quantitative variables are the year and suicides/100k pop. The categorical variable is sex which is categorized as male or female. The variable gender will analyze if the average number of suicides is Greater for male or female. The variable time (denoted as 1 for the January 2000) will help us to create a regression model which can predict the suicides/100k pop in future. Chambers, J. M. (2017).
The dataset masters are obtained from a secondary source named as kaggle.com. The data is an observational data as the values are recorded from each unit. The government agencies of various countries collect the data. This data is finalized and published by kaggle.com. The data corresponding to the year 2000-2015 would be a good representative of the population. Considering 12 months for each of the 15 years, a sample size of 192 would be appropriate. The biasedness is excluded by considering the participants of various age groups for both male and females. Garvan, F. (2001).
Part 2 - Data summary and Visualization
There is an increasing trend in the average number of suicides per 100000 population from the year 2000 until 2015.
suicides/100k pop
Mean
12.94541667
Standard Error
0.843362193
Median
7.115
Mode
0.34
Standard Deviation
11.68596934
Sample Variance
136.5618794
Kurtosis
-0.598106571
Skewness
0.767366793
Range
42.13
Minimum
0.27
Maximum
42.4
Sum
2485.52
Count
192
The average number of suicides/100k pop is 13 with a high standard deviation of 11.6. This universe value along with the histogram indicates that the distribution of suicides is slightly skewed to the right. There are very few years/months with a high suicide rate. Draper, N. R., & Smith, H. (1998).
As evident from the box plot, there are no outliers in this data set.
Row Labels
Average of suicides/100k pop
female
4.593020833
male
21.2978125
Grand Total
12.94541667
The average number of suicides for males is extremely greater than that of females. There are on an average 21 suicides per 100000 population for males in comparison to only 5 suicides per 100000 population for females.
Part 3 – Modelling
Analysis
The regression model is given by: suicides/100k popn = 3.93 + 0.0067*time + 16.73*male.
There is 51% variation in two sides which is explained by time and mail. This percentage is considered fair, and it seems that the model can be improved by adding some significant variables. Chatfield, C. (2018).
Consider the null hypothesis, that model is not significant. This is an alternative hypothesis that the model is significant. With (F=100.16, p<5%), the null hypothesis is rejected at 5% level of significance. There is sufficient evidence to conclude that the model is significant. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012).
With 1 month increase, the societal rate is increased by 0.006 suicides per 100000 population. But With t=0.63, p>5%, this value is not considered to be significant.
For males, the suicide in rate is 16.7 suicide per 100000 population more in comparison to females. This value is considered to be significant with t=14.15, p<5%.
The limitation of regression analysis is the assumption of normality of residuals and equality of error variances. These assumptions are violated as observed from a normal probability plot and time residual plot. As the normal probability plot is not as shape and the point are not random in the time reasonable plot.
Dashboard
Appendix
Part 1
year
time
sex
suicides/100k...