[Type text][Type text][Type text] SHA571: Understanding and Visualizing Data Cornell University Understanding and Visualizing Data Course Project Instructions: This project guides you through the...

1 answer below »

View more »
Answered Same DayAug 20, 2021

Answer To: [Type text][Type text][Type text] SHA571: Understanding and Visualizing Data Cornell University...

Pooja answered on Aug 21 2021
155 Votes
Table of Contents
Part 1 - Data collection plan    2
Part 2 - Data summary and Visualization    3
Part 3 – Modelling    5
Analysis    5
Dashboard    6
Appendix    7
Part 1    7
Part 2    11
Part 3    13
References    14
Part 1 - Data collection plan
A leading cause of death in the United States is suicide. This cause is ranked number 2 for the age group of 10-14, 15-24, and 15-34 years. For the age group of 35-44 and 45-54, the cause of suicidal death is ranked as Number 4. With th
e help of data analysis, I want to analyze the trend of the suicidal rates on the basis of gender. This situation is the increasing rate of suicides from 4.1 per 100,000 in the year 2001 to 26.1 per 100,000 in the year 2017. The data can be helpful to predict the total number of suicides in the future. The data can also be useful to know if there is a significant difference in the average number of suicides between male and female. 
The data set corresponding to suicide rates in the United States would be beneficial to understand the situation. The data is filtered for the year 2000 until 2015. The variables of concern are year, sex, suicides/100k pop. The quantitative variables are the year and suicides/100k pop. The categorical variable is sex which is categorized as male or female. The variable gender will analyze if the average number of suicides is Greater for male or female. The variable time (denoted as 1 for the January 2000) will help us to create a regression model which can predict the suicides/100k pop in future. Chambers, J. M. (2017).
The dataset masters are obtained from a secondary source named as kaggle.com. The data is an observational data as the values are recorded from each unit. The government agencies of various countries collect the data. This data is finalized and published by kaggle.com.  The data corresponding to the year 2000-2015 would be a good representative of the population. Considering 12 months for each of the 15 years, a sample size of 192 would be appropriate. The biasedness is excluded by considering the participants of various age groups for both male and females. Garvan, F. (2001).
Part 2 - Data summary and Visualization
There is an increasing trend in the average number of suicides per 100000 population from the year 2000 until 2015. 
    suicides/100k pop
    
    
    Mean
    12.94541667
    Standard Error
    0.843362193
    Median
    7.115
    Mode
    0.34
    Standard Deviation
    11.68596934
    Sample Variance
    136.5618794
    Kurtosis
    -0.598106571
    Skewness
    0.767366793
    Range
    42.13
    Minimum
    0.27
    Maximum
    42.4
    Sum
    2485.52
    Count
    192
The average number of suicides/100k pop is 13 with a high standard deviation of 11.6. This universe value along with the histogram indicates that the distribution of suicides is slightly skewed to the right. There are very few years/months with a high suicide rate. Draper, N. R., & Smith, H. (1998).
As evident from the box plot, there are no outliers in this data set.
    Row Labels
    Average of suicides/100k pop
    female
    4.593020833
    male
    21.2978125
    Grand Total
    12.94541667
The average number of suicides for males is extremely greater than that of females. There are on an average 21 suicides per 100000 population for males in comparison to only 5 suicides per 100000 population for females.
Part 3 – Modelling
Analysis
The regression model is given by: suicides/100k popn = 3.93 + 0.0067*time + 16.73*male. 
There is 51% variation in two sides which is explained by time and mail. This percentage is considered fair, and it seems that the model can be improved by adding some significant variables. Chatfield, C. (2018).
Consider the null hypothesis, that model is not significant. This is an alternative hypothesis that the model is significant. With (F=100.16, p<5%), the null hypothesis is rejected at 5% level of significance. There is sufficient evidence to conclude that the model is significant. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012).
With 1 month increase, the societal rate is increased by 0.006 suicides per 100000 population. But With t=0.63, p>5%, this value is not considered to be significant.
For males, the suicide in rate is 16.7 suicide per 100000 population more in comparison to females.  This value is considered to be significant with t=14.15, p<5%.
The limitation of regression analysis is the assumption of normality of residuals and equality of error variances. These assumptions are violated as observed from a normal probability plot and time residual plot. As the normal probability plot is not as shape and the point are not random in the time reasonable plot.
Dashboard
Appendix
Part 1
    year
    time
    sex
    suicides/100k...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here
April
January
February
March
April
May
June
July
August
September
October
November
December
2025
2025
2026
2027
SunMonTueWedThuFriSat
30
31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
00:00
00:30
01:00
01:30
02:00
02:30
03:00
03:30
04:00
04:30
05:00
05:30
06:00
06:30
07:00
07:30
08:00
08:30
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
19:00
19:30
20:00
20:30
21:00
21:30
22:00
22:30
23:00
23:30