University of CanberraIntroduction to Statistics/G 6540/6554
S1 2017Sample Assignment questions[a] Your assignment should be done individuallyonly.
[b] Your student ID must beput on the top right-hand corner of every page.
[c] All the questions must be attempted, using Excelor another package.
[d] All graphs should be presented within the body of the assignment under the relevant questions, NOT at the end of the assignment in an appendix.
[e] Your assignment is expected to contain no more than7 single-sided pages.
[f] A soft copy of your assignment is required to be submitted onto the Moodle site before
11:59pm Friday 14April 2017;Email submission is NOT acceptable.After the submission of your assignment, there will be a number ofvalidation questions available on Moodle based on or related to your assignment, which MUST be answered and marked before
11:59pm Tuesday 18 April 2017.
Question 1[Week 4 slides and Book page 386]Of all university degrees awarded in Australia, 50% are bachelor’s degrees, 59% are earned by women, and 29% are bachelor’s degrees earned by women.
[a] What is the proportion of all degrees earned by men?
[b] What is the proportion of all degrees which are bachelor’s degrees earned by men?
[c] Are the events bachelor’s degrees and degrees earned by men independent?
Question 2[Book Page 232, Ch7 Ex60]
The data set we use here has those observations for teenagers surveyedwho used Marijuana and Other Drugs in the Book, with one made-up country added (Drug_Abuse_Outlier data).
[a] Construct a histogram to study the Other Drugs variable, with brief comments.
[b] Find the correlation coefficient between the two variables.
[c] Present a scatter plot of Other Drugs vs Marijuana with the least-squares equation and brief comments.
[d] Identify one outlier in the Y direction, and remove it from the dataset. Find the correlation between Other Drugs and Marijuana, and draw a scatter plotof Other Drugs vs Marijuana with the least-squares equation. How dothe correlation and the scatterplot changeafter the one outlier has been removed? Why?
[e] Compare the two least-squares equations with their R-squared values, and identify which one is better. Why?
[f] Which subject has a particularly high Other Drugs value and which subject has a particularly low Other Drugs value relative to the pattern for the remaining subjects after the one outlier has been removed? Why?
Question 1Of all university degrees awarded in Australia, 50% are bachelor’s degrees, 59% are earned by women, and 29% are bachelor’s degrees earned by women.
Let events A and B be
A={bachelor’s degrees}
B={degreesearned by women}
Then
P(A)=0.50
P(B)=0.59
P(AB)=0.29
Draw a diagram…
[a]
What is the proportion of all degrees earned by men?
Let C={degrees earned by men}
P(C)=1-P(B)=1-0.59=0.41
[b]
What is the proportion of all degrees which are bachelor’s degrees earned by men?
Use AC={bachelor’s degrees earned by men}
P(AC) = P(A) – P(AB) = 0.5-0.29 = 0.21
[c]
Are the events bachelor’s degrees and degrees earned by men independent?A={bachelor’s degrees}
C={degreesearned by men}
P(AC) = 0.21
P(A) P(C) = 0.5 * 0.41 =/= P(AC) = 0.21
Hence, bachelor’s degrees and degrees earned by men are not independent
Question 2The data set we use here has those observations for teenagers surveyed who used Marijuana and Other Drugs in the Book, with one made-up country added (Drug_Abuse_Outlier data).[a]
Construct a histogram to study the Other Drugs variable, with brief comments.
A histogram from Excel:
Comments:
1. Location: the median is around xxx% for Other Drugs…
2. Spread: the IQR is about zzz%...
3. Shape: the curve is not symmetric, and is right-skewed with few countries with very high percentages
4. Outliers: It may contain an outlier or two, due to the long tail in the RHS. We need further analysis to check…
Note the data, with Out-country added:
[b]
Find the correlation coefficient between the two variables.The correlation is r= 0.628
Comment:
It is a positive moderate correlation, as the size is not larger than 0.7 or close to 1.
[c]
Present a scatter plot of Other Drugs vs Marijuana with the least-squares equation and brief comments.Plot:
Comment:
It seems a positive, linear, moderate relationship, with 1 possible outlier country.
Equation:
Fitted_Other_Drugs = -1.8294+ 0.7166 * Marijuana
If Marijuana is increased 1 %, then Other Drugs is fitted to increase 0.7166 %.
R-squared = 0.3945, r = 0.6281> 0 and correspondingly b1= + 0.7166> 0
n = 12
[d]
Identify one outlier in the Y direction, and remove it from the dataset. Find the correlation between Other Drugs and Marijuana, and draw a scatter plot of Other Drugs vs Marijuana with the least-squares equation. How do the correlation and the scatter plot change after the one outlier has been removed? Why?
Use the 5 number summary and 1.5 IQR for Other Drugs (%):
Q1 |
3 |
Q3 |
21.75 |
IQR |
18.75 |
1.5 IQR |
28.125 |
Q3+1.5IQR |
49.875 |
Q1-1.5IQR |
-25.125 |
So Country number 9 (i.e. Out-country with 60 larger than 49.875)is identified to be an outlier, it will be deleted next.
For new data, with the outlier deleted:
r= 0.9341
Marijuana (%)
|
Other Drugs (%)
|
Marijuana (%) |
1 |
Other Drugs (%) |
0.93410002 |
1 |
Comment: It’s increased (from r = 0.6281)…
Plot:
Comment:
It seems a still positive, but more linear, stronger relationship with no clear outliers…
[e]
Compare the two least-squares equations with their R square values, and identify which one is better. Why?
New_Fitted_Other_Drugs = -3.0678 + 0.6150 Marijuana
Comment:
If Marijuana is increased 1 %, the new fit is increased 0.6150 %...
R-squared is increased… so the new model is better than the old…
SUMMARY OUTPUT |
|
Regression Statistics
|
Multiple R |
0.93410002 |
R Square |
0.872542847 |
Adjusted R Square |
0.858380941 |
Standard Error |
3.853492185 |
Observations |
11 |
|
|
Coefficients
|
Standard Error
|
t Stat
|
P-value
|
Lower 95%
|
Upper 95%
|
Intercept |
-3.067799158 |
2.204360594 |
-1.3917 |
0.197448 |
-8.0544093 |
1.91881095 |
Marijuana (%) |
0.615003007 |
0.07835103 |
7.849329 |
2.58E-05 |
0.4377607 |
0.79224535 |
|
|
[f]
Which subject has a particularly high Other Drugs value and which subject has a particularly low Other Drugs value relative to the pattern for the remaining subjects after the one outlier has been removed? Why?[To look at their residuals…For ideas, see also Book Ch 7 Example on page 218 and Exercise 75 on page 236]