University of Canberra Introduction to Statistics/G 6540/6554 S1 2017 Sample Assignment questions [a] Your assignment should be done individuallyonly. [b] Your student ID must beput on the top...




University of Canberra


Introduction to Statistics/G 6540/6554



S1 2017


Sample Assignment questions

[a] Your assignment should be done individuallyonly.
[b] Your student ID must beput on the top right-hand corner of every page.
[c] All the questions must be attempted, using Excelor another package.
[d] All graphs should be presented within the body of the assignment under the relevant questions, NOT at the end of the assignment in an appendix.
[e] Your assignment is expected to contain no more than7 single-sided pages.
[f] A soft copy of your assignment is required to be submitted onto the Moodle site before
11:59pm Friday 14April 2017;Email submission is NOT acceptable.After the submission of your assignment, there will be a number ofvalidation questions available on Moodle based on or related to your assignment, which MUST be answered and marked before
11:59pm Tuesday 18 April 2017.

Question 1[Week 4 slides and Book page 386]

Of all university degrees awarded in Australia, 50% are bachelor’s degrees, 59% are earned by women, and 29% are bachelor’s degrees earned by women.
[a] What is the proportion of all degrees earned by men?
[b] What is the proportion of all degrees which are bachelor’s degrees earned by men?
[c] Are the events bachelor’s degrees and degrees earned by men independent?

Question 2[Book Page 232, Ch7 Ex60]


The data set we use here has those observations for teenagers surveyedwho used Marijuana and Other Drugs in the Book, with one made-up country added (Drug_Abuse_Outlier data).
[a] Construct a histogram to study the Other Drugs variable, with brief comments.
[b] Find the correlation coefficient between the two variables.
[c] Present a scatter plot of Other Drugs vs Marijuana with the least-squares equation and brief comments.
[d] Identify one outlier in the Y direction, and remove it from the dataset. Find the correlation between Other Drugs and Marijuana, and draw a scatter plotof Other Drugs vs Marijuana with the least-squares equation. How dothe correlation and the scatterplot changeafter the one outlier has been removed? Why?
[e] Compare the two least-squares equations with their R-squared values, and identify which one is better. Why?
[f] Which subject has a particularly high Other Drugs value and which subject has a particularly low Other Drugs value relative to the pattern for the remaining subjects after the one outlier has been removed? Why?

Question 1


Of all university degrees awarded in Australia, 50% are bachelor’s degrees, 59% are earned by women, and 29% are bachelor’s degrees earned by women.


Let events A and B be
A={bachelor’s degrees}
B={degreesearned by women}
Then
P(A)=0.50
P(B)=0.59
P(AB)=0.29
Draw a diagram…
[a]
What is the proportion of all degrees earned by men?


Let C={degrees earned by men}
P(C)=1-P(B)=1-0.59=0.41
[b]
What is the proportion of all degrees which are bachelor’s degrees earned by men?


Use AC={bachelor’s degrees earned by men}
P(AC) = P(A) – P(AB) = 0.5-0.29 = 0.21
[c]
Are the events bachelor’s degrees and degrees earned by men independent?

A={bachelor’s degrees}
C={degreesearned by men}
P(AC) = 0.21
P(A) P(C) = 0.5 * 0.41 =/= P(AC) = 0.21
Hence, bachelor’s degrees and degrees earned by men are not independent

Question 2


The data set we use here has those observations for teenagers surveyed who used Marijuana and Other Drugs in the Book, with one made-up country added (Drug_Abuse_Outlier data).

[a]
Construct a histogram to study the Other Drugs variable, with brief comments.


A histogram from Excel:
Comments:
1. Location: the median is around xxx% for Other Drugs…
2. Spread: the IQR is about zzz%...
3. Shape: the curve is not symmetric, and is right-skewed with few countries with very high percentages
4. Outliers: It may contain an outlier or two, due to the long tail in the RHS. We need further analysis to check…
Note the data, with Out-country added:
[b]
Find the correlation coefficient between the two variables.

The correlation is r= 0.628
Comment:
It is a positive moderate correlation, as the size is not larger than 0.7 or close to 1.
[c]
Present a scatter plot of Other Drugs vs Marijuana with the least-squares equation and brief comments.

Plot:
Comment:
It seems a positive, linear, moderate relationship, with 1 possible outlier country.
Equation:
Fitted_Other_Drugs = -1.8294+ 0.7166 * Marijuana
If Marijuana is increased 1 %, then Other Drugs is fitted to increase 0.7166 %.
R-squared = 0.3945, r = 0.6281> 0 and correspondingly b1= + 0.7166> 0
n = 12
[d]
Identify one outlier in the Y direction, and remove it from the dataset. Find the correlation between Other Drugs and Marijuana, and draw a scatter plot of Other Drugs vs Marijuana with the least-squares equation. How do the correlation and the scatter plot change after the one outlier has been removed? Why?


Use the 5 number summary and 1.5 IQR for Other Drugs (%):



























Q13
Q321.75
IQR18.75
1.5 IQR28.125
Q3+1.5IQR49.875
Q1-1.5IQR-25.125


So Country number 9 (i.e. Out-country with 60 larger than 49.875)is identified to be an outlier, it will be deleted next.
For new data, with the outlier deleted:
r= 0.9341

















Marijuana (%)

Other Drugs (%)
Marijuana (%)1
Other Drugs (%)0.934100021


Comment: It’s increased (from r = 0.6281)…
Plot:
Comment:
It seems a still positive, but more linear, stronger relationship with no clear outliers…
[e]
Compare the two least-squares equations with their R square values, and identify which one is better. Why?


New_Fitted_Other_Drugs = -3.0678 + 0.6150 Marijuana
Comment:
If Marijuana is increased 1 %, the new fit is increased 0.6150 %...
R-squared is increased… so the new model is better than the old…































SUMMARY OUTPUT

Regression Statistics
Multiple R0.93410002
R Square0.872542847
Adjusted R Square0.858380941
Standard Error3.853492185
Observations11







































Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%
Intercept-3.0677991582.204360594-1.39170.197448-8.05440931.91881095
Marijuana (%)0.6150030070.078351037.8493292.58E-050.43776070.79224535


[f]
Which subject has a particularly high Other Drugs value and which subject has a particularly low Other Drugs value relative to the pattern for the remaining subjects after the one outlier has been removed? Why?

[To look at their residuals…For ideas, see also Book Ch 7 Example on page 218 and Exercise 75 on page 236]
May 08, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here