This is a Data Science assignment, It is mostly math related questions. Please show all the calculations for each question and there is no coding required for these questions.
1)
We would like to know if the age of a child is related to the number of cavities he or she has. The data are shown below. If there is a significant relationship, predict the number of cavities for a child of 11.
(20 points)
Age of child x
|
6
|
8
|
9
|
10
|
12
|
14
|
No. of cavities y
|
2
|
1
|
3
|
4
|
6
|
5
|
2)
Assume we gathered a random sample of the following dataset, where the independent variable (x) represents the number of hours a student studies, and the dependent variable y represents the exam score of the student. Is there a correlation between the two variables, and if so, how strong this correlation is?
(20 points)
Hours of study(X)
|
Exam score(Y)
|
6
|
40
|
10
|
50
|
18
|
100
|
15
|
80
|
12
|
65
|
16
|
90
|
3)
The average age of a vehicle registered in the United States is 8 years, or 96 months. Assume that the standard deviation is 16 months. If a random sample of 36 vehicles is selected, find the probability that the mean of their ages is between 90 and 100 months. (10 points)
Hint: need to use the concept of the normal distribution and z score.
4)
Assume we gathered a random sample of the following dataset. Each column represents weekly sales of two stores. We would like to decide which store (A or B) most likely to predict their weekly sales with more certainty. (20 points)
Store A
|
Store B
|
2000
|
2500
|
4500
|
6500
|
3000
|
2000
|
1500
|
5000
|
6000
|
1200
|
4200
|
7000
|
5)
Assume we gather a random sample of the following dataset. We are trying to predict the body fat % of a person based on his/her weight in kg.
(30 points)
a)
Find the best fitted line of the given data above.
b)
Find the R-squared value.
c)
Find the F value of the best fitted line.
d)
Why your best fitted line does better in predicting comparing to this line equation:
Y = 0.5x + 3.
6)
Build a Decision Tree Classification based on the following dataset. There are three independent variables (a1, a2, a3) that will help with the prediction, and the ‘Classification’ column is the dependent variable. (40 points)
7)
Consider the following confusion matrix:
(10 points)
|
Predicted Yes
|
Predicted No
|
Actual Yes
|
95
|
5
|
Actual No
|
5
|
45
|
a)
Calculate the sensitivity, precision, and accuracy of the confusion matrix
b)
Define (give the values of) type I and type II errors in the given confusion matrix and explain the difference between the two.