Q1 Ophthalmologists from Victoria and Western Australia have surveyed children in the Western Desert in Western Australia to assess the prevalence of trachoma. The data below come from two years of a longitudinal survey. There are six stages of trachoma, of increasing severity. Children were observed to have trachoma up to the fourth stage. The data below show the stage of trachoma including an additional level — those with no signs of trachoma. Table 1.1: Child trachoma in the Western Desert Stage 1995 2005 None 120 259 Follicular 88 51 Intense inflammatory 7 3 Trachomatous scarring 0 2 Trichiasis 2 0 (a) Use R to find point estimates of the population proportion of children with no signs of trachoma, for the 1995 survey and for the 2005 survey. (b) Using an appropriate method, find a 95% confidence interval for the 1995 proportion with no signs of trachoma. (c) Find a point and an interval estimate of the difference between the 2005 and 1995 proportions of children with no signs of trachoma. Comment on the claim that the incidence of trachoma among children in the Western Desert in WA decreased between 1995 and 2005. (d) Using an appropriate method, find a 95% confidence interval for the 2005 proportion of children observed to have trachoma at the intense inflammatory stage or worse. (e) Comment on any assumptions made in your calculations so far. (f) Enter the data into R and produce the table of frequencies. (g) On separate graphs for the 1995 and 2005 surveys, plot the proportion of children at each stage vs the stage number (0 to 4). (Rather than use the existing table, it will be easier to re-enter the proportions and stage numbers into a new data frame.) There are two arguments to use inside the plot function which will improve these graphs: • ylim = c(0,1) sets the range of the vertical axis as 0 to 1; this enables the two graphs to be compared more easily. • type="l" sets the type of graph to lines rather than points, i.e. it joins the points up. [2 + 3 + 5 + 3 + 3 + 4 + 4 = 24 marks] Q2 The data in count10.csv [2, 3, 3, . . . , 0] were obtained as counts of the number of items in batches of ten, which had a particular characteristic. (a) Describe the data (including appropriate descriptive statistics and plots). (b) Show that for any binomial distribution, var(X) 6 E(X). (c) A binomial distribution would be appropriate for such data if the items were independent and each was equally likely to have the characteristic. Explain why these data are apparently incompatible with the binomial distribution. (d) The following proposals have been put forward to explain the failure of the binomial distribution to describe these data. i. The batches are from different sources. ii. The proportion with the characteristic changes over time. Discuss briefly (a sentence or two at most) each proposal, indicating whether it could result in data like those obtained; and how it might be checked. [6 + 2 + 2 + 4 = 14 marks] 2 MAST90044 Thinking and Reasoning with Data Assignment 1 Q3 (a) i. Generate 1000 independent observations on a uniform (0, 1000) distribution, i.e. each observation is equally likely to take any value in the interval (0, 1000) [runif]. ii. Think of these data as 1000 random time-points in the time interval (0, 1000). Sort the data from smallest to largest [sort] so that you obtain the sequence of time-points from time zero forward. Plot some of the points, say the points in (0, 20), to see what your random points look like. iii. Obtain the gaps between the time-points, i.e. the differences between successive time-points [diff]. (b) Describe the distribution of the gaps (including appropriate descriptive statistics and plots). (c) Theory (and common-sense?) tells us that the mean gap will be very close to 1. What is the median gap? (d) Find an estimate and a 95% confidence interval for the probability that the gap is less than 1. (e) Test the hypothesis that the 95th percentile of the gap distribution is equal to 3. [6 + 5 + 2 + 3 + 4 = 20 marks] Q4 The department of Social Engineering is facing a discrimination complaint. The plaintiff is a high-ranking employee of this department who was not promoted after the last election for prime minister. He believes that the reason for his non-promotion is that only candidates who supported the winning candidate for prime minister were promoted. Other candidates like him who did not support the winning candidate were not promoted. A summary of the available facts is as follows: Result: Including the plaintiff, 10 employees of the department were up for promotion. Seven were promoted and of these seven, six had supported the winning candidate for prime minister. The remaining four did not support the winning candidate. Promotion Procedure: The promotion procedure is based on the scores from a standard civil service test that was taken by all 10 candidates. The scores are ranked and the promotion procedure requires that for each promotion slot, the successful candidate must be selected from those who are currently among the top three ranked candidates (including ties). This is applied sequentially until all the available promotion slots have been filled. Plaintiff ’s Claim: The plaintiff was ranked number 4 and believes that this ranking should have been more than sufficient to obtain a promotion from the seven that were available. Since support of the winning candidate is not a requirement for promotion the plaintiff claims that discrimination occurred. Additional Information: Employees who were not candidates for promotion were asked a question regarding whether they felt a positive or negative change in their job conditions after the election was held. They were also asked whether they supported the winning candidate. 3 MAST90044 Thinking and Reasoning with Data Assignment 1 The Data The data are presented in three tables. Table 4.1 summarises the support by promotion results, for the 10 candidates. As can be seen from Table 4.2, the plaintiff (D) only becomes eligible for promotion after the first slot has been filled. Table 4.3 summarises the responses obtained from the employees who were not candidates for promotion. Table 4.1: Support by Promotion promoted not promoted supported winner 6 0 did not support winner 1 3 Table 4.2: Candidate Ranking (1 = highest) Candidate A B C D E F G H I J Rank 1 2 3 4 5 6 7 8 8 8 Table 4.3: Contribution by Change in Job Condition positive non-positive supported winner 4 2 did not support winner 1 12 (a) The P-value based on the Table 4.1 data suggests there is evidence to reject the null hypothesis of independence (support and promotion). What P-value is being referred to here? What is your interpretation of this statistically significant result? Are there any other factors that should be taken into account? (b) Although Table 4.3 can be analyzed in similar fashion, a significant result would not provide evidence of discrimination. Why? (c) An argument presented against the plaintiff was the following: with reference to Table 4.2 and by assuming each promotion is chosen at random among the top three ranked candidates currently available, the probability that candidate D would not be one of the seven promoted is ( 2 3 ) 6 = 0.0878. Since this probability is larger than the minimum standard of 5%, it is argued that the plaintiff’s complaint should be rejected. Comment. (d) What is your judgement? Give your reasons.