The same scientist has the feeling that the gene frequencies may vary from one race to another (e.g., Mongolian, Negroid, Caucasian) and, hence, from one geographical area to another depending on their relative components. He collected a data set from one of the Caribbean islands where Negroid and Caucasian mixtures are profound (with very little Mongolian presence). As such, he decided to work with the following mixture model:
Where
and for the p(N)
(and p(C)
), the Hardy–Weinberg equilibrium holds with q = q(N)
(and q(C)
). In this setup, he encountered the following problems:
(a) Under H0
: q(N)
= q(C), there are only two unknown (linearly independent) parameters (π drops out); whereas under the alternative, q(N)
= q(C), there are fi e linearly independent ones (including π). Thus, he confidentl prescribed a chi-squared test with 5 − 2 (= 3) degrees of freedom. Do you support this prescription? If not, why?
(b) He observed that there are only three (independent) cell probabilities, but fi e unknown parameters. So he concluded that his hypothesis was not testable in a single sample model. This smart scientist, therefore, decided to choose two different islands (for which the π values are quite different). Using the 4 × 2 contingency table he obtained, he wanted to estimate
He had six linearly independent cell probabilities and six unknown parameters, so that he was satisfie with the model. Under
he had two parameters, whereas he had six under the alternative. Hence, he concluded that the degrees of freedom of his goodness-of-fi test would be equal to four. Being so confident this time, he carried out a volume of simulation work to examine the adequacy of the chi-squared approximation. Alas, the fi was very poor! A smarter colleague suggested that the degrees of freedom for the chi-squared approximation should be equal to 2.3 and this showed some improvement. However, he was puzzled why the degrees of freedom was not an integer! Anyway, as there was no theoretical foundation, in frustration, he gave up! Can you eliminate the impasse?
(c) Verify that for this mixture model there is a basic identifiabilit issue: if
regardless of whether the
are on the boundary (i.e., {0},{1}) or not, the number of unknown parameters is equal to 2, whereas this number jumps to 4 + k when H0
does not hold. Thus, the parameter point belongs to a boundary of the parameter space under H0. Examine the impact of this irregularity on the asymptotics underlying the usual goodness-of-fi tests.
(d) Consider the 4 × k contingency table (for k ≥ 2) and discuss how the nonparametric test overcomes this problem?
(e) Can you justify the Hardy–Weinberg equilibrium for the mixture model from the random mating point of view?