attached
Quantitative Genomics and Genetics - Spring 2021 BTRY 4830/6830; PBSB 5201.01 Midterm - available on CMS by 11AM (ET), Tues., April 20 For midterm exam, due before 11:59PM (ET) Thurs., April 22 PLEASE NOTE THE FOLLOWING INSTRUCTIONS: 1. You are to complete this exam alone. The exam is open book, so you are allowed to use any books or information available online, your own notes and your previously constructed code, etc. HOWEVER YOU ARE NOT ALLOWED TO COMMUNICATE OR IN ANY WAY ASK ANYONE FOR ASSISTANCE WITH THIS EXAM IN ANY FORM e.g., DO NOT POST PUBLIC MESSAGES ON PIAZZA! (the only exceptions are Scott, Beulah, and Dr. Mezey, e.g., you MAY send us a private message on PIAZZA). As a non-exhaustive list this includes asking classmates or ANYONE else for advice or where to look for answers concerning problems, you are not allowed to ask anyone for access to their notes or to even look at their code whether constructed before the exam or not, etc. You are therefore only allowed to look at your own materials and materials you can access on your own. In short, work on your own! Please note that you will be violating Cornell’s honor code if you act otherwise. 2. Please pay attention to instructions and complete ALL requirements for ALL questions, e.g. some questions ask for R code, plots, AND written answers. We will give partial credit so it is to your advantage to attempt every part of every question. 3. A complete answer to this exam will include R code answers in Rmarkdown, where you will submit your .Rmd script and associated .pdf file. Note there will be penalties for scripts that fail to compile (!!). Also, as always, you do not need to repeat code for each part (i.e., if you write a single block of code that generates the answers for some or all of the parts, that is fine, but do please label your output that answers each question!!). You should include all of your plots and written answers in this same .Rmd script with your R code. 4. The exam must be uploaded on CMS before 11:59PM (ET) Thurs., April 22. It is your responsibility to make sure that it is in uploaded by then and no excuses will be accepted (power outages, computer problems, Cornell’s internet slowed to a crawl, etc.). Remember: you are welcome to upload early! We will deduct points for being late for exams received after this deadline (even if it is by minutes!!). 1 Your collaborator is interested in mapping genetic loci that can affect height in humans. They know there are loci scattered throughout the genome that can affect height, but they do not know the locations of these loci, so they have performed a GWAS experiment and they would like you to perform the analysis. They have collected data for a number of individuals sampled from a population and they have provided you scaled height phenotypes and SNP genotypes in two files (“midterm phenotypes.txt” and “midterm genotypes.txt”). Note that for each of the SNPs, there are two total alleles, i.e. two letters for each SNP and there are three possible states per SNP genotype: two homozygotes and a heterozygote. In the “genotypes” file, each column represents a specific SNP (column 1 = genotype 1, column 2 = genotype 2) and each consecutive pair of rows represent all of the genotype states for an individual for the entire set of SNPs (rows 1 and 2 = all of individual 1’s genotypes, rows 3 and 4 = all individual 2’s genotypes). Also note that the genotypes in the file are listed in order along the genome such that the first genotype is ‘genotype 1’ and the last is ‘genotype N ’. 1. (a) Import the scaled height data from the file “midterm phenotypes.txt” and report the sam- ple size n. (b) Produce a histogram of the height phenotype data (label your plot and your axes using informative names!). (c) Import the genotype data from the file “midterm genotypes.txt” and report the number of genotypes N. 2. Using the phenotype and genotype data: (a) For EACH of the N genotypes, calculate the MLE(β̂) for the three β parameters when when applying a genetic linear regression model (with NO covariates!!). NOTE (!!): in your linear regressions, DO use the Xa and Xd codings provided in class and DO calculate the MLE(β̂) using the formula provided in class (i.e. your R code must include the formula for the MLE). (b) Plot a histogram for the N estimates of each parameter (i.e. your answer will be three histograms, one each for the estimates of the β̂µ’s, β̂a’s, and β̂d ′ s). (c) Why does it make sense that most of the β̂a and β̂d values are relatively close to zero (use no more than two sentences in your answer)? 3. Using the phenotype and genotype data, for each genotype, calculate p-values for the null hypothesis H0 : βa = 0 ∩ βd = 0 versus the alternative hypothesis HA : βa 6= 0 ∪ βd 6= 0 when applying a genetic linear regression model (again NO covariates!). NOTE (!!): in your linear regressions, DO use the Xa and Xd codings provided in class and DO NOT use the function lm() (or any other R function!) to calculate your p-values but rather use the formula for MLE(β̂) provided in class (i.e., use your code and / or results from question [2]!), calculate the predicted value of the phenotype ŷi for each individual i under the null and alternative and use these to calculate SSM and SSE, and use the formulas for MSM and MSE to calculate the F-statistic, although you may use the function pf() to calculate the p-value for each F-statistic you calculate. 4. Produce a Manhattan plot from the output of question [3] (label your plot and your axes using informative names!). 5. (a) Plot a histogram of the p-values you calculated in question [3] (i.e., not the -log p-values (!!) just plot a histogram of the p-values). (b) What is a possible interpretation of why this histogram deviates from a uniform distribution (use no more than two sentences in your answer)? 6. (a) Assuming a Type 1 error for an individual test of α = 0.05, calculate and provide the Bonferroni corrected Type 1 error for the entire GWAS analysis of the N genotypes. (b) 2 Define Type 1 error and explain why your Bonferroni correction results in a lower overall Type I error compared to a case where you just used α = 0.05 to assess significance (use no more than two sentences in your answer). (c) Define Type II error and explain why your Bonferroni correction increases the Type II error compared to a case where you just used α = 0.05 (use no more than two sentences in your answer). (d) Define power and explain why your Bonferroni correction decreases power compared to a case where you just used α = 0.05 (use no more than two sentences in your answer). 7. (a) Provide a list of ALL genotype markers that have p-values that are considered significant by your Bonferroni corrected Type 1 error calculated in question [6] (remember: genotypes in the genotype file are in order from 1 to N !). (b) For the TWO most significant genotypes you identified, explain whether you believe these two genotypes are indicating the location of the same causal genotype and why or why not (use no more than two sentences in your answer). 8. Your collaborator needs some help interpreting the results of your analysis. Answer the following questions: (a) What is the definition of a p-value? (b) What is the definition of a causal polymorphism? (c) Even if you have measured a causal polymorphism in your GWAS, why might it NOT be possible to precisely identify which polymorphism is causal in a GWAS analysis (use no more than two sentences in your answer)? (d) What is an example of an ideal experiment (which need not be realistic to perform!) that would unequivocally demonstrate that a specific genotype is causal for a given phenotype (use no more than two sentences in your answer)? 9. (Tues. Lecture Question!) Say you conduct another GWAS analysis of 1000 genotypes and find that at a type I error rate of α = 0.05 you reject the null hypothesis 200 times. What is the False Discovery Rate (FDR) at this level of alpha? 10. (Thurs. Lecture Question!) What are the two conditions necessary for population structure to produce false positives in a GWAS analysis if you DO NOT include a covariate in your analysis (i.e., you apply the genetic linear regression to your total GWAS data WITHOUT any covariates to account for population structure)? 3 2.32417602345252 3.16571886906737 3.93029477702436 3.5465590663788 3.8890294771325 1.87723000435451 2.08005229467428 4.89145304658123 3.36152006012391 3.30091337085817 3.69708728442064 3.09551047669506 1.44015294758889 5.21590965090655 5.25983988702639 5.26571191870434 4.77582168355771 5.20051767663579 3.53896305948812 1.15908071599504 0.778030584655837 4.87786802724111 2.65970233064571 2.15917093468108 2.25319350771583 5.52751971353319 4.15193802885577 4.79003149639197 2.72541667066293 3.54818990197393 4.83621676819155 2.96389826345733 4.50150423637605 4.53485023189382 4.45986365694048 2.80074079005167 4.70904250357327 5.70387598967624 3.59917498557167 3.65639995594409 2.57119255179982 1.77315479368177 6.37871091776418 3.0094896166274 -1.90054167144568 0.855454193488349 2.97360340913835 3.52569910538592 4.55209716916584 4.66892300916282 3.39240606057612 5.11551270657084 3.31416386332188 5.44384672041576 6.99840843815261 3.12539724342722 2.59106832417062 4.85011682358683 3.37329159763098 3.94268284696945 4.51192000908992 3.02768259365263 5.11982404878436 2.35354185175028 4.15321324597919 7.32602528020195 4.10724108931242 2.59418479308266 4.39064420041404 3.25460716397054 4.89346843384578 3.22142628218264 3.62892553450756 3.44960958459802 4.4795963172129 3.86335438411438 3.50284515631539 4.09115159156946 3.2677504489592 4.26255868373732 2.93959346639231 2.75023263454638 4.01851839071878 1.48038543888054 4.11164667543133 4.55855080583461 3.60976016087323 4.22895629443354 5.9464226765194 4.46314368771774 3.37250320998172 2.69512907859562 2.49246987387754 4.56693290894186 5.41760144515202 1.50661349893882 5.08619029270364 3.39475020921371 2.92638730759378 4.26760116520151 2.58548417364647 -2.03128046586753 4.58630075106231 0.854715658491941 1.9588608863815 5.27122796144293 4.40297681252258 2.18629958334278 3.91267593322619 2.25937666605364 3.26002840602506 3.97916937333978 4.07920073917464 3.39981314796932 3.22661000758045 3.47932737779514 2.76195062491776 1.91608387630018 5.26838915019405 3.56202667016005 4.09395275087476 2.92538971074653 4.50675400067706 4.66681433185221 4.35159999517916 2.7014678581361 2.57335484823837 3.61154502728618 3.03475266120512 3.67192401685263 3.15500568079539 4.73273796512604 4.52366576989182 2.20458805763282 3.7377758168711 3.3407927503023 2.71487886393774 4.64976347879087 2.98833774824202 4.07199087165769 2.04098494080408 2.63547046676609 3.54310403945768 2.25250879551067 3.8430950882262 1.79463267176481 3.82943562134699 3.77252751069194 5.50888002789218 3.8216439073872 4.8573232259249 5.7052484232979 2.41182444303436 4.81038795373848 3.31663795966978 2.11708055369233 3.33129898589178 3.71961611783433 4.71764469082391 2.5875244206083 4.28546018243277 3.2690308279886 5.57537377585463 5.67144064286367 2.93832051802037 4.50061537873886 3.63761768167456 3.11907027983125 2.75707239849034 1.88018201209333 1.536615601025 2.8994721881034 2.17229446260575 2.42297462643739 4.15936858542368 2.73476529782101 2.09613115766759 3.46075952417338 4.2568216996565 -2.34887685434562 3.27227142281311 1.68897862200749 5.77027724389786 2.77871872743994 5.02529687212975 3.43741705545435 5.1844884823084 2.0063849455323 5.59246437740859 5.14828717221796 -0.101697276747919 3.45139234214342 5.06015575608223