Machine learning assignmentPage 1 of 6 Rubric: • All data and any other additional files are...

Question

Machine learning assignmentPage 1 of 6     Rubric:    • All data and any other additional files are available at the FUML VLE site in the  Assessment section. Your submission should be a single zip file named after your exam  number, Yxxxxxxx.zip, which contains a single Jupyter notebook fuml.ipynb  (combining code and explanations), and a PDF fuml.pdf of the state of the same  notebook after all of its code has been executed (in the same way in which PDFs of  Jupyter notebooks were provided with most lectures). In case of any discrepancies  between the contents and output of the Jupyter notebook and the PDF, the former will be  used for marking. Make sure you do not use archive formats other than zip. Your code  should assume all data files are in the same folder as the notebook from which they are  accessed.    • All Python code should be in Python3. Your Jupyter notebook must run correctly when run  with the Anaconda3 package under Windows via the Virtual Desktop Service (using the  Page 2 of 6 CS student desktop as practised in the practicals).    • Unless otherwise indicated there are no word limits on your answers.                                                                              Your exam number should be at the top of your Jupyter notebook and the corresponding  PDF. You should not be otherwise identified anywhere on your submission.  Page 3 of 6 1 Linear regression and regularisation (50 marks)    Download the file data.csv from the Assessment section of the FUML VLE site. The first line  of the file is a header line stating the names of the variables. The remaining lines represent all  data points (one data point per line). All values are separated by commas.  Your task here is to use scikit-learn to create a linear regression model for predicting the values  of the variable Y listed in the last column of the data file, based on the values of all other  variables. You need to choose which type of regression to do, whether to use regularisation and  if so, what kind, and how to validate your results. Alternate portions of your code with  appropriate text explaining what each part of the code aims to achieve, and comment on the  results, and how they inform your decisions.  I will be testing your regression model on an unseen test set and computing the R2 value of the  predictions. Your goal is for this R2 value to be as close to 1 as possible. You do not have access  to the test set, but you need to include a code tile at the end of your Jupyter notebook that  attempts to test your model on a dataset stored in a file unseendata.csv and compute R2  for it. The file will have exactly the same format as file data.csv, so you can use a renamed  copy of data.csv to test that part of your code, and to produce the corresponding content for  your PDF file (in order to demonstrate that this part of the code is in working order).   Mark distribution for this question:      20 marks Correctness of your Python code    15 marks Range of alternatives tested and appropriateness of the chosen regression model     15 marks Justification given for your chosen regression model.      2 Descriptive statistics, data visualisation and PCA (50 marks)    2.1 Preliminaries    The population of Square Island (Fig. 1), a territory perfectly aligned with the four cardinal  directions (North, East, South, West), was established in three migration waves, which are  reflected in the genetic makeup of the inhabitants.  The earliest wave preceded the other two by a long margin. It consisted of hunter-gatherers  whose genetic makeup had been distributed uniformly across the whole island before the next  two waves arrived. The second migration wave entered the island through an isthmus, that is, a  Page 4 of 6 Figure 1: Map of Square Island (North is up)          narrow strip of land, which temporarily connected the South-Western corner of the island with  the nearest continent during a mini ice age period when the sea levels dropped. The new  arrivals were farmers who started to spread slowly, breaking new ground for farming and  advancing by about one mile with each generation. The third and last migration wave brought a  population of seafarers to the island’s shores.  We have data on the relative frequency of 7 genes (proportion of population with a given gene  expressed as a number between 0 and 1) as measured at various locations on the island. These  locations are spaced at equal intervals along the X coordinate (West to East) and Y coordinate  (South to North) over the entire area. Genes 1 and 2 are mutually exclusive alternatives (alleles)  that can appear at one specific position in the genome known as Locus 1. The same is valid for  genes 3 and 4, which are the only 2 alternatives for Locus 2, and genes 5, 6 and 7, which are the  three alleles that can appear in Locus 3. This means that in any given location (x,y) the relative  frequencies of Gene 1 and Gene 2 add up to 1, and so do the relative frequencies of Genes 3  and 4, and Genes 5, 6 and 7. The data is available as a CSV table (see file sqisland.csv)  with a header row and 9 columns, representing the attributes listed in Table 1.    So, we may know, for instance, that in a given location 70% of the population have Gene 1, and  the remaining 30% have Gene 2. Similarly, the proportion of the population in that location with  Gene 3 may be, say, 40%, which leaves the remaining 60% carrying Gene 4. Finally, the  individuals in that location that carry one of the genes 5–7 may be split as 25% : 40% : 35%.    Remember, each of the 3 migration waves brought a new population with its own specific genetic  makeup (relative gene frequencies) that mixed with the existing population over time.  Page 5 of 6 Column  1  2  3  4  5  6  7  8  9  Name Range  X coordinate x ∈ N , 0 ≤ x ≤ 9, grows from left to right on the map  Y coordinate y ∈ N , 0 ≤ y ≤ 9, grows from bottom to top on the map  Gene 1 v ∈ R, 0 ≤ v ≤ 1  Gene 2 v ∈ R, 0 ≤ v ≤ 1  Gene 3 v ∈ R, 0 ≤ v ≤ 1  Gene 4 v ∈ R, 0 ≤ v ≤ 1  Gene 5 v ∈ R, 0 ≤ v ≤ 1  Gene 6 v ∈ R, 0 ≤ v ≤ 1  Gene 7 v ∈ R, 0 ≤ v ≤ 1  Table 1: Data file attributes 2.2 To Do 7 marks For each of the 7 genes, produce a contour plot visualising how its relative frequency  varies across the whole island. (Consider using matplotlib.pyplot.contourf.)  6 marks Study the contour plots to form a hypothesis about the most common alleles for  Locus 1 and Locus 2 in: (a) the hunter-gatherers’ population; (b) in the farmers’  population.  4 marks Describe any significant characteristics of the genetic makeup of the population of  seafarers.  5 marks Calculate and display the variance of each of the 7 gene attributes.    8 marks Calculate the Pearson correlation between (a) Gene 1 and Gene 4; (b) Gene 1 and  Gene 5. State if the null hypothesis of non-correlation can be rejected for either pair at the  95% significance level. Do these results agree with your hypothesis about the genetic  makeup of the farmers from the second wave?  8 marks Apply principal component analysis (PCA) to the data consisting of the relative  frequencies of Genes 1–7. Transform the data using all 7 principal components and  calculate and display the variance for each of them.  4 marks Compare the sums of variances of all 7 attributes before and after transforming the  data via PCA. Comment briefly whether the result can be expected or not and why.  8 marks Plot the first two PCA components as contour plots visualising the relative frequencies   of each component across the island. Compare the result to the contour plots for Gene 1  and Gene 3 (data before PCA). Which of the two pairs of plots do you find more helpful for  the task of reconstructing the waves of migration? Do you expect the same result for a  realistic data set with hundreds of genes and why?  Page 6 of 6    End of examination paper

Page 1 of 6 Rubric: • All data and any other additional files are available at the FUML VLE site in the Assessment section. Your submission should be a single zip file named after your exam number,...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment