Machine learning assignment
Page 1 of 6 Rubric: • All data and any other additional files are available at the FUML VLE site in the Assessment section. Your submission should be a single zip file named after your exam number, Yxxxxxxx.zip, which contains a single Jupyter notebook fuml.ipynb (combining code and explanations), and a PDF fuml.pdf of the state of the same notebook after all of its code has been executed (in the same way in which PDFs of Jupyter notebooks were provided with most lectures). In case of any discrepancies between the contents and output of the Jupyter notebook and the PDF, the former will be used for marking. Make sure you do not use archive formats other than zip. Your code should assume all data files are in the same folder as the notebook from which they are accessed. • All Python code should be in Python3. Your Jupyter notebook must run correctly when run with the Anaconda3 package under Windows via the Virtual Desktop Service (using the Page 2 of 6 CS student desktop as practised in the practicals). • Unless otherwise indicated there are no word limits on your answers. Your exam number should be at the top of your Jupyter notebook and the corresponding PDF. You should not be otherwise identified anywhere on your submission. Page 3 of 6 1 Linear regression and regularisation (50 marks) Download the file data.csv from the Assessment section of the FUML VLE site. The first line of the file is a header line stating the names of the variables. The remaining lines represent all data points (one data point per line). All values are separated by commas. Your task here is to use scikit-learn to create a linear regression model for predicting the values of the variable Y listed in the last column of the data file, based on the values of all other variables. You need to choose which type of regression to do, whether to use regularisation and if so, what kind, and how to validate your results. Alternate portions of your code with appropriate text explaining what each part of the code aims to achieve, and comment on the results, and how they inform your decisions. I will be testing your regression model on an unseen test set and computing the R2 value of the predictions. Your goal is for this R2 value to be as close to 1 as possible. You do not have access to the test set, but you need to include a code tile at the end of your Jupyter notebook that attempts to test your model on a dataset stored in a file unseendata.csv and compute R2 for it. The file will have exactly the same format as file data.csv, so you can use a renamed copy of data.csv to test that part of your code, and to produce the corresponding content for your PDF file (in order to demonstrate that this part of the code is in working order). Mark distribution for this question: 20 marks Correctness of your Python code 15 marks Range of alternatives tested and appropriateness of the chosen regression model 15 marks Justification given for your chosen regression model. 2 Descriptive statistics, data visualisation and PCA (50 marks) 2.1 Preliminaries The population of Square Island (Fig. 1), a territory perfectly aligned with the four cardinal directions (North, East, South, West), was established in three migration waves, which are reflected in the genetic makeup of the inhabitants. The earliest wave preceded the other two by a long margin. It consisted of hunter-gatherers whose genetic makeup had been distributed uniformly across the whole island before the next two waves arrived. The second migration wave entered the island through an isthmus, that is, a Page 4 of 6 Figure 1: Map of Square Island (North is up) narrow strip of land, which temporarily connected the South-Western corner of the island with the nearest continent during a mini ice age period when the sea levels dropped. The new arrivals were farmers who started to spread slowly, breaking new ground for farming and advancing by about one mile with each generation. The third and last migration wave brought a population of seafarers to the island’s shores. We have data on the relative frequency of 7 genes (proportion of population with a given gene expressed as a number between 0 and 1) as measured at various locations on the island. These locations are spaced at equal intervals along the X coordinate (West to East) and Y coordinate (South to North) over the entire area. Genes 1 and 2 are mutually exclusive alternatives (alleles) that can appear at one specific position in the genome known as Locus 1. The same is valid for genes 3 and 4, which are the only 2 alternatives for Locus 2, and genes 5, 6 and 7, which are the three alleles that can appear in Locus 3. This means that in any given location (x,y) the relative frequencies of Gene 1 and Gene 2 add up to 1, and so do the relative frequencies of Genes 3 and 4, and Genes 5, 6 and 7. The data is available as a CSV table (see file sqisland.csv) with a header row and 9 columns, representing the attributes listed in Table 1. So, we may know, for instance, that in a given location 70% of the population have Gene 1, and the remaining 30% have Gene 2. Similarly, the proportion of the population in that location with Gene 3 may be, say, 40%, which leaves the remaining 60% carrying Gene 4. Finally, the individuals in that location that carry one of the genes 5–7 may be split as 25% : 40% : 35%. Remember, each of the 3 migration waves brought a new population with its own specific genetic makeup (relative gene frequencies) that mixed with the existing population over time. Page 5 of 6 Column 1 2 3 4 5 6 7 8 9 Name Range X coordinate x ∈ N , 0 ≤ x ≤ 9, grows from left to right on the map Y coordinate y ∈ N , 0 ≤ y ≤ 9, grows from bottom to top on the map Gene 1 v ∈ R, 0 ≤ v ≤ 1 Gene 2 v ∈ R, 0 ≤ v ≤ 1 Gene 3 v ∈ R, 0 ≤ v ≤ 1 Gene 4 v ∈ R, 0 ≤ v ≤ 1 Gene 5 v ∈ R, 0 ≤ v ≤ 1 Gene 6 v ∈ R, 0 ≤ v ≤ 1 Gene 7 v ∈ R, 0 ≤ v ≤ 1 Table 1: Data file attributes 2.2 To Do 7 marks For each of the 7 genes, produce a contour plot visualising how its relative frequency varies across the whole island. (Consider using matplotlib.pyplot.contourf.) 6 marks Study the contour plots to form a hypothesis about the most common alleles for Locus 1 and Locus 2 in: (a) the hunter-gatherers’ population; (b) in the farmers’ population. 4 marks Describe any significant characteristics of the genetic makeup of the population of seafarers. 5 marks Calculate and display the variance of each of the 7 gene attributes. 8 marks Calculate the Pearson correlation between (a) Gene 1 and Gene 4; (b) Gene 1 and Gene 5. State if the null hypothesis of non-correlation can be rejected for either pair at the 95% significance level. Do these results agree with your hypothesis about the genetic makeup of the farmers from the second wave? 8 marks Apply principal component analysis (PCA) to the data consisting of the relative frequencies of Genes 1–7. Transform the data using all 7 principal components and calculate and display the variance for each of them. 4 marks Compare the sums of variances of all 7 attributes before and after transforming the data via PCA. Comment briefly whether the result can be expected or not and why. 8 marks Plot the first two PCA components as contour plots visualising the relative frequencies of each component across the island. Compare the result to the contour plots for Gene 1 and Gene 3 (data before PCA). Which of the two pairs of plots do you find more helpful for the task of reconstructing the waves of migration? Do you expect the same result for a realistic data set with hundreds of genes and why? Page 6 of 6 End of examination paper