All instructions have been given in the files
FIT1045 Algorithms and programming in Python, S2-2020 Programming Assignment Assessment value: 22% (10% for Part 1 + 12% for Part 2) Due: Week 6 (Part 1), Week 11 (Part 2) Objectives The objectives of this assignment are: • To demonstrate the ability to implement algorithms using basic data structures and operations on them. • To gain experience in designing an algorithm for a given problem description and implementing that algorithm in Python. • To explain the computational problems and the challenges they pose as well as your chosen strategies and programming techniques to address them. Submission Procedure You are going to create a single Python module file regression.py based on the provided template. Submit a first version of this file at the due date of Part 1, Friday midnight of Week 6. Submit a second version of this file at the due date of Part 2, Friday midnight of Week 11. For each of the two versions: 1. Save the current version of your regression.py into a zip file called yourStudentID yourFirstName yourLastName.zip. 2. Submit this zip file to Moodle. 3. Your assignment will not be accepted unless it is a readable zip file containing a file called products.py. Important Note: Please ensure that you have read and understood the university’s policies on plagiarism and collusion available at http://www.monash.edu.au/students/policies/academic-integrity.html. You will be required to agree to these policies when you submit your assignment. A common mistake students make is to use Google to find solutions to the questions. Once you have seen a solution it is often difficult to come up with your own version. The best way to avoid making this mistake is to avoid using Google. You have been given all the tools you need in workshops. If you find you are stuck, feel free to ask for assistance on Moodle, ensuring that you do not post your code. Marks and Module Documentation This assignment will be marked both by the correctness of your module and your analysis of it, which has to be provided as part of your function documentations. Each of the assessed function of your module must contain a docstring that contains in this order: 1. A one line short summary of what the function is computing 2. A formal input/output specification of the function 3. Some usage examples of the function (in doctest style); feel free to augment and alter the examples for the assessed functions that are already provided in the template if you feel they are incomplete or otherwise insufficient (provide some explanations in this case) 1 http://www.monash.edu.au/students/policies/academic-integrity.html 4. A paragraph explaining the challenges of the problem that the function solves and your overall approach to address these challenges 5. A paragraph explaining the specific choices of programming techniques that you have used 6. [only for Part 2] A paragraph analysing the computational complexity of the function; this does not imply that a specific computational complexity such as O(n log n) is required just that the given complexity is understood and analysed For example, if scaled(row, alpha) from Lecture 6 was an assessed function (in Part 2 and you are a FIT1045 student) its documentation in the submitted module could look as follows: def scaled(row, alpha): """ Provides a scaled version of a list with numeric elements. Input : list with numeric entries (row), scaling factor (alpha) Output: new list (res) of same length with res[i]==row[i]*alpha For example: >>> scaled([1, 4, -1], 2.5) [2.5, 10.0, -2.5] >>> scaled([], -23) [] This is a list processing problem where an arithmetic operation has to be carried out for each element of an input list of unknown size. This means that a loop is required to iterate over all elements and compute their scaled version. Importantly, according the specification, the function has to provide a new list instead of mutating the input list. Thus, the scaled elements have to be accumulated in a new list. In my implementation, I chose to iterate over the input list with a for-loop and to accumulate the computed scaled entries with an augmented assignment statement (+=) in order to avoid the re-creation of each intermediate list. This solution allows a concise and problem-related code with no need to explicitly handle list indices. The computational complexity of the function is O(n) for an input list of length n. This is because the for loop is executed n times, and in each iteration there is constant time operation of computing the scaled element and a constant time operation of appending it to the result list (due to the usage of the augmented assignment statement instead of a regular assignment statement with list concatenation). """ res = [] for x in row: res += [alpha*x] return res Beyond the assessed functions, you are highly encouraged to add additional functions to the module that you use as part of your implementation of the assessed functions. In fact, such a decomposition is essential for a readable implementation of the assessed function, which in turn forms the basis of a coherent and readable analysis in the documentation. All these additional functions also must contain a minimal docstring with the items 1-3 from the list above. Additionally, you can also use their docstrings to provide additional information, which you can refer to from the analysis paragraphs of the assessed functions. This assignment has a total of 22 marks and contributes to 22% of your final mark. For each day an assignment is late, the maximum achievable mark is reduced by 10% of the total. For example, if the assignment is late by 3 days (including weekends), the highest achievable mark is 70% of 15, which is 10.5. Assignments submitted 7 days after the due date will normally not be accepted. Mark breakdowns for each of the assessed functions can be found in parentheses after each task section below. Marks are subtracted for inappropriate or incomplete discussion of the function in their docstring. Again, readable code with appropriate decomposition and variable names is the basis of a suitable analysis in the documentation. 2 Overview In this assignment we create a Python module to perform some basic data science tasks. While the instructions contain some mathematics, the main focus is on implementing the corresponding algorithms and finding a good decomposition into subproblems and functions that solve these subproblems. In particular, we want to implement a method that is able to estimate relationships in observational data and make predictions based on these relationships. For example, consider a dataset that measures the average life expectancy for various countries together with other factors that potentially have an influence on the life expectancy, such as the average body mass index (BMI), the level of schooling in that country, or the gross domestic product (GDP) (see Table 1 for an example). Can we predict the life expectancy of Rwanda? And can we predict how the life expectancy of Liberia will change if it improved its schooling to the level of Austria? Country Adult Mortality Infant Deaths BMI GDP Schooling Life expectancy Austria 10.77 4.32 25.4 4.55e+11 15.7 81.1 Georgia 18.23 15.17 27.1 1.76e+10 13.5 74.5 Liberia 29.95 74.52 24.0 3.264e+09 9.8 61.1 Mexico 30.73 17.29 28.1 1.22e+12 12.9 76.6 Rwanda 35.31 64.04 22.1 9.509e+09 10.8 ? Table 1: Example dataset on Life Expectancy. To answer these questions, we have to find the relationships between the target variable—in case of this example the life expectancy—and the explanatory variables—in this example ”Adult Mortality”, ”Infant Deaths”, etc. (see Figure 1). For this we want to develop a method that finds linear relationships between the explanatory variables and the target variable and use it to understand those relationship and make predictions on the target variable. This method is called a linear regression. The main idea of linear regression is to use data to infer a prediction function that ’explains’ a target variable through linear effects of one or more explanatory variables. Finding the relationship between one particular explanatory variable and the target variable (e.g., the relationship between schooling and life expectancy) is called a univariate regression. Finding the joint relationship of all explanatory variables with the target variable is called a multivariate regression. (a) Adult mortality vs. life expectancy. (b) Schooling vs. life expectancy. Figure 1: Visualization of the relationship between explanatory variables and the target variable. In this assignment, you will develop functions that perform univariate regression (Part I) and multivariate regression (Part 2) and use them to answer questions on a dataset on life expectancy, similar to the example above. To help you to visually check and understand your implementation, a module for plotting data and linear prediction functions is provided. 3 Part 1: Univariate Regression (10%, due in Week 6) The first task is to model a linear relationship between one explanatory variable and the target variable. The data in this assignment is always represented as a table containing m rows and n columns and a list of length m containing the corresponding target variables. The table is represented as a list with m elements, each being a list of n values, one for each explanatory variable. An example dataset with one explanatory variable x and a target variable y would be >>> x = [1,2,3] >>> y = [1,4,6] and an example dataset with two explanatory variables x(1), x(2) would be >>> data = [[1,1],[2,1],[1,3]] >>> y = [3,4,7] Task A: Optimal Slope (2 Marks) Let us now start by modelling the linear relationship between one explanatory variable x and the target variable y based on the data of m observations (x1, y1), . . . , (xm, ym). For that, let’s start out simple by modelling the relation of x and y as y = ax , (1) i.e., a straight line through the origin. For this model we want to find an optimal slope parameter a that fits the given data as good as possible. A central concept to solve this problem is the residual vector defined as r = y1 − ax1 y2 − ax2 . . . ym − axm , i.e., the m-component vector that contains for each data point the difference of the target value and the corresponding predicted value. Intuitively, for a good fit, all components of the residual vector should have a small magnitude (being all 0 for a perfect fit). A popular approach is to determine the optimal parameter value a as the one that minimises the sum of squared components of the residual vector, i.e., the quantity r · r = r21 + r22 + · · · + r2m , which we also refer to as the sum of squared residuals. With some math (that is outside the scope of this unit) we can show that for the slope parameter a that