the subject is– Computational Intelligence for Data AnalyticsNeed to be done Part 3 and 4
CSE3CI – Computational Intelligence for Data Analytics Assignment, 2021 Due Date: Monday 17th May, 9:00am, 2021 Assessment Weight: 30% of the final mark for the subject Instructions • This is a GROUP assignment. You are permitted to work in groups of up to three. All group members will receive the same mark. You may complete the assignment as an individual, but if you do so, you will be marked in the same way as for a group. Plagiarism Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. When submitting your assignment via the LMS, the following announcement will appear: Software will be used to assist in the detection of plagiarism. Students are referred to the section on ‘Academic Misconduct’ in the subject’s guideline available on LMS. Lateness Policy Penalties are applied to late assignments (5% of total possible marks for the task is deducted per day, accepted up to 5 days after the due date only). An assignment submitted more than five working days after the due date will not be accepted. Submission Procedure You are required to submit the following: • A pdf format document containing your report. • A zip file containing all of the Python code that you used for the assignment. These documents are to be submitted electronically via the Learning Management System. In the case of group submissions, only one member of the group should submit, and the cover page of the report must contain the full name and Student ID of all group members. In the case of solo submissions, ensure your name and Student ID is on the cover page. You will also be required to do a short oral presentation of your report (10 minutes max ) during the scheduled lab class in Week 11. Depending on how many submissions are received, it may be necessary to schedule a second lab class during that week or week 12. Problem Description – Forecasting Electricity Prices The problem is to forecast electricity price based on historical data. Let the temperature and total demand of electricity at time instant t be T(t) and D(t) respectively. The goal is to predict the recommended retail price (RRP) price by using some historical data as system inputs. The historical data set consists of the following variables: T(t-2), T(t-1), T(t), D(t-2), D(t-1), D(t). The output should be a prediction of the Recommended Retail Price (RRP) of electricity at the next time instant t+1, denoted by P(t+1). You have been provided with real-world electricity pricing data from Queensland, Australia. There are two datasets: a training set, to be used for model development; and a test set, to be used to evaluate the performance of your models. Each dataset has the same structure. Rows correspond to successive time instants, and contain seven values: the predictor variables T(t-2), T(t-1), T(t), D(t-2), D(t-1), D(t), and the target variable P(t+1). The objective is to predict the value of P(t+1) on the basis of one or more of the six predictor variables. There are five parts to the assignment, described below, with the approximate assessment weighting. Parts 1, 2 and 3 are based on content that has been covered up to then end of Week 5. Content for Part 4 will be covered in Week 6 and 7. Part I – Data Preparation (approx. 5%) The performance of many systems can be improved through careful preparation of the data. Visualising the electricity prices will reveal that there are potential outliers1 in the dataset; i.e., observations that lie an abnormal distance from other values in a random sample from a population. Tasks: • Use an appropriate technique to identify and remove outliers of the output variable from the datasets (for both training and test sets). • Provide a plot showing the price data before and after the removal of outliers. Part 2 – Linear Regression Models (approx. 8%) Linear regression is often a good baseline against which to compare the performance of other models. Tasks: • Apply linear regression to the prediction of electricity prices. • For both the training and test sets, provide the Average Relative Error. • For both training and test sets, produce a plot showing, for each data point, how the predicted price compares with the actual price. Part 3 – Multilayer Perceptron Models (approx. 27%) Multilayer perceptrons can sometimes yield better performance over linear models. Tasks: • Experiment with the application of MLPs to predicting electricity prices. You should try varying MLPRegressor parameters such as the regularization coefficient, the number of training epochs, 1 You can read more about outliers here: http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm, http://mathworld.wolfram.com/Outlier.html http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm http://mathworld.wolfram.com/Outlier.html and the number of hidden units. Make sure that you record the training error and test error in each case. It is suggested that you use logistic units in the hidden layer, but you can use others if you wish. • Provide results for three different MLPRegressor parameter settings. − one of these should be the result for the best performing MLP that you were able to train; − one should clearly demonstrate underfitting; − one should clearly demonstrate overfitting. For each of these cases, provide the learning parameters that you have used, as well as the training error and the test error. • For the best-performing MLP, for both training data and test data, produce a plot showing, for each data point, how the predicted price compares with the actual price. Part 4 – Fuzzy Forecasting System (approx. 40%) For this part, you will develop a fuzzy forecasting system for predicting the electricity price. (You will learn about fuzzy inferencing systems in Weeks 6 and 7) Tasks: • Select appropriate values or fuzzy subsets for the linguistic variables that you will use in your fuzzy rules. • Apply statistical analysis (correlation coefficients) and heuristics to develop a set of fuzzy rules; • Implement your fuzzy system in Python, and produce clear plots of all membership functions involved in your system; • Evaluate the system performance in terms of the average relative error on both training and test sets. You may use either Mamdani-type or Sugeno-type inference, but you should include some justification for your decision. Part 5 – Report and Presentation (approx. 20%) This is the assignment ‘deliverable’; i.e., what you are required to submit. It should contain your results from Tasks 1 to 4, put together in a clear and coherent manner. It should also clearly describe how you conducted your investigation and any design choices you made (e.g., What parameters did you experiment with when applying the MLP?, What different membership functions did you experiment with in creating your fuzzy system?, Why did you opt for Mamdani-type inference as opposed to Sugeno- type inference?, and so on). Basically, the more thorough and systematic your analysis, the better. A summary of your overall findings should also be provided in the report. Assessment Approximate marks for each of Parts 1 to 5 have been indicated above. The marks for Parts 1 to 4 are based on correctness and completeness of the tasks specified. The 20% allocated for Part 5 will be based how clearly and coherently the report and presentation have been presented; the description and justification they provide for the design choices that have been made; the evidence they provide of systematic experimentation with different system parameters; the conclusions they make in regard to the use of the various approaches in predicting electricity prices. PowerPoint Presentation 1 Advice on designing your fuzzy inferencing system for Part 4 of the Assignment 2 Remove outliers from both the training and test set Removing Outliers 3 • Initially we have six input variables, T(t-2), T(t-1), T(t), D(t-2), D(t-1), D(t), and one output variable, P(t+1). • We can calculate the correlation coefficient matrix as follows: >> corrcoef(A) (Refer to the Week 5 lab) where the matrix A contains all seven columns (the six input variables and the output variable). • You will need to examine the correlations to decide upon which input variables you will use to constuct your fuzzy inferencing system. • In the following, it will be assumed that T(t-2) and D(t) have been selected. But note that these are probably not very good, and you must select variables that you expect will work well. • Then the objective of the FIS design is to find the unknown functional relationship between T(t-2), D(t) and P(t+1) Variable selection using correlation coefficient matrix 4 Two inputs and One ouput T(t-2), D(t) Price(t+1) 5 For each variable (both input and output), choose appropriate linguistic variables, and ‘roughly’ design membership functions based their distributions. Designing Membership Functions 6 You will need to come up with the fuzzy rules on your own. These can be based on heuristics; i.e., rules of thumb. Here are some examples: These are just examples, and you will need to come up with rules of your own! Fuzzy Rules 7 • Once you have selected your fuzzy membership functions and rules, you can implement them in Scikit-fuzzy using the code from Week 6 and 7 labs as a basis. • It is suggested that you organise the code into a function, so that it will be easy for you to perform the inferencing over all examples in the test set. • You should also write a function to calculate the Average Relative Error, which you will use to evaluate your systems performance. Implementing your fuzzy inferencing system in Scikit-fuzzy 8 • Part 4 of the assignment carries the greatest weight in terms of marks, so it is important that you can get a fuzzy system running, even if it does not perform very well. • It is strongly suggested that you