helloi did my code in big data u just have to make this report can u pls do that
ICT707 Data Science Practice Task 3 Semester 1, 2019 ICT707 Data Science Practice Assignment 3 Page 2 of 6 Assessment and Submission Details Marks: 40% of the Total Assessment for the Course Due Date: 11:59pm Friday, Exam Week 1 Submit your assignment to Blackboard Task 3. Please follow the submission instructions in Blackboard. The assignment will be marked out of a total of 100 marks and forms 40% of the total assessment for the course. ALL assignments will be checked for plagiarism by SafeAssign system provided by Blackboard automatically. Refer to your Course Outline or the Course Web Site for a copy of the “Student Misconduct, Plagiarism and Collusion” guidelines. Late submission will be penalised according to the policy in the course outline. Please note Saturday and Sunday are included in the count of days late. Requests for an extension to an assignment MUST be made to the course coordinator prior to the date of submission and requests made on the day of submission or after the submission date will only be considered in exceptional circumstances. Assignment submission extensions will only be made using the official University guidelines. ICT707 Data Science Practice Assignment 3 Page 3 of 6 Assignment Task This assignment consists of two deliverables, being: • PySpark source code in Jupyter Notebook format (50%). All Jupyter notebook files and the date set file relating to this assignment should be contained within a folder named: Task 3- Your Name-Student Number, the folder is then to be zipped and uploaded to blackboard. • A report (50%). The report must be uploaded as a separate file. Part I - PySpark source code (50%) Important Note: For code reproduction, your code must be self-contained. That is, it should not require other libraries besides PySpark environment we have used in the workshops. In this component, we need to utilise Python 3 and PySpark to complete the following data analysis tasks: 1. Exploratory data analysis 2. Recommendation engine 3. Classification 4. Clustering You need to choose a dataset from Kaggle (https://www.kaggle.com/datasets) to complete these tasks. Remember to include the data set file in you source code submission. Task I.1: Exploratory data analysis This subtask requires you to explore your dataset by • telling its number of rows and columns, • doing the data cleaning (missing values or duplicated records) if necessary • summarising 3 columns with plots (e.g. bar chart, histogram, boxplot, etc.) Task I.2: Recommendation engine This subtask requires you to implement a recommender system on Collaborative filtering with Alternative Least Squares Algorithm. You need to include • Model training and predictions • Model evaluation using MSE Task I.3: Classification This subtask requires you to implement a classification system on Logistic regression with LogisticRegressionWithLBFGS class. You need to include • Logistic Regression model training • Model evaluation https://www.kaggle.com/datasets ICT707 Data Science Practice Assignment 3 Page 4 of 6 Task I.4: Clustering This subtask requires you to implement a clustering system on K-means. You need to include • Model training • Model evaluation Part II –Report (50%) You are required to write a report explaining the theory underlining the key concepts around the design and implementation of your code. Finally, you are to include all code in .py format in the appendices of the report. Note that the code will not count towards the word count. Your report should follow the following template: Table of Contents 1.0 Introduction 2.0 Key System Concepts 2.1 Machine learning pipelines. Explain key steps in machine learning pipelines and how they were applied in your code. 2.2 Collaborative filtering. Explain Collaborative filtering principles and how they were applied in your code. 2.3 Logistic regression. Explain Logistic regression principles and how they were applied in your code. 2.4 K-Means. Explain K-Means principles and how they were applied in your code. 4.0 Conclusion References Appendices The marking rubrics are viewable on the blackboard. Report Format Your report should be 1000 ~ 1500 words. The report MUST be formatted using the following guidelines: • Title Page – Must not contain headers, footers, or page numbering. Include your name as the report’s author. • Header – Report title • Footer – your name and the page number • Paragraph text – 12 point Calibri single line spacing • Headings – Arial in an appropriate type size • Margins – 2.5cm on all margins ICT707 Data Science Practice Assignment 3 Page 5 of 6 • Page numbering • Introduction and onwards to use conventional numerals (1, 2, 3, 4) starting at page 1 from the introduction. • The report is to be created as a single Microsoft Word document (version 2007 or later). No other format is acceptable and doing so will result in the deduction of marks. Please follow the conventions detailed in: Summers, J. & Smith, B., 2014, Communication Skills Handbook, 4th Ed, Wiley, Australia. Referencing The report is to include (at least 5) appropriate references and these references should follow the Harvard method of referencing. Note that ALL references should be from journal articles, conference papers, technical papers or a recognized expert in the field. DO NOT use Wikipedia as a reference. The use of unqualified references will result in the deduction of marks. Assignment Return and Release of Grades Assignment grades will be available on the blackboard in two weeks after the submission. Details of marking will also be accessible via online rubrics on the blackboard. Where an assignment is undergoing investigation for alleged plagiarism or collusion the grade for the assignment and the assignment will be withheld until the investigation has concluded. Assignment Advice This assignment will take several weeks to complete and will require a good understanding of machine learning and PySpark for successful completion. It is imperative that students take heed of the following points in relation to doing this assignment: 1. Ensure that you clearly understand the requirements for the assignment – what must be done and what are the deliverables. 2. If you do not understand any of the assignment requirements – Please ASK your tutor. 3. Each time you work on any aspect of the assignment reread the assignment requirements to ensure that what is required is clearly understood. 4. We have practiced nearly all coding tasks in DataCamp before. If you have any difficulty, redoing the practices in DataCamp is recommended. 5. Prior to submitting your code, you should ensure not only that it executes as required, but also looks professional. It is expected that you adhere to python standards for naming and indenting. All methods should be adequately documented such that another programmer examining your code will readily know what the code is doing. ICT707 Data Science Practice Assignment 3 Page 6 of 6 Plagiarism and Collusion Advice 1. All work must be submitted through SafeAssign. 2. SafeAssign will pick up any similarities between work online as well as work from other students (in this semester and previous) 3. Please make sure you reference your work properly. If you are using any material from the internet or any books from the library, you need to cite the work correctly. Failure to do so will result in possible cases of Academic Misconduct. 4. Please do not share your work with other students. Do not give anyone your files to have a look. SafeAssign will pick up collusion, but keep in mind the percentages for Collusion may not report accurately until all student assignments have been submitted. Both the person copying and the person providing will potentially be held accountable. 5. You can submit a draft assignment through SafeAssign before making the actual submission. 6. If you need any advice or are unsure about referencing, please speak with ATMC Administration for assistance. End of Assignment Assignment Task Part I - PySpark source code (50%) Task I.1: Exploratory data analysis Task I.2: Recommendation engine Task I.3: Classification Task I.4: Clustering Part II –Report (50%)