Essential information:
You must follow the requirements explained in the reassessment outline and report template above.
Your submission should contain your
report plus 1 to 3 data files
(one file per research question)
- A data file containing the final version of the data that you used for analysis (ARFF or CSV format)
- A completed report (a Word document
or
PDF)
Failure to submit all of the above will result in a
mark of zero
for the whole assessment.
COM00148M Department of Computer Science BIG DATA ANALYTICS REASSESSMENT BRIEF Author Dr Rob Alexander Assessment Type Summative open - resit Release Following marks release of Summative Assessment Submission See ‘Assignments’ section in Canvas for submission deadline Feedback 20 working days of submission Weighting 100% Resit Guidance Your reassessment for Big Data Analytics is by resubmission of your original assessment. From your first attempt feedback you will now be aware which elements you were stronger on. A good strategy, to follow, would be that you work on any of your answers that achieved less than a 5/10. For each of your answers consider what you have been asked to add or improve. Remember the word count for these questions still stands, so you may have to rewrite all or part of that section. Be careful not to lose the points that you were awarded marks for in your first attempt. Consider how you can add evidence to your answers, further reading and referencing (IEEE) may help. Suggestions: For Q1: your feedback may imply you have not actually posed three questions, perhaps reword this so it is clearer. Alternatively, you may have listed three questions but not justified them, so you need to justify these with domain research. What makes these interesting and meaningful? Consider if you may need to re-write these and the impact it may have on the rest of your report if you do. For Q2, Q3 and Q4 they are asking you to consider “How, What and Why”, does your answer tell us this? Remember, adding citations to support your decisions will provide evidence and credibility to your responses. Q5 is looking for an appropriate summary of your results with some discussion on how you arrived at them, not just reporting technical data here. How did you interpret the results? Perhaps look at published academic papers to see how their authors presented and arrived at their results. Citations are required to support how you arrived at these results. Q6 asks if your results answer your three questions, or does it leave you with more questions? If it doesn't answer your questions what could you change if you were to repeat this exercise? Do your findings mirror or disagree with the research out there? Q7, Q9 and Q10 are research-based questions linked to the scenario, you should be able to answer these with further reading and the lessons taught in the module. Remember to cite sources you used to demonstrate you have undertaken the research required, it will also evidence and add credibility to your answers and may help you achieve more marks. Q8 This one is a little trickier and perhaps not as easy to resolve. Remember, adding citations to support your decisions will provide evidence and credibility to your responses. I would suggest you break it down in two parts. Consider: · What data do you have left once you have cleaned and fixed it? · How would you represent that in a database? · How would you convert the data, so it is readable in Weka? Finally, perhaps revisit your answers, where you achieved a 5 or more, think about how you can add evidence and credibility. If you haven't already, perhaps now consider adding citations if you completed further reading to support your responses. Whilst we recognise you have been given an opportunity to improve your work by adding or amending your work, you may feel that this is not necessary throughout. This could be as a result of the following reasons: · Some sections you may have been awarded pass marks in your first submission therefore you have decided to solely focus on the weaker areas of your work · You were penalised in your first submission as a result of lateness and want to fully resubmit your original work without changing anything as the non-penalised mark was a pass. Given this you should present all the changes and additions in your resubmitted work in BLUE text and clearly indicate any additional diagrams/figures. Additional citations and references should also be in BLUE text. You should add a note to the end of your submission if any changes or additions have been made to supporting artefacts, such as the submission of data sets alongside your report. Content that you have decided is NOT part of your submission should be removed completely. It is ESSENTIAL you do this to ensure makers can focus on your most recent work and not remark work that has already achieved the required grades. Module Learning Outcomes 1. Create a data set using modern database models and technology 2. Manipulate a data set to extract statistics and features, 3. Critically evaluate and apply data mining techniques/tools to build a classifier or regression model, and predict values for new examples 4. Analyse and communicate issues with scaling up to large data sets, and use appropriate techniques to scale up the computation, 5. Critically discuss the need for privacy, identify privacy risks in releasing information, and design techniques to mediate these risks. This assessment will contribute to all the learning outcomes for this module. Assessment Background/Scenario Data You will find a dataset called brfss_for_bda_2021.csv which describes a survey around health and associated behavioural factors carried out on a population in the US. It’s in CSV format. Assessment Task(s) Your task Is to use that dataset, any information you can find about it elsewhere, and the techniques taught in the module, to pose and answer three research questions of your choosing. You will then need to consider how you might store the (research-question-relevant) data in a database, how you might spread a very large version of that data over multiple computers, and what the privacy concerns are here and how you might address them. Produce a structured analysis report using the given template. The structured report consists of seven sections, each containing specific questions, which you must answer. In sections 3 and 4 of the report, which require you to use data analysis tools, you may use WEKA or Python tools. In section 3 you may also use Excel for visualisation. NOTE: Failure to submit all the data files will result in a zero grade. Deliverables General submission criteria · The submission is a report and final data files. The report must be provided in a format which Canvas can display (i.e. PDF or MS-Word native format), and data files containing the final version of the data that you used for analysis (ARFF or CSV format). · You are expected to research your answers and to cite appropriate academic and/or other sources in an appropriate format (IEEE) for the type of report you have been asked to write. It is probably not sufficient to use only the module notes. · Each part has an indicated maximum word count for your answer. Any cover page and reference lists or bibliographies do not count towards these limits. · Exceeding word counts will not be marked. · Your assessment submission should not include your examination number or any other personal identification information. Assessment Criteria The York Computer Science Department Question Criteria Available marks Section 1 – Your data Uploading data Pass/fail n/a Section 2 — Business/research questions 1. Business/research questions The questions are clear (the reader can understand what students are asking), they are plausibly of interest to someone, and they are answerable with the tools studied in the module. 10 Section 3 — Processing the data 2. Exploring the data There has been useful exploration, given the data and the business/research questions, it has been justified intelligibly, and reasonable conclusions have been drawn. 10 3. Cleaning/fixing the data Changes to the data are appropriate given the data and business/research questions, and the justification makes that clear. 10 Section 4 — Data analysis 4. Analysis techniques Analysis techniques are appropriate to the questions, and the justification explains this. 10 5. Results Results are clear and are plausible (i.e. not obviously invalid). 10 6. Answers to business/research questions The conclusions are reasonable, traceable clearly to the results, and are indeed answers (or justified observations that there is no clear answer) to the research questions. 10 7. Residual threats to validity The threats are valid, accurately described, and of genuine concern. 10 Section 5 — Dealing with large data sets 8. Relational databases Database schema is appropriate (and appropriately normalised). Interface to WEKA is valid and practical. 10 9. Distributed computation A viable approach using multiple technologies that would likely achieve a worthwhile gain in performance (given coordination overhead etc). 10 Section 6 — Privacy 10. Privacy issues The privacy issues are of genuine concern and the strategies are plausible solutions to them. 10 Section 7 – Report references 11. Reference list (none as such, but failure to do this, or failure do it well, can cost marks elsewhere or lead to a misconduct charge) n/a Total 100 Grading The pass mark for postgraduate modules is 50%. For more information about grades and assessment criteria, please review the ‘Assessment and award’ section of the York Online Handbook which can be downloaded from the Orientation Module. Assessment submission You will submit your assessments in the ‘Assignments’ area of the module in Canvas. Please check your Canvas module for the specific submission date for this assignment. For general assessment guidelines consult your Canvas Module, Orientation Module and for Academic Regulations the University of York’s Guide to Assessment, Standards, Marking and Feedback. Any queries regarding the details of your assessment should be directed to your module tutor. Any queries regarding assessment procedures should be directed to
[email protected]. Submission deadline The reassessment submission deadline will be confirmed via email following the summative assessment marks release. You can also find the deadline in the ‘Assignments’ area of the module in Canvas. Submission of student work to Turnitin® Turnitin® is a third-party text-matching software programme that compares written assessments with an online database of articles, books, websites, and pieces submitted by other users. To ensure the highest levels of academic integrity and in line with University Regulation 5.7b, any work submitted for this summative assessment will be submitted to Turnitin®. In accepting the University Regulations on admission, students have agreed to the University’s use of this software package. You can also use Turnitin® as part of your writing process, to help you check your use of source information and improve your understanding of academic integrity. To access the Turnitin® submission points, or for more information on this software tool, see the Turnitin Training module in Canvas. Important Information If you are unable to complete your open assessment by the submission date indicated above because of Exceptional Circumstances you can apply for an extension. If unforeseeable and exceptional circumstances do occur, you must seek support and provide evidence as soon as possible at the time of the occurrence. Full details of the Exceptional Circumstances Policy and the online application form are available on the Exceptional Circumstances affecting Assessment webpage. If you submit your open assessment on