please send directly to my email once is completed
Microsoft Word - CW-220CT-2 Faculty of Engineering, Environment and Computing EEC 220CT Assignment Brief 2018/19 Module Title Data and Information Retrieval individual Cohort (Sep) Module Code 220CT Coursework Title (e.g. CWK1) CW 1 Hand out date: 29 October 2018 Lecturer Rachid Anane Due date: 7 December 2018 Estimated Time (hrs): 20 Coursework type: CW % of Module Mark 50 Submission arrangement online via CUMoodle: File types and method of recording: Mark and Feedback date: Mark and Feedback method: feedback file Module Learning Outcomes Assessed: 1. Explain the difference between data and information and its significance as a business resource. 2. Identify the main advantages and disadvantages of using database and information retrieval systems. 3. Analyse, design, implement and manage a database solution for a specified commercial or scientific objective. 4. Demonstrate understanding of Big Data as a concept and as a business tool through the application of data analysis techniques Task and Mark distribution: 1. Normalisation (25%) 2. Database design (25%) 3. MapReduce (25%) 4. Recommendation Systems (25%) Notes: 1. You are expected to use the CUHarvard referencing format. For support and advice on how this students can contact Centre for Academic Writing (CAW). 2. Please notify your registry course support team and module leader for disability support. 3. Any student requiring an extension or deferral should follow the university process as outlined here. 4. The University cannot take responsibility for any coursework lost or corrupted on disks, laptops or personal computer. Students should therefore regularly back-up any work and are advised to save it on the University system. 5. If there are technical or performance issues that prevent students submitting coursework through the online coursework submission system on the day of a coursework deadline, an appropriate extension to the coursework submission deadline will be agreed. This extension will normally be 24 hours or the next working day if the deadline falls on a Friday or over the weekend period. This will be communicated via email and as a CUMoodle announcement. 220CT – Data and Information retrieval This assignment is made up of four parts: - Part 1 deals with normalisation and E-R modelling. - Part 2 covers database design. - Part 3 involves the application of MapReduce - Part 4 concerns recommendation systems Part 1: Normalisation (This task is worth 25 marks) The International Space Station (ISS) is a habitable artificial satellite in low Earth orbit. It is the ninth space station to be inhabited by crews following previous orbital stations that were launched by the US the former Soviet Union and later Russia. The ISS is intended to be a laboratory, observatory and factory in space as well as to provide transportation, maintenance, and act as a staging base for possible future missions to the Moon, Mars and beyond. In order to support the crew and overall operation of ISS the space agencies in charge of running the station conduct regular missions to launch spacecraft carrying payloads of essential or replacement equipment up to ISS. A payload inventory, see table below, is recorded of each mission, consisting of the space agency leading the mission and the equipment payload to be sent up to ISS. Mission No. Agency No. Lead Agency Country Mission Date Equipment Qty Equipment Weight ISS- 2237 178 JAXA Japan 14/12/2016 Potable water dispenser 2 100kg Flexible air duct 6 0.5kg Small storage rack 4 2kg ISS- 3664 526 ESA EU 16/01/2017 Bio Filter 6 0.20kg ISS- 2356 167 NASA USA 12/042017 Small storage rack 3 2kg Battery pack 2 5Kg Urine transfer tubing 2 1.5kg O2 scrubber 1 50kg ISS- 1234 032 Roskosmos Russia 16/04/218 Small storage rack 1 2kg Flexible air duct 2 0.5kg 1. Explain why the table is not normalised 2. Identify and state the functional dependencies in the table 3. Generate 1NF, 2NF and 3NF normalised relations. - Justify clearly every step - Produce the corresponding tables 4. Produce SQL statements to create the 3NF relations (tables), and include SQL insert statements for each of the tables. 5. Comment critically on the normalisation process. 6. Generate the ER diagram corresponding to the table. Part 2: Database Design (This task is worth 25 marks) The NASA exoplanet dataset archive can be found here: https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph- tblView?app=ExoTbls&config=planets In the context of Big Data, you are asked to design a database solution for the exoplanet data set above. Your solution must include the following: 1. The database solution of your choice. 2. Justification for the choice of the database. 3. A detailed explanation of how the data will be stored and accessed in the database you choose. 4. The benefits and drawbacks of this solution in relation to the type of data above and the size of the data set. 5. The quality of service (QoS), such as scalability that should be provided to the user should this solution be adopted. Part 3: Sequential and parallel processing (This task is worth 25 marks) Consider a flight data store with the following data structure, where all times are in GMT. Each record consists of the 13 attributes; the set of allowable values of the attributes and format are specified in the description (metadata). Data Value Description 1 Year 1999-2017 2 Month 1-12 3 Day of Month 1-31 4 Day of the Week 1 (Monday) – 7(Sunday) 5 Departure Time Recorded Departure time (hhmm) 6 Actual Departure time Scheduled Departure time (hhmm) 7 Arrival Time Recorded Arrival time (hhmm) 8 Carrier Carrier code (unique) 9 Flight Number Flight Number 10 Departure Delay minutes 11 Arrival Delay minutes 12 Cancellation Yes or No 13 Weather Delay minutes An example record would have the following values: (2015, 4, 20, 5, 1430, 1400, 1820, 131, JL729, 30, 15, No, 0) Flight monitors would like to determine the number of flights which were delayed for each carrier. 1. Assuming that the data is stored in a relational database produce, with justification, the SQL statement to create the table and the SQL statement to determine the number of flights which were delayed for each carrier. 2. Assuming that the data is too large to be processed in a centralised manner, and that it is stored in an ordinary file, produce a distributed solution which applies MapReduce to the data processing. a) Justify your decisions and all the steps of your solution, and specify clearly the map and reduce functions. b) Identify the advantages and drawbacks of this solution. c) Use diagrams if required. 3. Assuming that the monitors wish to determine the number of delayed flights for a specific year or month for example, comment on the general applicability of your solution. Part 4: Big Data and recommendation systems (This task is worth 25 marks) Research and comment critically on the structure and the use of recommendation systems. a) You should pay particular attention to the rationale, the architecture, the processes, the effectiveness, the implications of recommendation systems and relevant issues within a Big Data context. Your arguments should be supported by specific examples and case studies and should be properly referenced. Use suitable diagrams if required. b) Produce in your own words a well-structured and adequately referenced report that should be no more than 1000 words. Mark Scheme Q1 Achieve 40% Achieve 70% • Evidence of partially correct applicable and correctly identified database. • Evidence of reasoning behind database choice. • For each activity a brief explanation of design decisions should be provided. • Models providing detail about the design decisions and database design provided. • A complete and correct design, including all elements. • A complete explanation of the reasons behind the choice of Database. • A complete and fully implemented database. • For each step an explanation and justification of how and why it was applied. Q2 Achieve 40% Achieve70% • Basic d e f i n i t i o n of wh a t d a t a m i n i n g i s with a few references. • Basic understanding of sequential and parallel processing. • Basic application of a partially correct SQL query. • Partial understanding of parallel processing. • Partially correct MapReduce solution. • Basic rationale for the solution presented. • Excellent definition of what data mining is with a diverse set of