All files left on Moodle in a state "Draft(not submitted)" will not be evaluated. Please refer to the submission link on Moodle for the submission due date and time. This assignment contributes...


All files left on Moodle in a state "Draft(not submitted)" will not be evaluated. Please refer to the submission link on Moodle for the submission due date and time.


This assignment contributes to20%of the total evaluation in the subject. This assignment consists of 5 tasks. Specification of each task starts from a new page. Marks are indicated for each task.


It is a requirement that all Laboratory and Assignment tasks in this subject must be solved individually without any cooperation with the other students. If you have any doubts, questions, etc. please consult your lecturer or tutor during lab classes or office hours. Plagiarism will result in a FAIL grade being recorded for that assessment task.


Deliverables
Submit one PDF file for each task which is up to 5 pages. The environment for this Assignment is the BigDataVM imported from the BigDataVM-30-03-2020.ova file. The software tool UMLetlet14.2 is used.


Task 1. Conceptual modelling of a data warehouse (6 marks)


Consider a sample data warehouse domain presented below.


A large network of hotels would like to create a data warehouse to store information about their hotels located in the different cities of different countries, hotel guests visiting the rooms in hotels, staff working at the hotels. The management of the network would like to store the following information in the data warehouse.


Each hotel is described by its location (country, city, building number), email address and link to a Web page. A hotel offers the rooms to its customers. A room has a unique number within a hotel. A room number consists of a floor number and a unique number at a floor. For example, room 25 at 5thfloor has a number 0525.


A hotel employs staff members. A staff member has a unique employee number, first name, last name, and date of birth. Staff members belong to either administration group or maintenance group. Among the other duties, administration staff members are allowed to perform check-in and check-out of hotel guests. Maintenance staff members perform the maintenance works in the rooms occupied by hotel guests.


Hotel guests stay in hotel rooms of different types. On check-in day a start date of a visit is recorded and on check-out day an end date of a visit is recorded. Besides check-in and check- out dates, the data warehouse must contain data about the room number and the amount of money of each visit for each guest. The data warehouse must also contain data about the total number of facilities used by hotel guests and the total number of maintenances performed in a room during a visit.


A hotel guest is described by a number of identification document, first name, last name, date of birth and nationality. A hotel guest uses a credit card to pay for his/her stay in a hotel. A credit card number and a name of bank that issued a card is recorded.


A management of hotel network would like to get from a data warehouse information about:




  1. a) The total number of visits per hotel and per month.




  2. b) The average number of maintenance per room type, per hotel and per day.




  3. c) The total amount of money paid by guests per hotel and per year.




  4. d) The total number of facilities used per hotel and per day.




  5. e) The average length (in days) of each visit per hotel.




(1) Create a conceptual schema of a data warehouse use a graphical notation with UMLetlet 14. The software tool UMLetlet 14.2 can be downloaded from Moodle. You should use the "Conceptual modelling" notation in the software tool.


(2) Present OLAP operations for queries (a) to (e) above. The OLAP operations should userelational algebraic operatorswhich are introduced in lecture notes. (Hint: You can use a “Rollup*” operator in the above queries and for (e) you can also define a new measure for “length of stay” by using an “Addmeasure” operator.)


Deliverables


A filesolution1.pdfwith solutions to questions (1) and (2) above.


Task 2. Logical modelling and interval Hive tables (3 marks)


Consider the conceptual schema of a data warehouse in an attached file namedtask2.pdf,


which is in the assignment resources folder on Moodle.


(1) Perform a logical design to transform the given conceptual schema into astar schema.Use UMLetlet 14.2 diagram software tool and apply a "Logical modelling" notation to draw the logical (star) schema.


(2) Implement aninternaltable in Hive foreachtable which you defined in the previous step,. Each Hive table must contain at least one records. (Note: you can use any way to populate data into the Hive tables.)


(3) After the Hive tables are created, useand


Note. In a star schema, all level tables are flat (i.e., denormalized).


Deliverables


A filesolution2.pdfwhich contains:




  • A drawing of a logical schema for question (1);




  • The Hadoop commands, HQL statements and outputs for questions (2) and (3) in


    Zeppelin or Terminal.




DESCRIBE


SELECTstatements in HQL to


describe the tables and list the first row from each table, respectively.


Task 3. Implementation of external tables in Hive (3.5 marks)


Consider the following two-dimensional data cube.


Available-for-shipment Involved-in-shipment


The data cube contains information about that parts that can be shipped by the suppliers.


Download the following files froma sub-foldertask3of the assignment resource folder on Moodle:part.tbl,supplier.tbl,partsupp.tbl. In the filepartsupp.tbl, the first (resp., second) fields contains foreign keys toPART(resp.,SUPPLIER).


Implement three external Hive tables forPART,SUPPLIER, andPARTSUPP, and populate the data in the three files into the tables.


UseSELECTstatements to retrieve the first 10 rows from each one of the external tables implemented in the previous step.


Deliverables


A filesolution3.pdfthe clearly presents the Hadoop commands, HQL statements and outputs in Zeppelin or Terminal.












SUPPLIER



S_KEY


S_NAME S_ADDRESS S_PHONE S_ACCTBAL S_COMMENT



PART


PARTSUPP


P_KEY


P_NAME P_MFGR P_BRAND P_TYPE
P_SIZE P_CONTAINER P_RETAILPRICE P_COMMENT


PS_AVAILQTY PS_SUPPLECOST PS_COMMENT


Task 4. Data partition and OLAP operations in Hive (3.5 marks)


Download the following files froma sub-foldertask4of the assignment resource folder on Moodle:lineitem.tblandcreate_table_lineitem.hql.


The filelineitem.tblcontains records about some ordered items.
The scriptcreate_table_lineitem.hqlincludes a HQL statement that creates an


internal Hive tableLINEITEMfor data inlineitem.tbl. Implement the following operations:
(1) Execute the provided HQL statement to createLINTITEMin Hive.


(2) Process the following “window framing and ordering” queries onLINTITEM, both of which retrieve information about the order quantity (L_QUANTITY) and the year and month of shipment (L_SHIPDATE):


(i) Return the year and month of shipment, and theyear-to-monthorder quantity of shipment.


(ii) Return the year and month of shipment, and therankof order quantity of shipment in each month of the year.


(3) Create a table namedPLINEITEM, which is partitioned based on theyear of shipment, and copy the data fromLINEITEMinto it.


Deliverables


A filesolution4.pdfthat clearly presents the inputs and outputs for your operations in Zeppelin or Terminal.


Task 5. Implementation of HBase Tables (4 marks)


Consider the following conceptual schema:


Is Manager of


Works-on


DEPARTMENT


--


Works-at


EMPLOYEE


enumber ID first-name last-name salary


PROJECT


name ID budget


code ID title


The objective of this task is to developtwo alternativeimplementations of the above schema as HBase tables. Both tables are loaded with data that meet the following information:


There must beat least one department with two employees, and one of them is a manager of the department. There aretwo projects, each of which is allocated with two employees. The names of column families and column qualifiers must be indicative. You can determine the row keys and cell values.


After finishing loading the information, use a “scan” command to list all rows from each table.


Also explain the difference between your two implementations.`


Deliverables


A filesolution5.pdfwhich includes:




  • The Hadoop commands, HBase shell commands and execution output in Zeppelin or


    Terminal;




  • An explanation of the difference between the two implementations.



Sep 30, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here