In Part A your task is to answer a question about the data in an unprocessed text file, first using MRJob, and then using Hive
In Part B your task is to answer a question about the data in a CSV file, first using MRJob, and then using Hive.
Part A - MRJob and Hive with text (8 marks) In Part A your task is to answer a question about the data in an unprocessed text file, first using MRJob, and then using Hive. By using both to answer the same question about the same file you can more readily see how the two techniques compare. When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "walden.txt", containing some sample text (feel free to open the file and explore its contents)(it's an extract from Walden, by Henry David Thoreau). In this text file, each line is a sentence. It is worth noting that there are multiple spaces at the end of each line in this unprocessed text file. Your task is to find the average word lengths according to the first letters of sentences. For example, given a toy input file as shown below: Aaa bbb cc. Ab b. The output should be: "Letter A:" 2.6 Because, for A, we have (3*3+2*2)/5 = 2.6. You can assume that sentences are separated by full stops, and words are separated by spaces. For simplicity, we include all punctuations, like ',' and '.', when calculating word length, like what we did in example1 and example2. (So, the length of 'cc.' is 3 instead of 2.) The case of letters can be ignored. Given the walden.txt file as input, the format of the output is "letter: avg_word_length" (The result should be rounded to two decimal places, with round(x,2) ), as shown below: "Letter A:" 4.17 "Letter S:" 4.32 "Letter T:" 4.09 "Letter I:" 4.16 "Letter W:" 4.89 "Letter B:" 4.82 "Letter F:" 4.18 (There is no need to sort the results or remove the quotation marks.) First (4 marks) Write a MRJob job to do this. A file called "job.py" has been created for you - you just need to fill in the details. You should be able to modify MRJob jobs that you have already seen in this week's content. You can test your job by running the following command (it tells Python to execute job.py, using walden.txt as input): $ python job.py walden.txt Second (4 marks) Write a Hive script to do this. A file called "script.hql" has been created for you - you just need to fill in the details. You should be able to modify Hive scripts that you have already seen in this week's content. You might use some User-Defined Functions (UDFs) which can be found here. You can test your script by running the following command (it tells Hive to execute the commands contained in the file script.hql): $ hive -f script.hql This is worth 4 marks Part B - MRJob and Hive with CSV (8 marks) In Part B your task is to answer a question about the data in a CSV file, first using MRJob, and then using Hive. By using both to answer the same question about the same file you can more readily see how the two techniques compare. When you click the panel on the right you'll get a connection to a server that has, in your home directory, a CSV file called "orders.csv", containing data about book orders (feel free to open the file and explore its contents). Here are the fields in the file: OrderDate (date) ISBN (string) Title (string) Category (string) PriceEach (decimal(5,2)) Quantity (integer) FirstName (string) LastName (string) City (string) Your task is to find the total dollar amount of orders for each city. Your results should appear as the following: ATLANTA 211.85 AUSTIN 391.25 BOISE 39.9 CHEYENNE 19.95 CHICAGO 111.9 CODY 55.95 EASTPOINT 182.75 KALMAZOO 170.9 MACON 61.95 MIAMI 17.9 MORRISTOWN 55.95 SEATTLE 61.9 TALLAHASSEE 144.45 TRENTON 199.85 (There is no need to sort the results or remove the quotation marks.) First (4 marks) Write a MRJob job to do this. A file called "job.py" has been created for you - you just need to fill in the details. You should be able to modify MRJob jobs that you have already seen in this week's content. You can test your job by running the following command (it tells Python to execute job.py, using orders.csv as input): $ python job.py orders.csv Second (4 marks) Write a Hive script to do this. A file called "script.hql" has been created for you - you just need to fill in the details. You should be able to modify Hive scripts that you have already seen in this week's content. You can test your script by running the following command (it tells Hive to execute the commands contained in the file script.hql): $ hive -f script.hql