- Store document files in Hadoop Distributed File System (HDFS).
- Write two MapReduce programs in Python to solve two typical applications in text analysis.
- Write a report for professional data analyst and submit a document file containing all answers and screenshots via the submission point.
Assessment Details
Text analytics is the process of deriving high-quality information from documents. Text analysis parses the contents of a document and creates structured data out of free text contents of the document. Typical applications in text analytics are determining the frequency of word lengths and the frequency of individual characters within a text document.
In Practical 2,you are required to write two MapReduce programs and run them on HDFS.
Preparation
- First, create a directory for this assessment called practical2 within the /home/cloudera directory.
- Our input file (or data) for this assessment will be a dictionary of English words found on the Linux operating system on your Cloudera VM. To find the file go to the following directory: /usr/share/dict/words. Copy the file into your directory. To do this, right click on the file -> "Open with gedit" -> File -> "Save As..." to save a copy of this file in your directory. After that, rename this file to practical_input.txt.
- Second, create/practical2directory withinHadoop Distributed File System (HDFS). Review Practical Activity 5 to learn how to create a directory within HDFS.
- Lastly, upload our input file (i.e. practical_input.txt) to /practical2 HDFS directory.Review Practical Activity 5 to learn how to upload the data to HDFS directory.
Now, you are ready to start writing your MapReduce programs to answer the below questions.
Q1. Write a MapReduce program to determine the frequency of word lengths within an input file.
The program should return how many times each word length appears within the dictionary. For example, in the following list
{‘apple’ , ’banana’ , ‘orange’ , ‘pear’}
The length of the words is
{5 , 6 , 6 , 4}
And so the output ‘part-00000.txt’ file would look something like
4 1
5 1
6 2
indicating that there is one word with four letters, one word with five letters and two words with six letters.
If we follow Practical Activity 5, the frequency values (e.g. 2, 4, 10, 16, 23) are actually being saved as strings, so they will print in the order (10, 16, 2, 23, 4). See if you can get frequencies to print in correct numerical order. (Hint: It can be achieved in one line in the Linux terminal after the file is output.)
Q2. Write a MapReduce program to determine the frequency of individual characters within the provided text file.
The program should return how many times each character appears, and you don’t need to discern between letters, numbers or symbols, any character within the text file should be counted.
For instance
{‘11’ , ’cat’ , ’1hat’}
Should output
1 3
a 2
c 1
h 1
t 2
Note: This output should not need sorting. You should only see values 0-9, thus circumventing the problem in Question 1.
Marking Schema
All questions MapReduce program code: 2 marks each
All questions output: 1 mark each
All questions code presentation: 1 mark each
Report: 2 Marks
The marker will execute your MapReduce programs and compare the outputs.
You shouldsubmit
- your report in document (i.e. .doc/.docx) file
- three files in total for each question:
- mapperq*.py
- reducerQ*.py
- part-00000_q*
where * is replaced by the question number. The output of each MapReduce program will be ‘part-00000.txt’, so just rename them using the above convention once they’re done,