Store document files in Hadoop Distributed File System (HDFS). Write two MapReduce programs in...

Question

Store document files in Hadoop Distributed File System (HDFS).

Write two MapReduce programs in Python to solve two typical applications in text analysis.

Write a report for professional data analyst and submit a document file containing all answers and screenshots via the submission point.

Assessment Details

Text analytics is the process of deriving high-quality information from documents. Text analysis parses the contents of a document and creates structured data out of free text contents of the document. Typical applications in text analytics are determining the frequency of word lengths and the frequency of individual characters within a text document.

In Practical 2,you are required to write two MapReduce programs and run them on HDFS.

Preparation

First, create a directory for this assessment called practical2 within the /home/cloudera directory.

Our input file (or data) for this assessment will be a dictionary of English words found on the Linux operating system on your Cloudera VM. To find the file go to the following directory: /usr/share/dict/words. Copy the file into your directory. To do this, right click on the file -> "Open with gedit" -> File -> "Save As..." to save a copy of this file in your directory. After that, rename this file to practical_input.txt.

Second, create/practical2directory withinHadoop Distributed File System (HDFS). Review Practical Activity 5 to learn how to create a directory within HDFS.

Lastly, upload our input file (i.e. practical_input.txt) to /practical2 HDFS directory.Review Practical Activity 5 to learn how to upload the data to HDFS directory.

Now, you are ready to start writing your MapReduce programs to answer the below questions.

Q1. Write a MapReduce program to determine the frequency of word lengths within an input file.

The program should return how many times each word length appears within the dictionary. For example, in the following list

{‘apple’ , ’banana’ , ‘orange’ , ‘pear’}

The length of the words is

{5 , 6 , 6 , 4}

And so the output ‘part-00000.txt’ file would look something like

4 1

5 1

6 2

indicating that there is one word with four letters, one word with five letters and two words with six letters.

If we follow Practical Activity 5, the frequency values (e.g. 2, 4, 10, 16, 23) are actually being saved as strings, so they will print in the order (10, 16, 2, 23, 4). See if you can get frequencies to print in correct numerical order. (Hint: It can be achieved in one line in the Linux terminal after the file is output.)

Q2. Write a MapReduce program to determine the frequency of individual characters within the provided text file.

The program should return how many times each character appears, and you don’t need to discern between letters, numbers or symbols, any character within the text file should be counted.

For instance

{‘11’ , ’cat’ , ’1hat’}

Should output

1 3

a 2

c 1

h 1

t 2

Note: This output should not need sorting. You should only see values 0-9, thus circumventing the problem in Question 1.

Marking Schema

All questions MapReduce program code: 2 marks each

All questions output: 1 mark each

All questions code presentation: 1 mark each

Report: 2 Marks

The marker will execute your MapReduce programs and compare the outputs.

You shouldsubmit

your report in document (i.e. .doc/.docx) file

three files in total for each question:

mapperq*.py

reducerQ*.py

part-00000_q*

where * is replaced by the question number. The output of each MapReduce program will be ‘part-00000.txt’, so just rename them using the above convention once they’re done,

Shashi Kant · Accepted Answer

Hadoop/part-00000_q2.txt
cat hat rat
Hadoop/part-00000_q1.txt
apple banana orange pear
Hadoop/mapperq2.py
#!/usr/bin/python
# -*-coding:utf-8 -*
import sys
import re
from collections import defaultdict
def read_input(file):
    pattern = re.compile(r'\s+')
    for line in file:
        line = re.sub(pattern, '', line)
        yield line
def main(separator='	'):
    data = read_input(sys.stdin)
    for line in data:
        letter_dict = defaultdict(int)
        for letter in line:
            if letter.isalpha():
                letter_dict[letter.lower()] += 1
        for count_tuple in letter_dict.items():
            print '%s%s%d' % (count_tuple[0], separator, count_tuple[1])
if __name__ == "__main__":
    main()
Hadoop/Report.odt
Pofessional Data Analyst
To run MapReduce Programs the tools which we need are:
1. Java
2. Hadoop
Both must be install in the device.
Ex:
 Here we can see that we have installed Hadoop and Java.
 
Here the versions of the tools are:
Hadoop -> 3.2.2
Java -> 1.8.0_292
Now we will work on the program.
Questions are :
1.Write a MapReduce program to determine the frequency of word lengths within an input file.
2.Write a MapReduce program to determine the frequency of individual characters within the provided text file.
					Solution
First we need to create a file for both programs where we will put the data and will do the operations based on that input.
File names :
			1.part-00000_q1.txt
			2.part-00000_q2.txt
Now we will create mapper and reducer file where we will write the logic.
File names:
			1.	mapperq1.py, reducerq1.py
			2. mapperq2.py, reducerq2.py
Now we have the files which have data and  mapper, reducer file which have logic.
Now we need to create a directory on HDFS.
	Directory Name:
			practical2
to create a directory on HDFS we have to follow some commands :

Store document files in Hadoop Distributed File System (HDFS). Write two MapReduce programs in Python to solve two typical applications in text analysis. Write a report for professional data analyst...

Assessment Details

Marking Schema

Answer To: Store document files in Hadoop Distributed File System (HDFS). Write two MapReduce programs in...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment