Store document files in Hadoop Distributed File System (HDFS). Write two MapReduce programs in Python to solve two typical applications in text analysis. Write a report for professional data analyst...

1 answer below »


  • Store document files in Hadoop Distributed File System (HDFS).

  • Write two MapReduce programs in Python to solve two typical applications in text analysis.

  • Write a report for professional data analyst and submit a document file containing all answers and screenshots via the submission point.






Assessment Details


Text analytics is the process of deriving high-quality information from documents. Text analysis parses the contents of a document and creates structured data out of free text contents of the document. Typical applications in text analytics are determining the frequency of word lengths and the frequency of individual characters within a text document.



In Practical 2,you are required to write two MapReduce programs and run them on HDFS.






Preparation




  • First, create a directory for this assessment called practical2 within the /home/cloudera directory.

  • Our input file (or data) for this assessment will be a dictionary of English words found on the Linux operating system on your Cloudera VM. To find the file go to the following directory: /usr/share/dict/words. Copy the file into your directory. To do this, right click on the file -> "Open with gedit" -> File -> "Save As..." to save a copy of this file in your directory. After that, rename this file to practical_input.txt.

  • Second, create/practical2directory withinHadoop Distributed File System (HDFS). Review Practical Activity 5 to learn how to create a directory within HDFS.

  • Lastly, upload our input file (i.e. practical_input.txt) to /practical2 HDFS directory.Review Practical Activity 5 to learn how to upload the data to HDFS directory.



Now, you are ready to start writing your MapReduce programs to answer the below questions.






Q1. Write a MapReduce program to determine the frequency of word lengths within an input file.


The program should return how many times each word length appears within the dictionary. For example, in the following list


{‘apple’ , ’banana’ , ‘orange’ , ‘pear’}


The length of the words is


{5 , 6 , 6 , 4}


And so the output ‘part-00000.txt’ file would look something like


4 1


5 1


6 2


indicating that there is one word with four letters, one word with five letters and two words with six letters.


If we follow Practical Activity 5, the frequency values (e.g. 2, 4, 10, 16, 23) are actually being saved as strings, so they will print in the order (10, 16, 2, 23, 4). See if you can get frequencies to print in correct numerical order. (Hint: It can be achieved in one line in the Linux terminal after the file is output.)






Q2. Write a MapReduce program to determine the frequency of individual characters within the provided text file.


The program should return how many times each character appears, and you don’t need to discern between letters, numbers or symbols, any character within the text file should be counted.


For instance


{‘11’ , ’cat’ , ’1hat’}


Should output


1 3


a 2


c 1


h 1


t 2


Note: This output should not need sorting. You should only see values 0-9, thus circumventing the problem in Question 1.






Marking Schema


All questions MapReduce program code: 2 marks each


All questions output: 1 mark each


All questions code presentation: 1 mark each


Report: 2 Marks



The marker will execute your MapReduce programs and compare the outputs.


You shouldsubmit




  • your report in document (i.e. .doc/.docx) file

  • three files in total for each question:


    • mapperq*.py

    • reducerQ*.py

    • part-00000_q*




where * is replaced by the question number. The output of each MapReduce program will be ‘part-00000.txt’, so just rename them using the above convention once they’re done,



Answered 2 days AfterJul 31, 2021

Answer To: Store document files in Hadoop Distributed File System (HDFS). Write two MapReduce programs in...

Shashi Kant answered on Aug 03 2021
138 Votes
Hadoop/part-00000_q2.txt
cat hat rat
Hadoop/part-00000_q1.txt
apple banana orange pear
Hadoop/mapperq2.py
#!/usr/bin/python
# -*-coding:utf-8 -*
import sys
import re
from collection
s import defaultdict
def read_input(file):
pattern = re.compile(r'\s+')
for line in file:
line = re.sub(pattern, '', line)
yield line
def main(separator='\t'):
data = read_input(sys.stdin)
for line in data:
letter_dict = defaultdict(int)
for letter in line:
if letter.isalpha():
letter_dict[letter.lower()] += 1
for count_tuple in letter_dict.items():
print '%s%s%d' % (count_tuple[0], separator, count_tuple[1])
if __name__ == "__main__":
main()
Hadoop/Report.odt
Pofessional Data Analyst
To run MapReduce Programs the tools which we need are:
1. Java
2. Hadoop
Both must be install in the device.
Ex:
Here we can see that we have installed Hadoop and Java.

Here the versions of the tools are:
Hadoop -> 3.2.2
Java -> 1.8.0_292
Now we will work on the program.
Questions are :
1.Write a MapReduce program to determine the frequency of word lengths within an input file.
2.Write a MapReduce program to determine the frequency of individual characters within the provided text file.
                    Solution
First we need to create a file for both programs where we will put the data and will do the operations based on that input.
File names :
            1.part-00000_q1.txt
            2.part-00000_q2.txt
Now we will create mapper and reducer file where we will write the logic.
File names:
            1.    mapperq1.py, reducerq1.py
            2. mapperq2.py, reducerq2.py
Now we have the files which have data and mapper, reducer file which have logic.
Now we need to create a directory on HDFS.
    Directory Name:
            practical2
to create a directory on HDFS we have to follow some commands :
hdfs dfs...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here