Writing python, mapper and reducer
Part A - MapReduce with text (6 marks) In Part A your task is to answer a question about the data in a text file, first by writing a regular Python program, and then by writing a mapper and reducer. (By doing both you can see more readily how using a mapper and reducer differs from writing a regular program.) When you click the panel on the right you'll get a connection to a server that has, in your home directory, the text.txt file that you have already seen, containing some sample text (feel free to open the file and explore its contents). Your task is to find the length of the longest word in the file. Given the example text, the output is like: (You can assume that sentences are separated by full stops and that the words within a sentence are separated by spaces.) The longest word has 13 characters. First (2 marks) Write a regular Python program to do this. A file called "regular.py" has been created for you - you just need to fill in the details. You can test your program by running the following command (it tells Python to execute regular.py, using text.txt as input): $ python regular.py < text.txt="" second="" (4="" marks)="" write="" mapper="" and="" reducer="" programs="" to="" do="" this.="" files="" called="" "mapper.py"="" and="" "reducer.py"="" have="" been="" created="" for="" you="" -="" you="" just="" need="" to="" fill="" in="" the="" details.="" you="" can="" test="" your="" mapper="" and="" reducer="" by="" running="" the="" following="" command="" (it="" tells="" python="" to="" execute="" mapper.py,="" using="" text.txt="" as="" input,="" and="" then="" pipe="" the="" result="" as="" input="" to="" reducer.py):="" $="" python="" mapper.py="">< text.txt="" |="" python="" reducer.py="" to="" write="" your="" programs="" you="" should="" be="" able="" to="" modify="" programs="" that="" you="" have="" already="" seen="" in="" this="" week's="" content.="" to="" test="" on="" hadoop:="" we="" need="" to="" copy="" the="" file="" text.txt="" into="" hdfs.="" first,="" create="" your="" default="" working="" directory="" in="" hdfs:="" $="" hdfs="" dfs="" -mkdir="" -p="" user/user="" next,="" let's="" create="" a="" directory="" in="" which="" to="" keep="" the="" input="" files="" of="" our="" mapreduce="" job.="" call="" it="" "input":="" $="" hdfs="" dfs="" -mkdir="" user/user/input="" upload="" the="" file="" "text.txt"="" into="" hdfs="" user/user/input:="" $="" hdfs="" dfs="" -put="" text.txt="" user/user/input="" run="" the="" mapreduce="" job="" with="" 3="" reducers.="" $="" hadoop="" jar="" $hadoop_home/share/hadoop/tools/lib/hadoop-streaming-*.jar="" -file="" ~/mapper.py="" -mapper="" ~/mapper.py="" -file="" ~/reducer.py="" -reducer="" ~/reducer.py="" -numreducetasks="" 3="" -input="" input="" -output="" output="" you="" can="" view="" the="" results:="" $="" hdfs="" dfs="" -cat="" output/part-*="" to="" delete="" the="" output="" folder:="" $="" hdfs="" dfs="" -rm="" -r="" output="" when="" you="" are="" happy="" that="" your="" programs="" are="" correct,="" click="" "submit".="" part="" b="" -="" mapreduce="" with="" csv="" (6="" marks)="" in="" part="" b="" your="" task="" is="" to="" answer="" a="" question="" about="" the="" data="" in="" a="" csv="" file,="" first="" by="" writing="" a="" regular="" python="" program,="" and="" then="" by="" writing="" a="" mapper="" and="" reducer.="" (by="" doing="" both="" you="" can="" see="" more="" readily="" how="" using="" a="" mapper="" and="" reducer="" differs="" from="" writing="" a="" regular="" program.)="" when="" you="" click="" the="" panel="" on="" the="" right="" you'll="" get="" a="" connection="" to="" a="" server="" that="" has,="" in="" your="" home="" directory,="" the="" employees.csv="" file="" that="" you="" have="" already="" seen,="" containing="" data="" about="" employees="" (feel="" free="" to="" open="" the="" file="" and="" explore="" its="" contents).="" here,="" again,="" are="" the="" fields="" in="" the="" file:="" employee_id="" (integer)="" first_name="" (string)="" last_name="" (string)="" email="" (string)="" phone_number="" (string)="" hire_date="" (date)="" salary="" (integer)="" your="" task="" is="" to="" find="" the="" average="" salary="" of="" the="" employees,="" to="" the="" nearest="" dollar.="" you="" can="" use="" the="" function="" round(avg_salary).="" given="" the="" example="" input="" file,="" the="" output="" is="" like:="" the="" average="" salary="" is="" $="" 6462.="" first="" (2="" marks)="" write="" a="" regular="" python="" program="" to="" do="" this.="" a="" file="" called="" "regular.py"="" has="" been="" created="" for="" you="" -="" you="" just="" need="" to="" fill="" in="" the="" details.="" you="" can="" test="" your="" program="" by="" running="" the="" following="" command="" (it="" tells="" python="" to="" execute="" regular.py,="" using="" employees.csv="" as="" input):="" $="" python="" regular.py="">< employees.csv="" second="" (4="" marks)="" write="" mapper="" and="" reducer="" programs="" to="" do="" this.="" files="" called="" "mapper.py"="" and="" "reducer.py"="" have="" been="" created="" for="" you="" -="" you="" just="" need="" to="" fill="" in="" the="" details.="" you="" can="" test="" your="" mapper="" and="" reducer="" by="" running="" the="" following="" command="" (it="" tells="" python="" to="" execute="" mapper.py,="" using="" employees.csv="" as="" input,="" and="" then="" pipe="" the="" result="" as="" input="" to="" reducer.py):="" $="" python="" mapper.py="">< employees.csv | python reducer.py to test on hadoop: we need to copy the file text.txt into hdfs. $ hdfs dfs -mkdir -p /user/user $ hdfs dfs -mkdir /user/user/input $ hdfs dfs -put employees.csv /user/user/input run the mapreduce job with 3 reducers. $ hadoop jar $hadoop_home/share/hadoop/tools/lib/hadoop-streaming-*.jar -file ~/mapper.py -mapper ~/mapper.py -file ~/reducer.py -reducer ~/reducer.py -numreducetasks 3 -input input -output output you can view the results: $ hdfs dfs -cat output/part-* to delete the output folder: $ hdfs dfs -rm -r output to write your programs you should be able to modify programs that you have already seen in this week's content. when you are happy that your programs are correct, click "submit". employees.csv="" |="" python="" reducer.py="" to="" test="" on="" hadoop:="" we="" need="" to="" copy="" the="" file="" text.txt="" into="" hdfs.="" $="" hdfs="" dfs="" -mkdir="" -p="" user/user="" $="" hdfs="" dfs="" -mkdir="" user/user/input="" $="" hdfs="" dfs="" -put="" employees.csv="" user/user/input="" run="" the="" mapreduce="" job="" with="" 3="" reducers.="" $="" hadoop="" jar="" $hadoop_home/share/hadoop/tools/lib/hadoop-streaming-*.jar="" -file="" ~/mapper.py="" -mapper="" ~/mapper.py="" -file="" ~/reducer.py="" -reducer="" ~/reducer.py="" -numreducetasks="" 3="" -input="" input="" -output="" output="" you="" can="" view="" the="" results:="" $="" hdfs="" dfs="" -cat="" output/part-*="" to="" delete="" the="" output="" folder:="" $="" hdfs="" dfs="" -rm="" -r="" output="" to="" write="" your="" programs="" you="" should="" be="" able="" to="" modify="" programs="" that="" you="" have="" already="" seen="" in="" this="" week's="" content.="" when="" you="" are="" happy="" that="" your="" programs="" are="" correct,="" click=""> employees.csv | python reducer.py to test on hadoop: we need to copy the file text.txt into hdfs. $ hdfs dfs -mkdir -p /user/user $ hdfs dfs -mkdir /user/user/input $ hdfs dfs -put employees.csv /user/user/input run the mapreduce job with 3 reducers. $ hadoop jar $hadoop_home/share/hadoop/tools/lib/hadoop-streaming-*.jar -file ~/mapper.py -mapper ~/mapper.py -file ~/reducer.py -reducer ~/reducer.py -numreducetasks 3 -input input -output output you can view the results: $ hdfs dfs -cat output/part-* to delete the output folder: $ hdfs dfs -rm -r output to write your programs you should be able to modify programs that you have already seen in this week's content. when you are happy that your programs are correct, click "submit".>