Objectives 1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL). 2. Solve challenging big data processing tasks by finding highly efficient solutions; very...


Objectives



1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).


2. Solve challenging big data processing tasks by finding highly efficient solutions; very inefficient solutions will lose marks.


3. Experience processing four different types of real data


a. Standard multi-attribute data (video game sales data)


b. Time series data (Twitter feed)


c. Bag of words data


d. A News aggregation corpus



4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).



a) [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual


b) [Spark]
http://spark.apache.org/docs/latest/api/scala/index.html#package


c) [Spark SQL]
https://spark.apache.org/docs/latest/sql-programming-guide.html



https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset



https://spark.apache.org/docs/latest/api/sql/index.html


·
Hint:
If you are not sure what a spark API call does, try to write a small example and try it in the spark shell




This assignment is due 10:00 a.m. on Thursday 28th
of May, 2020.


Penalties are applied to late assignments (accepted up to 5 business days after the due date only). Five percent is deducted per business day late. A mark of zero will be assigned to assignments submitted more than 5 days late.




Submission checklist




  • Ensure that all of your solutions read their input from the full data files (not the small example versions)

  • Check that all of your solutions run without crashing on the CloudxLab interface.

  • Delete all output files

  • Zip the whole assignment folder and submit via LMS



Copying, Plagiarism



This is an
individual
assignment. You are not permitted to work as a part of a group when writing this assignment.



Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.



Expected quality of solutions



a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.


b) This entire assignment can be done using the CloudxLab servers and the supplied data sets
without running out of memory
. It is time to show your skills!


c) We are not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.


d) For Hive queries. We prefer answers that use less tables.



The questions in the assignment will be labelled using the following:


· [Hive]


o Means this question needs to be done using Hive


· [Spark RDD]


o Means this question needs to be done using Spark RDDs, you are
not allowed
to use any Spark SQL features like dataframe or datasets.


· [Spark SQL]


o Means this question needs to be done using Spark SQL and therefore not allowed to use RDDs. You need to do these questions using the Spark dataframe or dataset API – do not use SQL syntax. i.e. you must not use the spark.sql() function; instead you must use the individual functions (e.g. .select() and .join())


o For example, write code like this:




o
Don’t
write code like this:





Do the entire assignment using CloudxLab.


Assignment structure:





· For each Hive question a skeleton .hql file is provided for you to write your solution in. You can run these just like you did in labs:


$ hive -f Task_XX.hql




· For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file. Run your Spark code using:


$ spark-shell -i Task_XX.scala




Tips:



1. Look at the data files
before
you begin each task. Try to understand what you are dealing with! The data for Spark questions is already on HDFS so you don’t need to write the data-loading.


2. For each subtask we provide small example input and the corresponding output in the assignment specifications below. These small versions of the files are also supplied with the assignment (they have “-small” in the name). It’s a good idea to get your solution working on the small inputs first before moving on to the full files.


3. In addition to testing the correctness of your code using the very small example input, you should also use the large input files that we provide to test the scalability of your solutions. If your solution takes longer than 10 minutes to run on the large input, it definitely has scalability problems.


4. It can take some time to run Spark applications. So for the Spark questions it’s best to experiment using the
spark-shell
interpreter first to figure out a working solution, and then put your code into the .scala files afterwards.


Task 1: Analysing Video Game Sales [29 marks total]



We will be doing some analytics on data that contains a list of video games with sales greater than 100k copies[1]. The data is stored in a semicolon (“;”) delimited format.




The data is supplied with the assignment in the zip file and on HDFS at the below locations:




Local filesystem:



















Small version



t1/Input_data/vgsales-small.csv




Full version



t1/Input_data/vgsales.csv





HDFS:



















Small version



hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales-small.csv




Full version



hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales.csv





The data has the following attributes

























































Attribute index



Attribute name



Description



0



Name



The game name



1



Platform



Platform of the games release (Categorical: "Wii","NES","GB","DS","X360","PS3","PS2","SNES","GBA",


"3DS","PS4","N64","PS","XB","PC","2600","PSP","XOne",


"GC","WiiU","GEN","DC","PSV","SAT","SCD","WS","NG",


"TG16","3DO","GG","PCFX")



2



Year



Year of the game's release



3



Genre



Genre of the game (categorical: "Sports","Platform","Racing",


"Role-Playing","Puzzle","Misc","Shooter","Simulation",


"Action","Fighting","Adventure","Strategy")



4



Publisher



Publisher of the game (categorical: "Nintendo","Microsoft Game Studios","Take-Two Interactive","Sony Computer Entertainment","Activision","Ubisoft","Bethesda Softworks",


"Electronic Arts","Sega","SquareSoft","Atari","505 Games", …, and 567 other unique records)



5



NA_Sales



Sales in North America (in millions)



6



EU_Sales



Sales in Europe (in millions)



7



JP_Sales



Sales in Japan (in millions)



8



Other_Sales



Sales in the rest of the world (in millions)







Here is a small example of the data that we will use to illustrate the subtasks below:



























































































































































































































































Name




Platform




Year




Genre




Publisher




NA_Sales




EU_Sales




JP_Sales




Other_Sales



Gran Turismo 3: A-Spec



PS2



2001



Racing



Sony Computer Entertainment



6.85



5.09



1.87



1.16



Call of Duty: Modern Warfare 3



X360



2011



Shooter



Activision



9.03



4.28



0.13



1.32



Pokemon Yellow: Special Pikachu Edition



GB



1998



Role-Playing



Nintendo



5.89



5.04



3.12



0.59



Call of Duty: Black Ops



X360



2010



Shooter



Activision



9.67



3.73



0.11



1.13



Pokemon HeartGold/Pokemon SoulSilver



DS



2009



Action



Nintendo



4.4



2.77



3.96



0.77



High Heat Major League Baseball 2003



PS2



2002



Sports



3DO



0.18



0.14



0



0.05



Panzer Dragoon



SAT



1995



Shooter



Sega



0



0



0.37



0



Corvette



GBA



2003



Racing



TDK Mediactive



0.2



0.07



0



0.01



Resident Evil: Revelations



PS3



2013



Action



Capcom



0.14



0.32



0.22



0.12



Mission: Impossible - Operation Surma



PS2



2003



Platform



Atari



0.14



0.11



0



0.04



Suikoden Tierkreis



DS



2008



Role-Playing



Konami Digital Entertainment



0.09



0.01



0.15



0.01



Dragon Ball GT: Final Bout



PS



1997



Fighting



Namco Bandai Games



0.02



0.02



0.22



0.02



Harry Potter and the Sorcerer's Stone



PS2



2003



Action



Electronic Arts



0.14



0.11



0



0.04



Tropico 4



PC



2011



Strategy



Kalypso Media



0.1



0.13



0



0.04



Call of Juarez: The Cartel



X360



2011



Shooter



Ubisoft



0.14



0.11



0



0.03



Prince of Persia: The Two Thrones



GC



2005



Action



Ubisoft



0.11



0.03



0



0



NHL 07



PSP



2006



Sports



Electronic Arts



0.13



0.02



0



0.02



Disney's Winnie the Pooh's Rumbly Tumbly Adventure



GBA



2005



Platform



Ubisoft



0.09



0.03



0



0



Batman & Robin



PS



1998



Action



Acclaim Entertainment



0.06



0.04



0



0.01



Spider-Man: Battle for New York



DS



2006



Platform



Activision



0.12



0



0



0.01




Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask. There is a skeleton file provided for each of the tasks.



a) [Hive] Report the top 3 games for the 'X360' console in North America based on their sales in descending order. Write the results to “Task_1a-out”. For the above small example data set you would report the following (Name, Genre, NA_Sales):



Call of Duty: Black Ops Shooter 9.67


Call of Duty: Modern Warfare 3 Shooter 9.03


Call of Juarez: The Cartel Shooter 0.14



[6 marks]



b) [Hive] Report the number of games for each pair of platform and genre, in descending order of count. Write the results to “Task_1b-out”. Hint you can group data by more than one column. For the small example data set you would report the following (Platform, Genre, count):



X360 Shooter 3


PS Action 1


SAT Shooter 1


PSP Sports 1


PS3 Action 1


PS2 Sports 1


PS2 Racing 1


PS2 Platform 1


PS2 Action 1


PS Fighting 1


PC Strategy 1


GC Action 1


GBA Racing 1


GBA Platform 1


GB Role-Playing 1


DS Role-Playing 1


DS Platform 1


DS Action 1


[6 marks]



c) [Spark RDD] Find the highest and lowest selling genre based on global sale. Print the result to the terminal output using
println. For the small example data set you should get the following results:


(Formula:) [Global Sales = NA_Sales + EU_Sales + JP_Sales]



Highest selling Genre: Shooter Global Sale (in millions): 27.57


Lowest selling Genre: Strategy Global Sale (in millions): 0.23



[7 marks]



d) [Spark RDD] Report the top 50 publishers by market share (percentage of games made by the publisher among all games), along with the total number of games they made. Output should be in descending order of market share. Print the result to the terminal output using
println. Hint: The Spark RDD method
take
might be useful. For the small example data set you would report the following (publisher name, number of games made by the published, percentage of games made by the publisher among all games):



(Ubisoft,3,15.0)


(Activision,3,15.0)


(Electronic Arts,2,10.0)


(Nintendo,2,10.0)


(Acclaim Entertainment,1,5.0)


(Sega,1,5.0)


(3DO,1,5.0)


(Namco Bandai Games,1,5.0)


(TDK Mediactive,1,5.0)


(Sony Computer Entertainment,1,5.0)


(Konami Digital Entertainment,1,5.0)


(Kalypso Media,1,5.0)


(Capcom,1,5.0)


(Atari,1,5.0)




[10 marks]




Task 2: Analysing Twitter Time Series Data [24 marks]



In this task we will be doing some analytics on real Twitter data[2]. The data is stored in a tab (“\t”) delimited format.




The data is supplied with the assignment in the zip file and on HDFS at the below locations:




Local filesystem:



















Small version



t2/Input_data/twitter-small.tsv




Full version



t2/Input_data/twitter.tsv





HDFS:



















Small version



hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter-small.tsv




Full version



hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter.tsv






The data has the following attributes
































Attribute index



Attribute name



Description



0



tokenType



In our data set all rows have Token type of hashtag. So this attribute is useless for this assignment.



1



month



The year and month specified like the following: YYYYMM. So 4 digits for year followed by 2 digits for month. So like the following 200905, meaning the year 2009 and month of May



2



count



An integer representing the number tweets of this hash tag for the given year and month



3



hashtagName



The #tag name, e.g. babylove, mydate, etc.








Here is a small example of the Twitter data that we will use to illustrate the subtasks below:



































































Token type



Month



count



Hash Tag Name



hashtag



200910



2



Babylove



hashtag



200911



2



babylove



hashtag



200912



90



babylove



hashtag



200812



100



mycoolwife



hashtag



200901



201



mycoolwife



hashtag



200910



1



mycoolwife



hashtag



200912



500



mycoolwife



hashtag



200905



23



abc



hashtag



200907



1000



abc







a) [Hive] Find the top 5 months with the highest number of cumulative tweets and sort it according to the number of tweets of each month. Write the results to “Task_2a-out”. The month with the highest number of cumulative tweets should come first. So, for the above small example dataset the result would be:




month numtweets


200907 1000


200912 590


200901 201


200812 100


200905 23




[6 marks]




b) [Spark RDD] Find the year with the maximum number of cumulative tweets in the entire dataset across all months. Report the total number of tweets for that year. Print the result to the terminal output using println. For the above small example dataset the output would be:



2009 1819



[6 marks]



c) [Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. We have already written code in your code template that reads the x and y values from the keyboard. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Print the result to the terminal output using println. For the above small example data set the output should be the following:





Input
x = 200910, y = 200912




Output
hashtagName: mycoolwife, countX: 1, countY: 500



[12 marks]





Task 3: Indexing Bag of Words data [25 marks]



In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.




The data is supplied with the assignment in the zip file and on HDFS at the below locations:




Local filesystem:

























Small version



t3/Input_data/docword-small.txt



t3/Input_data/vocab-small.txt




Full version



t3/Input_data/docword.txt



t3/Input_data/vocab.txt





HDFS:

























Small version



hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword-small.txt



hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab-small.txt




Full version



hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword.txt



hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab.txt





The first file is called
docword.txt,
which contains the contents of all the documents stored in the following format:



























Attribute index



Attribute name



Description



0



docId



The ID of the document that contains the word



1



vocabId



Instead of storing the word itself, we store an ID from the vocabulary file.



2



count



An integer representing the number of times this word occurred in this document.




The second file called
vocab.txt
contains each word in the vocabulary, which is indexed by vocabIndex from the
docword.txt
file.







Here is a small example content of the
docword.txt
file.































































docId




vocabId




Count



3



3



600



2



3



702



1



2



120



2



5



200



2



2



500



3



1



100



3



5



2000



3



4



122



1



3



1200



1



1



1000





Here is an example of the
vocab.txt
file
































vocabId




Word



1



Plane



2



Car



3



Motorbike



4



Truck



5



Boat




Complete the following subtasks using Spark:


a) [spark SQL] Calculate the total count of each word across all documents. List the words ordered by count in descending order. Write the results in csv format at “file:///home/USERNAME/Task_3a-out”, replacing
USERNAME
with your
CloudxLab
username. Use
show()
to print the first 10 rows of the dataframe that you saved. So, for the above small example input the output would be the following:


+---------+----------+


| word|sum(count)|


+---------+----------+


|motorbike| 2502|


| boat| 2200|


| plane| 1100|


| car| 620|


| truck| 122|


+---------+----------+



Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.


[8 marks]



b) [spark SQL]



1. Create a dataframe containing rows with four fields: (word,
docId,
count, firstLetter). You should add the
firstLetter
column by using a UDF which extracts the first letter of
word
as a String. Write the results in parquet format
partitioned by firstLetter
at “file:///home/USERNAME/t3_docword_index_part.parquet”, replacing
USERNAME
with your
CloudxLab
username. Use
show()
to print the first 10 rows of the partitioned dataframe that you saved.
So, for the above example input, you should see the following output (the exact ordering is not important):


+---------+-----+-----+-----------+


| word|docId|count|firstLetter|


+---------+-----+-----+-----------+


| plane| 1| 1000| p|


| plane| 3| 100| p|


| car| 2| 500| c|


| car| 1| 120| c|


|motorbike| 1| 1200| m|


|motorbike| 2| 702| m|


|motorbike| 3| 600| m|


| truck| 3| 122| t|


| boat| 3| 2000| b|


| boat| 2| 200| b|


+---------+-----+-----+-----------+


[7 marks]



c) [spark SQL] Load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of query words from the user. For each word in the queryWords list use println to display the following: the word and; the docId with the largest count for that word (you can break ties arbitrarily). Skip any query words that aren’t found in the dataset. To iterate over the query words, use a normal Scala for loop, like this:



for(queryWord


// ...put Spark query code here...


}


For this subtask there is a big optimisation that can be made because of how the data is partitioned. So, think carefully about how you filter the data.



If queryWords contains “car”, “dog”, and “truck”, then the output for the example dataset would be:



[car,2]


[truck,3]




[5 marks]


d) [spark SQL] Again load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of document IDs from the user. For each document ID in the docIds list use println to display the following: the document ID and; the word with the most occurrences in that document, along with the number of occurrences of that word in the document (you can break ties arbitrarily). Skip any document IDs that aren’t found in the dataset. Note in this task we are searching via document ID instead of query word so we cannot take advantage of the same optimization you used for part c). So, you need to find some other kind of optimization. Hint: think about the fact you are repeatedly reusing the same dataframe in the scala for loop as you process each docId.



If docIds contains “2” and “3”, then the output for the example dataset would be:



[2, motorbike, 702]
[3, boat, 2000]



[5 marks]






Task 4: Finding associations between news publishers [22 marks]



In this task, you will be asked to analyse a dataset of news articles[3] by finding associations between news publishers.





The data is supplied with the assignment in the zip file and on HDFS at the below locations:




Local filesystem:



















Small version



t4/Input_data/news-small.tsv




Full version



t4/Input_data/news.tsv





HDFS:



















Small version



hdfs:///user/ashhall1616/bdc_data/assignment/t4/news-small.csv




Full version



hdfs:///user/ashhall1616/bdc_data/assignment/t4/news.csv







Note: Due to the size of the full dataset, you may need to run tasks using this command (the command below allocates 1GB of RAM to the spark executor rather than the default 512MB):




spark-shell -i Task_XX.scala –-driver-memory 1g



The data has the following attributes



















































Attribute index



Attribute name



Description



0



articleId



Numeric identifier for article



1



title



Title of article



2



url



Url of article



3



publisher



The publisher of the article



4



category



News category (b = business, t = science and technology, e = entertainment, m = health)



5



storyId



Alphanumeric id of “story”. All articles are first grouped via a clustering algorithm. Then each cluster is assigned a story id. The story id identifies the topic of the story, not the specific article. Therefore, there are multiple articles (rows of data) per story.



6



hostname



URL hostname



7



timestamp



Approximate time the news was published in unix time








Here is a small example of the article data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):





































































articleId



publisher



Category



storyId



hostname



1



Los Angeles Times



B



ddUyU0VZz0BRneMioxUPQVP6sIxvM



www.latimes.com



2



Livemint



B



ddUyU0VZz0BRneMioxUPQVP6sIxvM



www.livemint.com



3



IFA Magazine



B



ddUyU0VZz0BRneMioxUPQVP6sIxvM



www.ifamagazine.com



4



IFA Magazine



B



ddUyU0VZz0BRneMioxUPQVP6sIxvM



www.ifamagazine.com



5



Moneynews



B



ddUyU0VZz0BRneMioxUPQVP6sIxvM



www.moneynews.com



6



NASDAQ



B



ddUyU0VZz0BRneMioxUPQVP6sIxvM



www.nasdaq.com



16



NASDAQ



B



dPhGU51DcrolUIMxbRm0InaHGA2XM



www.nasdaq.com



19



IFA Magazine



B



dPhGU51DcrolUIMxbRm0InaHGA2XM



www.ifamagazine.com




Complete the following subtasks using Spark:



a) [Spark SQL]


1. Create a list of each story paired with each publisher that wrote at least one article for that story. Store the resulting dataframe in Parquet format at “file:///home/USERNAME/t4_story_publishers.parquet”, replacing
USERNAME
with your
CloudxLab
username. For the small example input file the expected output is:



[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Livemint]


[ddUyU0VZz0BRneMioxUPQVP6sIxvM, IFA Magazine]


[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Moneynews]


[ddUyU0VZz0BRneMioxUPQVP6sIxvM, NASDAQ]


[dPhGU51DcrolUIMxbRm0InaHGA2XM, IFA Magazine]


[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Los Angeles Times]


[dPhGU51DcrolUIMxbRm0InaHGA2XM, NASDAQ]



2. From this dataframe, find the top 10 most popular stories (measured by number of articles) and the number of publishers that wrote an article about them, report the results using
println. A story is considered popular if at least 5 publishers published at least one article on it. For the small example input file the expected output is:



[ddUyU0VZz0BRneMioxUPQVP6sIxvM,5]


[7 marks]



b) [Spark SQL] Load up the Parquet file which you created in the previous subtask. Find all pairs of publishers that publish articles about the same stories. For each publisher pair report the number of co-published stories. Where a co-published story in a story published by both publishers. Report the pairs in decreasing order of frequency. The solution may take a few minutes to run. Note the solution must conform to the following rules:


1. There should not be any replicated entries like:


NASDAQ, NASDAQ, 1000


2. You should not have the same pair occurring twice in opposite order. Only one of the following should occur:


NASDAQ, Reuters, 1000
Reuters, NASDAQ, 1000


(i.e. it is
incorrect
to have
both
of the above two lines in your result)




Save the results in CSV format at “file:///home/USERNAME/t4_paired_publishers.csv”, again replacing
USERNAME
with your
CloudxLab
username. For the example above, the output should be as follows (it is OK if the csv is split over multiple files, but the combined data contents of the multiple files should be the same, with the order preserved):



[NASDAQ,IFA Magazine,2]


[Moneynews,Livemint,1]


[Moneynews,IFA Magazine,1]


[NASDAQ,Livemint,1]


[NASDAQ,Los Angeles Times,1]


[Moneynews,Los Angeles Times,1]


[Los Angeles Times,IFA Magazine,1]


[Livemint,IFA Magazine,1]


[NASDAQ,Moneynews,1]


[Los Angeles Times,Livemint,1]


[15 marks]






Bonus Marks:






[1] : Video games Sales data source:
https://www.kaggle.com/gregorut/videogamesales


[2] : Twitter data source:
http://www.infochimps.com/datasets/twitter-census-conversation-metrics-one-year-of-urls-hashtags-sm


[3] : News data source:
https://archive.ics.uci.edu/ml/datasets/News+Aggregator

May 27, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here