Objectives
1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).
2. Solve challenging big data processing tasks by finding highly efficient solutions; very inefficient solutions will lose marks.
3. Experience processing four different types of real data
a. Standard multi-attribute data (video game sales data)
b. Time series data (Twitter feed)
c. Bag of words data
d. A News aggregation corpus
4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).
a) [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual
b) [Spark]
http://spark.apache.org/docs/latest/api/scala/index.html#package
c) [Spark SQL]
https://spark.apache.org/docs/latest/sql-programming-guide.html
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
https://spark.apache.org/docs/latest/api/sql/index.html
·
Hint:
If you are not sure what a spark API call does, try to write a small example and try it in the spark shell
This assignment is due 10:00 a.m. on Thursday 28th
of May, 2020.
Penalties are applied to late assignments (accepted up to 5 business days after the due date only). Five percent is deducted per business day late. A mark of zero will be assigned to assignments submitted more than 5 days late.
Submission checklist
- Ensure that all of your solutions read their input from the full data files (not the small example versions)
- Check that all of your solutions run without crashing on the CloudxLab interface.
- Delete all output files
- Zip the whole assignment folder and submit via LMS
Copying, Plagiarism
This is an
individual
assignment. You are not permitted to work as a part of a group when writing this assignment.
Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.
Expected quality of solutions
a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.
b) This entire assignment can be done using the CloudxLab servers and the supplied data sets
without running out of memory. It is time to show your skills!
c) We are not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.
d) For Hive queries. We prefer answers that use less tables.
The questions in the assignment will be labelled using the following:
· [Hive]
o Means this question needs to be done using Hive
· [Spark RDD]
o Means this question needs to be done using Spark RDDs, you are
not allowed
to use any Spark SQL features like dataframe or datasets.
· [Spark SQL]
o Means this question needs to be done using Spark SQL and therefore not allowed to use RDDs. You need to do these questions using the Spark dataframe or dataset API – do not use SQL syntax. i.e. you must not use the spark.sql() function; instead you must use the individual functions (e.g. .select() and .join())
o For example, write code like this:
o
Don’t
write code like this:
Do the entire assignment using CloudxLab.
Assignment structure:
· For each Hive question a skeleton .hql file is provided for you to write your solution in. You can run these just like you did in labs:
$ hive -f Task_XX.hql
· For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file. Run your Spark code using:
$ spark-shell -i Task_XX.scala
Tips:
1. Look at the data files
before
you begin each task. Try to understand what you are dealing with! The data for Spark questions is already on HDFS so you don’t need to write the data-loading.
2. For each subtask we provide small example input and the corresponding output in the assignment specifications below. These small versions of the files are also supplied with the assignment (they have “-small” in the name). It’s a good idea to get your solution working on the small inputs first before moving on to the full files.
3. In addition to testing the correctness of your code using the very small example input, you should also use the large input files that we provide to test the scalability of your solutions. If your solution takes longer than 10 minutes to run on the large input, it definitely has scalability problems.
4. It can take some time to run Spark applications. So for the Spark questions it’s best to experiment using the
spark-shell
interpreter first to figure out a working solution, and then put your code into the .scala files afterwards.
Task 1: Analysing Video Game Sales [29 marks total]
We will be doing some analytics on data that contains a list of video games with sales greater than 100k copies[1]. The data is stored in a semicolon (“;”) delimited format.
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
Local filesystem:
Small version
|
t1/Input_data/vgsales-small.csv
|
Full version
|
t1/Input_data/vgsales.csv
|
HDFS:
Small version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales-small.csv
|
Full version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales.csv
|
The data has the following attributes
Attribute index
|
Attribute name
|
Description
|
0
|
Name
|
The game name
|
1
|
Platform
|
Platform of the games release (Categorical: "Wii","NES","GB","DS","X360","PS3","PS2","SNES","GBA",
"3DS","PS4","N64","PS","XB","PC","2600","PSP","XOne",
"GC","WiiU","GEN","DC","PSV","SAT","SCD","WS","NG",
"TG16","3DO","GG","PCFX")
|
2
|
Year
|
Year of the game's release
|
3
|
Genre
|
Genre of the game (categorical: "Sports","Platform","Racing",
"Role-Playing","Puzzle","Misc","Shooter","Simulation",
"Action","Fighting","Adventure","Strategy")
|
4
|
Publisher
|
Publisher of the game (categorical: "Nintendo","Microsoft Game Studios","Take-Two Interactive","Sony Computer Entertainment","Activision","Ubisoft","Bethesda Softworks",
"Electronic Arts","Sega","SquareSoft","Atari","505 Games", …, and 567 other unique records)
|
5
|
NA_Sales
|
Sales in North America (in millions)
|
6
|
EU_Sales
|
Sales in Europe (in millions)
|
7
|
JP_Sales
|
Sales in Japan (in millions)
|
8
|
Other_Sales
|
Sales in the rest of the world (in millions)
|
Here is a small example of the data that we will use to illustrate the subtasks below:
Name
|
Platform
|
Year
|
Genre
|
Publisher
|
NA_Sales
|
EU_Sales
|
JP_Sales
|
Other_Sales
|
Gran Turismo 3: A-Spec
|
PS2
|
2001
|
Racing
|
Sony Computer Entertainment
|
6.85
|
5.09
|
1.87
|
1.16
|
Call of Duty: Modern Warfare 3
|
X360
|
2011
|
Shooter
|
Activision
|
9.03
|
4.28
|
0.13
|
1.32
|
Pokemon Yellow: Special Pikachu Edition
|
GB
|
1998
|
Role-Playing
|
Nintendo
|
5.89
|
5.04
|
3.12
|
0.59
|
Call of Duty: Black Ops
|
X360
|
2010
|
Shooter
|
Activision
|
9.67
|
3.73
|
0.11
|
1.13
|
Pokemon HeartGold/Pokemon SoulSilver
|
DS
|
2009
|
Action
|
Nintendo
|
4.4
|
2.77
|
3.96
|
0.77
|
High Heat Major League Baseball 2003
|
PS2
|
2002
|
Sports
|
3DO
|
0.18
|
0.14
|
0
|
0.05
|
Panzer Dragoon
|
SAT
|
1995
|
Shooter
|
Sega
|
0
|
0
|
0.37
|
0
|
Corvette
|
GBA
|
2003
|
Racing
|
TDK Mediactive
|
0.2
|
0.07
|
0
|
0.01
|
Resident Evil: Revelations
|
PS3
|
2013
|
Action
|
Capcom
|
0.14
|
0.32
|
0.22
|
0.12
|
Mission: Impossible - Operation Surma
|
PS2
|
2003
|
Platform
|
Atari
|
0.14
|
0.11
|
0
|
0.04
|
Suikoden Tierkreis
|
DS
|
2008
|
Role-Playing
|
Konami Digital Entertainment
|
0.09
|
0.01
|
0.15
|
0.01
|
Dragon Ball GT: Final Bout
|
PS
|
1997
|
Fighting
|
Namco Bandai Games
|
0.02
|
0.02
|
0.22
|
0.02
|
Harry Potter and the Sorcerer's Stone
|
PS2
|
2003
|
Action
|
Electronic Arts
|
0.14
|
0.11
|
0
|
0.04
|
Tropico 4
|
PC
|
2011
|
Strategy
|
Kalypso Media
|
0.1
|
0.13
|
0
|
0.04
|
Call of Juarez: The Cartel
|
X360
|
2011
|
Shooter
|
Ubisoft
|
0.14
|
0.11
|
0
|
0.03
|
Prince of Persia: The Two Thrones
|
GC
|
2005
|
Action
|
Ubisoft
|
0.11
|
0.03
|
0
|
0
|
NHL 07
|
PSP
|
2006
|
Sports
|
Electronic Arts
|
0.13
|
0.02
|
0
|
0.02
|
Disney's Winnie the Pooh's Rumbly Tumbly Adventure
|
GBA
|
2005
|
Platform
|
Ubisoft
|
0.09
|
0.03
|
0
|
0
|
Batman & Robin
|
PS
|
1998
|
Action
|
Acclaim Entertainment
|
0.06
|
0.04
|
0
|
0.01
|
Spider-Man: Battle for New York
|
DS
|
2006
|
Platform
|
Activision
|
0.12
|
0
|
0
|
0.01
|
|
|
|
|
|
|
|
|
|
|
Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask. There is a skeleton file provided for each of the tasks.
a) [Hive] Report the top 3 games for the 'X360' console in North America based on their sales in descending order. Write the results to “Task_1a-out”. For the above small example data set you would report the following (Name, Genre, NA_Sales):
Call of Duty: Black Ops Shooter 9.67
Call of Duty: Modern Warfare 3 Shooter 9.03
Call of Juarez: The Cartel Shooter 0.14
[6 marks]
b) [Hive] Report the number of games for each pair of platform and genre, in descending order of count. Write the results to “Task_1b-out”. Hint you can group data by more than one column. For the small example data set you would report the following (Platform, Genre, count):
X360 Shooter 3
PS Action 1
SAT Shooter 1
PSP Sports 1
PS3 Action 1
PS2 Sports 1
PS2 Racing 1
PS2 Platform 1
PS2 Action 1
PS Fighting 1
PC Strategy 1
GC Action 1
GBA Racing 1
GBA Platform 1
GB Role-Playing 1
DS Role-Playing 1
DS Platform 1
DS Action 1
[6 marks]
c) [Spark RDD] Find the highest and lowest selling genre based on global sale. Print the result to the terminal output using
println. For the small example data set you should get the following results:
(Formula:) [Global Sales = NA_Sales + EU_Sales + JP_Sales]
Highest selling Genre: Shooter Global Sale (in millions): 27.57
Lowest selling Genre: Strategy Global Sale (in millions): 0.23
[7 marks]
d) [Spark RDD] Report the top 50 publishers by market share (percentage of games made by the publisher among all games), along with the total number of games they made. Output should be in descending order of market share. Print the result to the terminal output using
println. Hint: The Spark RDD method
take
might be useful. For the small example data set you would report the following (publisher name, number of games made by the published, percentage of games made by the publisher among all games):
(Ubisoft,3,15.0)
(Activision,3,15.0)
(Electronic Arts,2,10.0)
(Nintendo,2,10.0)
(Acclaim Entertainment,1,5.0)
(Sega,1,5.0)
(3DO,1,5.0)
(Namco Bandai Games,1,5.0)
(TDK Mediactive,1,5.0)
(Sony Computer Entertainment,1,5.0)
(Konami Digital Entertainment,1,5.0)
(Kalypso Media,1,5.0)
(Capcom,1,5.0)
(Atari,1,5.0)
[10 marks]
Task 2: Analysing Twitter Time Series Data [24 marks]
In this task we will be doing some analytics on real Twitter data[2]. The data is stored in a tab (“\t”) delimited format.
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
Local filesystem:
Small version
|
t2/Input_data/twitter-small.tsv
|
Full version
|
t2/Input_data/twitter.tsv
|
HDFS:
Small version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter-small.tsv
|
Full version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter.tsv
|
The data has the following attributes
Attribute index
|
Attribute name
|
Description
|
0
|
tokenType
|
In our data set all rows have Token type of hashtag. So this attribute is useless for this assignment.
|
1
|
month
|
The year and month specified like the following: YYYYMM. So 4 digits for year followed by 2 digits for month. So like the following 200905, meaning the year 2009 and month of May
|
2
|
count
|
An integer representing the number tweets of this hash tag for the given year and month
|
3
|
hashtagName
|
The #tag name, e.g. babylove, mydate, etc.
|
Here is a small example of the Twitter data that we will use to illustrate the subtasks below:
Token type
|
Month
|
count
|
Hash Tag Name
|
hashtag
|
200910
|
2
|
Babylove
|
hashtag
|
200911
|
2
|
babylove
|
hashtag
|
200912
|
90
|
babylove
|
hashtag
|
200812
|
100
|
mycoolwife
|
hashtag
|
200901
|
201
|
mycoolwife
|
hashtag
|
200910
|
1
|
mycoolwife
|
hashtag
|
200912
|
500
|
mycoolwife
|
hashtag
|
200905
|
23
|
abc
|
hashtag
|
200907
|
1000
|
abc
|
a) [Hive] Find the top 5 months with the highest number of cumulative tweets and sort it according to the number of tweets of each month. Write the results to “Task_2a-out”. The month with the highest number of cumulative tweets should come first. So, for the above small example dataset the result would be:
month numtweets
200907 1000
200912 590
200901 201
200812 100
200905 23
[6 marks]
b) [Spark RDD] Find the year with the maximum number of cumulative tweets in the entire dataset across all months. Report the total number of tweets for that year. Print the result to the terminal output using println. For the above small example dataset the output would be:
2009 1819
[6 marks]
c) [Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. We have already written code in your code template that reads the x and y values from the keyboard. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Print the result to the terminal output using println. For the above small example data set the output should be the following:
Input
x = 200910, y = 200912
Output
hashtagName: mycoolwife, countX: 1, countY: 500
[12 marks]
Task 3: Indexing Bag of Words data [25 marks]
In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
Local filesystem:
Small version
|
t3/Input_data/docword-small.txt
|
t3/Input_data/vocab-small.txt
|
Full version
|
t3/Input_data/docword.txt
|
t3/Input_data/vocab.txt
|
HDFS:
Small version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword-small.txt
|
hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab-small.txt
|
Full version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword.txt
|
hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab.txt
|
The first file is called
docword.txt,
which contains the contents of all the documents stored in the following format:
Attribute index
|
Attribute name
|
Description
|
0
|
docId
|
The ID of the document that contains the word
|
1
|
vocabId
|
Instead of storing the word itself, we store an ID from the vocabulary file.
|
2
|
count
|
An integer representing the number of times this word occurred in this document.
|
The second file called
vocab.txt
contains each word in the vocabulary, which is indexed by vocabIndex from the
docword.txt
file.
Here is a small example content of the
docword.txt
file.
docId
|
vocabId
|
Count
|
3
|
3
|
600
|
2
|
3
|
702
|
1
|
2
|
120
|
2
|
5
|
200
|
2
|
2
|
500
|
3
|
1
|
100
|
3
|
5
|
2000
|
3
|
4
|
122
|
1
|
3
|
1200
|
1
|
1
|
1000
|
Here is an example of the
vocab.txt
file
vocabId
|
Word
|
1
|
Plane
|
2
|
Car
|
3
|
Motorbike
|
4
|
Truck
|
5
|
Boat
|
Complete the following subtasks using Spark:
a) [spark SQL] Calculate the total count of each word across all documents. List the words ordered by count in descending order. Write the results in csv format at “file:///home/USERNAME/Task_3a-out”, replacing
USERNAME
with your
CloudxLab
username. Use
show()
to print the first 10 rows of the dataframe that you saved. So, for the above small example input the output would be the following:
+---------+----------+
| word|sum(count)|
+---------+----------+
|motorbike| 2502|
| boat| 2200|
| plane| 1100|
| car| 620|
| truck| 122|
+---------+----------+
Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
[8 marks]
b) [spark SQL]
1. Create a dataframe containing rows with four fields: (word,
docId,
count, firstLetter). You should add the
firstLetter
column by using a UDF which extracts the first letter of
word
as a String. Write the results in parquet format
partitioned by firstLetter
at “file:///home/USERNAME/t3_docword_index_part.parquet”, replacing
USERNAME
with your
CloudxLab
username. Use
show()
to print the first 10 rows of the partitioned dataframe that you saved.
So, for the above example input, you should see the following output (the exact ordering is not important):
+---------+-----+-----+-----------+
| word|docId|count|firstLetter|
+---------+-----+-----+-----------+
| plane| 1| 1000| p|
| plane| 3| 100| p|
| car| 2| 500| c|
| car| 1| 120| c|
|motorbike| 1| 1200| m|
|motorbike| 2| 702| m|
|motorbike| 3| 600| m|
| truck| 3| 122| t|
| boat| 3| 2000| b|
| boat| 2| 200| b|
+---------+-----+-----+-----------+
[7 marks]
c) [spark SQL] Load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of query words from the user. For each word in the queryWords list use println to display the following: the word and; the docId with the largest count for that word (you can break ties arbitrarily). Skip any query words that aren’t found in the dataset. To iterate over the query words, use a normal Scala for loop, like this:
for(queryWord
// ...put Spark query code here...
}
For this subtask there is a big optimisation that can be made because of how the data is partitioned. So, think carefully about how you filter the data.
If queryWords contains “car”, “dog”, and “truck”, then the output for the example dataset would be:
[car,2]
[truck,3]
[5 marks]
d) [spark SQL] Again load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of document IDs from the user. For each document ID in the docIds list use println to display the following: the document ID and; the word with the most occurrences in that document, along with the number of occurrences of that word in the document (you can break ties arbitrarily). Skip any document IDs that aren’t found in the dataset. Note in this task we are searching via document ID instead of query word so we cannot take advantage of the same optimization you used for part c). So, you need to find some other kind of optimization. Hint: think about the fact you are repeatedly reusing the same dataframe in the scala for loop as you process each docId.
If docIds contains “2” and “3”, then the output for the example dataset would be:
[2, motorbike, 702]
[3, boat, 2000]
[5 marks]
Task 4: Finding associations between news publishers [22 marks]
In this task, you will be asked to analyse a dataset of news articles[3] by finding associations between news publishers.
The data is supplied with the assignment in the zip file and on HDFS at the below locations:
Local filesystem:
Small version
|
t4/Input_data/news-small.tsv
|
Full version
|
t4/Input_data/news.tsv
|
HDFS:
Small version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t4/news-small.csv
|
Full version
|
hdfs:///user/ashhall1616/bdc_data/assignment/t4/news.csv
|
Note: Due to the size of the full dataset, you may need to run tasks using this command (the command below allocates 1GB of RAM to the spark executor rather than the default 512MB):
spark-shell -i Task_XX.scala –-driver-memory 1g
The data has the following attributes
Attribute index
|
Attribute name
|
Description
|
0
|
articleId
|
Numeric identifier for article
|
1
|
title
|
Title of article
|
2
|
url
|
Url of article
|
3
|
publisher
|
The publisher of the article
|
4
|
category
|
News category (b = business, t = science and technology, e = entertainment, m = health)
|
5
|
storyId
|
Alphanumeric id of “story”. All articles are first grouped via a clustering algorithm. Then each cluster is assigned a story id. The story id identifies the topic of the story, not the specific article. Therefore, there are multiple articles (rows of data) per story.
|
6
|
hostname
|
URL hostname
|
7
|
timestamp
|
Approximate time the news was published in unix time
|
Here is a small example of the article data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):
articleId
|
publisher
|
Category
|
storyId
|
hostname
|
1
|
Los Angeles Times
|
B
|
ddUyU0VZz0BRneMioxUPQVP6sIxvM
|
www.latimes.com
|
2
|
Livemint
|
B
|
ddUyU0VZz0BRneMioxUPQVP6sIxvM
|
www.livemint.com
|
3
|
IFA Magazine
|
B
|
ddUyU0VZz0BRneMioxUPQVP6sIxvM
|
www.ifamagazine.com
|
4
|
IFA Magazine
|
B
|
ddUyU0VZz0BRneMioxUPQVP6sIxvM
|
www.ifamagazine.com
|
5
|
Moneynews
|
B
|
ddUyU0VZz0BRneMioxUPQVP6sIxvM
|
www.moneynews.com
|
6
|
NASDAQ
|
B
|
ddUyU0VZz0BRneMioxUPQVP6sIxvM
|
www.nasdaq.com
|
16
|
NASDAQ
|
B
|
dPhGU51DcrolUIMxbRm0InaHGA2XM
|
www.nasdaq.com
|
19
|
IFA Magazine
|
B
|
dPhGU51DcrolUIMxbRm0InaHGA2XM
|
www.ifamagazine.com
|
Complete the following subtasks using Spark:
a) [Spark SQL]
1. Create a list of each story paired with each publisher that wrote at least one article for that story. Store the resulting dataframe in Parquet format at “file:///home/USERNAME/t4_story_publishers.parquet”, replacing
USERNAME
with your
CloudxLab
username. For the small example input file the expected output is:
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Livemint]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Moneynews]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, NASDAQ]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Los Angeles Times]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, NASDAQ]
2. From this dataframe, find the top 10 most popular stories (measured by number of articles) and the number of publishers that wrote an article about them, report the results using
println. A story is considered popular if at least 5 publishers published at least one article on it. For the small example input file the expected output is:
[ddUyU0VZz0BRneMioxUPQVP6sIxvM,5]
[7 marks]
b) [Spark SQL] Load up the Parquet file which you created in the previous subtask. Find all pairs of publishers that publish articles about the same stories. For each publisher pair report the number of co-published stories. Where a co-published story in a story published by both publishers. Report the pairs in decreasing order of frequency. The solution may take a few minutes to run. Note the solution must conform to the following rules:
1. There should not be any replicated entries like:
NASDAQ, NASDAQ, 1000
2. You should not have the same pair occurring twice in opposite order. Only one of the following should occur:
NASDAQ, Reuters, 1000
Reuters, NASDAQ, 1000
(i.e. it is
incorrect
to have
both
of the above two lines in your result)
Save the results in CSV format at “file:///home/USERNAME/t4_paired_publishers.csv”, again replacing
USERNAME
with your
CloudxLab
username. For the example above, the output should be as follows (it is OK if the csv is split over multiple files, but the combined data contents of the multiple files should be the same, with the order preserved):
[NASDAQ,IFA Magazine,2]
[Moneynews,Livemint,1]
[Moneynews,IFA Magazine,1]
[NASDAQ,Livemint,1]
[NASDAQ,Los Angeles Times,1]
[Moneynews,Los Angeles Times,1]
[Los Angeles Times,IFA Magazine,1]
[Livemint,IFA Magazine,1]
[NASDAQ,Moneynews,1]
[Los Angeles Times,Livemint,1]
[15 marks]
Bonus Marks:
[1] : Video games Sales data source:
https://www.kaggle.com/gregorut/videogamesales
[2] : Twitter data source:
http://www.infochimps.com/datasets/twitter-census-conversation-metrics-one-year-of-urls-hashtags-sm
[3] : News data source:
https://archive.ics.uci.edu/ml/datasets/News+Aggregator