Objectives1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark...

Question

Objectives1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).2. Solve challenging big data processing tasks by finding highly efficient solutions; very inefficient solutions will lose marks.3. Experience processing four different types of real dataa. Standard multi-attribute data (video game sales data)b. Time series data (Twitter feed)c. Bag of words datad. A News aggregation corpus4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).a) [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManualb) [Spark]	http://spark.apache.org/docs/latest/api/scala/index.html#packagec) [Spark SQL]	https://spark.apache.org/docs/latest/sql-programming-guide.html	https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset	https://spark.apache.org/docs/latest/api/sql/index.html·	Hint:	If you are not sure what a spark API call does, try to write a small example and try it in the spark shell	This assignment is due 10:00 a.m. on Thursday 28th		of May, 2020.Penalties are applied to late assignments (accepted up to 5 business days after the due date only). Five percent is deducted per business day late. A mark of zero will be assigned to assignments submitted more than 5 days late. Submission checklist	Ensure that all of your solutions read their      input from the full data files (not the small example versions)	Check      that all of your solutions run without crashing on the CloudxLab      interface.	Delete all output files	Zip the whole assignment folder and      submit via LMSCopying, PlagiarismThis is an	individual	assignment. You are not permitted to work as a part of a group when writing this assignment.Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed.Expected quality of solutionsa) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.b) This entire assignment can be done using the CloudxLab servers and the supplied data sets		without running out of memory. It is time to show your skills!c) We are not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.d) For Hive queries. We prefer answers that use less tables.The questions in the assignment will be labelled using the following:· [Hive]o Means this question needs to be done using Hive· [Spark RDD]o Means this question needs to be done using Spark RDDs, you are	not allowed	to use any Spark SQL features like dataframe or datasets.· [Spark SQL]o Means this question needs to be done using Spark SQL and therefore not allowed to use RDDs. You need to do these questions using the Spark dataframe or dataset API – do not use SQL syntax. i.e. you must not use the spark.sql() function; instead you must use the individual functions (e.g. .select() and .join())o For example, write code like this:	        o	Don’t	write code like this:	Do the entire assignment using CloudxLab.		Assignment structure:· For each Hive question a skeleton .hql file is provided for you to write your solution in. You can run these just like you did in labs:$ hive -f Task_XX.hql· For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file. Run your Spark code using:$ spark-shell -i Task_XX.scalaTips:1. Look at the data files	before	you begin each task. Try to understand what you are dealing with! The data for Spark questions is already on HDFS so you don’t need to write the data-loading.2. For each subtask we provide small example input and the corresponding output in the assignment specifications below. These small versions of the files are also supplied with the assignment (they have “-small” in the name). It’s a good idea to get your solution working on the small inputs first before moving on to the full files.3. In addition to testing the correctness of your code using the very small example input, you should also use the large input files that we provide to test the scalability of your solutions. If your solution takes longer than 10 minutes to run on the large input, it definitely has scalability problems.4. It can take some time to run Spark applications. So for the Spark questions it’s best to experiment using the	spark-shell	interpreter first to figure out a working solution, and then put your code into the .scala files afterwards.	Task 1: Analysing Video Game Sales [29 marks total]We will be doing some analytics on data that contains a list of video games with sales greater than 100k copies[1]. The data is stored in a semicolon (“;”) delimited format.The data is supplied with the assignment in the zip file and on HDFS at the below locations:	Local filesystem:															Small version														t1/Input_data/vgsales-small.csv																			Full version														t1/Input_data/vgsales.csv							HDFS:															Small version														hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales-small.csv																			Full version														hdfs:///user/ashhall1616/bdc_data/assignment/t1/vgsales.csv						The data has the following attributes										Attribute index										Attribute name										Description														0										Name										The game name														1										Platform										Platform of the games release (Categorical:   "Wii","NES","GB","DS","X360","PS3","PS2","SNES","GBA",				"3DS","PS4","N64","PS","XB","PC","2600","PSP","XOne",				"GC","WiiU","GEN","DC","PSV","SAT","SCD","WS","NG",				"TG16","3DO","GG","PCFX")														2										Year										Year of the game's release														3										Genre										Genre of the game (categorical: "Sports","Platform","Racing",				"Role-Playing","Puzzle","Misc","Shooter","Simulation",				"Action","Fighting","Adventure","Strategy")														4										Publisher										Publisher of the game (categorical: "Nintendo","Microsoft   Game Studios","Take-Two Interactive","Sony Computer   Entertainment","Activision","Ubisoft","Bethesda   Softworks",				"Electronic   Arts","Sega","SquareSoft","Atari","505   Games", …, and 567 other unique records)														5										NA_Sales										Sales in North America (in millions)														6										EU_Sales										Sales in Europe (in millions)														7										JP_Sales										Sales in Japan (in millions)														8										Other_Sales										Sales in the rest of the world (in   millions)						Here is a small example of the data that we will use to illustrate the subtasks below:															Name																			Platform																			Year																			Genre																								Publisher																			NA_Sales																			EU_Sales																			JP_Sales																			Other_Sales																		Gran Turismo 3: A-Spec										PS2										2001										Racing										Sony   Computer Entertainment										6.85										5.09										1.87										1.16														Call of   Duty: Modern Warfare 3										X360										2011										Shooter										Activision										9.03										4.28										0.13										1.32														Pokemon   Yellow: Special Pikachu Edition										GB										1998										Role-Playing										Nintendo										5.89										5.04										3.12										0.59														Call of   Duty: Black Ops										X360										2010										Shooter										Activision										9.67										3.73										0.11										1.13														Pokemon   HeartGold/Pokemon SoulSilver										DS										2009										Action										Nintendo										4.4										2.77										3.96										0.77														High   Heat Major League Baseball 2003										PS2										2002										Sports										3DO										0.18										0.14										0										0.05														Panzer   Dragoon										SAT										1995										Shooter										Sega										0										0										0.37										0														Corvette										GBA										2003										Racing										TDK Mediactive										0.2										0.07										0										0.01														Resident   Evil: Revelations										PS3										2013										Action										Capcom										0.14										0.32										0.22										0.12														Mission:   Impossible - Operation Surma										PS2										2003										Platform										Atari										0.14										0.11										0										0.04														Suikoden   Tierkreis										DS										2008										Role-Playing										Konami Digital Entertainment										0.09										0.01										0.15										0.01														Dragon   Ball GT: Final Bout										PS										1997										Fighting										Namco Bandai Games										0.02										0.02										0.22										0.02														Harry Potter   and the Sorcerer's Stone										PS2										2003										Action										Electronic Arts										0.14										0.11										0										0.04														Tropico   4										PC										2011										Strategy										Kalypso Media										0.1										0.13										0										0.04														Call of   Juarez: The Cartel										X360										2011										Shooter										Ubisoft										0.14										0.11										0										0.03														Prince   of Persia: The Two Thrones										GC										2005										Action										Ubisoft										0.11										0.03										0										0														NHL 07										PSP										2006										Sports										Electronic Arts										0.13										0.02										0										0.02														Disney's   Winnie the Pooh's Rumbly Tumbly Adventure										GBA										2005										Platform										Ubisoft										0.09										0.03										0										0														Batman   & Robin										PS										1998										Action										Acclaim Entertainment										0.06										0.04										0										0.01														Spider-Man:   Battle for New York										DS										2006										Platform										Activision										0.12										0										0										0.01																																								Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask. There is a skeleton file provided for each of the tasks. a) [Hive] Report the top 3 games for the 'X360' console in North America based on their sales in descending order. Write the results to “Task_1a-out”. For the above small example data set you would report the following (Name, Genre, NA_Sales):Call of Duty: Black Ops Shooter 9.67Call of Duty: Modern Warfare 3 Shooter 9.03Call of Juarez: The Cartel Shooter 0.14[6 marks]b) [Hive] Report the number of games for each pair of platform and genre, in descending order of count. Write the results to “Task_1b-out”. Hint you can group data by more than one column. For the small example data set you would report the following (Platform, Genre, count):X360 Shooter 3PS Action  1SAT Shooter 1PSP Sports  1PS3 Action  1PS2 Sports  1PS2 Racing 1PS2 Platform 1PS2 Action  1PS Fighting 1PC Strategy 1GC Action  1GBA Racing 1GBA Platform 1GB Role-Playing 1DS Role-Playing 1DS Platform 1DS Action  1[6 marks]c) [Spark RDD] Find the highest and lowest selling genre based on global sale. Print the result to the terminal output using	println. For the small example data set you should get the following results:(Formula:) [Global Sales = NA_Sales + EU_Sales + JP_Sales]Highest selling Genre: Shooter  Global Sale (in millions): 27.57 Lowest selling Genre: Strategy Global Sale (in millions): 0.23[7 marks]d) [Spark RDD] Report the top 50 publishers by market share (percentage of games made by the publisher among all games), along with the total number of games they made. Output should be in descending order of market share. Print the result to the terminal output using	println. Hint: The Spark RDD method	take	might be useful. For the small example data set you would report the following (publisher name, number of games made by the published, percentage of games made by the publisher among all games):(Ubisoft,3,15.0)(Activision,3,15.0)(Electronic Arts,2,10.0)(Nintendo,2,10.0)(Acclaim Entertainment,1,5.0)(Sega,1,5.0)(3DO,1,5.0)(Namco Bandai Games,1,5.0)(TDK Mediactive,1,5.0)(Sony Computer Entertainment,1,5.0)(Konami Digital Entertainment,1,5.0)(Kalypso Media,1,5.0)(Capcom,1,5.0)(Atari,1,5.0)[10 marks]Task 2: Analysing Twitter Time Series Data [24 marks]In this task we will be doing some analytics on real Twitter data[2]. The data is stored in a tab (“	”) delimited format.The data is supplied with the assignment in the zip file and on HDFS at the below locations:	Local filesystem:															Small version														t2/Input_data/twitter-small.tsv																			Full version														t2/Input_data/twitter.tsv							HDFS:															Small version														hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter-small.tsv																			Full version														hdfs:///user/ashhall1616/bdc_data/assignment/t2/twitter.tsv							The data has the following attributes										Attribute   index										Attribute   name										Description														0										tokenType										In   our data set all rows have Token type of hashtag. So this attribute is   useless for this assignment.														1										month										The   year and month specified like the following: YYYYMM. So 4 digits for year   followed by 2 digits for month. So like the following 200905, meaning the   year 2009 and month of May 														2										count										An   integer representing the number tweets of this hash tag for the given year   and month														3										hashtagName										The   #tag name, e.g. babylove, mydate, etc.						Here is a small example of the Twitter data that we will use to illustrate the subtasks below:										Token   type										Month										count										Hash   Tag Name														hashtag										200910										2										Babylove														hashtag										200911										2										babylove														hashtag										200912										90										babylove														hashtag										200812										100										mycoolwife														hashtag										200901										201										mycoolwife														hashtag										200910										1										mycoolwife														hashtag										200912										500										mycoolwife														hashtag										200905										23										abc														hashtag										200907										1000										abc						a) [Hive] Find the top 5 months with the highest number of cumulative tweets and sort it according to the number of tweets of each month. Write the results to “Task_2a-out”. The month with the highest number of cumulative tweets should come first. So, for the above small example dataset the result would be: month  numtweets200907 1000200912 590200901 201200812 100200905 23[6 marks]b) [Spark RDD] Find the year with the maximum number of cumulative tweets in the entire dataset across all months. Report the total number of tweets for that year. Print the result to the terminal output using println. For the above small example dataset the output would be:2009 1819 [6 marks]c) [Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. We have already written code in your code template that reads the x and y values from the keyboard. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Print the result to the terminal output using println. For the above small example data set the output should be the following:	Input	x = 200910, y = 200912	Output	hashtagName: mycoolwife, countX: 1, countY: 500[12 marks]Task 3: Indexing Bag of Words data [25 marks]In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.The data is supplied with the assignment in the zip file and on HDFS at the below locations:	Local filesystem:															Small version														t3/Input_data/docword-small.txt														t3/Input_data/vocab-small.txt																			Full version														t3/Input_data/docword.txt														t3/Input_data/vocab.txt							HDFS:															Small version														hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword-small.txt														hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab-small.txt																			Full version														hdfs:///user/ashhall1616/bdc_data/assignment/t3/docword.txt														hdfs:///user/ashhall1616/bdc_data/assignment/t3/vocab.txt						The first file is called	docword.txt,	which contains the contents of all the documents stored in the following format: 										Attribute index										Attribute name										Description														0										docId										The ID of the document that contains the   word														1										vocabId										Instead of storing the word itself, we   store an ID from the vocabulary file.														2										count										An integer representing the number of   times this word occurred in this document.						The second file called	vocab.txt	contains each word in the vocabulary, which is indexed by vocabIndex from the	docword.txt	file.	Here is a small example content of the	docword.txt	file.															docId															vocabId																			Count														3										3										600														2										3										702														1										2										120														2										5										200														2										2										500														3										1										100														3										5										2000														3										4										122														1										3										1200														1										1										1000						Here is an example of the	vocab.txt	file															vocabId																			Word																		1										Plane														2										Car														3										Motorbike														4										Truck														5										Boat						Complete the following subtasks using Spark:a) [spark SQL] Calculate the total count of each word across all documents. List the words ordered by count in descending order. Write the results in csv format at “file:///home/USERNAME/Task_3a-out”, replacing	USERNAME	with your	CloudxLab	username. Use	show()	to print the first 10 rows of the dataframe that you saved. So, for the above small example input the output would be the following:+---------+----------+| word|sum(count)|+---------+----------+|motorbike| 2502|| boat| 2200|| plane| 1100|| car| 620|| truck| 122|+---------+----------+Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.[8 marks]b) [spark SQL]	1. Create a dataframe containing rows with four fields: (word,	docId,	count, firstLetter). You should add the	firstLetter	column by using a UDF which extracts the first letter of	word	as a String. Write the results in parquet format	partitioned by firstLetter	at “file:///home/USERNAME/t3_docword_index_part.parquet”, replacing	USERNAME	with your	CloudxLab	username. Use	show()	to print the first 10 rows of the partitioned dataframe that you saved.So, for the above example input, you should see the following output (the exact ordering is not important):+---------+-----+-----+-----------+| word|docId|count|firstLetter|+---------+-----+-----+-----------+| plane| 1| 1000| p|| plane| 3| 100| p|| car| 2| 500| c|| car| 1| 120| c||motorbike| 1| 1200| m||motorbike| 2| 702| m||motorbike| 3| 600| m|| truck| 3| 122| t|| boat| 3| 2000| b|| boat| 2| 200| b|+---------+-----+-----+-----------+[7 marks]c) [spark SQL] Load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of query words from the user. For each word in the queryWords list use println to display the following: the word and; the docId with the largest count for that word (you can break ties arbitrarily). Skip any query words that aren’t found in the dataset. To iterate over the query words, use a normal Scala for loop, like this:for(queryWord // ...put Spark query code here...}For this subtask there is a big optimisation that can be made because of how the data is partitioned. So, think carefully about how you filter the data.If queryWords contains “car”, “dog”, and “truck”, then the output for the example dataset would be:[car,2] [truck,3]	[5 marks]d) [spark SQL] Again load the dataframe stored in partitioned parquet format from subtask b). The task template includes code to obtain a list of document IDs from the user. For each document ID in the docIds list use println to display the following: the document ID and; the word with the most occurrences in that document, along with the number of occurrences of that word in the document (you can break ties arbitrarily). Skip any document IDs that aren’t found in the dataset. Note in this task we are searching via document ID instead of query word so we cannot take advantage of the same optimization you used for part c). So, you need to find some other kind of optimization. Hint: think about the fact you are repeatedly reusing the same dataframe in the scala for loop as you process each docId.If docIds contains “2” and “3”, then the output for the example dataset would be:[2, motorbike, 702][3, boat, 2000][5 marks]Task 4: Finding associations between news publishers [22 marks]In this task, you will be asked to analyse a dataset of news articles[3] by finding associations between news publishers.	The data is supplied with the assignment in the zip file and on HDFS at the below locations:	Local filesystem:															Small version														t4/Input_data/news-small.tsv																			Full version														t4/Input_data/news.tsv							HDFS:															Small version														hdfs:///user/ashhall1616/bdc_data/assignment/t4/news-small.csv																			Full version														hdfs:///user/ashhall1616/bdc_data/assignment/t4/news.csv								Note: Due to the size of the full dataset, you may need to run tasks using this command (the command below allocates 1GB of RAM to the spark executor rather than the default 512MB):		spark-shell -i Task_XX.scala –-driver-memory 1gThe data has the following attributes										Attribute index										Attribute name										Description														0										articleId										Numeric identifier   for article														1										title										Title of article														2										url										Url of article														3										publisher										The publisher of   the article														4										category										News category (b = business, t = science and technology, e =   entertainment, m = health)														5										storyId										Alphanumeric id of “story”.   All articles are first grouped via a clustering algorithm. Then each cluster   is assigned a story id. The story id identifies the topic of the story, not   the specific article. Therefore, there are multiple articles (rows of data)   per story. 														6										hostname										URL hostname														7										timestamp										Approximate time   the news was published in unix time						Here is a small example of the article data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):										articleId										publisher										Category										storyId										hostname														1										Los Angeles Times										B										ddUyU0VZz0BRneMioxUPQVP6sIxvM										www.latimes.com														2										Livemint										B										ddUyU0VZz0BRneMioxUPQVP6sIxvM										www.livemint.com														3										IFA Magazine										B										ddUyU0VZz0BRneMioxUPQVP6sIxvM										www.ifamagazine.com														4										IFA Magazine										B										ddUyU0VZz0BRneMioxUPQVP6sIxvM										www.ifamagazine.com														5										Moneynews										B										ddUyU0VZz0BRneMioxUPQVP6sIxvM										www.moneynews.com														6										NASDAQ										B										ddUyU0VZz0BRneMioxUPQVP6sIxvM										www.nasdaq.com														16										NASDAQ										B										dPhGU51DcrolUIMxbRm0InaHGA2XM										www.nasdaq.com														19										IFA Magazine										B										dPhGU51DcrolUIMxbRm0InaHGA2XM										www.ifamagazine.com						Complete the following subtasks using Spark:a) [Spark SQL] 1. Create a list of each story paired with each publisher that wrote at least one article for that story. Store the resulting dataframe in Parquet format at “file:///home/USERNAME/t4_story_publishers.parquet”, replacing	USERNAME	with your	CloudxLab	username. For the small example input file the expected output is:[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Livemint][ddUyU0VZz0BRneMioxUPQVP6sIxvM, IFA Magazine][ddUyU0VZz0BRneMioxUPQVP6sIxvM, Moneynews][ddUyU0VZz0BRneMioxUPQVP6sIxvM, NASDAQ][dPhGU51DcrolUIMxbRm0InaHGA2XM, IFA Magazine][ddUyU0VZz0BRneMioxUPQVP6sIxvM, Los Angeles Times][dPhGU51DcrolUIMxbRm0InaHGA2XM, NASDAQ]2. From this dataframe, find the top 10 most popular stories (measured by number of articles) and the number of publishers that wrote an article about them, report the results using	println. A story is considered popular if at least 5 publishers published at least one article on it. For the small example input file the expected output is:[ddUyU0VZz0BRneMioxUPQVP6sIxvM,5][7 marks]b) [Spark SQL] Load up the Parquet file which you created in the previous subtask. Find all pairs of publishers that publish articles about the same stories. For each publisher pair report the number of co-published stories. Where a co-published story in a story published by both publishers. Report the pairs in decreasing order of frequency. The solution may take a few minutes to run. Note the solution must conform to the following rules:1. There should not be any replicated entries like:NASDAQ, NASDAQ, 10002. You should not have the same pair occurring twice in opposite order. Only one of the following should occur:NASDAQ, Reuters, 1000Reuters, NASDAQ, 1000(i.e. it is	incorrect	to have	both	of the above two lines in your result)	Save the results in CSV format at “file:///home/USERNAME/t4_paired_publishers.csv”, again replacing	USERNAME	with your	CloudxLab	username. For the example above, the output should be as follows (it is OK if the csv is split over multiple files, but the combined data contents of the multiple files should be the same, with the order preserved):[NASDAQ,IFA Magazine,2][Moneynews,Livemint,1][Moneynews,IFA Magazine,1][NASDAQ,Livemint,1][NASDAQ,Los Angeles Times,1][Moneynews,Los Angeles Times,1][Los Angeles Times,IFA Magazine,1][Livemint,IFA Magazine,1][NASDAQ,Moneynews,1][Los Angeles Times,Livemint,1][15 marks]Bonus Marks:[1] : Video games Sales data source:	https://www.kaggle.com/gregorut/videogamesales[2] : Twitter data source:	http://www.infochimps.com/datasets/twitter-census-conversation-metrics-one-year-of-urls-hashtags-sm[3]  : News data source:	https://archive.ics.uci.edu/ml/datasets/News+Aggregator

Objectives 1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL). 2. Solve challenging big data processing tasks by finding highly efficient solutions; very...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment

Attribute index	Attribute name	Description
0	Name	The game name
1	Platform	Platform of the games release (Categorical: "Wii","NES","GB","DS","X360","PS3","PS2","SNES","GBA", "3DS","PS4","N64","PS","XB","PC","2600","PSP","XOne", "GC","WiiU","GEN","DC","PSV","SAT","SCD","WS","NG", "TG16","3DO","GG","PCFX")
2	Year	Year of the game's release
3	Genre	Genre of the game (categorical: "Sports","Platform","Racing", "Role-Playing","Puzzle","Misc","Shooter","Simulation", "Action","Fighting","Adventure","Strategy")
4	Publisher	Publisher of the game (categorical: "Nintendo","Microsoft Game Studios","Take-Two Interactive","Sony Computer Entertainment","Activision","Ubisoft","Bethesda Softworks", "Electronic Arts","Sega","SquareSoft","Atari","505 Games", …, and 567 other unique records)
5	NA_Sales	Sales in North America (in millions)
6	EU_Sales	Sales in Europe (in millions)
7	JP_Sales	Sales in Japan (in millions)
8	Other_Sales	Sales in the rest of the world (in millions)

Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales
Gran Turismo 3: A-Spec	PS2	2001	Racing	Sony Computer Entertainment	6.85	5.09	1.87		1.16
Call of Duty: Modern Warfare 3	X360	2011	Shooter	Activision	9.03	4.28	0.13		1.32
Pokemon Yellow: Special Pikachu Edition	GB	1998	Role-Playing	Nintendo	5.89	5.04	3.12		0.59
Call of Duty: Black Ops	X360	2010	Shooter	Activision	9.67	3.73	0.11		1.13
Pokemon HeartGold/Pokemon SoulSilver	DS	2009	Action	Nintendo	4.4	2.77	3.96		0.77
High Heat Major League Baseball 2003	PS2	2002	Sports	3DO	0.18	0.14	0		0.05
Panzer Dragoon	SAT	1995	Shooter	Sega	0	0	0.37		0
Corvette	GBA	2003	Racing	TDK Mediactive	0.2	0.07	0		0.01
Resident Evil: Revelations	PS3	2013	Action	Capcom	0.14	0.32	0.22		0.12
Mission: Impossible - Operation Surma	PS2	2003	Platform	Atari	0.14	0.11	0		0.04
Suikoden Tierkreis	DS	2008	Role-Playing	Konami Digital Entertainment	0.09	0.01	0.15		0.01
Dragon Ball GT: Final Bout	PS	1997	Fighting	Namco Bandai Games	0.02	0.02	0.22		0.02
Harry Potter and the Sorcerer's Stone	PS2	2003	Action	Electronic Arts	0.14	0.11	0		0.04
Tropico 4	PC	2011	Strategy	Kalypso Media	0.1	0.13	0		0.04
Call of Juarez: The Cartel	X360	2011	Shooter	Ubisoft	0.14	0.11	0		0.03
Prince of Persia: The Two Thrones	GC	2005	Action	Ubisoft	0.11	0.03	0		0
NHL 07	PSP	2006	Sports	Electronic Arts	0.13	0.02	0		0.02
Disney's Winnie the Pooh's Rumbly Tumbly Adventure	GBA	2005	Platform	Ubisoft	0.09	0.03	0		0
Batman & Robin	PS	1998	Action	Acclaim Entertainment	0.06	0.04	0		0.01
Spider-Man: Battle for New York	DS	2006	Platform	Activision	0.12	0	0		0.01

Token type	Month	count	Hash Tag Name
hashtag	200910	2	Babylove
hashtag	200911	2	babylove
hashtag	200912	90	babylove
hashtag	200812	100	mycoolwife
hashtag	200901	201	mycoolwife
hashtag	200910	1	mycoolwife
hashtag	200912	500	mycoolwife
hashtag	200905	23	abc
hashtag	200907	1000	abc

docId	vocabId	Count
3	3	600
2	3	702
1	2	120
2	5	200
2	2	500
3	1	100
3	5	2000
3	4	122
1	3	1200
1	1	1000

articleId	publisher	Category	storyId	hostname
1	Los Angeles Times	B	ddUyU0VZz0BRneMioxUPQVP6sIxvM	www.latimes.com
2	Livemint	B	ddUyU0VZz0BRneMioxUPQVP6sIxvM	www.livemint.com
3	IFA Magazine	B	ddUyU0VZz0BRneMioxUPQVP6sIxvM	www.ifamagazine.com
4	IFA Magazine	B	ddUyU0VZz0BRneMioxUPQVP6sIxvM	www.ifamagazine.com
5	Moneynews	B	ddUyU0VZz0BRneMioxUPQVP6sIxvM	www.moneynews.com
6	NASDAQ	B	ddUyU0VZz0BRneMioxUPQVP6sIxvM	www.nasdaq.com
16	NASDAQ	B	dPhGU51DcrolUIMxbRm0InaHGA2XM	www.nasdaq.com
19	IFA Magazine	B	dPhGU51DcrolUIMxbRm0InaHGA2XM	www.ifamagazine.com