Instructions:
Here are the datasets. In order for you to better understand the data and do the assignment , it would be best if you create an account, log in to the game and do a round of it. Info is below in case you have not done that already (this was a request I made in the 1st week of class).
Users JSON
Download Users JSON
Game JSON
Download Game JSON
News JSON
Download News JSON
Instructions:
Read and Load the JSON files into three or four SQL Tables(name the tables with your initials at the end)
Join Users and Games via username (if you use SQL Tables)
Create a user_game_id at the game table and use the same user_game_id in a 3rd table (game_cultural_value)
Join Games and News via news_id (if you use SQL Tables)
Exclude records with usernames like %Patty, %ProfPG, %ProfPattyG, %Claudine, %Mattius, %Lica from both users and game
Those datasets will be used for assignments 11 through 15.
A quick explanation of some fields:
cultural_values at the user table/collection: the cultural values the user identifies her/himself with
cultural_values at the game collection/table: the cultural value that the user tags/interpret the game(news) with (what is the cultural value the user sees in the article)
"Sports": { the kind of game/new "news_id": "5fd6611a6758331d4432048f", news id "time_without_media": 3.531, TIME SPENT READING THE TEXT OF THE NEWS "time_with_media": 2.564, TIME SPENT VISUALIZING THE NEWS IMAGE "share_without_media": true, iF USER SHARED THE NEWS WITHOUT THE IMAGE "share_with_media": true, IF THE USER SHARED THE NEWS WITH IMAGE "fake_news": true, IF USER THINKS THE NEWS IS FAKE OR NOT "cultural_values": [ CILTURAL VALUE THAT THE USER ATTACHES/TAG/CLASSIFY THE NEWS "Importance of Group Goals" ]
The key:value pair at the news collection are self-descriptive
Here are my suggestions:
If you create a table for Games played by user it will be something like this(better to add a primary key for faster access)
User_Game_ID Section Username Game time_w/o_media time_w/media share_w/o_mediashare_w/media fake_news news_id
1 1 ali Sport 3.53 2.56 1 1 1 5fd6611a6758331d4432048f
2 1 ali Entertainement 2.43 2.18 0 0 1 5fd6611e6758331d4432049a
3 2 ali Politics 2.0 3.18 0 0 1 5fd6611e6758331d4432124a
A game/news can be tagged with more than one cultural value and a user can login to play games more than once, therefore you need yet another table o keep things in the normal form:
Create a table for game_cultural_value
user_game_id cultural value news_id
1 "Importance of Group Goals" 5fd6611a6758331d4432048f
1 "Collaborative Decision Making" ........
2 "Collaborative Decision Making" ........
I suggest you convert true to 1 and false to 0, so you can run logistic regression;
if you are using NoSQL then there is no transformation, except for replacing to 1/0 and parsing into lists
On February 2022, the team added a few more data points to collect info, therefore all the users created by and after 02.01.2022 might have to be processed differently (more info on that later). make sure to keep the created data in the users table; in fact do not exclude any key:value pair(you will delete the users' records as explained above)
You need to find out what was added to the game table in terms of key values pairs after feb 2022, because by week 14 you will be asked to analyze that info separately (just for the users created after 2.2022 and just calculating p-values for them)
About the game:
A global interdisciplinary research team has developed an application called Headliner Share/Don’t Share. The researchers are seeking how sharing news headliners might be influencing social media communities. We would appreciate your participation in this pilot round of testing. It is estimated to take between 5-10 minutes depending on the amount of time you participate in the virtual application. Your participation is voluntary. Data gathering is for research purposes only. There will be one follow-up email seeking your input on Headliner Share/Don’t Share.
I suggest you create a user account and play the game before starting the assignment, so you get familiar with it.
Headliner Share/Don’tShare(Links to an external site.)-https://headlinerss.herokuapp.com/home
(Links to an external site.)
When working with JSON, there are a few resources or joining and querying collections
https://www.mongodb.com/docs/manual/reference/operator/aggregation/lookup/(Links to an external site.)
https://www.w3schools.com/nodejs/nodejs_mongodb_join.asp(Links to an external site.)
https://hevodata.com/learn/mongodb-join-two-collections/(Links to an external site.)
https://medium.com/fasal-engineering/fetching-data-from-different-collections-via-mongodb-aggregation-operations-with-examples-a273f24bfff0(Links to an external site.)
https://www.analyticsvidhya.com/blog/2020/08/query-a-mongodb-database-using-pymongo/(Links to an external site.)
This library is installed if you want to use:
https://pypi.org/project/mongojoin/
Assignment:
use jupyter notebook for this assignment.
Load The three JSON files into SQL Tables (follow the instructions under the final project datasets section or MongoDB.
The idea is to do approach the final project using MongoDb and SQL, so you can compare your results and practice both.
Using those 3 data sets answer those questions:
After deleting the usernames indicated, how many distinct users are there?
How many users are in the user data set?
How many distinct users played the game? Every time a user appears at the game table, the user is playing the Headliners Share/Not Share game and that would be considered one section.
What is the breakdown by gender? (total users) What is the % of players by gender? (pie chart); do you see any difference if you use the user data set X the game one? If so why?
What is the breakdown by age? What is the % of players by age? (pie chart or bar chart); do you see any difference if you use the user data set X the game one? If so why?
What is the breakdown by profession? What is the % of players by profession? (pie or bar chart); do you see any difference if you use the user data set X the game one? If so why?
What is the breakdown by country? What is the % of players by country? (pie chart or bar chart); do you see any difference if you use the user data set X the game one? If so why?
Which were the games played by each user and how many of those games per user (this is the same question as below)?
How many times has each user played a game (one has to iterate through the news field to find how many games a user played - and if the user appeared more than once in the game tables, chances are that the user played a given game more than once - you might want to break the news field into another table using the username as the joining key)?
What were the top 5 news shared & the top 5 news not shared? Also need to iterate through the news fields from the games table or use the news fields to create another table as suggested above)
What was the mean number of games played by user? mean of games played (see suggestion above)
What is the breakdown of share /not share by age group/gender/profession/country? Any visible patterns?
What is the correlation for each of the groupings above (only for samples(groups) above 30)?
Plot graph of the above
Explain the techniques you used, either as comments on your python code or in a separate word document.
Tip:
another way to do this is to use what you have learned each JSON key:value pair which is a dictionary itself needs to be read through iteration in order to be processed. Play around with this example and that will give you ideas on how to approach the JSON files:
import json
import mysql.connector
import numpy
from scipy import stats
try:
# Connecting to MySQL database
mydb = mysql.connector.connect(
host="mysql", database='ALY2100', user="ALY2100", password="homework"
)
mycursor = mydb.cursor()
# Drop the table if it already exists
mycursor.execute("drop table if exists book_store")
# Create table named book_store
mycursor.execute('''CREATE TABLE book_store (author varchar(50), country VARCHAR(50), image_link varchar(200), language varchar(50),
link varchar(200), pages int(5), title varchar(100), year int(5))''')
# open the books json file
f = open('books.json')
# read json object as dictionary
data = json.load(f)
processed_data = []
# process the data and make it sql insert query ready
for x in data:
processed_data.append((x['author'],x['country'],x['imageLink'],x['language'],x['link'],x['pages'],x['title'],x['year']))
# insert the data to database
query= "INSERT INTO book_store (author, country, image_link, language, link, pages, title, year) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"
mycursor.executemany(query, processed_data)
mydb.commit()
print(mycursor.rowcount, " book records inserted successfully.")
# printing the displayed data
mycursor.execute("SELECT * FROM book_store;")
result = mycursor.fetchall()
print("Book table data after insertion.")
for i in result:
print(i)
# maintain a list for pages data
pages_list = []
mycursor.execute("select pages from book_store")
result = mycursor.fetchall()
for i in result:
pages_list.append(i[0])
# Calculating the mean for pages list
mean = numpy.mean(pages_list)
print("The mean of given pages list is: ",mean)
# Calculating the meadian for pages list
median = numpy.median(pages_list)
print("The median of given pages list is: ",median)
# Calculating the mode for pages list
mode = stats.mode(pages_list)
print("The mode of given pages list is: ",mode)
# Calculating the standard deviation for pages list
standard_deviation = numpy.std(pages_list)
print("The standard deviation of given pages list is: ",standard_deviation)
# Calculating the variance for pages list
variance = numpy.var(pages_list)
print("The variance of given pages list is: :",variance)
except IOError:
print("Sorry the file does not exist or unable to read the file")
except mysql.connector.Error as error:
print("Error, The SQL operation has failed {}".format(error))
finally:
# closing the file
f.close()
# closing the db connection
mydb.close()
print("Database connection closed")