pdf


Homework 3 Relational Databases & SQL Statistics 141B 2023 Prof. Duncan Temple Lang Due Sunday, 10pm May 21st Submit via Canvas 1 Data We will work with the posts and related data from the StackOverflow site for questions about statistics Cross Validated, the Stats.stackexchange forum. This is a site where people ask questions about statistics, generally, and people often provide answers to these questions, and some comment on the question or the responses. This is a community of users who gain points for providing answers when others vote up (or down) their answers. Often, there is one “accepted” answer for a question. While this is typically the “best” answer, there may be others that are better, perhaps posted later. Some questions come with a financial incentive - a bounty. Users can also gain badges for certain knowledge and skills. In addition to the posts (questions and responses), we have comments associated with individual posts. We also have the history of changes to a post. The SQLite3 database is available from the Box folder https://ucdavis.app.box.com/folder/ 206515800849. It is 2.65 gigabytes. There are 14 tables in this database: 8 containing information about posts, users, tags, etc. and 6 providing meta-data, specifically text descriptions of different types. Documentation for each of the tables is available at Schema for StackOverflow Databases and an Interactive Schema Diagram showing a super-set of the tables and some of the relationships between them. The tables with names ending in TypeMap or TypeId map the numerical values to labels for the Post types(question, answer, . . . ), Vote types (up, down, accepted, offensive, . . . ), history of the post actions (edit title, body, tags, closed, reopened, . . . ) You will need to install the RSQLite and DBI R packages. 2 Report Structure For this assignment, your report will focus on each question separately rather than having a narrative of the overall, high-level process. For each question, • include the full text of the question, https://stats.stackexchange.com/ https://ucdavis.app.box.com/folder/206515800849 https://ucdavis.app.box.com/folder/206515800849 https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede https://sedeschema.github.io/ • map the (intentionally) casually written question to more precisely identify the relevant variables you need and the rows from the different tables, describing how you are interpreting some of the ambiguous terms, • interpret the results, justifying with numerical or graphical summaries, • show the SQL and R code you use to get the relevant data. 3 Questions Do as much of the computations in SQL and then bring the resulting table back to R to finish the computa- tions. Not all of the questions can be (easily) answered directly in SQL alone. Give the answer and show the SQL and R code used to answer each question. For each question, start by determining which tables are needed for each query. Answer at least 13 of the first 20 questions and all of the questions in the Required Questions section below. 1. How many users are there? 2. How many users joined since 2020? (Hint: Convert the CreationDate to a year.) 3. How many users joined each year? Describe this with a plot, commenting on any anomalies. 4. How many different types of posts are there in the Posts table? Get the description of the types from the PostTypeIdMap table. In other words, create a table with the description of each post type and the number of posts of that type, and arrange it from most to least occurrences. 5. How many posted questions are there? 6. What are the top 50 most common tags on questions? For each of the top 50 tags on questions, how many questions are there for each tag. 7. How many tags are in most questions? 8. How many answers are there? 9. What’s the most recent question (by date-time) in the Posts table? • Find it on the stats.exchange.com Web site and provide the URL. • How would we map a question in the Posts table to the corresponding SO URL? 10. For the 10 users who posted the most questions • How many questions did they post? • What are the users’ names? • When did they join SO? • What is their Reputation? • What country do they have in their profile? 2 11. Following from the previous questions, for the 10 users who posted the most questions, how many gold, silver and bronze badges does each of these 10 individuals have? 12. For each of the following terms, how many questions contain that term: Regression, ANOVA, Data Mining, Machine Learning, Deep Learning, Neural Network. 13. Using the Posts and PostLinks tables, how many questions gave rise to a ”related” or ”duplicate” ques- tion? • And how many responses did these questions get? • How experienced were the users posting these **questions**. 14. What is the date range for the questions and answers in this database? 15. What question has the most comments associated with it? • how many answers are there for this question? 16. How many comments are there across all posts? • How many posts have a comment? • What is the distribution of comments per question? 17. Is there any relationship between the number of tags on a question, the length of the question, and the number of responses (posts and comments)? 18. Do the people who vote tend to have badges? 19. How many questions were edited by the original poster? by other users? 20. . How many posts have multiple different people who edit it? 4 Required Questions 21. Compute the table that contains • the question, • the name of the user who posted it, • when that user joined, • their location • the date the question was first posted, • the accepted answer, • when the accepted answer was posted • the name of the user who provided the accepted answer. 3 22. Determine the users that have only posted questions and never answered a question? (Compute the table containing the number of questions, number of answers and the user’s login name for this group.) How many are there? 23. Compute the table with information for the 75 users with the most accepted answers. This table should include • the user’s display name, • creation date, • location, • the number of badges they have won, – the names of the badges (as a single string) • the dates of the earliest and most recent accepted answer (as two fields) – the (unique) tags for all the questions for which they had the accepted answer (as a single string) 24. How many questions received no answers (accepted or unaccepted)? How many questions had no accepted answer? 25. What is the distribution of answers per posted question? 26. What is the length of time for a question to receive an answer? to obtaining an accepted answer? 27. How many answers are typically received before the accepted answer? 5 Useful References and Web Pages • Cross Validated - Stats.stackexchange site • Schema for StackOverflow Databases • Interactive Schema Diagram • SQLite Documentation with links to many sets of functions, etc. • SQLite Tutorial • SQLite built-in functions • SQLite Date and Time Functions 4 https://stats.stackexchange.com/ https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede https://sedeschema.github.io/ https://www.sqlite.org/docs.html https://www.sqlitetutorial.net/ https://www.sqlite.org/lang_corefunc.html https://www.sqlite.org/lang_datefunc.html Data Report Structure Questions Required Questions Useful References and Web Pages Homework 3 Relational Databases & SQL Statistics 141B 2023 Prof. Duncan Temple Lang Due Sunday, 10pm May 21st Submit via Canvas 1 Data We will work with the posts and related data from the StackOverflow site for questions about statistics Cross Validated, the Stats.stackexchange forum. This is a site where people ask questions about statistics, generally, and people often provide answers to these questions, and some comment on the question or the responses. This is a community of users who gain points for providing answers when others vote up (or down) their answers. Often, there is one “accepted” answer for a question. While this is typically the “best” answer, there may be others that are better, perhaps posted later. Some questions come with a financial incentive - a bounty. Users can also gain badges for certain knowledge and skills. In addition to the posts (questions and responses), we have comments associated with individual posts. We also have the history of changes to a post. The SQLite3 database is available from the Box folder https://ucdavis.app.box.com/folder/ 206515800849. It is 2.65 gigabytes. There are 14 tables in this database: 8 containing information about posts, users, tags, etc. and 6 providing meta-data, specifically text descriptions of different types. Documentation for each of the tables is available at Schema for StackOverflow Databases and an Interactive Schema Diagram showing a super-set of the tables and some of the relationships between them. The tables with names ending in TypeMap or TypeId map the numerical values to labels for the Post types(question, answer, . . . ), Vote types (up, down, accepted, offensive, . . . ), history of the post actions (edit title, body, tags, closed, reopened, . . . ) You will need to install the RSQLite and DBI R packages. 2 Report Structure For this assignment, your report will focus on each question separately rather than having a narrative of the overall, high-level process. For each question, • include the full text of the question, https://stats.stackexchange.com/ https://ucdavis.app.box.com/folder/206515800849 https://ucdavis.app.box.com/folder/206515800849 https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede https://sedeschema.github.io/ • map the (intentionally) casually written question to more precisely identify the relevant variables you need and the rows from the different tables, describing how you are interpreting some of the ambiguous terms, • interpret the results, justifying with numerical or graphical summaries, • show the SQL and R code you use to get the relevant data. 3 Questions Do as much of the computations in SQL and then bring the resulting table back to R to finish the computa- tions. Not all of the questions can be (easily) answered directly in SQL alone. Give the answer and show the SQL and R code used to answer each question. For each question, start by determining which tables are needed for each query. Answer at least 13 of the first 20 questions and all of the questions in the Required Questions section below. 1. How many users are there? 2. How many users joined since 2020? (Hint: Convert the CreationDate to a year.) 3. How many users joined each year? Describe this with a plot, commenting on any anomalies. 4. How many different types of posts are there in the Posts table? Get the description of the types from the PostTypeIdMap table. In other words, create a table with the description of each post type and the number of posts of that type, and arrange it from most to least occurrences. 5. How many posted questions are there? 6. What are the top 50 most common tags on questions? For each of the top 50 tags on questions, how many questions are there for each tag. 7. How many tags are in most questions? 8. How many answers are there? 9. What’s the most recent question (by date-time) in the Posts table? • Find it on the stats.exchange.com Web site and provide the URL. • How would we map a question in the Posts table to the corresponding SO URL? 10. For the 10 users who posted the most questions • How many questions did they post? • What are the users’ names? • When did they join SO? • What is their Reputation? • What country do they have in their profile? 2 11. Following from the previous questions, for the 10 users who posted the most questions, how many gold, silver and bronze badges does each of these 10 individuals have? 12. For each of the following terms, how many questions contain that term: Regression, ANOVA, Data Mining, Machine Learning, Deep Learning, Neural Network. 13. Using the Posts and PostLinks tables, how many questions gave rise to a ”related” or ”duplicate” ques- tion? • And how many responses did these questions get? • How experienced were the users posting these **questions**. 14. What is the date range for the questions and answers in this database? 15. What question has the most comments associated with it? • how many answers are there for this question? 16. How many comments are there across all posts? • How many posts have a comment? • What is the distribution of comments per question? 17. Is there any relationship between the number of tags on a question, the length of the question, and the number of responses (posts and comments)? 18. Do the people who vote tend to have badges? 19. How many questions were edited by the original poster? by other users? 20. . How many posts have multiple different people who edit it? 4 Required Questions 21. Compute the table that contains • the question, • the name of the user who posted it, • when that user joined, • their location • the date the question was first posted, • the accepted answer, • when the accepted answer was posted • the name of the user who provided the accepted answer. 3 22. Determine the users that have only posted questions and never answered a question? (Compute the table containing
Answered 8 days AfterMay 11, 2023

Answer To: pdf

Sumit Kumar answered on May 19 2023
42 Votes
Answer at least 13 of the first 20 questions and all of the questions in the
Required Questions section below.
Question 1: - How many users are there?
File: datascience.stackexchange.db & stats.stackexchange.db: -
SQL Query: - select count (*) from users;
R Language: -
result <- dbGetQuery(con1,"select count(*) from users")
totaluser <- result[1,1]
print(totaluser)
Question 2: - How many users joined since 2020? (Hint: Convert the CreationDate to a year.)
File: datascience.stackexchange.db & stats.stackexcha
nge.db: -
SQL: - SELECT COUNT(*) AS TotalUsers
...> FROM users
...> WHERE strftime('%Y', CreationDate) >= '2020';
R: -
result <- dbGetQuery(con1,"SELECT COUNT(*) AS TotalUsers FROM users WHERE strftime('%Y', CreationDate) >= '2020';")
print(result[1,1])
Question 3: - How many users joined each year? Describe this with a plot, commenting on any anomalies.
SQL: - SELECT strftime('%Y', CreationDate) AS Year, COUNT(*) AS TotalUsers
...> FROM users
...> GROUP BY Year;
R: -
library(RSQLite)
library(DBI)
library(ggplot2)
query <- "SELECT strftime('%Y', CreationDate) AS Year, COUNT(*) AS TotalUsers FROM users GROUP BY Year"
result <- dbGetQuery(con1, query)
For Plotting
ggplot(result, aes(x = Year, y = TotalUsers)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Year", y = "Total Users", title = "Number of Users Joined Each Year") +
theme_minimal()
Question 4: - How many different types of posts are there in the Posts table? Get the description of the types from the
PostTypeIdMap table. In other words, create a table with the description of each post type and the number
of posts of that type, and arrange it from most to least occurrences.
SQL: - SELECT p.value AS TypeDescription, COUNT(*) AS NumPosts
...> FROM Posts
...> JOIN PostTypeIdMap AS p ON Posts.PostTypeId = p.id
...> GROUP BY p.value
...> ORDER BY NumPosts DESC;
R: -
> query <- "SELECT p.value AS TypeDescription, COUNT(*) AS NumPosts
+ FROM Posts
+ JOIN PostTypeIdMap AS p ON Posts.PostTypeId = p.id
+ GROUP BY p.value
+ ORDER BY NumPosts DESC"
> result <- dbGetQuery(con1, query)
> print(result)
Question 5: - How many posted questions are there?
SQL: -
SELECT COUNT(*) AS NumQuestions
FROM Posts
WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question');
R: -
query <- "SELECT COUNT(*) AS NumQuestions
+ FROM Posts
+ WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question')"
result <- dbGetQuery(con1, query)
print(result$NumQuestions)
Question 6: - What are the top 50 most common tags on questions? For each of the top 50 tags on questions, how many
questions are there for each tag.
SQL: -
SELECT Tags, COUNT(*) AS NumQuestions
FROM Posts
WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question')
GROUP BY Tags
ORDER BY NumQuestions DESC
LIMIT 50 ;
R: -
query <- "SELECT Tags, COUNT(*) AS NumQuestions
FROM Posts
WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question')
GROUP BY Tags
ORDER BY NumQuestions DESC
LIMIT 50"
result <- dbGetQuery(con1, query)
print(result)
Question 7: - How many tags are in most questions?
SQL: -
SELECT COUNT(*) AS NumTags
FROM (
SELECT Tags
FROM Posts
WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question')
GROUP BY Tags
ORDER BY COUNT(*) DESC
LIMIT 1
) AS MostCommonTags;
R: -
query <- "SELECT COUNT(*) AS NumTags
FROM (
SELECT Tags
FROM Posts
WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question')
GROUP BY Tags
ORDER BY COUNT(*) DESC
LIMIT 1
) AS MostCommonTags"
result <- dbGetQuery(con1, query)
print(result$NumTags)
Question 11. Following from the previous questions, for the 10 users who posted the most questions, how many gold,
silver and bronze badges does each of these 10 individuals have?
SQL: -
SELECT u.DisplayName, COUNT(CASE WHEN b.Name = 'gold' THEN 1 END) AS GoldBadges,
COUNT(CASE WHEN b.Name = 'silver' THEN 1 END) AS SilverBadges,
COUNT(CASE WHEN b.Name = 'bronze' THEN 1 END) AS BronzeBadges
FROM Users AS u
JOIN Badges AS b ON u.Id = b.UserId
JOIN ( SELECT OwnerUserId, COUNT(*) AS NumQuestions
FROM Posts
WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question')
GROUP BY OwnerUserId
ORDER BY NumQuestions DESC
LIMIT 10
) AS top10users ON u.Id = top10users.OwnerUserId
GROUP BY u.Id, u.DisplayName ;
R: -
query <- "SELECT u.DisplayName, COUNT(CASE WHEN b.Name = 'gold' THEN 1 END) AS GoldBadges,
+ COUNT(CASE WHEN b.Name = 'silver' THEN 1 END) AS SilverBadges,
+ COUNT(CASE WHEN b.Name = 'bronze' THEN 1 END) AS BronzeBadges
+ FROM Users AS u
+ JOIN Badges AS b ON u.Id = b.UserId
+ JOIN (
+ SELECT OwnerUserId, COUNT(*) AS NumQuestions
+ FROM Posts
+ WHERE PostTypeId = (SELECT id FROM PostTypeIdMap WHERE value = 'Question')
+ GROUP BY OwnerUserId
+ ORDER BY NumQuestions DESC
+ LIMIT 10
+ ) AS top10users ON u.Id = top10users.OwnerUserId
+ GROUP BY u.Id,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here