I attached assignment details below. The following note is from my professor Dear students, I'd like to clarify that for Question 1 of Assignment 2, there is NO hard requirement for you to do it using...

1 answer below »
I attached assignment details below. The following note is from my professor


Dear students,
I'd like to clarify that for Question 1 of Assignment 2, there is NO hard requirement for you to do it using Spark (i.e., you can write ordinary Java or Python programs). As this was not made clear at the first place, I decide to take following methods:
1. The deadline of Assignment 2 is extended to this Sunday (10/6).
2. If you use Spark to implement any part of the A-priori algorithm, you will get 2 bonus points.
3. If you further implement the SON algorithm, you will get 3 additional bonus points.





CSCE 4013-002: Assignment 2 (7pt) Due 11:59pm Monday, October 4, 2019 1 Association Rules (4pt) The action or practice of selling additional products or services to existing customers is called cross-selling. Giving product recommendation is one of the examples of cross-selling that are frequently used by online retailers. One simple method to give product recommendations is to recommend products that are frequently browsed together by the customers. Suppose we want to recommend new products to the customer based on the products they have already browsed on the online website. Write a program using the A-priori algorithm to find products which are frequently browsed together. Fix the support to s = 100 (i.e. product pairs need to occur together at least 100 times to be considered frequent) and find itemsets of size 2 and 3. Use the online browsing behavior dataset “browsing.txt” as the input file. Each line repre- sents a browsing session of a customer. On each line, each string of 8 characters represents the id of an item browsed during that session. The items are separated by spaces. Identify pairs of items (X, Y ) such that the support of {X, Y } is at least 100. For all such pairs, compute the confidence scores of the corresponding association rules: X ⇒ Y , Y ⇒ X. Sort the rules in decreasing order of confidence scores and list the top 5 rules in the report. Break ties, if any, by lexicographically increasing order on the left hand side of the rule. Identify item triples (X, Y, Z) such that the support of {X, Y, Z} is at least 100. For all such triples, compute the confidence scores of the corresponding association rules: (X, Y ) ⇒ Z, (X,Z) ⇒ Y , (Y, Z) ⇒ X. Sort the rules in decreasing order of confidence scores and list the top 5 rules in the report. Order the left-hand-side pair lexicographically and break ties, if any, by lexicographical order of the first then the second item in the pair. Bonus (3pt) Write a Spark program based on the SON algorithm with A-priori as the in-memory algo- rithm for each subset of baskets. What to submit 1 1. Submit the source code (.java or .py files). 2. Include in the report the top 5 association rules for pairs, and top 5 association rules for triples. 2 Locality-Sensitive Hashing (3pt) 2.1 When simulating a random permutation of rows, we could save a lot of time if we restricted our attention to a randomly chosen k of the n rows, rather than hashing all the row numbers. The downside of doing so is that if none of the k rows contains a 1 in a certain column, then the result of the minhashing is “don’t know”, i.e., we get no row number as a minhash value (or ∞ if it is initialized so). It would be a mistake to assume that two columns that both minhash to “don’t know” are likely to be similar. However, if the probability of getting “don’t know” as a minhash value is small, we can tolerate the situation, and simply ignore such minhash values when computing the fraction of minhashes in which two columns agree. 2.1.1 Suppose a column has m 1’s and therefore n − m 0’s. Prove that the probability we get “don’t know” as the minhash value for this column is at most (n−k n )m. 2.1.2 Suppose we want the probability of “don’t know” to be at most e−10. Assuming n and m are both very large (but n is much larger than m or k), give a simple approximation to the smallest value of k that will assure this probability is at most e−10. Hints: (1) You can use (n−k n )m as the exact value of the probability of “don’t know”. (2) Remember that for large x, (1− 1 x )x ≈ 1 e . 2.2 When minhashing, one might expect that we could estimate the Jaccard similarity without using all possible permutations of rows. For example, we could only allow cyclic permutations i.e., start at a randomly chosen row r, which becomes the first in the order, followed by rows r + 1, r + 2, and so on, down to the last row, and then continuing with the first row, second row, and so on, down to row r − 1. There are only n such permutations if there are n rows. 2 However, these permutations are not sufficient to estimate the Jaccard similarity correctly. Give an example of two columns such that the probability (over cyclic permutations only) that their minhash values agree is not the same as their Jaccard similarity. In your answer, please provide (a) an example of a matrix with two columns (let the two columns correspond to sets denoted by S1 and S2) (b) the Jaccard similarity of S1 and S2 (c) the probability that a random cyclic permutation yields the same minhash value for both S1 and S2. What to submit • Proof for 2.1.1. • Derivation and final answer for 2.1.2. • Example for 2.2. 3
Answered Same DayOct 01, 2021

Answer To: I attached assignment details below. The following note is from my professor Dear students, I'd like...

Ximi answered on Oct 04 2021
147 Votes
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Reading data
da
ta = spark.sparkContext.textFile("browsing.txt")
transactions = data.map(lambda line: (1, list(set(line.strip().split(' ')))))
from pyspark.ml.fpm import FPGrowth
dataframe =...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here