This assignment requires the answer for each questions. I give you one zip file and 3 datasets and one scenario(pdf file) in it. please answer the questions and code them by using datasets and Jupyter Notebook (Python Language).
Scenario is divided into three part, when you finish one part please give me word file(.doc / .docx). because, I want to ask your work is correct or not to my instructor.
COMP8320 Information and Data Security Assignment 1 Total marks: 30 Weighting: 15% Deadline: Sunday (End of Week 8), 4 October 2020 (11:59 pm). Note: Submit the assignment via Turnitin (Include Student Name and ID in assignment). Objectives This assignment has been designed to test your knowledge of the first four lectures in COMP8320. Notes • Assumptions (if any) must be stated clearly in your answers. • There may not be one right answer for some of the questions. So, your explanations need to present your case clearly. The explanations you provide do not have to be long. • The assignment questions can be done using Python, pandas and NumPy. It is highly recommended that you use the Jupyter Notebook. You should include your code in your assignment submission. Your code should not be a screenshot. Instead, the text should be selectable. 1 Submission • On line submission via Turnitin. Assignments will be marked and returned online. There are no hardcopy submissions for written assignments. Ensure you submit the correct file. The submission process shows you a complete preview of your entire assignment after you have uploaded it but before you have submitted it. Carefully check through every single page to ensure everything is there and the correct version has been uploaded. Multiple submissions may be possible via Turnitin prior to the final due date and time of an assessment task and originality reports may be made available to students to view and check their levels of similarity prior to making a final submission. Students are encouraged to use these reports to ensure that they do not breach the Academic Honesty Policy through high levels of similarity (plagiarism). Teaching staff will use the report to judge whether plagiarism has occurred and whether penalties should apply for breaches of the Academic Honesty Policy. Any similar text identified by Turnitin will be considered carefully to see if it is indeed a breach of the Academic Honesty Policy. 2 1 Question 1 (10 marks) Suppose the dataset “adult.data.csv” has been uploaded as open data by a multinational tech company, after removing obvious identifiers, e.g., names and addresses. The dataset contains information about all its employees worldwide. You know one of the employees in the dataset, Alice. You further know the following facts about Alice (as background knowledge): age: 60 workclass: Private marital-status: Divorced race: Black sex: Female You seek to uniquely re-identify Alice in the dataset to learn (infer) more about her. (Hint: When doing comparison in Python it may be handy to convert all values to strings and use the strip() command to remove any white spaces.) (a) Write a program that matches each row in the adult.data.csv dataset against your background knowl- edge, and returns all matching rows. (6 marks) (b) How many rows match your background knowledge? (1 mark) (c) You recall that you know an additional piece of information about Alice, that her “education” is 10th. How many rows match your background knowledge now? (1 mark) (d) What are the retrieved rows from part (c)? (2 marks) 2 Question 2 (10 marks) A large supermarket has released a dataset of all purchases made by its 10,000 customers after removing names, as “purchase-rest.csv”. The purchases are categorised into 40 items (numbered 1 to 40, inclusive). If a customer buys an item, the corresponding cell is 1, otherwise it is 0. You know that Bob is one of the customers. And you know his past buying habits (“bob.csv”) as background knowledge. You would like to identify which row in the released dataset is likely to be Bob’s data, thus re-identifying Bob. Note that there could be some items marked as 1 in bob.csv but marked as 0 in purchase-dense.csv in Bob’s row, and vice versa. Hence you cannot do a direct (subset) match as in Question 1. Instead, you seek to use a similarity metric, i.e., the Jaccard index (see Appendix). (a) Write a program in Python that takes a list of binary values, and converts it into a set of indexes which are labelled 1 in the list. (2 marks) (b) Write a program in Python that takes two sets and computes their Jaccard index. (2 marks) (c) Write a program that computes the Jaccard index of each row in the purchase-rest.csv dataset and your background knowledge about Bob, and outputs the top two rows with the highest Jaccard index. (4 marks) (d) What are the row indexes, the set of items bought and the Jaccard index of the top two rows. (2 marks) 3 3 Question 3 (10 marks) Consider the restricted Adult dataset “adult-rest.csv.” The dataset contains only the Race, Sex and Salary attributes. You have been tasked to release a differentially private variant of this dataset using the Stability-based Histogram algorithm (see Appendix). You choose � = 1 and δ = 10−6. Let D be the initial dataset and let D′ be the differentially private dataset. (a) Implement the algorithm in Python to obtain an output dataset D′. (5 marks) (b) What is the “threshold” in Step 7 of the algorithm (see Appendix). (1 mark) (c) Which points (rows) from D do not appear in D′ and why? (2 marks) (d) For any point query qx, where x is a point in the domain, let |qx(D) − qx(D′)| be the absolute error between the original (true) answer from D and the noisy answer from D′. Which point in the domain has the largest error, and what is the error? (2 marks) References [BNS16] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS ’16, pages 369–380, New York, NY, USA, 2016. ACM. [BV18] Victor Balcer and Salil Vadhan. Differential privacy on finite computers. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. [DDN03] Danny Dolev, Cynthia Dwork, and Moni Naor. Nonmalleable cryptography. SIAM review, 45(4):727–784, 2003. [Vad17] Salil Vadhan. The complexity of differential privacy. In Tutorials on the Foundations of Cryptog- raphy, pages 347–450. Springer, 2017. 4 4 Appendix 4.1 Jaccard Index Let A and B be two sets. Then their Jaccard index J is defined as: J(A,B) = |A ∩B| |A ∪B| , if A ∪ B is not empty, and J(A,B) = 0, if A ∪ B is empty. For example, if A = {1, 2, 3} and B = {2, 3, 4}, then: J(A,B) = 2 4 = 1 2 . Note that for any two sets A and B, 0 ≤ J(A,B) ≤ 1. 4.2 Stability-based Histogram Algorithm This algorithm is taken from [BNS16, Vad17, BV18]. Algorithm 1: Stability-based Histogram Input: Domain X , dataset D ∈ Xn, parameters � and δ. 1 Randomly shuffle the rows of D. 2 Initialize Dout ← ∅. 3 for each distinct point x ∈ D do 4 if qx(D) = 0 then set ax = 0. 5 if qx(D) > 0 then 6 Set ax ← qx(D) + Lap( 2� ). 7 if ax < 2="" ln="" (="" 2="" δ="" )="" �+="" 1="" then="" 8="" set="" ax="" ←="" 0.="" 9="" else="" 10="" set="" ax="" ←="" round(ax).="" 11="" if="" ax=""> 0 then append ax copies of x to Dout. 12 Output Dout. 5 Question 1 (10 marks) Question 2 (10 marks) Question 3 (10 marks) Appendix Jaccard Index Stability-based Histogram Algorithm