2/6/2020 1/7 Record Linkage You must work alone on this assignment. In this assignment, you will identify or link records from two datasets that refer to the same restaurant. The datasets, which come...

1 answer below »
please design the code as the requirement (split functions)


2/6/2020 1/7 Record Linkage You must work alone on this assignment. In this assignment, you will identify or link records from two datasets that refer to the same restaurant. The datasets, which come from two restaurant review companies, contain the names and addresses of a group of restaurants. The task of linking this information is non-trivial when the restaurants’ names and addresses are not guaranteed to be identical across the two data sets. This assignment will give you experience in probabilistic record linkage, a technique that you will find use- ful when working with real-world data in the future. We will train an algorithm to classify pairs of entries as matches, unmatches (that is, not a match), and possible matches. The training algorithm has several steps: 1. Convert each restaurant name and address into three fields: a name, city, and street address. 2. Compute the probability of a match using hand-labeled match data. 3. Compute the probability of a unmatch using randomly-chosen unmatch data. 4. Use the computed probabilities to train the classifier. Once you have trained the algorithm, we will apply it to the full datasets. Getting started The directory pa3 includes the following files: record_linkage.py: skeleton file for the assignment (you will add code to this file only), test_traffic_stops.py: test code, util.py: utility functions, get_files.sh: shell script for getting the data, Data ¶ To get the data for this assignment, you need to run the following command from the linux command-line: $ ./get_files.sh Running this script will download two directories: data and output. The first contains the input data for the assignment. The second contains the expected output. We will be using data provided by Zagat and Fodor’s. The companies provide the information for a given restaurant as a single string. We have split these strings into name, city, and address fields using regular expressions and constructed two CSV files (data/zagat.csv and data/fodors.csv). This automated ex- traction process generates a less-than-perfect split for some restaurants. While the results are not perfect, the process used is realistic, and the data is in sufficiently good shape to use with the probabilistic algo- rithm that we will employ. The files data/zagat.csv and data/fodors.csv each contain four columns: index, restaurant name, city, and street address. Here, for example, are the first few rows from data/zagat.csv: index,restaurant name,city,street address 0,Apple Pan The,West LA,10801 W. Pico Blvd. 1,Arnie Morton's of Chicago,Los Angeles,435 S. La Cienega Blvd. 2,Art's Deli,Studio City,12224 Ventura Blvd. Due: February 13th at 5pm We have seeded your repository with a directory for this assignment. To pick it up, change to your ucid-win-20-username directory (where the string username should be replaced with your user- name) and then run the command: git pull upstream master. You should also run git pull to make sure your local copy of your repository is in sync with the server. As in previous assignments, you will need to run this command on your VM and in Lab, if you plan to use both. 2/6/2020 2/7 3,Asahi Ramen,West LA,2027 Sawtelle Blvd. 4,Baja Fresh,Westlake Village,3345 Kimber Dr. You must read the review data into Pandas dataframes. When you load the data, we recommend (1) using the index column, which contains a unique identifier (an integer) for each row, as the row index and (2) setting the type for the remaining columns to str. In addition to the review data, we have also provided a training dataset of known links in a file named data/known_links.csv. This file contains 50 manually-linked pairs of restaurants. In other words, a hu- man chose some of the rows in one dataset, and determined the corresponding rows in the other dataset. The first restaurant in the pair is from the Zagat dataset and the second from the Fodor’s dataset. Rows in the known-links file correspond to these pairs, and the two columns give the indexes for a restaurant with- in the corresponding files. For example, the first non-header line of the known-links file is “269,386”, so the row with index 269 in the Zagat dataset is deemed to be a “match” with the row with index 386 in the Fodor’s dataset. We’ve also provided a file, data/unmatch_pairs.csv, that contains 1000 pairs of randomly-chosen index- es, one from each review dataset. We will deem these pairs to be unmatches. Why are we using 1000 ran- dom pairs for unmatches, when we only have 50 for matches? The more pairs we have, the better the accu- racy of our record linkage. Of course, having more matches would also have been helpful as well, but that requires expensive manual work. Your first task is to load the data. Computing the probability of a match or an unmatch We will explain the process of converting dataframes along with the matched and unmatched pairs into something more abstract that can be used to help classify other restaurants as matches or unmatches in two steps. First, we will explain how to compute similarity tuples and then how to use the resulting tuples to estimate the probability of a match or an unmatch. Constructing similarity tuples In this assignment, we will compute the similarity of the name, city, and address fields separately. A natur- al measure of similarity for strings is the edit distance, also known as the Levenshtein distance. It is de- fined as the minimum number of single-character insertions, deletions, and substitutions required to con- vert one string to another. The Levenshtein distance is computationally intensive, so in practice, a related measure, called the Jaro-Winkler distance is used instead. The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, al- though it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1.0, and a complete mismatch (“Medici” and “Subway”) has a score of 0.0. You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. It is already installed on the CS machines, but you will need to install it on your VM using the following command: $ sudo -H pip3 install jellyfish Specifically, we’ll use the function jellyfish.jaro_winkler to compute the Jaro-Winkler distance. To make this discussion more concrete, here are the results of computing the Jaro-Winkler score for the name, city, and address fields from the first matched pair (index 269 from the Zagat dataset and index 386 from the Fodor’s dataset): In [1]: import jellyfish In [2]: jellyfish.jaro_winkler('Ritz-Carlton Restaurant', ...: 'Restaurant Ritz-Carlton Atlanta') Out[2]: 0.7786561264822135 In [3]: jellyfish.jaro_winkler('Atlanta', 'Atlanta') Out[3]: 1.0 https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance https://github.com/jamesturk/jellyfish 2/6/2020 3/7 In [4]: jellyfish.jaro_winkler('181 Peachtree St.', ...: '181 Peachtree St.') Out[4]: 1.0 As we will see in class, the Jaro-Winkler similarity scores do not increase monotonically, and so, a “cut-off” on the similarity for a single field or many fields will not serve our purpose. Instead, we break-up the range of similarity values into discrete chunks, which we will call "low", "medium", and "high", and determine the probability of a match for each combination of field-wise chunks separately. We have provided a simple utility function get_jw_category() in util.py that essentially breaks up the range of Jaro-Winkler similarity scores into three probability blocks. It takes a Jaro-Winkler distance and returns the string "low", "medium", or "high". We will apply this function to all three fields: names, cities, and addresses. Thus, for any pair of rows from the review datasets, we can create a similarity tuple (name_category, city_category, address_category) that represents the similarity of the fields for the corre- sponding pair of rows. Applying this process to the rows that correspond to the indexes 269 and 386 in the Zagat and Fodor’s datasets respectively yields the similarity tuple: ('low', 'high', 'high'). Estimating the probabilities Whether, during record linkage, a pair of rows should be classified as a match, unmatch, or something in between will depend on whether its (name_category, city_category, address_category) tuple was most closely associated with matches or unmatches when we trained the algorithm, and our tolerance for error. Specifically, we will determine whether a tuple should be classified as a match, possible match, or unmatch based on estimates of the probability that a matched pair results in that tuple as well as the probability that an unmatched pair results in that tuple. Formally, assume that is a potential pair, is the tuple formed from its field similarities, and is the set of all possible tuples. For every we need estimates for two quantities: Your task is to compute estimates for the former by iterating through all the pairs of rows corresponding to the known matches, determining their similarity tuples, and counting the frequency of each of the 27 pos- sible similarity tuples (combinations of the three similarity levels in the three different positions) during this process. We’ll use the relative frequency for a tuple as our estimate of the probability for given a matching pair. Similarly find an estimate for the probability given an unmatch by iterating through un- match pairs and computing the frequency of the corresponding similarity tuples. To make this more concrete: your second task is to construct two dictionaries: one that maps similarity tu- ples to match probabilities and the other that maps similarity tuples to unmatch probabilities. Partition tuples into match, possible match, and unmatch sets This step is tricky. It is important that you read this section extremely carefully. The goal of this step is to compute a dictionary that maps each similarity tuple to one of "match", "possible match", or "unmatch". How should we decide which tuples belong in which of the three different categories: match, possible match, and unmatch? It depends on our tolerance for two kinds of errors: false positive rate, the probability that we classify an actual unmatch as a match, and false negative rate, the probability that we classify an actual match as an unmatch. Assume that is the maximum false positive rate and is the maximum false negative rate we are willing to accept, and, given these are satisfied, we want to maximize the number of our matches. In order to classify which tuples should be associated
Answered Same DayFeb 07, 2021

Answer To: 2/6/2020 1/7 Record Linkage You must work alone on this assignment. In this assignment, you will...

Prasun Kumar answered on Feb 13 2021
150 Votes
Download the solution from https://bit.ly/EdSolution3. Unzip the file and follow instructions given in the report.pdf file. https://bit.ly/EdSolution3
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here