{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Problem Set 1\n",
"\n",
"## Overview\n",
"\n",
"**Entity resolution**, also known as record deduplication, is the process of **identifying rows** in one or more data sets that **refer to the same real world entity**. This is a common, yet difficult problem in data cleaning and integration. \n",
"\n",
"In this assignment, you will use scalable text analysis techniques to perform entity resolution across two data sets of commercial products.\n",
"\n",
"### Instructions\n",
"\n",
"1. You can create both code cells (for the source code) as well as Markdown cells (for the annotations/explanations) as needed in the notebook. Do not try to cram a lot of code in a single cell, but break it logically. \n",
"\n",
"2. Some empty code cells beginning with the **%%time** magic command, have been provided. Please **do not delete** these commands: they are there to help you see how long it takes to run the code! \n",
"\n",
"3. All code cells that begin with the comment `## DO NOT MODIFY` should not be edited. Some of them are there just to make sure that you are following naming conventions in creating the correct objects.\n",
"\n",
"4. Part 3 of the assignment **must be completed by graduate students**; it is optional (for extra credit) for undergraduate students. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Files\n",
"\n",
"Your assignment is to perform entity resolution over two web-scraped data sets of commercial product listings, one from Amazon, and one from Google.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data files for this assignment can be found under `Resources/Data/set1data.tar.gz`:\n",
"\n",
"The zip file includes the following files:\n",
"\n",
"* **`Google.csv`**, the Google Products data set \n",
"* **`Amazon.csv`**, the Amazon data set\n",
"* **`Google_small.csv`**, a sample from the Google Products data set \n",
"* **`Amazon_small.csv`**, a sample from the Amazon data set\n",
"* **`Amazon_Google_perfectMapping.csv`**, the \"gold standard\" mapping\n",
"\n",
"The sample files will only be used in **Part I**.\n",
"\n",
"The \"gold standard\" file contains all of the true mappings between entities in the two data sets: each row has a pair of record IDs (one Google, one Amazon) from the corresponding datasets that describe the same thing in the real world."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Guidelines\n",
"\n",
"#### Code\n",
"\n",
"You will use several libraries alongside basic Python: **pandas**, **numpy**, **scipy** and **scikit-learn** for data processing and analysis, and **matplotlib** or **bokeh** for visualization. Treat this assignment as a gentle introduction to the APIs for these libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Invalid switch - \"\".\n"
]
}
],
"source": [
"## DO NOT MODIFY\n",
"\n",
"import pandas as pd\n",
"pd.set_option('display.max_colwidth', None)\n",
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
"\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"import scipy.spatial.distance as dist\n",
"import numpy as np\n",
"from bokeh.io import output_notebook, show\n",
"import bokeh.plotting\n",
"\n",
"%ls data/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 0: Preliminaries\n",
"\n",
"### Exercise 0\n",
"\n",
"Download the data and unzip the files into a folder called `data` in the same directory as this notebook. We start by processing these files with `pandas`. \n",
"\n",
"For each of the data files, create DataFrames whose indexes will be the IDs: \n",
"the IDs are the zero-th column of the file (they are URLs for Google, and alphanumeric strings for Amazon). Use the \"iso-8859-1\" encoding while parsing the csv files, and make the price information numeric. Adjust the column names so that they are identical across all the data frames. Eventually, you should have four datafiles with identical indexes named **`id`** and identical column names `name`,`description`, `manufacturer`, `price` \n",
"\n",
"Also create a `gold_standard` frame from `Amazon_Google_perfectMapping.csv` with a plain numeric index.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"goog_s = pd.read_csv(\"D:\\\\New\\\\huber\\\\Google_small.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz_s = pd.read_csv(\"D:\\\\New\\\\huber\\\\Amazon_small.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz_s = amaz_s.rename(columns={\"title\": \"name\"})\n",
"gold_standard = pd.read_csv(\"D:\\\\New\\\\huber\\\\Amazon_Google_perfectMapping.csv\", engine = \"python\", encoding = \"ISO-8859-1\")\n",
"goog = pd.read_csv(\"D:\\\\New\\\\huber\\\\Google.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz = pd.read_csv(\"D:\\\\New\\\\huber\\\\Amazon.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz = amaz.rename(columns={\"title\": \"name\"})"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"name | \n",
"description | \n",
"manufacturer | \n",
"price | \n",
"
\n",
"\n",
"id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
"\n",
"\n",
"\n",
"b000jz4hqo | \n",
"clickart 950 000 - premier image pack (dvd-rom) | \n",
"NaN | \n",
"broderbund | \n",
"0.0 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" name description \\\n",
"id \n",
"b000jz4hqo clickart 950 000 - premier image pack (dvd-rom) NaN \n",
"\n",
" manufacturer price \n",
"id \n",
"b000jz4hqo broderbund 0.0 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## DO NOT MODIFY\n",
"\n",
"goog_s.head(1)\n",
"amaz_s.head(1)\n",
"gold_standard.head(2)\n",
"goog.head(1)\n",
"amaz.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1: Rudimentary Exploration\n",
"\n",
"Explore the data using pandas and its plotting functions to understand some characteristics of the data. For instance (and do add more exploratory code/questions here):\n",
"\n",
"- how much information is missing in the two data sets and where?\n",
"\n",
"- could the price and the manufacturer data be useful in resolving entities across the two datasets? Explain your answer.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name 0\n",
"description 191\n",
"manufacturer 2994\n",
"price 0\n",
"dtype: int64\n"
]
}
],
"source": [
"## Missing information in Google dataset\n",
"\n",
"goog_mis = goog.isnull().sum(axis = 0)\n",
"print(goog_mis)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name 0\n",
"description 191\n",
"manufacturer 2994\n",
"price 0\n",
"dtype: int64\n"
]
}
],
"source": [
"## Missing information in Amazon dataset\n",
"\n",
"amaz_mis = goog.isnull().sum(axis = 0)\n",
"print(amaz_mis)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Comments\n",
"\n",
"> Both the datasets have 191 missing values in the description column and 2994 missing values in the manufacturer column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: ER with Pairwise Similarity\n",
"\n",
"A simple approach to entity resolution is to treat all records as sets of strings and compute their similarity with a set distance function. This is obviously a **quadratic**-complexity operation so only use the **sample** data here, i.e. \n",
"> you will only use the dataframes `goog_s` and `amaz_s` created above in Exercise 0.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 2\n",
"\n",
"Use the `scikit-learn` feature extraction module to create `CountVectorizer`s and `TfidfVectorizer`s for the **combined** sample data files. In particular:\n",
"\n",
"- create a new dataframe containing first the Google samples and subsequently Amazon samples: make sure that you do not change this order in subsequent processing because it gives you a way to use the first 200 integers as the indices for the Google samples and the next 200 for the Amazon samples.\n",
"\n",
"- create a new column called **`info`** in this dataframe which will contain the `name` and `description` within each record combined into a single string (join them together with a space in between).\n",
"\n",
"- create a `CountVectorizer` valled `cv` with binary counts of **1-grams (single words)** avoiding English stopwords.\n",
"\n",
"- similarly, create a `TfidfVectorizer` called `tfidf` with 1-grams avoiding English stopwords.\n",
"\n",
"Use the `fit_transform` methods to obtain the feature matrices for the vectorizers. At this point, you should have 2 models called `cv_model` and `tfidf_model` corresponding to the Vectorizer objects `cv` and `tfidf` respectively. \n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"combined = goog_s.append(amaz_s)\n",
"\n",
"combined['info'] = combined['name'].str.cat(combined['description'],sep=\" \")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"cv = CountVectorizer(input='content', encoding='ISO-8859-1', decode_error='ignore', analyzer='word',\n",
" ngram_range=(1,1), stop_words='english', binary=True)\n",
"\n",
"tfidf = TfidfVectorizer(input='content', encoding='ISO-8859-1', decode_error='ignore', analyzer='word',\n",
" ngram_range=(1,1), stop_words='english')\n",
"\n",
"cv_model = cv.fit_transform(combined)\n",
"tfidf_model = tfidf.fit_transform(combined)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wall time: 0 ns\n"
]
}
],
"source": [
"%%time\n",
"\n",
"pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 3\n",
"\n",
"**(a)** Compute the pairwise Jaccard distances between all the Google sample entities and all the Amazon sample entities separately for the CountVectorizer model (the `scipy.spatial.distance` module has a function called `cdist` that can be used). Call this array `jacc_dist`.\n",
"\n",
"Next, subtract Jaccard distances from 1 to obtain the Jaccard similarities: these will \n",
"be in an array whose rows are Google samples and whose columns are Amazon samples. Call this array `jacc_sim`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"jacc_dist = dist.cdist(goog_s.isin(gold_standard).iloc[2:2], amaz_s.isin(gold_standard).iloc[2:2])\n",
"jacc_sim = 1 - jacc_dist"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**(b)** Create a **heatmap** to visualize the similarities. Explore the matplotlib or Bokeh APIs/examples to figure this out."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png":...