set1/set1.ipynb { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Set 1\n", "\n", "## Overview\n", "\n", "**Entity resolution**, also known as record deduplication, is the...

1 answer below ยป
Please help.


set1/set1.ipynb { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Set 1\n", "\n", "## Overview\n", "\n", "**Entity resolution**, also known as record deduplication, is the process of **identifying rows** in one or more data sets that **refer to the same real world entity**. This is a common, yet difficult problem in data cleaning and integration. \n", "\n", "In this assignment, you will use scalable text analysis techniques to perform entity resolution across two data sets of commercial products.\n", "\n", "### Instructions\n", "\n", "1. You can create both code cells (for the source code) as well as Markdown cells (for the annotations/explanations) as needed in the notebook. Do not try to cram a lot of code in a single cell, but break it logically. \n", "\n", "2. Some empty code cells beginning with the **%%time** magic command, have been provided. Please **do not delete** these commands: they are there to help you see how long it takes to run the code! \n", "\n", "3. All code cells that begin with the comment `## DO NOT MODIFY` should not be edited. Some of them are there just to make sure that you are following naming conventions in creating the correct objects.\n", "\n", "4. Part 3 of the assignment **must be completed by graduate students**; it is optional (for extra credit) for undergraduate students. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Files\n", "\n", "Your assignment is to perform entity resolution over two web-scraped data sets of commercial product listings, one from Amazon, and one from Google.\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data files for this assignment can be found under `Resources/Data/set1data.tar.gz`:\n", "\n", "The zip file includes the following files:\n", "\n", "* **`Google.csv`**, the Google Products data set \n", "* **`Amazon.csv`**, the Amazon data set\n", "* **`Google_small.csv`**, a sample from the Google Products data set \n", "* **`Amazon_small.csv`**, a sample from the Amazon data set\n", "* **`Amazon_Google_perfectMapping.csv`**, the \"gold standard\" mapping\n", "\n", "The sample files will only be used in **Part I**.\n", "\n", "The \"gold standard\" file contains all of the true mappings between entities in the two data sets: each row has a pair of record IDs (one Google, one Amazon) from the corresponding datasets that describe the same thing in the real world." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Guidelines\n", "\n", "#### Code\n", "\n", "You will use several libraries alongside basic Python: **pandas**, **numpy**, **scipy** and **scikit-learn** for data processing and analysis, and **matplotlib** or **bokeh** for visualization. Treat this assignment as a gentle introduction to the APIs for these libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "## DO NOT MODIFY\n", "\n", "import pandas as pd\n", "pd.set_option('display.max_colwidth', None)\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "import scipy.spatial.distance as dist\n", "import numpy as np\n", "from bokeh.io import output_notebook, show\n", "import bokeh.plotting\n", "\n", "%ls data/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 0: Preliminaries\n", "\n", "### Exercise 0\n", "\n", "Download the data and unzip the files into a folder called `data` in the same directory as this notebook. We start by processing these files with `pandas`. \n", "\n", "For each of the data files, create DataFrames whose indexes will be the IDs: \n", "the IDs are the zero-th column of the file (they are URLs for Google, and alphanumeric strings for Amazon). Use the \"iso-8859-1\" encoding while parsing the csv files, and make the price information numeric. Adjust the column names so that they are identical across all the data frames. Eventually, you should have four datafiles with identical indexes named **`id`** and identical column names `name`,`description`, `manufacturer`, `price` \n", "\n", "Also create a `gold_standard` frame from `Amazon_Google_perfectMapping.csv` with a plain numeric index.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "## TODO Exercise 0 here\n", "\n", "# goog_s, goog: DataFrames for the Google sample and large datasets respectively\n", "# amaz_s, amaz: DataFrames for the Amazon sample and large datasets respectively\n", "# gold_standard: Dataframe with gold standard mapping" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## DO NOT MODIFY\n", "\n", "goog_s.head(1)\n", "amaz_s.head(1)\n", "gold_standard.head(2)\n", "goog.head(1)\n", "amaz.head(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for df in [amaz, amaz_s]:\n", " assert df.columns == ['name', 'description', 'manufacturer', 'price']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1: Rudimentary Exploration\n", "\n", "Explore the data using pandas and its plotting functions to understand some characteristics of the data. For instance (and do add more exploratory code/questions here):\n", "\n", "- how much information is missing in the two data sets and where?\n", "\n", "- could the price and the manufacturer data be useful in resolving entities across the two datasets? Explain your answer.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO Exercise 1\n", "\n", "pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Comments\n", "\n", "> Summarize your exploratory observations from Exercise 1's code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: ER with Pairwise Similarity\n", "\n", "A simple approach to entity resolution is to treat all records as sets of strings and compute their similarity with a set distance function. This is obviously a **quadratic**-complexity operation so only use the **sample** data here, i.e. \n", "> you will only use the dataframes `goog_s` and `amaz_s` created above in Exercise 0.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise 2\n", "\n", "Use the `scikit-learn` feature extraction module to create `CountVectorizer`s and `TfidfVectorizer`s for the **combined** sample data files. In particular:\n", "\n", "- create a new dataframe containing first the Google samples and subsequently Amazon samples: make sure that you do not change this order in subsequent processing because it gives you a way to use the first 200 integers as the indices for the Google samples and the next 200 for the Amazon samples.\n", "\n", "- create a new column called **`info`** in this dataframe which will contain the `name` and `description` within each record combined into a single string (join them together with a space in between).\n", "\n", "- create a `CountVectorizer` valled `cv` with binary counts of **1-grams (single words)** avoiding English stopwords.\n", "\n", "- similarly, create
Answered Same DaySep 26, 2021

Answer To: set1/set1.ipynb { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Set...

Pritam Kumar answered on Sep 27 2021
133 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Problem Set 1\n",
"\n",
"## Overview\n",
"\n",
"**Entity resolution**, also known as record deduplication, is the process of **identifying rows** in one or more data sets that **refer to the same real world entity**. This is a common, yet difficult problem in data cleaning and integration. \n",
"\n",
"In this assignment, you will use scalable text analysis techniques to perform entity resolution across two data sets of commercial products.\n",
"\n",
"### Instructions\n",
"\n",
"1. You can create both code cells (for the source code) as well as Markdown cells (for the annotations/explanations) as needed in the notebook. Do not try to cram a lot of code in a single cell, but break it logically. \n",
"\n",
"2. Some empty code cells beginning with the **%%time** magic command, have been provided. Please **do not delete** these commands: they are there to help you see how long it takes to run the code! \n",
"\n",
"3. All code cells that begin with the comment `## DO NOT MODIFY` should not be edited. Some of them are there just to make sure that you are following naming conventions in creating the correct objects.\n",
"\n",
"4. Part 3 of the assignment **must be completed by graduate students**; it is optional (for extra credit) for undergraduate students. \
n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Files\n",
"\n",
"Your assignment is to perform entity resolution over two web-scraped data sets of commercial product listings, one from Amazon, and one from Google.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data files for this assignment can be found under `Resources/Data/set1data.tar.gz`:\n",
"\n",
"The zip file includes the following files:\n",
"\n",
"* **`Google.csv`**, the Google Products data set \n",
"* **`Amazon.csv`**, the Amazon data set\n",
"* **`Google_small.csv`**, a sample from the Google Products data set \n",
"* **`Amazon_small.csv`**, a sample from the Amazon data set\n",
"* **`Amazon_Google_perfectMapping.csv`**, the \"gold standard\" mapping\n",
"\n",
"The sample files will only be used in **Part I**.\n",
"\n",
"The \"gold standard\" file contains all of the true mappings between entities in the two data sets: each row has a pair of record IDs (one Google, one Amazon) from the corresponding datasets that describe the same thing in the real world."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Guidelines\n",
"\n",
"#### Code\n",
"\n",
"You will use several libraries alongside basic Python: **pandas**, **numpy**, **scipy** and **scikit-learn** for data processing and analysis, and **matplotlib** or **bokeh** for visualization. Treat this assignment as a gentle introduction to the APIs for these libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Invalid switch - \"\".\n"
]
}
],
"source": [
"## DO NOT MODIFY\n",
"\n",
"import pandas as pd\n",
"pd.set_option('display.max_colwidth', None)\n",
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
"\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"import scipy.spatial.distance as dist\n",
"import numpy as np\n",
"from bokeh.io import output_notebook, show\n",
"import bokeh.plotting\n",
"\n",
"%ls data/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 0: Preliminaries\n",
"\n",
"### Exercise 0\n",
"\n",
"Download the data and unzip the files into a folder called `data` in the same directory as this notebook. We start by processing these files with `pandas`. \n",
"\n",
"For each of the data files, create DataFrames whose indexes will be the IDs: \n",
"the IDs are the zero-th column of the file (they are URLs for Google, and alphanumeric strings for Amazon). Use the \"iso-8859-1\" encoding while parsing the csv files, and make the price information numeric. Adjust the column names so that they are identical across all the data frames. Eventually, you should have four datafiles with identical indexes named **`id`** and identical column names `name`,`description`, `manufacturer`, `price` \n",
"\n",
"Also create a `gold_standard` frame from `Amazon_Google_perfectMapping.csv` with a plain numeric index.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"goog_s = pd.read_csv(\"D:\\\\New\\\\huber\\\\Google_small.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz_s = pd.read_csv(\"D:\\\\New\\\\huber\\\\Amazon_small.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz_s = amaz_s.rename(columns={\"title\": \"name\"})\n",
"gold_standard = pd.read_csv(\"D:\\\\New\\\\huber\\\\Amazon_Google_perfectMapping.csv\", engine = \"python\", encoding = \"ISO-8859-1\")\n",
"goog = pd.read_csv(\"D:\\\\New\\\\huber\\\\Google.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz = pd.read_csv(\"D:\\\\New\\\\huber\\\\Amazon.csv\", engine = \"python\", index_col=0, encoding = \"ISO-8859-1\")\n",
"amaz = amaz.rename(columns={\"title\": \"name\"})"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
namedescriptionmanufacturerprice
id
b000jz4hqoclickart 950 000 - premier image pack (dvd-rom)NaNbroderbund0.0
\n",
"
"
],
"text/plain": [
" name description \\\n",
"id \n",
"b000jz4hqo clickart 950 000 - premier image pack (dvd-rom) NaN \n",
"\n",
" manufacturer price \n",
"id \n",
"b000jz4hqo broderbund 0.0 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## DO NOT MODIFY\n",
"\n",
"goog_s.head(1)\n",
"amaz_s.head(1)\n",
"gold_standard.head(2)\n",
"goog.head(1)\n",
"amaz.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1: Rudimentary Exploration\n",
"\n",
"Explore the data using pandas and its plotting functions to understand some characteristics of the data. For instance (and do add more exploratory code/questions here):\n",
"\n",
"- how much information is missing in the two data sets and where?\n",
"\n",
"- could the price and the manufacturer data be useful in resolving entities across the two datasets? Explain your answer.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name 0\n",
"description 191\n",
"manufacturer 2994\n",
"price 0\n",
"dtype: int64\n"
]
}
],
"source": [
"## Missing information in Google dataset\n",
"\n",
"goog_mis = goog.isnull().sum(axis = 0)\n",
"print(goog_mis)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name 0\n",
"description 191\n",
"manufacturer 2994\n",
"price 0\n",
"dtype: int64\n"
]
}
],
"source": [
"## Missing information in Amazon dataset\n",
"\n",
"amaz_mis = goog.isnull().sum(axis = 0)\n",
"print(amaz_mis)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Comments\n",
"\n",
"> Both the datasets have 191 missing values in the description column and 2994 missing values in the manufacturer column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: ER with Pairwise Similarity\n",
"\n",
"A simple approach to entity resolution is to treat all records as sets of strings and compute their similarity with a set distance function. This is obviously a **quadratic**-complexity operation so only use the **sample** data here, i.e. \n",
"> you will only use the dataframes `goog_s` and `amaz_s` created above in Exercise 0.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 2\n",
"\n",
"Use the `scikit-learn` feature extraction module to create `CountVectorizer`s and `TfidfVectorizer`s for the **combined** sample data files. In particular:\n",
"\n",
"- create a new dataframe containing first the Google samples and subsequently Amazon samples: make sure that you do not change this order in subsequent processing because it gives you a way to use the first 200 integers as the indices for the Google samples and the next 200 for the Amazon samples.\n",
"\n",
"- create a new column called **`info`** in this dataframe which will contain the `name` and `description` within each record combined into a single string (join them together with a space in between).\n",
"\n",
"- create a `CountVectorizer` valled `cv` with binary counts of **1-grams (single words)** avoiding English stopwords.\n",
"\n",
"- similarly, create a `TfidfVectorizer` called `tfidf` with 1-grams avoiding English stopwords.\n",
"\n",
"Use the `fit_transform` methods to obtain the feature matrices for the vectorizers. At this point, you should have 2 models called `cv_model` and `tfidf_model` corresponding to the Vectorizer objects `cv` and `tfidf` respectively. \n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"combined = goog_s.append(amaz_s)\n",
"\n",
"combined['info'] = combined['name'].str.cat(combined['description'],sep=\" \")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"cv = CountVectorizer(input='content', encoding='ISO-8859-1', decode_error='ignore', analyzer='word',\n",
" ngram_range=(1,1), stop_words='english', binary=True)\n",
"\n",
"tfidf = TfidfVectorizer(input='content', encoding='ISO-8859-1', decode_error='ignore', analyzer='word',\n",
" ngram_range=(1,1), stop_words='english')\n",
"\n",
"cv_model = cv.fit_transform(combined)\n",
"tfidf_model = tfidf.fit_transform(combined)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wall time: 0 ns\n"
]
}
],
"source": [
"%%time\n",
"\n",
"pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 3\n",
"\n",
"**(a)** Compute the pairwise Jaccard distances between all the Google sample entities and all the Amazon sample entities separately for the CountVectorizer model (the `scipy.spatial.distance` module has a function called `cdist` that can be used). Call this array `jacc_dist`.\n",
"\n",
"Next, subtract Jaccard distances from 1 to obtain the Jaccard similarities: these will \n",
"be in an array whose rows are Google samples and whose columns are Amazon samples. Call this array `jacc_sim`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"jacc_dist = dist.cdist(goog_s.isin(gold_standard).iloc[2:2], amaz_s.isin(gold_standard).iloc[2:2])\n",
"jacc_sim = 1 - jacc_dist"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**(b)** Create a **heatmap** to visualize the similarities. Explore the matplotlib or Bokeh APIs/examples to figure this out."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"image/png":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions ยป

Submit New Assignment

Copy and Paste Your Assignment Here