Objective Investigate and assess Data Engineering approaches to integrate handwriting recognition datasets Instructions Complete this assignment individually Use an accepted style (e.g., APA, Chicago,...

1 answer below »

Objective


Investigate and assess Data Engineering approaches to integrate handwriting recognition datasets


Instructions



  • Complete this assignment individually

  • Use an accepted style (e.g., APA, Chicago, etc.) for citing any materials

  • Submit a PDF file containing your submission for this assignment

  • Read the Notes section below for additional information

  • Find 2 additional datasets to augment the MNIST dataset used in Assignments 2-4

  • For each of the 3 datasets (MNIST and the 2 from the previous step), document the following:

    • The source for the data

    • The frequency (if any) with which the dataset is updated

    • Any metadata associated with the dataset



  • Design a unified model to consolidates the datasets into one form

  • Map the datasets to the aforementioned unified model

  • Identify processes for extracting the datasets from their respective sources

  • Evaluate the need for a data warehouse and/or a data lake to serve as the data repository for the datasets

  • Evaluate one of the following technologies/services to determine if it fits our needs

    • Snowflake

    • Amazon Web Services

    • Microsoft Azure

    • Google Cloud



  • Include any related literature to support your responses


Notes


As indicated in previous course materials, the MNIST dataset is a mature and often used dataset for automated handwriting recognition. More recently, other datasets have been released that aim to improve on MNIST.




Data Engineering Assignment (Assignment 5) Objective Investigate and assess Data Engineering approaches to integrate handwriting recognition datasets Instructions · Complete this assignment individually · Use an accepted style (e.g., APA, Chicago, etc.) for citing any materials · Submit a PDF file containing your submission for this assignment · Read the Notes section below for additional information · Find 2 additional datasets to augment the MNIST dataset used in Assignments 2-4 · For each of the 3 datasets (MNIST and the 2 from the previous step), document the following: · The source for the data · The frequency (if any) with which the dataset is updated · Any metadata associated with the dataset · Design a unified model to consolidates the datasets into one form · Map the datasets to the aforementioned unified model · Identify processes for extracting the datasets from their respective sources · Evaluate the need for a data warehouse and/or a data lake to serve as the data repository for the datasets · Evaluate one of the following technologies/services to determine if it fits our needs · Snowflake · Amazon Web Services · Microsoft Azure · Google Cloud · Include any related literature to support your responses Notes As indicated in previous course materials, the MNIST dataset is a mature and often used dataset for automated handwriting recognition. More recently, other datasets have been released that aim to improve on MNIST. Rubric { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### GPU benchmark on MNIST" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import tensorflow as tf\n", "import numpy as np \n", "import matplotlib.pyplot as plt\n", "import keras as k\n", "from tensorflow.examples.tutorials.mnist import input_data\n", "from keras.datasets import mnist\n", "from keras.models import Sequential\n", "from keras.layers import Dense, Dropout, Flatten\n", "from keras.layers import Conv2D, MaxPooling2D, BatchNormalization\n", "from keras.optimizers import SGD, Adam\n", "from keras.models import load_model\n", "from keras import backend as K\n", "\n", "# data preprocessing\n", "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n", "img_rows, img_cols = 28,28\n", "x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)\n", "x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)\n", "input_shape = (img_rows, img_cols, 1)\n", "x_test=x_test.astype('float32')\n", "x_train=x_train.astype('float32')\n", "mean=np.mean(x_train)\n", "std=np.std(x_train)\n", "x_test = (x_test-mean)/std\n", "x_train = (x_train-mean)/std\n", "\n", "# labels\n", "num_classes=10\n", "y_train = k.utils.to_categorical(y_train, num_classes)\n", "y_test = k.utils.to_categorical(y_test, num_classes)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of filters 1\n", "batch size 8\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 89s 1ms/step - loss: 0.6453 - acc: 0.8034 - val_loss: 0.1984 - val_acc: 0.9360\n", "number of filters 1\n", "batch size 16\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 43s 709us/step - loss: 0.3025 - acc: 0.9064 - val_loss: 0.2010 - val_acc: 0.9379\n", "number of filters 1\n", "batch size 32\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 21s 354us/step - loss: 0.2247 - acc: 0.9294 - val_loss: 0.1230 - val_acc: 0.9600\n", "number of filters 1\n", "batch size 64\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 11s 178us/step - loss: 0.1830 - acc: 0.9418 - val_loss: 0.1164 - val_acc: 0.9614\n", "number of filters 1\n", "batch size 128\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 5s 90us/step - loss: 0.1615 - acc: 0.9488 - val_loss: 0.1066 - val_acc: 0.9652\n", "number of filters 1\n", "batch size 256\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 3s 46us/step - loss: 0.1496 - acc: 0.9527 - val_loss: 0.1051 - val_acc: 0.9657\n", "number of filters 1\n", "batch size 512\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 2s 25us/step - loss: 0.1416 - acc: 0.9560 - val_loss: 0.1041 - val_acc: 0.9656\n", "number of filters 1\n", "batch size 1024\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 1s 18us/step - loss: 0.1392 - acc: 0.9554 - val_loss: 0.0998 - val_acc: 0.9674\n", "number of filters 2\n", "batch size 8\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 86s 1ms/step - loss: 0.4295 - acc: 0.8730 - val_loss: 0.1104 - val_acc: 0.9636\n", "number of filters 2\n", "batch size 16\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 43s 714us/step - loss: 0.1736 - acc: 0.9469 - val_loss: 0.0883 - val_acc: 0.9717\n", "number of filters 2\n", "batch size 32\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 21s 356us/step - loss: 0.1213 - acc: 0.9630 - val_loss: 0.0655 - val_acc: 0.9798\n", "number of filters 2\n", "batch size 64\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 11s 179us/step - loss: 0.0877 - acc: 0.9722 - val_loss: 0.0679 - val_acc: 0.9770\n", "number of filters 2\n", "batch size 128\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 6s 92us/step - loss: 0.0736 - acc: 0.9770 - val_loss: 0.0573 - val_acc: 0.9822\n", "number of filters 2\n", "batch size 256\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 3s 48us/step - loss: 0.0668 - acc: 0.9786 - val_loss: 0.0520 - val_acc: 0.9844\n", "number of filters 2\n", "batch size 512\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 2s 27us/step - loss: 0.0607 - acc: 0.9806 - val_loss: 0.0524 - val_acc: 0.9842\n", "number of filters 2\n", "batch size 1024\n", "Train on 60000 samples, validate on 10000 samples\n", "Epoch 1/1\n", "60000/60000 [==============================] - 1s 22us/step - loss: 0.0591 - acc: 0.9813 - val_loss: 0.0524 - val_acc: 0.9834\n", "number of filters 4\n", "batch size 8\n", "Train on 60000 samples, validate on 10000 samples\n",
Answered 2 days AfterFeb 26, 2021

Answer To: Objective Investigate and assess Data Engineering approaches to integrate handwriting recognition...

Sandeep Kumar answered on Mar 01 2021
152 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GPU benchmark on MNIST"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
]
}
],
"source": [
"import tensorflow as tf\n",
"import numpy as np \n",
"import matplotlib.pyplot as plt\n",
"import keras as k\n",
"from tensorflow.examples.tutorials.mnist import input_data\n",
"from keras.datasets import mnist\n",
"from keras.models import Sequential\n",
"from keras.layers import Dense, Dropout, Flatten\n",
"from keras.layers import Conv2D, MaxPooling2D, BatchNormalization\n",
"from keras.optimizers import SGD, Adam\n",
"from keras.mod
els import load_model\n",
"from keras import backend as K\n",
"import requests\n",
"import os\n",
"from random import randint\n",
"\n",
"try:\n",
" from tqdm import tqdm\n",
"except ImportError:\n",
" tqdm = lambda x, total, unit: x # If tqdm doesn't exist, replace it with a function that does nothing\n",
" print('**** Could not import tqdm. Please install tqdm for download progressbars! (pip install tqdm) ****')\n",
"\n",
"# Python2 compatibility\n",
"try:\n",
" input = raw_input\n",
"except NameError:\n",
" pass\n",
"\n",
"download_dict = {\n",
" '1) Kuzushiji-MNIST (10 classes, 28x28, 70k examples)': {\n",
" '1) MNIST data format (ubyte.gz)':\n",
" ['http://codh.rois.ac.jp/kmnist/dataset/kmnist/train-images-idx3-ubyte.gz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/kmnist/train-labels-idx1-ubyte.gz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/kmnist/t10k-images-idx3-ubyte.gz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/kmnist/t10k-labels-idx1-ubyte.gz'],\n",
" '2) NumPy data format (.npz)':\n",
" ['http://codh.rois.ac.jp/kmnist/dataset/kmnist/kmnist-train-imgs.npz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/kmnist/kmnist-train-labels.npz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/kmnist/kmnist-test-imgs.npz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/kmnist/kmnist-test-labels.npz'],\n",
" },\n",
" '2) Kuzushiji-49 (49 classes, 28x28, 270k examples)': {\n",
" '1) NumPy data format (.npz)':\n",
" ['http://codh.rois.ac.jp/kmnist/dataset/k49/k49-train-imgs.npz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/k49/k49-train-labels.npz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/k49/k49-test-imgs.npz',\n",
" 'http://codh.rois.ac.jp/kmnist/dataset/k49/k49-test-labels.npz'],\n",
" },\n",
" '3) Kuzushiji-Kanji (3832 classes, 64x64, 140k examples)': {\n",
" '1) Folders of images (.tar)':\n",
" ['http://codh.rois.ac.jp/kmnist/dataset/kkanji/kkanji.tar'],\n",
" }\n",
"\n",
"}\n",
"\n",
"# Download a list of files\n",
"def download_list(url_list):\n",
" for url in url_list:\n",
" path = url.split('/')[-1]\n",
" r = requests.get(url, stream=True)\n",
" with open(path, 'wb') as f:\n",
" total_length = int(r.headers.get('content-length'))\n",
" print('Downloading {} - {:.1f} MB'.format(path, (total_length / 1024000)))\n",
"\n",
" for chunk in tqdm(r.iter_content(chunk_size=1024), total=int(total_length / 1024) + 1, unit=\"KB\"):\n",
" if chunk:\n",
" f.write(chunk)\n",
" print('All dataset files downloaded!')\n",
"\n",
"# Ask the user about which path to take down the dict\n",
"def traverse_dict(d):\n",
" print('Please select a download option:')\n",
" keys = sorted(d.keys()) # Print download options\n",
" for key in keys:\n",
" print(key)\n",
"\n",
" userinput = input('> ').strip()\n",
"\n",
" try:\n",
" selection = int(userinput) - 1\n",
" except ValueError:\n",
" print('Your selection was not valid')\n",
" traverse_dict(d) # Try again if input was not valid\n",
" return\n",
"\n",
" selected = keys[selection]\n",
"\n",
" next_level = d[selected]\n",
" if isinstance(next_level, list): # If we've hit a list of downloads, download that list\n",
" download_list(next_level)\n",
" else:\n",
" traverse_dict(next_level) # Otherwise, repeat with the next level\n",
"\n",
"traverse_dict(download_dict)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"APP_NAME = '%s-%d' % ('fashion-mnist', randint(0, 100))\n",
"LOG_FORMAT = '%(asctime)-15s %(filename)s:%(funcName)s:[%(levelname)s] %(message)s'\n",
"JSON_FORMAT = '%(message)s'\n",
"\n",
"RUN_LOCALLY = False\n",
"ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) + '/'\n",
"TEST_DIR = ROOT_DIR + 'test/'\n",
"DATA_DIR = ROOT_DIR + 'data/fashion'\n",
"VIS_DIR = ROOT_DIR + 'visualization/'\n",
"MODEL_SAVE_DIR = ROOT_DIR + 'save/'\n",
"MULTI_TASK_MODEL = '20170814-153653'\n",
"TEST_DATA_DIR = TEST_DIR + 'data/'\n",
"LOG_DIR = ROOT_DIR + 'log/'\n",
"RESULT_DIR = ROOT_DIR + 'result/'\n",
"TEMPLATE_DIR = ROOT_DIR + 'templates/'\n",
"STATIC_DIR = ROOT_DIR + 'static/'\n",
"SCRIPT_DIR = ROOT_DIR + 'script/'\n",
"BASELINE_PATH = ROOT_DIR + 'benchmark/baselines.json'\n",
"\n",
"Q2A_SUFFIX = '-merged-ad1-20170501+36D+20170605.json.gz'\n",
"\n",
"SYNC_SCRIPT_PATH = SCRIPT_DIR + 'sync_s3.sh'\n",
"DOWNLOAD_SCRIPT_PATH = SCRIPT_DIR + 'load_s3_json.sh'\n",
"LOG_PATH = LOG_DIR + APP_NAME + '.log'\n",
"RESULT_PATH = RESULT_DIR + APP_NAME + '.json'\n",
"\n",
"Q2A_PATH = DATA_DIR + \"query2brand-train.tfr\"\n",
"Q2A_INFO = DATA_DIR + \"query2brand.json\"\n",
"MAX_ITEM_PER_ATTRIBUTE = 20\n",
"\n",
"LOSS_JITTER = 1e-4\n",
"SYNC_INTERVAL = 300.0 # sync every 5 minutes\n",
"SYNC_TIMEOUT = 600\n",
"FIRST_SYNC_DELAY = 300.0 # do the first task only after 5 minutes.\n",
"\n",
"RNN_ARGS_JSON = ROOT_DIR + 'nn/queryclf/config.json'\n",
"\n",
"Q2A_JSON_AKEY1 = 'attributes'\n",
"Q2A_JSON_AKEY2 = 'value'\n",
"\n",
"\n",
"def touch(fname: str, times=None, create_dirs: bool = False):\n",
" if create_dirs:\n",
" base_dir = os.path.dirname(fname)\n",
" if not os.path.exists(base_dir):\n",
" os.makedirs(base_dir)\n",
" with open(fname, 'a'):\n",
" os.utime(fname, times)\n",
"\n",
"\n",
"def touch_dir(base_dir: str) -> None:\n",
" if not os.path.exists(base_dir):\n",
" os.makedirs(base_dir)\n",
"\n",
"\n",
"def _get_logger(name: str):\n",
" import logging.handlers\n",
" touch(LOG_PATH, create_dirs=True)\n",
" touch_dir(MODEL_SAVE_DIR)\n",
" l = logging.getLogger(name)\n",
" l.setLevel(logging.DEBUG)\n",
" fh = logging.FileHandler(LOG_PATH)\n",
" fh.setLevel(logging.INFO)\n",
" ch = logging.StreamHandler()\n",
" ch.setLevel(logging.INFO)\n",
" fh.setFormatter(logging.Formatter(LOG_FORMAT))\n",
" ch.setFormatter(logging.Formatter(LOG_FORMAT))\n",
" l.addHandler(fh)\n",
" l.addHandler(ch)\n",
" return l\n",
"\n",
"\n",
"def get_json_logger(name: str):\n",
" import logging.handlers\n",
" touch(RESULT_PATH, create_dirs=True)\n",
" l = logging.getLogger(__name__ + name)\n",
" l.setLevel(logging.INFO)\n",
" # add rotator to the logger. it's lazy in the sense that it wont rotate unless there are new logs\n",
" fh = logging.FileHandler(RESULT_PATH)\n",
" fh.setLevel(logging.INFO)\n",
" fh.setFormatter(logging.Formatter(JSON_FORMAT))\n",
" l.addHandler(fh)\n",
" return l\n",
"\n",
"\n",
"LOGGER = _get_logger(__name__)\n",
"JSON_LOGGER = get_json_logger('json' + __name__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# data preprocessing\n",
"(x_train, y_train), (x_test, y_test) = mnist.load_data()\n",
"img_rows, img_cols = 28,28\n",
"x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)\n",
"x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here