{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "8p0BMR7HfMH3" }, "source": [ "# Neural Networks for Text Classification\n", "\n", "In this assignment, you will implement the Naive Bayes...

1 answer below »
Please see the attachment, this is a machine learning homework that needs to be done in Google Collab. To view the notebook, take the file I uploaded it drop it into google drive, then double click and a notebook will open using google collab.


{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "8p0BMR7HfMH3" }, "source": [ "# Neural Networks for Text Classification\n", "\n", "In this assignment, you will implement the Naive Bayes algorithm, a simple, but competitive neural bag-of-words model for text classification and also experiment with a state-of-the pre-trained transformer model. You will train your models on a (provided) dataset of positive and negative movie reviews and report accuracy on a test set." ] }, { "cell_type": "markdown", "metadata": { "id": "jZ03PcaBgA3m" }, "source": [ "#Download the dataset\n", "\n", "First you will need to download the IMDB dataset - to do this, simply run the cell below. We have prepared a small version of the ACL IMDB dataset for you to use to help make your experiments faster. The full dataset is available [here](https://ai.stanford.edu/~amaas/data/sentiment/), in case you are interested, but there is no need to use this for the assignment.\n", "\n", "Note that files downloaded in Colab are only saved temporariliy - if your session reconnects you will need to re-download the files. In case you need persistent storage, you can mount your Google drive folder like so:\n", "\n", "```\n", "from google.colab import drive\n", "drive.mount('/content/drive')\n", "```\n", "\n", "You can also open a command line prompt by clicking on the shell icon on the left hand side of the page, and upload/download files from your local machine by clicking on the file icon." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oDvw3YUJgQHj", "outputId": "b40deb87-f391-4842-f811-95cc4ec3ade6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2021-10-24 22:22:34-- https://github.com/aritter/5525_sentiment/raw/master/aclImdb_small.tgz\n", "Resolving github.com (github.com)... 192.30.255.113\n", "Connecting to github.com (github.com)|192.30.255.113|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/aritter/5525_sentiment/master/aclImdb_small.tgz [following]\n", "--2021-10-24 22:22:35-- https://raw.githubusercontent.com/aritter/5525_sentiment/master/aclImdb_small.tgz\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 9635749 (9.2M) [application/octet-stream]\n", "Saving to: ‘aclImdb_small.tgz’\n", "\n", "aclImdb_small.tgz 100%[===================>] 9.19M --.-KB/s in 0.07s \n", "\n", "2021-10-24 22:22:35 (138 MB/s) - ‘aclImdb_small.tgz’ saved [9635749/9635749]\n", "\n" ] } ], "source": [ "#Download the data\n", "\n", "!wget https://github.com/aritter/5525_sentiment/raw/master/aclImdb_small.tgz\n", "!tar xvzf aclImdb_small.tgz > /dev/null" ] }, { "cell_type": "markdown", "metadata": { "id": "Jv4obseHgEgb" }, "source": [ "# Converting text to numbers\n", "\n", "Below is some code we are providing you to read in the IMDB dataset, perform tokenization (using `nltk`), and convert words into indices. Please don't modify this code in your submitted version. We will provide example usage below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WfdeHCQHfR2n", "outputId": "3354a168-ee16-4ee2-b479-adba7b2aebb0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n" ] } ], "source": [ "import os\n", "import sys\n", "\n", "import nltk\n", "from nltk import word_tokenize\n", "nltk.download('punkt')\n", "import torch\n", "\n", "#Sparse matrix implementation\n", "from scipy.sparse import csr_matrix\n", "import numpy as np\n", "from collections import Counter\n", "\n", "np.random.seed(1)\n", "\n", "class Vocab:\n", " def __init__(self, vocabFile=None):\n", " self.locked = False\n", " self.nextId = 0\n", " self.word2id = {}\n", " self.id2word = {}\n", " if vocabFile:\n", " for line in open(vocabFile):\n", " line = line.rstrip('\\n')\n", " (word, wid) = line.split('\\t')\n", " self.word2id[word] = int(wid)\n", " self.id2word[wid] = word\n", " self.nextId = max(self.nextId, int(wid) + 1)\n", "\n", " def GetID(self, word):\n", " if not word in self.word2id:\n", "
Answered 9 days AfterNov 02, 2021

Answer To: { "cells": [ { "cell_type": "markdown", "metadata": { "id": "8p0BMR7HfMH3" }, "source": [ "# Neural...

Amar Kumar answered on Nov 11 2021
123 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here