{ "cells": [ { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "locked": true, "solution": false }, "editable": false, "deletable": false }, "source": [ "# Linear Regression\n",...

2 answer below »
Need code to complete attached python assignments


{ "cells": [ { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "locked": true, "solution": false }, "editable": false, "deletable": false }, "source": [ "# Linear Regression\n", "\n", "------------- \n", "\n", "_Author: Carleton Smith_" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "locked": true, "solution": false }, "editable": false, "deletable": false }, "source": [ "<\ a="">\n", "## Questions\n", "- [Question 1](#q1)\n", "- [Question 2](#q2)\n", "- [Question 3](#q3)\n", "- [Question 4](#q4)\n", "- [Question 5](#q5)\n", "- [Question 6](#q6)\n", "- [Question 7](#q7)\n", "- [Question 8](#q8)\n", "- [Question 9](#q9)\n", "- [Question 10](#q10)\n", "- [Question 11](#q11)\n", "- [Question 12](#q12)\n", "- [Question 13](#q13)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!ls resource\/asnlib" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "locked": true, "solution": false }, "editable": false, "deletable": false }, "source": [ "
<\ a="">\n", "## Assignment Overview\n", "\n", "Linear regression is one of the most ubiquitous models used in practice due to its simplicity (in both ease of deployment & interpretation). The goal of this assignment is to reinforce your understanding of the mechanics of linear regression, to set up a regression analysis, and to interpret the results. You'll achieve these goals by going through and most of a end-to-end data science project using regression.\n", "\n", "This assignment will test your ability to:\n", "\n", "1. Explore and prepare a dataset for Linear Regression\n", "2. Create a linear regression model using `scikit-learn` and `statsmodels` packages\n", "3. Correctly interpret the results from a linear regression analysis\n", "4. Diagnose model shortcomings and address them\n", "5. Perform feature selection to reach a parsimonious and generalized model\n", "\n", "Let's get started.\n", "\n", "#### EXPECTED TIME 1.5 HRS\n", "\n", "## Assignment Contents\n", "\n", "- [Assignment Overview](#overview)\n", "- [Introduction](#intro)\n", "- [Define the Problem](#define)\n", "- [Acquire Data](#acquire)\n", "- [Preprocess Data](#preprocess)\n", "- [Data Exploration](#exploration)\n", "- [Modeling & Feature Selection](#modeling)\n", " - [Predictions with Test Data](#m1_test_pred)\n", " - [Calculating Error](#error)\n", " - [Dummy Variables](#dummy-variables)\n", " - [A Second Model using More Features](#model-2)\n", " - [Remediate Model Shortcomings](#remediate)\n", " - [The `scikit-learn` Implementation](#scikit-learn)\n",
Answered 3 days AfterJul 01, 2021

Answer To: { "cells": [ { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "locked": true,...

Atal Behari answered on Jul 05 2021
152 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"# Naive Bayes Classifiers"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"-----------"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"_Author: Dhavide Aruliah_"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"### Assignment Contents"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"- [Question 1: Computing Discrete Probabilities](#q-club-black)\n",
"- [Question 2: Computing Conditional probabilities](#q-cond-p)\n",
"- [Question 3: Reasoning about Spam Messages](#q-spam)\n",
"- [Question 4: Preparing the SMS Messaging Data](#q-preparing)\n",
"- [Question 5: Getting priors](#q-priors)\n",
"- [Question 6: Getting likelihoods](#q-likelihoods)\n",
"- [Question 7: Computing smoothed likelihoods](#q-smoothed)\n",
"- [Question 8: Predicting spam](#q-predict)"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"#### EXPECTED TIME 2.0 HRS "
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false

},
"editable": false,
"deletable": false
},
"source": [
"## Activities in this Assignment\n",
"\n",
"This assignment provides an overview of *Naive Bayes classifiers* as an approach to classification problems in supervised learning. In spite of the \"naive\" assumptions involved, it works very well in practice particularly for text analysis in, for instance, spam filtering or document classification. As such, this assignment is built around a very simple model of spam filtering to get a sense of how naive Bayes classification really works.\n",
"\n",
"The primary goals are:\n",
"+ to review notions of probability as related to Bayes' theorem (notably independent events & conditional probability).\n",
"+ to practice the application of Bayes' theorem for probabilistic reasoning.\n",
"+ to develop a (highly simplified) model of text analysis for spam classification using the naive Bayes classification framework."
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"---\n",
"\n",
"## Reminders from Discrete Probability\n",
"\n",
"For finite sets, probabilities of distinct events can be modeled using *sets*.\n",
"\n",
"As an example, consider a standard deck of playing cards. There are 52 cards in a deck with four suits (clubs (♣), diamonds (♢), hearts (♡), and spades (♠)) each with one of thirteen ranks (two through ten, jack, queen, king, and ace). We can represent a deck in Python by a collection of tuples."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['♣', '♢', '♡', '♠']\n"
]
},
{
"data": {
"text/plain": "[('2', '♣'), ('2', '♢'), ('2', '♡'), ('2', '♠'), ('3', '♣')]"
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"suits = ['♣', '♢', '♡', '♠']\n",
"print(suits)\n",
"\n",
"ranks = ['2', '3', '4', '5', '6', '7', '8' ,'9', '10', 'J', 'Q', 'K', 'A']\n",
"deck = [ (r,s) for r in ranks for s in suits ]\n",
"deck[:5]"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('3', '♡'), ('4', '♢'), ('3', '♣'), ('Q', '♡'), ('4', '♣')]\n"
]
}
],
"source": [
"# Put the deck into random order.\n",
"import random\n",
"random.shuffle(deck)\n",
"print(deck[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An *event* is any subset of a set of possible outcomes. For instance, let $E_{\\text{black}}$ is the event of drawing a black card from the deck and let $E_{\\text{club}}$ be the event of drawing a club from the deck."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.5 0.25\n"
]
}
],
"source": [
"E_black = {card for card in deck if ((card[1]=='♠') or (card[1]=='♣'))}\n",
"E_club = {card for card in deck if (card[1]=='♣')}\n",
"print(len(E_black)/len(deck), len(E_club)/len(deck))"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"[Back to top](#Assignment-Contents)\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"---\n",
"\n",
"### Question 1: Computing Discrete Probabilities\n",
"\n",
"Your first task is to answer a few questions about probability and a standard deck of cards.\n",
"\n",
"+ What is $p(E_{\\text{black}})$, the probability of drawing a single black card (i.e., a card that is either of the club suit or the spades suit) from the deck? Assign your answer to `p_black`.\n",
" + You can simply assign the number or you can compute it empirically using the Python sets `E_black` & `deck`.\n",
"+ What is $p(E_{\\text{club}})$, the probability of drawing a single club (i.e., a card from the club suit) from the deck? Assign your answer to `p_club`.\n",
" + You can simply assign the number or you can compute it empirically using the Python sets `E_club` & `deck`.\n",
"+ **True** or **False**: the events $E_{\\text{black}}$ & $E_{\\text{club}}$ are independent. Assign your answer as a Python boolean literal `True` or `False` to `independent_club_black`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"p_black = 0.5\n",
"p_club = 0.25\n",
"independent_club_black = True\n"
]
}
],
"source": [
"### GRADED\n",
"### QUESTION 1:\n",
"### Assign values to p_black, p_club, and independent_club_black as described above.\n",
"### YOUR SOLUTION HERE:\n",
"p_black = 26/52\n",
"p_club = 13/52\n",
"independent_club_black = True\n",
"### For verifying answer:\n",
"print('p_black = {}'.format(p_black))\n",
"print('p_club = {}'.format(p_club))\n",
"print('independent_club_black = {}'.format(independent_club_black))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"nbgrader": {
"grade": true,
"grade_id": "Question 01",
"locked": true,
"points": "6",
"solution": false
},
"editable": false,
"deletable": false
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"[Back to top](#Assignment-Contents)\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"---\n",
"\n",
"### Question 2: Computing Conditional probabilities\n",
"\n",
"+ Suppose I draw a card from the deck and I tell you it is a black card. What is the probability that that the card drawn is also a club? That is, what is the *conditional probability* that $p(E_{\\text{club}}\\,|\\,E_{\\text{black}})$?\n",
" + Assign the value of $p(E_{\\text{club}}\\,|\\,E_{\\text{black}})$ to `p_club_black` as a standard Python floating-point numeric value.\n",
"+ Alternatively, suppose I draw a card from the deck and I tell you it is a club, i.e., a card from the club suit. What is the probability that that the card drawn is also black? That is, what is the *conditional probability* that $p(E_{\\text{black}}\\,|\\,E_{\\text{club}})$?\n",
" + Assign the value of $p(E_{\\text{black}}\\,|\\,E_{\\text{club}})$ to `p_black_club` as a standard Python floating-point numeric value."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"p_club_black = 0.5\n",
"p_black_club = 1.0\n"
]
}
],
"source": [
"### GRADED\n",
"### QUESTION 2:\n",
"### Assign numeric values to p_club_black & p_black_club as described above.\n",
"### YOUR SOLUTION HERE:\n",
"p_club_black = .5\n",
"p_black_club = 1.0\n",
"### For verifying answer:\n",
"print('p_club_black = {}'.format(p_club_black))\n",
"print('p_black_club = {}'.format(p_black_club))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true,
"nbgrader": {
"grade": true,
"grade_id": "Question 02",
"locked": true,
"points": "4",
"solution": false
},
"editable": false,
"deletable": false
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"[Back to top](#Assignment-Contents)"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"Having reviewed a little about independent and conditional probabilities, remember *Bayes' theorem*:\n",
"\n",
"$$\\displaystyle{\\boxed{p(A\\,|\\,B) = \\frac{p(B\\,|\\,A) p(A)}{p(B)}}}$$\n",
"\n",
"+ $p(A\\,|\\,B)$ is the \"*posterior* probability of $A$ given $B$\";\n",
"+ $p(B\\,|\\,A)$ is the \"*likelihood* of $B$ given $A$\";\n",
"+ $p(A)$ is the \"*prior* probability of $A$\"; and\n",
"+ $p(B)$ is the \"*evidence*\" (normalizing factor).\n",
"\n",
"The next questions require you to apply Bayes' theorem to reason about spam messages. Remember, the goal is to identify messages as *spam* (not wanted, undesirable) or *ham* (i.e., the opposite of spam messages)."
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"[Back to top](#Assignment-Contents)\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"---\n",
"\n",
"### Question 3: Reasoning about Spam Messages\n",
"\n",
"Assume in the following that you have a training set of 2,000 messages known to be spam and 1,000 messages known to be ham (i.e., not spam). Suppose further that the word \"bargain\" occurs in 250 of the spam messages and 5 of the ham messages.\n",
"\n",
"+ Assume the empirical prior probability of an incoming message being ham or spam is provided by the respective fractions of ham or spam messages in the training set.\n",
" + Assign the (estimated) prior probability of spam to `prior_spam` (i.e., $p(\\text{spam})$).\n",
" + Assign the (estimated) prior probability of ham to `prior_ham` (i.e., $p(\\text{ham})$).\n",
"+ Assume the empirical likelihood of the word \"bargain\" occurring in a message known to be spam (respectively, ham) is given by the counts above.\n",
" + Assign the (estimated) likelihood of \"bargain\" occurring in an incoming spam message to `likelihood_bargain_spam` (i.e., $p(\\text{bargain}\\,|\\,\\text{spam})$).\n",
" + Assign the (estimated) likelihood of \"bargain\" occurring in an incoming ham message to `likelihood_bargain_ham` (i.e., $p(\\text{bargain}\\,|\\,\\text{ham})$).\n",
"+ Finally, combine the preceding computations to estimate the *posterior* probability of an incoming message being spam given that it contains the word \"bargain\" (that is, $p(\\text{spam}\\,|\\,\\text{bargain})$).\n",
" + Assign the posterior probability $p(\\text{spam}\\,|\\,\\text{bargain})$ to `posterior_spam_bargain`.\n",
"+ Assign all the values computed here to Python floating-point values up to three decimal places."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"prior_spam: 0.667\n",
"prior_ham: 0.333\n",
"likelihood_bargain_spam: 0.125\n",
"likelihood_bargain_ham : 0.005\n",
"posterior_spam_bargain: 0.980\n"
]
}
],
"source": [
"### GRADED\n",
"### QUESTION 3:\n",
"### Assign floating-point values to prior_spam, prior_ham, likelihood_bargain_spam,\n",
"### likelihood_bargain_ham, and posterior_spam_bargain as described above.\n",
"### Provide results accurate to at least 3 decimal places (i.e., an absolute tolerance of 1.0e-3).\n",
"### YOUR SOLUTION HERE:\n",
"prior_spam = 2000/3000\n",
"prior_ham = 1000/3000\n",
"likelihood_bargain_spam = 250/2000\n",
"likelihood_bargain_ham = 5/1000\n",
"posterior_spam_bargain = (250/2000)*(2000/3000)/(255/3000)\n",
"\n",
"### For verifying answer:\n",
"print('prior_spam: {:5.3f}'.format(prior_spam))\n",
"print('prior_ham: {:5.3f}'.format(prior_ham))\n",
"print('likelihood_bargain_spam: {:5.3f}'.format(likelihood_bargain_spam))\n",
"print('likelihood_bargain_ham : {:5.3f}'.format(likelihood_bargain_ham))\n",
"print('posterior_spam_bargain: {:5.3f}'.format(posterior_spam_bargain))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true,
"nbgrader": {
"grade": true,
"grade_id": "Question 03",
"locked": true,
"points": "10",
"solution": false
},
"editable": false,
"deletable": false
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"[Back to top](#Assignment-Contents)"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"## Filtering Spam from SMS Messages\n",
"\n",
"For the next questions, you'll work with a dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), namely the [*SMS Spam Collection*]( https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), a public set of labeled SMS messages that have been collected for mobile phone spam research."
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"[Back to top](#Assignment-Contents)\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"---\n",
"\n",
"### Question 4: Preparing the SMS Messaging Data\n",
"\n",
"Your task now is to load the SMS messaging data into a Pandas DataFrame.\n",
"\n",
"+ The data is stored in a file whose location is provided for you as `FILE_PATH`. Use the function `pd.read_csv` with the options `sep=\"\\t\"` and `header=None`.\n",
"+ Assign the resulting `DataFrame` object to the identifier `messages`.\n",
"+ Give the DataFrame meaningful column headers by assigning the list `['target', 'msg']` to `messages.columns`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [
{
"data": {
"text/plain": " target msg\n0 ham Go until jurong point, crazy.. Available only ...\n1 ham Ok lar... Joking wif u oni...\n2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n3 ham U dun say so early hor... U c already then say...\n4 ham Nah I don't think he goes to usf, he lives aro...",
"text/html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
targetmsg
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n
"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"### GRADED\n",
"### QUESTION 4:\n",
"### Prepare the dataframe messages as specified above. \n",
"###\n",
"# Necessary imports\n",
"import numpy as np, pandas as pd\n",
"FILE_PATH = 'C:/Users/Atal/PycharmProjects/GreyNodes/Dataset/smsspamcollection.txt'\n",
"### YOUR SOLUTION HERE:\n",
"messages = pd.read_csv(FILE_PATH,sep=\"\\t\",header=None)\n",
"messages.columns = ['target', 'msg']\n",
"### For verifying answer:\n",
"display(messages.head())\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true,
"nbgrader": {
"grade": true,
"grade_id": "Question 04",
"locked": true,
"points": "5",
"solution": false
},
"editable": false,
"deletable": false
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable": false,
"deletable": false
},
"source": [
"[Back to top](#Assignment-Contents)"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"editable":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here