Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the steps you might take to do Graph Analysis. I have provided sample code for you to use as you go...

1 answer below »

Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the steps you might take to do Graph Analysis. I have provided sample code for you to use as you go through the tutorial. I need you to prove me comments out each step and run them separately so I can fully understand what you are doing for each step of the analysis.



(I am using the first part of it to practice Graphic Analytics but the updates to anaconda missed up some of the packages and I can’t run python.)


I got some of the code done - the data set is large (205mb)


I will have to put it in a dropbox link




Testing Hypothesis Exercise Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the steps you might take to do Graph Analysis. I have provided sample code for you to use as you go through the tutorial. I need you to prove me comments out each step and run them separately so I can fully understand what you are doing for each step of the analysis. (I am using the first part of it to practice Graphic Analytics but the updates to anaconda missed up some of the packages and I can’t run python.) #Hypothesis: Articles about Climate Change are more likely to be published by "Liberal" sources NOTE: This case study is not complete! Here is some additional sample code to use: import pandas as pd import numpy as np import json import sys import warnings from sklearn.datasets import make_regression from sklearn.feature_selection import RFECV from sklearn import datasets, linear_model from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.decomposition import NMF from sklearn import datasets from sklearn.model_selection import train_test_split #9.1 reducing features using Principal Components digits = datasets.load_digits() features= StandardScaler().fit_transform(digits.data) pca=PCA(n_components=0.99, whiten=True) features_pca = pca.fit_transform(features) print("original number of features:", features.shape[1]) print("reduced number of features:", features_pca.shape[1]) print("output from 9.1 done!") #9.4 Reducing Features Using Matrix Factorization features = digits.data nmf=NMF(n_components=10, random_state=1) features_nmf=nmf.fit_transform(features) print("Original number of features:", features.shape[1]) print("reduced number of features:", features_nmf.shape[1]) print("output from 9.4 done!") #10.1 - Thresholding Numerical Feature Variance from sklearn import datasets from sklearn.feature_selection import VarianceThreshold #import data iris= datasets.load_iris() #create features and target features=iris.data target=iris.target #create thresholder thresholder = VarianceThreshold(threshold=.5) #create high variance feature matrix and print features_high_variance=thresholder.fit_transform(features) print(features_high_variance[0:3]) #10.2 - Thresholding Binary Feature Variance features = [[0,1,0], [0,1,1], [0,1,0], [0,1,1], [1,0,0]] thresholder=VarianceThreshold(threshold = (.75*(1-.75))) print(thresholder.fit_transform(features))
Answered Same DaySep 29, 2021

Answer To: Complete the Hypothesis Case Study Part 1 tutorial. It is not a complete case study; it is just the...

Kshitij answered on Sep 29 2021
134 Votes
45265.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CINDY HERRERA DSC550 WEEK 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applied Text Analysis With Python Exercises"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import string\n",
"import re\n",
"import matplotlib.pyplot as plt\n",
"from collections import Counter"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#Step 1: Load data into a dataframe"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"addr1 = \"articles1.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2: check the dimension of the table/look at the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The dimension of the table is: (50000, 10)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
Unnamed: 0idtitlepublicationauthordateyearmonthurlcontent
0017283House Republicans Fret About Winning Their Hea...New York TimesCarl Hulse2016-12-312016.012.0NaNWASHINGTON — Congressional Republicans have...
1117284Rift Between Officers and Residents as Killing...New York TimesBenjamin Mueller and Al Baker2017-06-192017.06.0NaNAfter the bullet shells get counted, the blood...
2217285Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...New York TimesMargalit Fox2017-01-062017.01.0NaNWhen Walt Disney’s “Bambi” opened in 1942, cri...
3317286Among Deaths in 2016, a Heavy Toll in Pop Musi...New York TimesWilliam McDonald2017-04-102017.04.0NaNDeath may be the great equalizer, but it isn’t...
4417287Kim Jong-un Says North Korea Is Preparing to T...New York TimesChoe Sang-Hun2017-01-022017.01.0NaNSEOUL, South Korea — North Korea’s leader, ...
\n",
"
"
],
"text/plain": [
" Unnamed: 0 id title \\\n",
"0 0 17283 House Republicans Fret About Winning Their Hea... \n",
"1 1 17284 Rift Between Officers and Residents as Killing... \n",
"2 2 17285 Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ... \n",
"3 3 17286 Among Deaths in 2016, a Heavy Toll in Pop Musi... \n",
"4 4 17287 Kim Jong-un Says North Korea Is Preparing to T... \n",
"\n",
" publication author date year month \\\n",
"0 New York Times Carl Hulse 2016-12-31 2016.0 12.0 \n",
"1 New York Times Benjamin Mueller and Al Baker 2017-06-19 2017.0 6.0 \n",
"2 New York Times Margalit Fox 2017-01-06 2017.0 1.0 \n",
"3 New York Times William McDonald 2017-04-10 2017.0 4.0 \n",
"4 New York Times Choe Sang-Hun 2017-01-02 2017.0 1.0 \n",
"\n",
" url content \n",
"0 NaN WASHINGTON — Congressional Republicans have... \n",
"1 NaN After the bullet shells get counted, the blood... \n",
"2 NaN When Walt Disney’s “Bambi” opened in 1942, cri... \n",
"3 NaN Death may be the great equalizer, but it isn’t... \n",
"4 NaN SEOUL, South Korea — North Korea’s leader, ... "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"articles = pd.read_csv(addr1)\n",
"\n",
"\n",
"print(\"The dimension of the table is: \", articles.shape)\n",
"\n",
"# here we displayed the top 5 rows of the dataframe we created , \n",
"# so that we can have a idea of what type of things are there in the dataframe\n",
"articles.head(5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3: what type of variables are in the table "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Describe Data\n",
" Unnamed: 0 id year month url\n",
"count 50000.000000 50000.000000 50000.000000 50000.000000 0.0\n",
"mean 25694.378380 44432.454800 2016.273700 5.508940 NaN\n",
"std 15350.143677 15773.615179 0.634694 3.333062 NaN\n",
"min 0.000000 17283.000000 2011.000000 1.000000 NaN\n",
"25% 12500.750000 31236.750000 2016.000000 3.000000 NaN\n",
"50% 25004.500000 43757.500000 2016.000000 5.000000 NaN\n",
"75% 38630.250000 57479.250000 2017.000000 8.000000 NaN\n",
"max 53291.000000 73469.000000 2017.000000 12.000000 NaN\n",
"Summarized Data\n",
" title publication \\\n",
"count 50000 50000 \n",
"unique 49920 5 \n",
"top The 10 most important things in the world righ... Breitbart \n",
"freq 7 23781 \n",
"\n",
" author date content \n",
"count 43694 50000 50000 \n",
"unique 3603 983 49888 \n",
"top Breitbart News 2016-08-22 advertisement \n",
"freq 1559 221 42 \n"
]
},
{
"data": {
"text/plain": [
"Unnamed: 0 int64\n",
"id int64\n",
"title object\n",
"publication object\n",
"author object\n",
"date object\n",
"year float64\n",
"month float64\n",
"url float64\n",
"content object\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# now we are required to get the type of variables in the table , which is doen as follows\n",
"print(\"Describe Data\")\n",
"print(articles.describe())\n",
"print(\"Summarized Data\")\n",
"print(articles.describe(include=['O']))\n",
"\n",
"# this will return the datatypes of the columns\n",
"articles.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"50000\n"
]
}
],
"source": [
"#display length of data\n",
"print(len(articles))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here