45265.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CINDY HERRERA DSC550 WEEK 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applied Text Analysis With Python Exercises"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import string\n",
"import re\n",
"import matplotlib.pyplot as plt\n",
"from collections import Counter"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#Step 1: Load data into a dataframe"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"addr1 = \"articles1.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2: check the dimension of the table/look at the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The dimension of the table is: (50000, 10)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"Unnamed: 0 | \n",
"id | \n",
"title | \n",
"publication | \n",
"author | \n",
"date | \n",
"year | \n",
"month | \n",
"url | \n",
"content | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"0 | \n",
"17283 | \n",
"House Republicans Fret About Winning Their Hea... | \n",
"New York Times | \n",
"Carl Hulse | \n",
"2016-12-31 | \n",
"2016.0 | \n",
"12.0 | \n",
"NaN | \n",
"WASHINGTON — Congressional Republicans have... | \n",
"
\n",
"\n",
"1 | \n",
"1 | \n",
"17284 | \n",
"Rift Between Officers and Residents as Killing... | \n",
"New York Times | \n",
"Benjamin Mueller and Al Baker | \n",
"2017-06-19 | \n",
"2017.0 | \n",
"6.0 | \n",
"NaN | \n",
"After the bullet shells get counted, the blood... | \n",
"
\n",
"\n",
"2 | \n",
"2 | \n",
"17285 | \n",
"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ... | \n",
"New York Times | \n",
"Margalit Fox | \n",
"2017-01-06 | \n",
"2017.0 | \n",
"1.0 | \n",
"NaN | \n",
"When Walt Disney’s “Bambi” opened in 1942, cri... | \n",
"
\n",
"\n",
"3 | \n",
"3 | \n",
"17286 | \n",
"Among Deaths in 2016, a Heavy Toll in Pop Musi... | \n",
"New York Times | \n",
"William McDonald | \n",
"2017-04-10 | \n",
"2017.0 | \n",
"4.0 | \n",
"NaN | \n",
"Death may be the great equalizer, but it isn’t... | \n",
"
\n",
"\n",
"4 | \n",
"4 | \n",
"17287 | \n",
"Kim Jong-un Says North Korea Is Preparing to T... | \n",
"New York Times | \n",
"Choe Sang-Hun | \n",
"2017-01-02 | \n",
"2017.0 | \n",
"1.0 | \n",
"NaN | \n",
"SEOUL, South Korea — North Korea’s leader, ... | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 id title \\\n",
"0 0 17283 House Republicans Fret About Winning Their Hea... \n",
"1 1 17284 Rift Between Officers and Residents as Killing... \n",
"2 2 17285 Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ... \n",
"3 3 17286 Among Deaths in 2016, a Heavy Toll in Pop Musi... \n",
"4 4 17287 Kim Jong-un Says North Korea Is Preparing to T... \n",
"\n",
" publication author date year month \\\n",
"0 New York Times Carl Hulse 2016-12-31 2016.0 12.0 \n",
"1 New York Times Benjamin Mueller and Al Baker 2017-06-19 2017.0 6.0 \n",
"2 New York Times Margalit Fox 2017-01-06 2017.0 1.0 \n",
"3 New York Times William McDonald 2017-04-10 2017.0 4.0 \n",
"4 New York Times Choe Sang-Hun 2017-01-02 2017.0 1.0 \n",
"\n",
" url content \n",
"0 NaN WASHINGTON — Congressional Republicans have... \n",
"1 NaN After the bullet shells get counted, the blood... \n",
"2 NaN When Walt Disney’s “Bambi” opened in 1942, cri... \n",
"3 NaN Death may be the great equalizer, but it isn’t... \n",
"4 NaN SEOUL, South Korea — North Korea’s leader, ... "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"articles = pd.read_csv(addr1)\n",
"\n",
"\n",
"print(\"The dimension of the table is: \", articles.shape)\n",
"\n",
"# here we displayed the top 5 rows of the dataframe we created , \n",
"# so that we can have a idea of what type of things are there in the dataframe\n",
"articles.head(5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3: what type of variables are in the table "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Describe Data\n",
" Unnamed: 0 id year month url\n",
"count 50000.000000 50000.000000 50000.000000 50000.000000 0.0\n",
"mean 25694.378380 44432.454800 2016.273700 5.508940 NaN\n",
"std 15350.143677 15773.615179 0.634694 3.333062 NaN\n",
"min 0.000000 17283.000000 2011.000000 1.000000 NaN\n",
"25% 12500.750000 31236.750000 2016.000000 3.000000 NaN\n",
"50% 25004.500000 43757.500000 2016.000000 5.000000 NaN\n",
"75% 38630.250000 57479.250000 2017.000000 8.000000 NaN\n",
"max 53291.000000 73469.000000 2017.000000 12.000000 NaN\n",
"Summarized Data\n",
" title publication \\\n",
"count 50000 50000 \n",
"unique 49920 5 \n",
"top The 10 most important things in the world righ... Breitbart \n",
"freq 7 23781 \n",
"\n",
" author date content \n",
"count 43694 50000 50000 \n",
"unique 3603 983 49888 \n",
"top Breitbart News 2016-08-22 advertisement \n",
"freq 1559 221 42 \n"
]
},
{
"data": {
"text/plain": [
"Unnamed: 0 int64\n",
"id int64\n",
"title object\n",
"publication object\n",
"author object\n",
"date object\n",
"year float64\n",
"month float64\n",
"url float64\n",
"content object\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# now we are required to get the type of variables in the table , which is doen as follows\n",
"print(\"Describe Data\")\n",
"print(articles.describe())\n",
"print(\"Summarized Data\")\n",
"print(articles.describe(include=['O']))\n",
"\n",
"# this will return the datatypes of the columns\n",
"articles.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"50000\n"
]
}
],
"source": [
"#display length of data\n",
"print(len(articles))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png":...