Scalable Data Analytics Homework 1 Spring 2021 Deadline Feb.15 Noon, 2021 Deadlines Homework 1 is due on Feb.15th 12:30pm. 50% late submission penalty. How to submit: Please submit a zip file to the...

1 answer below »
Please follow instruction on the pdf by using the attached csv file to answer the 3 simple questions. Please submit the Jupyter (Python) code as .ipynb


Scalable Data Analytics Homework 1 Spring 2021 Deadline Feb.15 Noon, 2021 Deadlines Homework 1 is due on Feb.15th 12:30pm. 50% late submission penalty. How to submit: Please submit a zip file to the Assignment/Homework 1 folder in the iCollege. The zip file name should be ’Yourname-Pantherid.zip’. The zipped file should contain three separate ipython notebook files ’1-generator.ipynb’, ’2-HOF.ipynb’, and ’3- generator-HOF.ipynb’ for the first, second and third problems respectively. Data Set: Citibike dataset posted in the iCollege. 1. (2 points) Python’s Generators and Streaming. Compute the median age of the Citibike’s subscribed customers. You are required to read data line by line and are not allowed to store the entire data set in memory. Indeed, you should not have any containers (e.g. list, dictionary, DataFrame, etc.) with more than 100 elements in memory. You should use yield when you want to iterate over a sequence, but don’t want to store the entire sequence in memory as shown in the Codes/Lab3. What to submit: Turn in an ipython notebook with the plot of the histogram of customers age and print out a single number showing the median age of the subscribed customers. 2. (4 points) Python’s Higher Order Functions This is how you can read the file and transform it to a list of lists. import pandas as pd df = pd.read_csv("citibike.csv") rows = df.values.tolist() (a). Determine the number trips that gender 1 made, and that gender 2 made. We can do this by just counting the number of occurrences of ”1” and ”2” in the gender column (2pt): # After this, you should get something like # (37805, 7848) Scalable Data Analytics - Page 2 of 2 (b). Count the number of trips per birth year using higher order functions (2pt): # After this, you should get something like # {"1900.0": 22, "1901.0": 1, "1910.0": 2, "1922.0": 4, ... "1995.0": 256, "1996.0": 124, "1997.0": 94, "1998.0": 59, "1999.0": 17} Hint: math.isnan() is able to remove all the nan values. What to submit: Turn in an ipython notebook print out the results for problems (a) and (b). 3. (4 points) Extract the first ride of the day from a Citibike data stream. The first ride of the day is interpreted as the ride with the earliest starting time of a day. For the sample data, which is a week worth of citibike records, your program should only generate 7 items (one for each day). Streaming Computation: you are asked to complete the task using steaming computation methods. You can only iterate the data set once using yield as shown in Codes/Lab3. You can store a container (e.g. list, dictionary, DataFrame,etc.) with maximum 7 elements in memory. The data set has been sorted by the starting time. What to submit: Turn in an ipython notebook print out the birth years of the first riders each day for problem. { "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import csv " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "#read file\n", "with open(\"citibike.csv\",\"r\") as fi:\n", " reader = csv.DictReader(fi)\n", " for row in reader:\n", " birthyear = row[\"birth_year\"]\n", " if birthyear != \"\":\n", " age = 2015-int(birthyear)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "#def generator to iterate teh daata set only once\n", "def citibike2gen(filename):\n", " with open(\"citibike.csv\",\"r\") as fi:\n", " reader = csv.DictReader(fi)\n", " for row in reader:\n", " birthyear = row[\"birth_year\"]\n", " if birthyear != \"\":\n", " age = 2015-int(birthyear)\n", " yield age\n", "count = {}\n", "for age in citibike2gen(\"citibike.csv\"):\n", " count[age] = count.get(age,0)+1" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{37: 1377,\n", " 22: 470,\n", " 46: 1133,\n", " 30: 1673,\n", " 58: 449,\n", " 36: 1279,\n", " 32: 1793,\n", " 60: 413,\n", " 33: 1455,\n", " 27: 1358,\n", " 24: 922,\n", " 25: 1361,\n", " 38: 1122,\n", " 47: 1010,\n", " 28: 1730,\n", " 35: 1509,\n", " 55: 771,\n", " 29: 1568,\n", " 34: 1499,\n", " 40: 1071,\n", " 42: 1022,\n", " 44: 1162,\n", " 31: 1714,\n", " 20: 256,\n", " 21: 392,\n", " 49: 863,\n", " 43: 1081,\n", " 51: 891,\n", " 61: 417,\n", " 23: 493,\n", " 26: 1322,\n", " 45: 1347,\n", " 54: 618,\n", " 41: 1158,\n", " 39: 1168,\n", " 56: 687,\n", " 50: 947,\n", " 57: 783,\n", " 48: 999,\n", " 52: 970,\n", " 66: 134,\n", " 63: 247,\n", " 70: 28,\n", " 67: 149,\n", " 18: 94,\n", " 19: 124,\n", " 53: 899,\n", " 65: 150,\n", " 71: 59,\n", " 62: 346,\n", " 59: 488,\n", " 64: 229,\n", " 74: 39,\n", " 77: 24,\n", " 81: 8,\n", " 68: 74,\n", " 73: 61,\n", " 75: 21,\n", " 72: 18,\n", " 69: 93,\n", " 17: 59,\n", " 115: 22,\n", " 16: 17,\n", " 80: 9,\n", " 76: 4,\n", " 105: 2,\n", " 89: 1,\n", " 86: 1,\n", " 114: 1,\n", " 93: 4}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2015-02-01 00:00:00+00:00\n",
Answered 1 days AfterFeb 16, 2021

Answer To: Scalable Data Analytics Homework 1 Spring 2021 Deadline Feb.15 Noon, 2021 Deadlines Homework 1 is...

Sanchi answered on Feb 17 2021
149 Votes
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n"
]
},
{
"cel
l_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
" \n",
"data = pd.read_csv('C:/Users/sanchi.kalra/Desktop/Greynodes/AS18/citibike-ltwimtfd.csv')\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"data['starttime'] = pd.to_datetime(data['starttime'], format='%Y-%m-%d %H:%M:%S')\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"x = data.resample('D', on= 'starttime').min()\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"y =[]\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"for m in x['starttime']:\n",
"\ty.append(m)\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"df = data[data['starttime'].isin(y)]\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"data1 = df[['starttime','birth_year']]\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"data1 = data1.groupby(data1['starttime'].unique())\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"finaldata = data1.apply(lambda x: x)\n"
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here