{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we'll be working with some data from the Indego bikeshare company:\n",
"\n",
"- `./data/indego-trips-2017-q3.csv`\n",
"\n",
"Our goal is to look at a particular numeric aspect:\n",
"\n",
"- how often bikes get used (and worn out).\n",
"\n",
"The entire data set takes place over a quarter of 2017. So all of the bikes are represented according to the same quantity of time, right? Well, if so and if each gets rented randomly at a fixed rate, $\\lambda$, then the distribution of bike usage probabilities:\n",
"\n",
"$$P(\\text{a bike gets rented }\\:x\\:\\text{ times in a quarter})$$\n",
"\n",
"will be a Poisson distribution! Let's investigate to see if we can support this possibility."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__C1.__ _(2 pts_) To get started, import pandas and load the data as usual. Print the spreadsheet's head so that the data's structure is close at hand."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"trip_id | \n",
"duration | \n",
"start_time | \n",
"end_time | \n",
"start_station | \n",
"start_lat | \n",
"start_lon | \n",
"end_station | \n",
"end_lat | \n",
"end_lon | \n",
"bike_id | \n",
"plan_duration | \n",
"trip_route_category | \n",
"passholder_type | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"144361832 | \n",
"12 | \n",
"2017-07-01 00:04:00 | \n",
"2017-07-01 00:16:00 | \n",
"3160 | \n",
"39.956619 | \n",
"-75.198624 | \n",
"3163 | \n",
"39.949741 | \n",
"-75.180969 | \n",
"11883 | \n",
"30 | \n",
"One Way | \n",
"Indego30 | \n",
"
\n",
"\n",
"1 | \n",
"144361829 | \n",
"31 | \n",
"2017-07-01 00:06:00 | \n",
"2017-07-01 00:37:00 | \n",
"3046 | \n",
"39.950119 | \n",
"-75.144722 | \n",
"3101 | \n",
"39.942951 | \n",
"-75.159554 | \n",
"5394 | \n",
"0 | \n",
"One Way | \n",
"Walk-up | \n",
"
\n",
"\n",
"2 | \n",
"144361830 | \n",
"15 | \n",
"2017-07-01 00:06:00 | \n",
"2017-07-01 00:21:00 | \n",
"3006 | \n",
"39.952202 | \n",
"-75.203110 | \n",
"3101 | \n",
"39.942951 | \n",
"-75.159554 | \n",
"3331 | \n",
"30 | \n",
"One Way | \n",
"Indego30 | \n",
"
\n",
"\n",
"3 | \n",
"144361831 | \n",
"15 | \n",
"2017-07-01 00:06:00 | \n",
"2017-07-01 00:21:00 | \n",
"3006 | \n",
"39.952202 | \n",
"-75.203110 | \n",
"3101 | \n",
"39.942951 | \n",
"-75.159554 | \n",
"3515 | \n",
"30 | \n",
"One Way | \n",
"Indego30 | \n",
"
\n",
"\n",
"4 | \n",
"144361828 | \n",
"30 | \n",
"2017-07-01 00:07:00 | \n",
"2017-07-01 00:37:00 | \n",
"3046 | \n",
"39.950119 | \n",
"-75.144722 | \n",
"3101 | \n",
"39.942951 | \n",
"-75.159554 | \n",
"11913 | \n",
"0 | \n",
"One Way | \n",
"Walk-up | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" trip_id duration start_time end_time \\\n",
"0 144361832 12 2017-07-01 00:04:00 2017-07-01 00:16:00 \n",
"1 144361829 31 2017-07-01 00:06:00 2017-07-01 00:37:00 \n",
"2 144361830 15 2017-07-01 00:06:00 2017-07-01 00:21:00 \n",
"3 144361831 15 2017-07-01 00:06:00 2017-07-01 00:21:00 \n",
"4 144361828 30 2017-07-01 00:07:00 2017-07-01 00:37:00 \n",
"\n",
" start_station start_lat start_lon end_station end_lat end_lon \\\n",
"0 3160 39.956619 -75.198624 3163 39.949741 -75.180969 \n",
"1 3046 39.950119 -75.144722 3101 39.942951 -75.159554 \n",
"2 3006 39.952202 -75.203110 3101 39.942951 -75.159554 \n",
"3 3006 39.952202 -75.203110 3101 39.942951 -75.159554 \n",
"4 3046 39.950119 -75.144722 3101 39.942951 -75.159554 \n",
"\n",
" bike_id plan_duration trip_route_category passholder_type \n",
"0 11883 30 One Way Indego30 \n",
"1 5394 0 One Way Walk-up \n",
"2 3331 30 One Way Indego30 \n",
"3 3515 30 One Way Indego30 \n",
"4 11913 0 One Way Walk-up "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# code here\n",
"import pandas as pd\n",
"data = pd.read_csv('data/indego-trips-2017-q3.csv')\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(276785, 14)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__C2.__ _(5 pts)_ Now, let's start things out by counting the number of trips that each bike has in total, using pandas `df.groupby()` to group the trips, and a counter, `NumBikes`, to store the number of bikes, $n$, rented $x$ times in the quarter, $n(x)$."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# code here\n",
"from collections import Counter\n",
"grouped = data.groupby('bike_id').agg({\"trip_id\": \"count\"})\n",
"NumBikes = Counter()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"for x,y in grouped.reset_index().as_matrix():\n",
" NumBikes[x] = y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__C3.__ _(5 pts)_ Now that we've got our bikes counted up, let's compute the empirical probabilities:\n",
"\n",
"$$P(x) = P(\\text{a bike is rented }\\:x\\:\\text{ times in a quarter}) = \n",
"\\frac{n(x)}{\\sum n(x)}.$$\n",
"\n",
"We already have $n(x)$ in our `Counter()` from __C2__, so let's start by turning its keys and values into numpy arrays (vectors), `n`, and `x`. After this is done, we can make the probabilities, `p`, from a scalar product of `n`: divide it by its sum."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.\n",
" \n"
]
}
],
"source": [
"# code here\n",
"NumBikesVectors = grouped.reset_index().as_matrix()\n",
"n = NumBikesVectors[:, 0]\n",
"x = NumBikesVectors[:, 1]"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"p = n/x.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__C4.__ _(2 pts)_ Now it's time to find the average number of times a bike gets rented in a quarter. We'll call this quantity $\\lambda$. So far, we've talked about averages of data, e.g., the arithmetic...