{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Hm-lEcSS1qVm"
},
"source": [
"# HW2 - Transforms!\n",
" \n",
"In this homework we'll be practicing some data aggregations, joins, and flattening JSON files in order to make inference from two datasets!\n",
" \n",
"The first is a dataset with 25,000,000 movie reviews. These are already in flat files and are clean. But, we obviously want to do some filtering to remove movies that only have a review or two. And we'll want to do a couple of aggregations and joins to distill down those 25 million reviews into something understandable and useful. \n",
" \n",
"The other dataset will have you going to Yelp and making a dev account. You'll then access various bits of information about local businesses and their reviews in order to figure out which ones are doing the best in Covid times. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4dK6_6V0_VY0"
},
"source": [
"## Wranging and aggregating movie review data\n",
" \n",
"The website MovieLens.com has a research group which provides open access to millions of reviews, for free! We're going to work with those data for this homework. [Feel free to check out the website here.](https://grouplens.org/datasets/movielens/) You can go and download the raw data, but in this case I was nice and tossed it all on Amazon S3 for fast direct downloads. \n",
" \n",
"There are two datasets we'll be working with.\n",
"* movies - this is a file of 60,000+ movies\n",
"* reviews - this is a file of 25 *million* individual reviews for the 60k movies\n",
" \n",
"The goal for this section of the homework is to do two types of data aggregations that will allow for someone to make inference on which movies were the most popular, reviewed, polarizing and were cult classics. \n",
" \n",
"To do this we'll first start by making a simple data set that brings just overall review properties together with movies. We'll then do some deeper groping to create a dataset that looks at the same properties but over time. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "EGSFQInVD6ws"
},
"source": [
"### Data first\n",
" \n",
"Let's bring in our two files and libraries. The ratings file is understandably large. So it's a good idea to download it and then save a copy as something else and work with that. This way if you mess up you don't have to download it all over again\n"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "39GB_QJjD57f"
},
"outputs": [],
"source": [
"# Libraries\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "6MZ0784dEfia"
},
"outputs": [],
"source": [
"# Movies data\n",
"movies = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/movies.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "p0yc4ooJdcUC"
},
"outputs": [],
"source": [
"# Ratings data\n",
"ratings = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/ratings.csv')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "qI52SkIGPmAW"
},
"outputs": [],
"source": [
"# Make a copy of ratings\n",
"ratings_backup = ratings.copy"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "iPR1QogNFAgm"
},
"source": [
"## Explore your data - 1 point\n",
" \n",
"Below take some time to explore your date. Check out the following items.\n",
" \n",
"* Head and tail of both datasets\n",
"* Shape\n",
"* Datatypes\n",
"* The number of NaN values in the rating column of the ratings dataset\n",
" \n",
"**Task** Do the head,tail, and shape operations all in one cell. Count the number of NaNs in another. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((62423, 3), (25000095, 4))"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.shape, ratings.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"title | \n",
"genres | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"
\n",
"\n",
"1 | \n",
"2 | \n",
"Jumanji (1995) | \n",
"Adventure|Children|Fantasy | \n",
"
\n",
"\n",
"2 | \n",
"3 | \n",
"Grumpier Old Men (1995) | \n",
"Comedy|Romance | \n",
"
\n",
"\n",
"3 | \n",
"4 | \n",
"Waiting to Exhale (1995) | \n",
"Comedy|Drama|Romance | \n",
"
\n",
"\n",
"4 | \n",
"5 | \n",
"Father of the Bride Part II (1995) | \n",
"Comedy | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"\n",
" genres \n",
"0 Adventure|Animation|Children|Comedy|Fantasy \n",
"1 Adventure|Children|Fantasy \n",
"2 Comedy|Romance \n",
"3 Comedy|Drama|Romance \n",
"4 Comedy "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"userId | \n",
"movieId | \n",
"rating | \n",
"timestamp | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"296 | \n",
"5.0 | \n",
"1147880044 | \n",
"
\n",
"\n",
"1 | \n",
"1 | \n",
"306 | \n",
"3.5 | \n",
"1147868817 | \n",
"
\n",
"\n",
"2 | \n",
"1 | \n",
"307 | \n",
"5.0 | \n",
"1147868828 | \n",
"
\n",
"\n",
"3 | \n",
"1 | \n",
"665 | \n",
"5.0 | \n",
"1147878820 | \n",
"
\n",
"\n",
"4 | \n",
"1 | \n",
"899 | \n",
"3.5 | \n",
"1147868510 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" userId movieId rating timestamp\n",
"0 1 296 5.0 1147880044\n",
"1 1 306 3.5 1147868817\n",
"2 1 307 5.0 1147868828\n",
"3 1 665 5.0 1147878820\n",
"4 1 899 3.5 1147868510"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {},
"colab_type": "code",
"collapsed": true,
"id": "E6hE7ikTYeGk"
},
"outputs": [
{
"data": {
"text/plain": [
"( movieId title \\\n",
" 0 1 Toy Story (1995) \n",
" 1 2 Jumanji (1995) \n",
" 2 3 Grumpier Old Men (1995) \n",
" 3 4 Waiting to Exhale (1995) \n",
" 4 5 Father of the Bride Part II (1995) \n",
" \n",
" genres \n",
" 0 Adventure|Animation|Children|Comedy|Fantasy \n",
" 1 Adventure|Children|Fantasy \n",
" 2 Comedy|Romance \n",
" 3 Comedy|Drama|Romance \n",
" 4 Comedy ,\n",
" movieId title genres\n",
" 62418 209157 We (2018) Drama\n",
" 62419 209159 Window of the Soul (2001) Documentary\n",
" 62420 209163 Bad Poems (2018) Comedy|Drama\n",
" 62421 209169 A Girl Thing (2001) (no genres listed)\n",
" 62422 209171 Women of Devil's Island (1962) Action|Adventure|Drama,\n",
" userId movieId rating timestamp\n",
" 0 1 296 5.0 1147880044\n",
" 1 1 306 3.5 1147868817\n",
" 2 1 307 5.0 1147868828\n",
" 3 1 665 5.0 1147878820\n",
" 4 1 899 3.5 1147868510,\n",
" userId movieId rating timestamp\n",
" 25000090 162541 50872 4.5 1240953372\n",
" 25000091 162541 55768 2.5 1240951998\n",
" 25000092 162541 56176 2.0 1240950697\n",
" 25000093 162541 58559 4.0 1240953434\n",
" 25000094 162541 63876 5.0 1240952515,\n",
" (62423, 3),\n",
" (25000095, 4))"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# head, tail, shape of both datastes\n",
"movies.head(),movies.tail(),ratings.head(),ratings.tail(), movies.shape, ratings.shape"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Movies DF \n",
" movieId int64\n",
"title object\n",
"genres object\n",
"dtype: object \n",
"\n",
"Ratings DF \n",
" userId int64\n",
"movieId int64\n",
"rating float64\n",
"timestamp int64\n",
"dtype: object\n"
]
},
{
"data": {
"text/plain": [
"(None, None)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"Movies DF \\n\",movies.dtypes,\"\\n\"), print(\"Ratings DF \\n\",ratings.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "NIxsJVC1F7jK"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of NaNs in 'rating' column 0\n"
]
}
],
"source": [
"# Get count of NaNs\n",
"print(\"Number of NaNs in 'rating' column {0}\".format(ratings['rating'].isna().sum()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IbY5XgE33Kmc"
},
"source": [
"## Convert timestamp in ratings to a datetime. - 0.5 points\n",
" \n",
"One issue that you can see from your exploration is that the ratings only have a timestamp. This timestamp is measured in the number of seconds since 00:00:00 on January 1st, 1970. You'll need to convert this to a datetime in order to actually do our later data aggregations. \n",
" \n",
"You use `pd.to_datetime` on timestamps like this. [For full details on the various ways to use this function please look at the Pandas documentation.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) Briefly, this is how it's used. \n",
" \n",
"```\n",
">>> pd.to_datetime(1490195805, unit='s')\n",
"Timestamp('2017-03-22 15:16:45')\n",
"```\n",
" \n",
"**Task** Go and make a new column in the movies dataframe called `review_dt` that contains this timestamp but converted to a datetime. \n",
" \n",
"Note, it's worth doing this by assigning to a test vector first to make sure it works. Once it works, then assign to the `reviews` dataframe. \n",
" \n",
"For example\n",
"```\n",
"test_vec = pd.to_datetime(arguments)\n",
"test_vec # to check what it contains\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "ecldZK15dqQC"
},
"outputs": [],
"source": [
"# Make test vector\n",
"test_vec = pd.to_datetime(ratings['timestamp'][0],unit='s')\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "0rWM0NKhowev"
},
"outputs": [
{
"data": {
"text/plain": [
"Timestamp('2006-05-17 15:34:04')"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check vector and datatype to make sure it make sense. \n",
"test_vec"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "RBXoYE2Mo1pR"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"userId | \n",
"movieId | \n",
"rating | \n",
"timestamp | \n",
"review_dt | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"296 | \n",
"5.0 | \n",
"1147880044 | \n",
"2006-05-17 15:34:04 | \n",
"
\n",
"\n",
"1 | \n",
"1 | \n",
"306 | \n",
"3.5 | \n",
"1147868817 | \n",
"2006-05-17 12:26:57 | \n",
"
\n",
"\n",
"2 | \n",
"1 | \n",
"307 | \n",
"5.0 | \n",
"1147868828 | \n",
"2006-05-17 12:27:08 | \n",
"
\n",
"\n",
"3 | \n",
"1 | \n",
"665 | \n",
"5.0 | \n",
"1147878820 | \n",
"2006-05-17 15:13:40 | \n",
"
\n",
"\n",
"4 | \n",
"1 | \n",
"899 | \n",
"3.5 | \n",
"1147868510 | \n",
"2006-05-17 12:21:50 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" userId movieId rating timestamp review_dt\n",
"0 1 296 5.0 1147880044 2006-05-17 15:34:04\n",
"1 1 306 3.5 1147868817 2006-05-17 12:26:57\n",
"2 1 307 5.0 1147868828 2006-05-17 12:27:08\n",
"3 1 665 5.0 1147878820 2006-05-17 15:13:40\n",
"4 1 899 3.5 1147868510 2006-05-17 12:21:50"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now ad this to the ratings dataframe\n",
"ratings['review_dt'] = ratings['timestamp'].apply(lambda x : pd.to_datetime(x,unit='s'))\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open('ratings_with_review_dt.pickle','wb') as handle:\n",
" pickle.dump(ratings,handle,protocol=pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "5RQV_35CdqdS"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"userId | \n",
"movieId | \n",
"rating | \n",
"timestamp | \n",
"review_dt | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"296 | \n",
"5.0 | \n",
"1147880044 | \n",
"2006-05-17 15:34:04 | \n",
"
\n",
"\n",
"1 | \n",
"1 | \n",
"306 | \n",
"3.5 | \n",
"1147868817 | \n",
"2006-05-17 12:26:57 | \n",
"
\n",
"\n",
"2 | \n",
"1 | \n",
"307 | \n",
"5.0 | \n",
"1147868828 | \n",
"2006-05-17 12:27:08 | \n",
"
\n",
"\n",
"3 | \n",
"1 | \n",
"665 | \n",
"5.0 | \n",
"1147878820 | \n",
"2006-05-17 15:13:40 | \n",
"
\n",
"\n",
"4 | \n",
"1 | \n",
"899 | \n",
"3.5 | \n",
"1147868510 | \n",
"2006-05-17 12:21:50 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" userId movieId rating timestamp review_dt\n",
"0 1 296 5.0 1147880044 2006-05-17 15:34:04\n",
"1 1 306 3.5 1147868817 2006-05-17 12:26:57\n",
"2 1 307 5.0 1147868828 2006-05-17 12:27:08\n",
"3 1 665 5.0 1147878820 2006-05-17 15:13:40\n",
"4 1 899 3.5 1147868510 2006-05-17 12:21:50"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the head of your dataframe again\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "jt8syWKWdqji"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Oldest Review: 1995-01-09 11:46:49\n",
"Newest Review: 2019-11-21 09:15:03\n"
]
}
],
"source": [
"# What's the oldest and newest review? \n",
"print(\"Oldest Review: {0}\".format(ratings['review_dt'].min()))\n",
"print(\"Newest Review: {0}\".format(ratings['review_dt'].max()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3lQ8tFNgpPjF"
},
"source": [
"## Your first aggregation and join - 1.5 points\n",
"\n",
"The first aggregation and join I want you to do is just at the whole movie level. Your movies dataframe just has a single row for each movie, but obviously there are thousands of individual reviews for each of those movies. We want to come up with some summary statistics on the reviews for each movie and then join them to the movies dataframe. \n",
"\n",
"**Task:** I want you to do the following:\n",
"* Create an aggregated dataframe called `ratings_by_movie`. This dataframe should be grouped by each movie. I want you to use `.agg()` to calculate the mean, standard deviation ('std') and the number of reviews for each movie. \n",
"* rename the columns of that dataframe `'movieId', 'rating_mean', 'rating_std', and 'rating_count'`\n",
"* Join this new `ratings_by_movie` dataframe such that it attaches all those summary statistics to their corresponding movies. Called the joined dataframe `movies_with_ratings`"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "1diokGd0fghb"
},
"outputs": [],
"source": [
"# Make ratings_by_movie\n",
"ratings_by_movie = ratings.groupby('movieId', as_index=False).rating.agg(['mean','std','count']).reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "vaoGm6endqVu"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"mean | \n",
"std | \n",
"count | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"3.893708 | \n",
"0.921552 | \n",
"57309 | \n",
"
\n",
"\n",
"1 | \n",
"2 | \n",
"3.251527 | \n",
"0.959851 | \n",
"24228 | \n",
"
\n",
"\n",
"2 | \n",
"3 | \n",
"3.142028 | \n",
"1.008443 | \n",
"11804 | \n",
"
\n",
"\n",
"3 | \n",
"4 | \n",
"2.853547 | \n",
"1.108531 | \n",
"2523 | \n",
"
\n",
"\n",
"4 | \n",
"5 | \n",
"3.058434 | \n",
"0.996611 | \n",
"11714 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" movieId mean std count\n",
"0 1 3.893708 0.921552 57309\n",
"1 2 3.251527 0.959851 24228\n",
"2 3 3.142028 1.008443 11804\n",
"3 4 2.853547 1.108531 2523\n",
"4 5 3.058434 0.996611 11714"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check it\n",
"ratings_by_movie.head()"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "eiNm1Qo4sUIr"
},
"outputs": [],
"source": [
"# Rename columns\n",
"ratings_by_movie.columns = ['movieId', 'rating_mean', 'rating_std','rating_count']"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "gfwRPPcxrY71"
},
"outputs": [],
"source": [
"# Join it and call movies with ratings\n",
"movies_with_ratings = pd.merge(movies,\n",
" ratings_by_movie,\n",
" how='left',\n",
" on='movieId',\n",
" validate='1:1')"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "_Ee4dKTTsIqd"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"title | \n",
"genres | \n",
"rating_mean | \n",
"rating_std | \n",
"rating_count | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"3.893708 | \n",
"0.921552 | \n",
"57309.0 | \n",
"
\n",
"\n",
"1 | \n",
"2 | \n",
"Jumanji (1995) | \n",
"Adventure|Children|Fantasy | \n",
"3.251527 | \n",
"0.959851 | \n",
"24228.0 | \n",
"
\n",
"\n",
"2 | \n",
"3 | \n",
"Grumpier Old Men (1995) | \n",
"Comedy|Romance | \n",
"3.142028 | \n",
"1.008443 | \n",
"11804.0 | \n",
"
\n",
"\n",
"3 | \n",
"4 | \n",
"Waiting to Exhale (1995) | \n",
"Comedy|Drama|Romance | \n",
"2.853547 | \n",
"1.108531 | \n",
"2523.0 | \n",
"
\n",
"\n",
"4 | \n",
"5 | \n",
"Father of the Bride Part II (1995) | \n",
"Comedy | \n",
"3.058434 | \n",
"0.996611 | \n",
"11714.0 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"\n",
" genres rating_mean rating_std \\\n",
"0 Adventure|Animation|Children|Comedy|Fantasy 3.893708 0.921552 \n",
"1 Adventure|Children|Fantasy 3.251527 0.959851 \n",
"2 Comedy|Romance 3.142028 1.008443 \n",
"3 Comedy|Drama|Romance 2.853547 1.108531 \n",
"4 Comedy 3.058434 0.996611 \n",
"\n",
" rating_count \n",
"0 57309.0 \n",
"1 24228.0 \n",
"2 11804.0 \n",
"3 2523.0 \n",
"4 11714.0 "
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check movies_with_ratings. \n",
"movies_with_ratings.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "V7asUN4XtlA6"
},
"source": [
"## Filtering and more transformations - 1 point\n",
" \n",
"Now we want to clean up this dataset a bit and then do a couple more transforms. One issue you can see from your check above is that many movies only have one rating. We want to set a minimum number of reviews needed to be included. We also want to do some binning where movies with certain ratings levels \n",
" \n",
"**Task:** Please do the following operations\n",
"* Filter `movies_with_ratings` so it only contains movies that have at least 10 ratings\n",
"* Use the function `cut()` to automatically bin our `ratings_mean` column into three groups of 'bad', 'fine', or 'good' movies. Call this `rating_group`.\n",
"* Use the same function to take the standard deviation in rating and make three groups of 'agreement', 'average', 'controversial'. Thus, movies with low standard deviation have agreement in the rating, while movies with high standard deviation have controversy in the ratings. Call this column `ratings_agreement`."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((62423, 6), (62423, 3))"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_with_ratings.shape, movies.shape"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "A-q2L_UDtoVU"
},
"outputs": [],
"source": [
"# Filter first being sure to overwrite dataframe\n",
"movies_with_ratings = movies_with_ratings.loc[movies_with_ratings['rating_count'] >= 10]"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "51jG155vvC64"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows left after filtering 'movies_with_ratings' : 24330\n"
]
}
],
"source": [
"# Check how many rows you're left with\n",
"print(\"Number of rows left after filtering 'movies_with_ratings' : {0}\".format(movies_with_ratings.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Gdax-0B9x2dN"
},
"source": [
"I didn't show you how to use `cut()` in the lesson, but it's a transform just like anything else. You could use a `np.where()` statement like we did, but `cut()` is a bit easier. All it does is take a column as the first argument, the number of bins you want to group it in as the second argument, and then the labels you want to give those bins as the third. It automatically divides them up into equal sized bins. \n",
" \n",
"For example, if I make the following list:\n",
"```\n",
"rating = [2, 4, 9, 8, 5, 3, 6, 10, 2, 1, 6, 7]\n",
"```\n",
" \n",
"And run `cut()` on it with three bins and levels 'bad', 'fine', and 'good':\n",
"```\n",
"pd.cut(rating, 3, labels=['bad', 'fine', 'good'])\n",
"```\n",
" \n",
"I get a return of:\n",
"```\n",
"[bad, bad, good, good, fine, ..., good, bad, bad, fine, fine]\n",
"Length: 12\n",
"Categories (3, object): [bad < fine < good]\n",
"```\n",
" \n",
"Note how it orders them for you based on the order of the labels. "
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "SYVhXGaGx1-7"
},
"outputs": [
{
"data": {
"text/plain": [
"['bad', 'bad', 'good', 'good', 'fine', ..., 'good', 'bad', 'bad', 'fine', 'fine']\n",
"Length: 12\n",
"Categories (3, object): ['bad' < 'fine' < 'good']"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# If you want to test it!\n",
"rating = [2, 4, 9, 8, 5, 3, 6, 10, 2, 1, 6, 7]\n",
"pd.cut(rating, 3, labels=['bad', 'fine', 'good'])"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "3DVYIrbDvG4i"
},
"outputs": [],
"source": [
"# Now make 'bad', 'fine', 'good' levels for ratings\n",
"# Assign to new column called 'rating_group' \n",
"movies_with_ratings['rating_group'] = pd.cut(movies_with_ratings['rating_mean'], 3, labels=['bad', 'fine', 'good'])"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "VmC2I-kFwWgo"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"title | \n",
"genres | \n",
"rating_mean | \n",
"rating_std | \n",
"rating_count | \n",
"rating_group | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"3.893708 | \n",
"0.921552 | \n",
"57309.0 | \n",
"good | \n",
"
\n",
"\n",
"1 | \n",
"2 | \n",
"Jumanji (1995) | \n",
"Adventure|Children|Fantasy | \n",
"3.251527 | \n",
"0.959851 | \n",
"24228.0 | \n",
"good | \n",
"
\n",
"\n",
"2 | \n",
"3 | \n",
"Grumpier Old Men (1995) | \n",
"Comedy|Romance | \n",
"3.142028 | \n",
"1.008443 | \n",
"11804.0 | \n",
"fine | \n",
"
\n",
"\n",
"3 | \n",
"4 | \n",
"Waiting to Exhale (1995) | \n",
"Comedy|Drama|Romance | \n",
"2.853547 | \n",
"1.108531 | \n",
"2523.0 | \n",
"fine | \n",
"
\n",
"\n",
"4 | \n",
"5 | \n",
"Father of the Bride Part II (1995) | \n",
"Comedy | \n",
"3.058434 | \n",
"0.996611 | \n",
"11714.0 | \n",
"fine | \n",
"
\n",
"\n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"
\n",
"\n",
"61967 | \n",
"207405 | \n",
"Doctor Sleep (2019) | \n",
"Horror | \n",
"3.730769 | \n",
"0.785109 | \n",
"39.0 | \n",
"good | \n",
"
\n",
"\n",
"62027 | \n",
"207612 | \n",
"Tell Me Who I Am (2019) | \n",
"Documentary | \n",
"3.357143 | \n",
"1.099450 | \n",
"14.0 | \n",
"good | \n",
"
\n",
"\n",
"62088 | \n",
"207830 | \n",
"Terminator: Dark Fate (2019) | \n",
"Action|Sci-Fi | \n",
"3.372727 | \n",
"0.794764 | \n",
"55.0 | \n",
"good | \n",
"
\n",
"\n",
"62313 | \n",
"208715 | \n",
"Let It Snow (2019) | \n",
"Comedy|Romance | \n",
"2.933333 | \n",
"1.222799 | \n",
"15.0 | \n",
"fine | \n",
"
\n",
"\n",
"62319 | \n",
"208737 | \n",
"Midway (2019) | \n",
"Action|Drama|War | \n",
"3.333333 | \n",
"0.887625 | \n",
"12.0 | \n",
"good | \n",
"
\n",
"\n",
"
\n",
"
24330 rows รย 7 columns
\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"... ... ... \n",
"61967 207405 Doctor Sleep (2019) \n",
"62027 207612 Tell Me Who I Am (2019) \n",
"62088 207830 Terminator: Dark Fate (2019) \n",
"62313 208715 Let It Snow (2019) \n",
"62319 208737 Midway (2019) \n",
"\n",
" genres rating_mean rating_std \\\n",
"0 Adventure|Animation|Children|Comedy|Fantasy 3.893708 0.921552 \n",
"1 Adventure|Children|Fantasy 3.251527 0.959851 \n",
"2 Comedy|Romance 3.142028 1.008443 \n",
"3 Comedy|Drama|Romance 2.853547 1.108531 \n",
"4 Comedy 3.058434 0.996611 \n",
"... ... ... ... \n",
"61967 Horror 3.730769 0.785109 \n",
"62027 Documentary 3.357143 1.099450 \n",
"62088 Action|Sci-Fi 3.372727 0.794764 \n",
"62313 Comedy|Romance 2.933333 1.222799 \n",
"62319 Action|Drama|War 3.333333 0.887625 \n",
"\n",
" rating_count rating_group \n",
"0 57309.0 good \n",
"1 24228.0 good \n",
"2 11804.0 fine \n",
"3 2523.0 fine \n",
"4 11714.0 fine \n",
"... ... ... \n",
"61967 39.0 good \n",
"62027 14.0 good \n",
"62088 55.0 good \n",
"62313 15.0 fine \n",
"62319 12.0 good \n",
"\n",
"[24330 rows x 7 columns]"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check it\n",
"movies_with_ratings"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "xZpmz44l0AZB"
},
"outputs": [],
"source": [
"# Now use cut() again to create your ratings_agreement column. \n",
"# Use three bins and order of 'agreement', 'average', and 'controversial'\n",
"movies_with_ratings['ratings_agreement'] = pd.cut(movies_with_ratings['rating_std'], 3, labels=['agreement', 'average','controversial'])\n"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"title | \n",
"genres | \n",
"rating_mean | \n",
"rating_std | \n",
"rating_count | \n",
"rating_group | \n",
"ratings_agreement | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"3.893708 | \n",
"0.921552 | \n",
"57309.0 | \n",
"good | \n",
"average | \n",
"
\n",
"\n",
"1 | \n",
"2 | \n",
"Jumanji (1995) | \n",
"Adventure|Children|Fantasy | \n",
"3.251527 | \n",
"0.959851 | \n",
"24228.0 | \n",
"good | \n",
"average | \n",
"
\n",
"\n",
"2 | \n",
"3 | \n",
"Grumpier Old Men (1995) | \n",
"Comedy|Romance | \n",
"3.142028 | \n",
"1.008443 | \n",
"11804.0 | \n",
"fine | \n",
"average | \n",
"
\n",
"\n",
"3 | \n",
"4 | \n",
"Waiting to Exhale (1995) | \n",
"Comedy|Drama|Romance | \n",
"2.853547 | \n",
"1.108531 | \n",
"2523.0 | \n",
"fine | \n",
"average | \n",
"
\n",
"\n",
"4 | \n",
"5 | \n",
"Father of the Bride Part II (1995) | \n",
"Comedy | \n",
"3.058434 | \n",
"0.996611 | \n",
"11714.0 | \n",
"fine | \n",
"average | \n",
"
\n",
"\n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"
\n",
"\n",
"61967 | \n",
"207405 | \n",
"Doctor Sleep (2019) | \n",
"Horror | \n",
"3.730769 | \n",
"0.785109 | \n",
"39.0 | \n",
"good | \n",
"average | \n",
"
\n",
"\n",
"62027 | \n",
"207612 | \n",
"Tell Me Who I Am (2019) | \n",
"Documentary | \n",
"3.357143 | \n",
"1.099450 | \n",
"14.0 | \n",
"good | \n",
"average | \n",
"
\n",
"\n",
"62088 | \n",
"207830 | \n",
"Terminator: Dark Fate (2019) | \n",
"Action|Sci-Fi | \n",
"3.372727 | \n",
"0.794764 | \n",
"55.0 | \n",
"good | \n",
"average | \n",
"
\n",
"\n",
"62313 | \n",
"208715 | \n",
"Let It Snow (2019) | \n",
"Comedy|Romance | \n",
"2.933333 | \n",
"1.222799 | \n",
"15.0 | \n",
"fine | \n",
"average | \n",
"
\n",
"\n",
"62319 | \n",
"208737 | \n",
"Midway (2019) | \n",
"Action|Drama|War | \n",
"3.333333 | \n",
"0.887625 | \n",
"12.0 | \n",
"good | \n",
"average | \n",
"
\n",
"\n",
"
\n",
"
24330 rows รย 8 columns
\n",
"
"
],
"text/plain": [
" movieId title \\\n",
"0 1 Toy Story (1995) \n",
"1 2 Jumanji (1995) \n",
"2 3 Grumpier Old Men (1995) \n",
"3 4 Waiting to Exhale (1995) \n",
"4 5 Father of the Bride Part II (1995) \n",
"... ... ... \n",
"61967 207405 Doctor Sleep (2019) \n",
"62027 207612 Tell Me Who I Am (2019) \n",
"62088 207830 Terminator: Dark Fate (2019) \n",
"62313 208715 Let It Snow (2019) \n",
"62319 208737 Midway (2019) \n",
"\n",
" genres rating_mean rating_std \\\n",
"0 Adventure|Animation|Children|Comedy|Fantasy 3.893708 0.921552 \n",
"1 Adventure|Children|Fantasy 3.251527 0.959851 \n",
"2 Comedy|Romance 3.142028 1.008443 \n",
"3 Comedy|Drama|Romance 2.853547 1.108531 \n",
"4 Comedy 3.058434 0.996611 \n",
"... ... ... ... \n",
"61967 Horror 3.730769 0.785109 \n",
"62027 Documentary 3.357143 1.099450 \n",
"62088 Action|Sci-Fi 3.372727 0.794764 \n",
"62313 Comedy|Romance 2.933333 1.222799 \n",
"62319 Action|Drama|War 3.333333 0.887625 \n",
"\n",
" rating_count rating_group ratings_agreement \n",
"0 57309.0 good average \n",
"1 24228.0 good average \n",
"2 11804.0 fine average \n",
"3 2523.0 fine average \n",
"4 11714.0 fine average \n",
"... ... ... ... \n",
"61967 39.0 good average \n",
"62027 14.0 good average \n",
"62088 55.0 good average \n",
"62313 15.0 fine average \n",
"62319 12.0 good average \n",
"\n",
"[24330 rows x 8 columns]"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_with_ratings"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "SwMX5K530m-j"
},
"source": [
"### Exploring our data \n",
" \n",
"Making bins like this allows us to figure out things like which movies are both bad, but have differing opinions on. For example, we can filter out movies that are in the bad category but have a lot of controversy about those ratings. This could mean they're 'cult classic' movies where despite the low rating some people actually really love the movies. \n",
" \n",
"A dataset like this could be used in a recommendation engine where if you see people liking these 'bad but good' movies you could suggest others that meet the came criteria. \n",
"\n",
"**Task:** There is no points here - only code to let you make figures to see what the dataset could be used for. Also, if this code works it means you probably did your answers above are right :)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "8k2C-Lzs0Z33"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"title | \n",
"genres | \n",
"rating_mean | \n",
"rating_std | \n",
"rating_count | \n",
"rating_group | \n",
"ratings_agreement | \n",
"
\n",
"\n",
"\n",
"\n",
"8618 | \n",
"26157 | \n",
"Manos: The Hands of Fate (1966) | \n",
"Horror | \n",
"1.798165 | \n",
"1.550268 | \n",
"109.0 | \n",
"bad | \n",
"controversial | \n",
"
\n",
"\n",
"16964 | \n",
"89281 | \n",
"Birdemic: Shock and Terror (2010) | \n",
"Romance|Thriller | \n",
"1.521739 | \n",
"1.503492 | \n",
"115.0 | \n",
"bad | \n",
"controversial | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" movieId title genres \\\n",
"8618 26157 Manos: The Hands of Fate (1966) Horror \n",
"16964 89281 Birdemic: Shock and Terror (2010) Romance|Thriller \n",
"\n",
" rating_mean rating_std rating_count rating_group ratings_agreement \n",
"8618 1.798165 1.550268 109.0 bad controversial \n",
"16964 1.521739 1.503492 115.0 bad controversial "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_with_ratings[(movies_with_ratings['rating_group'] == 'bad') & \n",
" (movies_with_ratings['ratings_agreement'] == 'controversial') & \n",
" (movies_with_ratings['rating_count'] >= 100)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "zPBvEkD42NDF"
},
"source": [
"## Grouping within years - 1 point\n",
" \n",
"Now that we've done our overall grouping by movie, let's get a bit more detail about these ratings. Specifically, let's engineer a dataset that breaks down the average rating not only by movie, but also by the year the person provided the review. This would allow for someone to see which movies continue to do well over time, which ones become more popular, and which ones don't age well!\n",
" \n",
"**Task:** You're going to do the following steps:\n",
"* Create a new `ratings_by_movie` that groups both by `movieId` but also by your `review_dt` column. I want you to group into year intervals. \n",
"* Join `movies` to `ratings_by_movie` so that you have the summary review statistics for each year the movie has been out\n",
"* Clean up and filter your dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "tioZCC2YB7Qa"
},
"source": [
"First, create `ratings_by_movie`. You can group by two levels by just adding a list of what levels you want to group by in the `groupby()` statement. I'll give you some help there, but you have to complete the rest in order to group by movieId first and review_dt second. "
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "nXvUmhqxyiAp"
},
"outputs": [],
"source": [
"# Note the groupby syntax. I first am grouping by movieId\n",
"# But then also am calling dt.year on our datetime column\n",
"# This will then tell Python to do the aggregations within year as well\n",
"ratings_by_movie = ratings.groupby(['movieId',ratings['review_dt'].dt.year], as_index=False).rating.agg(['mean','std','count']).reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "jZ8EJYtD0QXM"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"review_dt | \n",
"mean | \n",
"std | \n",
"count | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"1996 | \n",
"4.132756 | \n",
"0.884990 | \n",
"6237 | \n",
"
\n",
"\n",
"1 | \n",
"1 | \n",
"1997 | \n",
"3.872500 | \n",
"0.894079 | \n",
"6000 | \n",
"
\n",
"\n",
"2 | \n",
"1 | \n",
"1998 | \n",
"3.889515 | \n",
"0.944972 | \n",
"887 | \n",
"
\n",
"\n",
"3 | \n",
"1 | \n",
"1999 | \n",
"3.975140 | \n",
"0.913960 | \n",
"2494 | \n",
"
\n",
"\n",
"4 | \n",
"1 | \n",
"2000 | \n",
"4.136634 | \n",
"0.865127 | \n",
"3535 | \n",
"
\n",
"\n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"
\n",
"\n",
"323732 | \n",
"209157 | \n",
"2019 | \n",
"1.500000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323733 | \n",
"209159 | \n",
"2019 | \n",
"3.000000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323734 | \n",
"209163 | \n",
"2019 | \n",
"4.500000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323735 | \n",
"209169 | \n",
"2019 | \n",
"3.000000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323736 | \n",
"209171 | \n",
"2019 | \n",
"3.000000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"
\n",
"
323737 rows รย 5 columns
\n",
"
"
],
"text/plain": [
" movieId review_dt mean std count\n",
"0 1 1996 4.132756 0.884990 6237\n",
"1 1 1997 3.872500 0.894079 6000\n",
"2 1 1998 3.889515 0.944972 887\n",
"3 1 1999 3.975140 0.913960 2494\n",
"4 1 2000 4.136634 0.865127 3535\n",
"... ... ... ... ... ...\n",
"323732 209157 2019 1.500000 NaN 1\n",
"323733 209159 2019 3.000000 NaN 1\n",
"323734 209163 2019 4.500000 NaN 1\n",
"323735 209169 2019 3.000000 NaN 1\n",
"323736 209171 2019 3.000000 NaN 1\n",
"\n",
"[323737 rows x 5 columns]"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"ratings_by_movie"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "owv03D5lCh6s"
},
"source": [
"You need to rename columns, but this is a bit trickier as you have two levels of your dataframe index. I'm going to give you the code below. But, what it's doing is resetting that one level of the index `review_dt` and putting it back as a regular column. I'm then renaming the resulting columns. "
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "J-Xp1BY_dp3r"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"year | \n",
"rating_mean | \n",
"rating_std | \n",
"rating_count | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"1996 | \n",
"4.132756 | \n",
"0.884990 | \n",
"6237 | \n",
"
\n",
"\n",
"1 | \n",
"1 | \n",
"1997 | \n",
"3.872500 | \n",
"0.894079 | \n",
"6000 | \n",
"
\n",
"\n",
"2 | \n",
"1 | \n",
"1998 | \n",
"3.889515 | \n",
"0.944972 | \n",
"887 | \n",
"
\n",
"\n",
"3 | \n",
"1 | \n",
"1999 | \n",
"3.975140 | \n",
"0.913960 | \n",
"2494 | \n",
"
\n",
"\n",
"4 | \n",
"1 | \n",
"2000 | \n",
"4.136634 | \n",
"0.865127 | \n",
"3535 | \n",
"
\n",
"\n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"
\n",
"\n",
"323732 | \n",
"209157 | \n",
"2019 | \n",
"1.500000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323733 | \n",
"209159 | \n",
"2019 | \n",
"3.000000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323734 | \n",
"209163 | \n",
"2019 | \n",
"4.500000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323735 | \n",
"209169 | \n",
"2019 | \n",
"3.000000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"323736 | \n",
"209171 | \n",
"2019 | \n",
"3.000000 | \n",
"NaN | \n",
"1 | \n",
"
\n",
"\n",
"
\n",
"
323737 rows รย 5 columns
\n",
"
"
],
"text/plain": [
" movieId year rating_mean rating_std rating_count\n",
"0 1 1996 4.132756 0.884990 6237\n",
"1 1 1997 3.872500 0.894079 6000\n",
"2 1 1998 3.889515 0.944972 887\n",
"3 1 1999 3.975140 0.913960 2494\n",
"4 1 2000 4.136634 0.865127 3535\n",
"... ... ... ... ... ...\n",
"323732 209157 2019 1.500000 NaN 1\n",
"323733 209159 2019 3.000000 NaN 1\n",
"323734 209163 2019 4.500000 NaN 1\n",
"323735 209169 2019 3.000000 NaN 1\n",
"323736 209171 2019 3.000000 NaN 1\n",
"\n",
"[323737 rows x 5 columns]"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reset index\n",
"#ratings_by_movie = ratings_by_movie.reset_index(level = 'review_dt')\n",
"# Here we are making 'movieId' as a column and not making it as an index. This helps us in easier processing of next steps.\n",
"ratings_by_movie.columns = ['movieId','year', 'rating_mean', 'rating_std', 'rating_count']\n",
"ratings_by_movie"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "q9EFQCOshC4U"
},
"outputs": [],
"source": [
"# Now join the movie dataframe onto rating_by_movie\n",
"movies_with_ratings = pd.merge(ratings_by_movie,\n",
" movies,\n",
" how='left',\n",
" on='movieId',\n",
" validate='m:1'\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "yBLjjJ880vvh"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"movieId | \n",
"year | \n",
"rating_mean | \n",
"rating_std | \n",
"rating_count | \n",
"title | \n",
"genres | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1 | \n",
"1996 | \n",
"4.132756 | \n",
"0.884990 | \n",
"6237 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"
\n",
"\n",
"1 | \n",
"1 | \n",
"1997 | \n",
"3.872500 | \n",
"0.894079 | \n",
"6000 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"
\n",
"\n",
"2 | \n",
"1 | \n",
"1998 | \n",
"3.889515 | \n",
"0.944972 | \n",
"887 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"
\n",
"\n",
"3 | \n",
"1 | \n",
"1999 | \n",
"3.975140 | \n",
"0.913960 | \n",
"2494 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"
\n",
"\n",
"4 | \n",
"1 | \n",
"2000 | \n",
"4.136634 | \n",
"0.865127 | \n",
"3535 | \n",
"Toy Story (1995) | \n",
"Adventure|Animation|Children|Comedy|Fantasy | \n",
"
\n",
"\n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"
\n",
"\n",
"323732 | \n",
"209157 | \n",
"2019 | \n",
"1.500000 | \n",
"NaN | \n",
"1 | \n",
"We (2018) | \n",
"Drama | \n",
"
\n",
"\n",
"323733 | \n",
"209159 | \n",
" ...SOLUTION.PDFAnswer To This Question Is Available To Download
Submit New Assignment
Please select references for your assignment
Please select no of pages for your assignment
Please select level for your assignment
x
I am Online - Talk to me!
Please fill out the form below to start chatting with the next available agent.
Mehmet Mert
3