{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Aim 1:**\n",
"To read the provided csv and know about the data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"id | \n",
"imdb_id | \n",
"popularity | \n",
"budget | \n",
"revenue | \n",
"original_title | \n",
"cast | \n",
"homepage | \n",
"director | \n",
"tagline | \n",
"... | \n",
"overview | \n",
"runtime | \n",
"genres | \n",
"production_companies | \n",
"release_date | \n",
"vote_count | \n",
"vote_average | \n",
"release_year | \n",
"budget_adj | \n",
"revenue_adj | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"135397 | \n",
"tt0369610 | \n",
"32.985763 | \n",
"150000000 | \n",
"1513528810 | \n",
"Jurassic World | \n",
"Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | \n",
"http://www.jurassicworld.com/ | \n",
"Colin Trevorrow | \n",
"The park is open. | \n",
"... | \n",
"Twenty-two years after the events of Jurassic ... | \n",
"124 | \n",
"Action|Adventure|Science Fiction|Thriller | \n",
"Universal Studios|Amblin Entertainment|Legenda... | \n",
"6/9/15 | \n",
"5562 | \n",
"6.5 | \n",
"2015 | \n",
"1.379999e+08 | \n",
"1.392446e+09 | \n",
"
\n",
"\n",
"
\n",
"
1 rows × 21 columns
\n",
"
"
],
"text/plain": [
" id imdb_id popularity budget revenue original_title \\\n",
"0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World \n",
"\n",
" cast \\\n",
"0 Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... \n",
"\n",
" homepage director tagline ... \\\n",
"0 http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... \n",
"\n",
" overview runtime \\\n",
"0 Twenty-two years after the events of Jurassic ... 124 \n",
"\n",
" genres \\\n",
"0 Action|Adventure|Science Fiction|Thriller \n",
"\n",
" production_companies release_date vote_count \\\n",
"0 Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 \n",
"\n",
" vote_average release_year budget_adj revenue_adj \n",
"0 6.5 2015 1.379999e+08 1.392446e+09 \n",
"\n",
"[1 rows x 21 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"df=pd.read_csv('tmdb-movies-fn2tqcxx.csv')\n",
"df.head(1)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"
\n",
"RangeIndex: 10866 entries, 0 to 10865\n",
"Data columns (total 21 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 id 10866 non-null int64 \n",
" 1 imdb_id 10856 non-null object \n",
" 2 popularity 10866 non-null float64\n",
" 3 budget 10866 non-null int64 \n",
" 4 revenue 10866 non-null int64 \n",
" 5 original_title 10866 non-null object \n",
" 6 cast 10790 non-null object \n",
" 7 homepage 2936 non-null object \n",
" 8 director 10822 non-null object \n",
" 9 tagline 8042 non-null object \n",
" 10 keywords 9373 non-null object \n",
" 11 overview 10862 non-null object \n",
" 12 runtime 10866 non-null int64 \n",
" 13 genres 10843 non-null object \n",
" 14 production_companies 9836 non-null object \n",
" 15 release_date 10866 non-null object \n",
" 16 vote_count 10866 non-null int64 \n",
" 17 vote_average 10866 non-null float64\n",
" 18 release_year 10866 non-null int64 \n",
" 19 budget_adj 10866 non-null float64\n",
" 20 revenue_adj 10866 non-null float64\n",
"dtypes: float64(4), int64(6), object(11)\n",
"memory usage: 1.7+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10866, 21)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape #to know the number of rows and columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Aim 2:**\n",
"To find the most popular genere."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" original_title genres\n",
"0 Jurassic World Action|Adventure|Science Fiction|Thriller\n",
"1 Mad Max: Fury Road Action|Adventure|Science Fiction|Thriller\n",
"2 Insurgent Adventure|Science Fiction|Thriller\n",
"3 Star Wars: The Force Awakens Action|Adventure|Science Fiction|Fantasy\n",
"4 Furious 7 Action|Crime|Thriller\n",
"... ... ...\n",
"10861 The Endless Summer Documentary\n",
"10862 Grand Prix Action|Adventure|Drama\n",
"10863 Beregis Avtomobilya Mystery|Comedy\n",
"10864 What's Up, Tiger Lily? Action|Comedy\n",
"10865 Manos: The Hands of Fate Horror\n",
"\n",
"[10866 rows x 2 columns]\n"
]
}
],
"source": [
"print(df[['original_title', 'genres']])"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"original_title \n",
"Jurassic World 0 Action\n",
" 1 Adventure\n",
" 2 Science Fiction\n",
" 3 Thriller\n",
"Mad Max: Fury Road 0 Action\n",
" ... \n",
"Beregis Avtomobilya 0 Mystery\n",
" 1 Comedy\n",
"What's Up, Tiger Lily? 0 Action\n",
" 1 Comedy\n",
"Manos: The Hands of Fate 0 Horror\n",
"Length: 26960, dtype: object"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cleaned = df.set_index('original_title').genres.str.split('|', expand=True).stack()\n",
"cleaned"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"g_Action | \n",
"g_Adventure | \n",
"g_Animation | \n",
"g_Comedy | \n",
"g_Crime | \n",
"g_Documentary | \n",
"g_Drama | \n",
"g_Family | \n",
"g_Fantasy | \n",
"g_Foreign | \n",
"g_History | \n",
"g_Horror | \n",
"g_Music | \n",
"g_Mystery | \n",
"g_Romance | \n",
"g_Science Fiction | \n",
"g_TV Movie | \n",
"g_Thriller | \n",
"g_War | \n",
"g_Western | \n",
"
\n",
"\n",
"original_title | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
"\n",
"\n",
"\n",
"$5 a Day | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"$9.99 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"'71 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"1 | \n",
"0 | \n",
"
\n",
"\n",
"(500) Days of Summer | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"(T)Raumschiff Surprise - Periode 1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"... | \n",
"
\n",
"\n",
"ì˜í˜•ì œ | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"ì‹ ì˜ í•œ 수 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"í¬í™” ì†ìœ¼ë¡œ | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"
\n",
"\n",
"형사 Duelist | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"í•˜ìš¸ë§ | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"
\n",
"
10548 rows × 20 columns
\n",
"
"
],
"text/plain": [
" g_Action g_Adventure g_Animation \\\n",
"original_title \n",
"$5 a Day 0 0 0 \n",
"$9.99 0 0 1 \n",
"'71 1 0 0 \n",
"(500) Days of Summer 0 0 0 \n",
"(T)Raumschiff Surprise - Periode 1 0 0 0 \n",
"... ... ... ... \n",
"ì˜í˜•ì œ 0 0 0 \n",
"ì‹ ì˜ í•œ 수 1 0 0 \n",
"í¬í™” ì†ìœ¼ë¡œ 0 0 0 \n",
"형사 Duelist 1 0 0 \n",
"í•˜ìš¸ë§ 0 0 0 \n",
"\n",
" g_Comedy g_Crime g_Documentary g_Drama \\\n",
"original_title \n",
"$5 a Day 1 0 0 1 \n",
"$9.99 0 0 0 1 \n",
"'71 0 0 0 1 \n",
"(500) Days of Summer 1 0 0 1 \n",
"(T)Raumschiff Surprise - Periode 1 1 0 0 0 \n",
"... ... ... ... ... \n",
"ì˜í˜•ì œ 0 0 0 1 \n",
"ì‹ ì˜ í•œ 수 0 1 0 1 \n",
"í¬í™” ì†ìœ¼ë¡œ 0 0 0 0 \n",
"형사 Duelist 0 0 0 0 \n",
"í•˜ìš¸ë§ 0 0 0 0 \n",
"\n",
" g_Family g_Fantasy g_Foreign g_History \\\n",
"original_title \n",
"$5 a Day 0 0 0 0 \n",
"$9.99 0 0 0 0 \n",
"'71 0 0 0 0 \n",
"(500) Days of Summer 0 0 0 0 \n",
"(T)Raumschiff Surprise - Periode 1 0 0 0 0 \n",
"... ... ... ... ... \n",
"ì˜í˜•ì œ 0 0 1 0 \n",
"ì‹ ì˜ í•œ 수 0 0 0 0 \n",
"í¬í™” ì†ìœ¼ë¡œ 0 0 0 0 \n",
"형사 Duelist 0 0 0 0 \n",
"í•˜ìš¸ë§ 0 0 1 0 \n",
"\n",
" g_Horror g_Music g_Mystery g_Romance \\\n",
"original_title \n",
"$5 a Day 0 0 0 0 \n",
"$9.99 0 0 0 0 \n",
"'71 0 0 0 0 \n",
"(500) Days of Summer 0 0 0 1 \n",
"(T)Raumschiff Surprise - Periode 1 0 0 0 0 \n",
"... ... ... ... ... \n",
"ì˜í˜•ì œ 0 0 0 0 \n",
"ì‹ ì˜ í•œ 수 0 0 0 0 \n",
"í¬í™” ì†ìœ¼ë¡œ 0 0 0 0 \n",
"형사 Duelist 0 0 0 0 \n",
"í•˜ìš¸ë§ 0 0 1 0 \n",
"\n",
" g_Science Fiction g_TV Movie g_Thriller \\\n",
"original_title \n",
"$5 a Day 0 0 0 \n",
"$9.99 0 0 0 \n",
"'71 0 0 1 \n",
"(500) Days of Summer 0 0 0 \n",
"(T)Raumschiff Surprise - Periode 1 1 0 0 \n",
"... ... ... ... \n",
"ì˜í˜•ì œ 0 0 1 \n",
"ì‹ ì˜ í•œ 수 0 0 0 \n",
"í¬í™” ì†ìœ¼ë¡œ 0 0 0 \n",
"형사 Duelist 0 0 0 \n",
"í•˜ìš¸ë§ 0 0 1 \n",
"\n",
" g_War g_Western \n",
"original_title \n",
"$5 a Day 0 0 \n",
"$9.99 0 0 \n",
"'71 1 0 \n",
"(500) Days of Summer 0 0 \n",
"(T)Raumschiff Surprise - Periode 1 0 0 \n",
"... ... ... \n",
"ì˜í˜•ì œ 0 0 \n",
"ì‹ ì˜ í•œ 수 0 0 \n",
"í¬í™” ì†ìœ¼ë¡œ 1 0 \n",
"형사 Duelist 0 0 \n",
"í•˜ìš¸ë§ 0 0 \n",
"\n",
"[10548 rows x 20 columns]"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1=pd.get_dummies(cleaned, prefix='g').groupby(level=0).sum()\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"g_Action 2385\n",
"g_Adventure 1471\n",
"g_Animation 699\n",
"g_Comedy 3793\n",
"g_Crime 1355\n",
"g_Documentary 520\n",
"g_Drama 4761\n",
"g_Family 1231\n",
"g_Fantasy 916\n",
"g_Foreign 188\n",
"g_History 334\n",
"g_Horror 1637\n",
"g_Music 408\n",
"g_Mystery 810\n",
"g_Romance 1712\n",
"g_Science Fiction 1230\n",
"g_TV Movie 167\n",
"g_Thriller 2908\n",
"g_War 270\n",
"g_Western 165\n",
"dtype: int64"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2=df1.sum()\n",
"df2"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4761"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.max()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Thus, most popular genre is \"Drama\" with count 4761.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" **Aim 3:**\n",
"To find the correlation between release year and revenue and to plot it using matplotlib."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"revenue | \n",
"release_year | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"1513528810 | \n",
"2015 | \n",
"
\n",
"\n",
"1 | \n",
"378436354 | \n",
"2015 | \n",
"
\n",
"\n",
"2 | \n",
"295238201 | \n",
"2015 | \n",
"
\n",
"\n",
"3 | \n",
"2068178225 | \n",
"2015 | \n",
"
\n",
"\n",
"4 | \n",
"1506249360 | \n",
"2015 | \n",
"
\n",
"\n",
"... | \n",
"... | \n",
"... | \n",
"
\n",
"\n",
"10861 | \n",
"0 | \n",
"1966 | \n",
"
\n",
"\n",
"10862 | \n",
"0 | \n",
"1966 | \n",
"
\n",
"\n",
"10863 | \n",
"0 | \n",
"1966 | \n",
"
\n",
"\n",
"10864 | \n",
"0 | \n",
"1966 | \n",
"
\n",
"\n",
"10865 | \n",
"0 | \n",
"1966 | \n",
"
\n",
"\n",
"
\n",
"
10866 rows × 2 columns
\n",
"
"
],
"text/plain": [
" revenue release_year\n",
"0 1513528810 2015\n",
"1 378436354 2015\n",
"2 295238201 2015\n",
"3 2068178225 2015\n",
"4 1506249360 2015\n",
"... ... ...\n",
"10861 0 1966\n",
"10862 0 1966\n",
"10863 0 1966\n",
"10864 0 1966\n",
"10865 0 1966\n",
"\n",
"[10866 rows x 2 columns]"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4=df[['revenue', 'release_year']]\n",
"df4"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png":...