Python 101 for Data Analysis. Homework is Jupyter notebook using Pandas. The Jupyternotebook and 3 files are attached. Total 5 questions. 1. Find the movies that have more than one genre2. Create a...

2 answer below »
Python 101 for Data Analysis. Homework is Jupyter notebook using Pandas. The Jupyternotebook and 3 files are attached. Total 5 questions.
1. Find the movies that have more than one genre2. Create a Univariate plots of columns: 'rating', 'Age', 'release year', 'Gender' and 'Occupation'3. Create a plot on how popularity of genres has changed over the years4. Find the top 25 movies in terms of average ratings for movies that have been rated more than 100 times5. Check for the validity of the below statements with respect to the data provided•Men watch more drama than women•Women watch more Sci-Fi than men•Men watch more Romance than women


user id,movie id,rating,timestamp 196,242,3,881250949 186,302,3,891717742 22,377,1,878887116 244,51,2,880606923 166,346,1,886397596 298,474,4,884182806 115,265,2,881171488 253,465,5,891628467 305,451,3,886324817 6,86,3,883603013 62,257,2,879372434 286,1014,5,879781125 200,222,5,876042340 210,40,3,891035994 224,29,3,888104457 303,785,3,879485318 122,387,5,879270459 194,274,2,879539794 291,1042,4,874834944 234,1184,2,892079237 119,392,4,886176814 167,486,4,892738452 299,144,4,877881320 291,118,2,874833878 308,1,4,887736532 95,546,2,879196566 38,95,5,892430094 102,768,2,883748450 63,277,4,875747401 160,234,5,876861185 50,246,3,877052329 301,98,4,882075827 225,193,4,879539727 290,88,4,880731963 97,194,3,884238860 157,274,4,886890835 181,1081,1,878962623 278,603,5,891295330 276,796,1,874791932 7,32,4,891350932 10,16,4,877888877 284,304,4,885329322 201,979,2,884114233 276,564,3,874791805 287,327,5,875333916 246,201,5,884921594 242,1137,5,879741196 249,241,5,879641194 99,4,5,886519097 178,332,3,882823437 251,100,4,886271884 81,432,2,876535131 260,322,4,890618898 25,181,5,885853415 59,196,5,888205088 72,679,2,880037164 87,384,4,879877127 290,143,5,880474293 42,423,5,881107687 292,515,4,881103977 115,20,3,881171009 20,288,1,879667584 201,219,4,884112673 13,526,3,882141053 246,919,4,884920949 138,26,5,879024232 167,232,1,892738341 60,427,5,883326620 57,304,5,883698581 223,274,4,891550094 189,512,4,893277702 243,15,3,879987440 92,1049,1,890251826 246,416,3,884923047 194,165,4,879546723 241,690,2,887249482 178,248,4,882823954 254,1444,3,886475558 293,5,3,888906576 127,229,5,884364867 225,237,5,879539643 299,229,3,878192429 225,480,5,879540748 276,54,3,874791025 291,144,5,874835091 222,366,4,878183381 267,518,5,878971773 42,403,3,881108684 11,111,4,891903862 95,625,4,888954412 8,338,4,879361873 162,25,4,877635573 87,1016,4,879876194 279,154,5,875296291 145,275,2,885557505 119,1153,5,874781198 62,498,4,879373848 62,382,3,879375537 28,209,4,881961214 135,23,4,879857765 32,294,3,883709863 90,382,5,891383835 286,208,4,877531942 293,685,3,888905170 216,144,4,880234639 166,328,5,886397722 250,496,4,878090499 271,132,5,885848672 160,174,5,876860807 265,118,4,875320714 198,498,3,884207492 42,96,5,881107178 168,151,5,884288058 110,307,4,886987260 58,144,4,884304936 90,648,4,891384754 271,346,4,885844430 62,21,3,879373460 279,832,3,881375854 237,514,4,879376641 94,789,4,891720887 128,485,3,879966895 298,317,4,884182806 44,195,5,878347874 264,200,5,886122352 194,385,2,879524643 72,195,5,880037702 222,750,5,883815120 250,264,3,878089182 41,265,3,890687042 224,245,3,888082216 82,135,3,878769629 262,1147,4,879791710 293,471,3,888904884 216,658,3,880245029 250,140,3,878092059 59,23,5,888205300 286,379,5,877533771 244,815,4,880605185 7,479,4,891352010 174,368,1,886434402 87,274,4,879876734 194,1211,2,879551380 82,1134,2,884714402 13,836,2,882139746 13,272,4,884538403 244,756,2,880605157 305,427,5,886323090 95,787,2,888954930 43,14,2,883955745 299,955,4,889502823 57,419,3,883698454 84,405,3,883452363 269,504,4,891449922 299,111,3,877878184 194,466,4,879525876 160,135,4,876860807 99,268,3,885678247 10,486,4,877886846 259,117,4,874724988 85,427,3,879456350 303,919,4,879467295 213,273,5,878870987 121,514,3,891387947 90,98,5,891383204 49,559,2,888067405 42,794,3,881108425 155,323,2,879371261 68,117,4,876973939 172,177,4,875537965 19,4,4,885412840 268,231,4,875744136 5,2,3,875636053 305,117,2,886324028 44,294,4,883612356 43,137,4,875975656 279,1336,1,875298353 80,466,5,887401701 254,164,4,886472768 298,281,3,884183336 279,1240,1,892174404 66,298,4,883601324 18,443,3,880130193 268,1035,2,875542174 99,79,4,885680138 13,98,4,881515011 26,258,3,891347949 7,455,4,891353086 222,755,4,878183481 200,673,5,884128554 119,328,4,876923913 213,172,5,878955442 276,322,3,874786392 94,1217,3,891723086 130,379,4,875801662 38,328,4,892428688 160,719,3,876857977 293,1267,3,888906966 26,930,2,891385985 130,216,4,875216545 92,1079,3,886443455 256,452,4,882164999 1,61,4,878542420 72,48,4,880036718 56,755,3,892910207 13,360,4,882140926 15,405,2,879455957 92,77,3,875654637 207,476,2,884386343 292,174,5,881105481 232,483,5,888549622 251,748,2,886272175 224,26,3,888104153 181,220,4,878962392 259,255,4,874724710 305,471,4,886323648 52,280,3,882922806 161,202,5,891170769 148,408,5,877399018 125,235,2,892838559 97,228,5,884238860 58,1098,4,884304936 83,234,4,887665548 90,347,4,891383319 272,178,5,879455113 194,181,3,879521396 125,478,4,879454628 110,688,1,886987605 299,14,4,877877775 151,10,5,879524921 269,127,4,891446165 6,14,5,883599249 54,106,3,880937882 303,69,5,879467542 16,944,1,877727122 301,790,4,882078621 276,1091,3,874793035 305,214,2,886323068 194,1028,2,879541148 91,323,2,891438397 87,554,4,879875940 294,109,4,877819599 286,171,4,877531791 200,318,5,884128458 229,328,1,891632142 178,568,4,882826555 303,842,2,879484804 62,65,4,879374686 207,591,3,876018608 92,172,4,875653271 301,401,4,882078040 36,339,5,882157581 70,746,3,884150257 63,242,3,875747190 28,201,3,881961671 279,68,4,875307407 250,7,4,878089716 14,98,3,890881335 299,1018,3,889502324 194,54,3,879525876 303,815,3,879485532 119,237,5,874775038 295,218,5,879966498 268,930,2,875742942 268,2,2,875744173 66,258,4,883601089 233,202,5,879394264 83,623,4,880308578 214,334,3,891542540 192,476,2,881368243 100,344,4,891374868 268,145,1,875744501 301,56,4,882076587 307,89,5,879283786 234,141,3,892334609 83,576,4,880308755 181,264,2,878961624 297,133,4,875240090 38,153,5,892430369 7,382,4,891352093 264,813,4,886122952 181,872,1,878961814 201,146,1,884140579 85,507,4,879456199 269,367,3,891450023 59,468,3,888205855 286,143,4,889651549 193,96,1,889124507 113,595,5,875936424 292,11,5,881104093 130,1014,3,876250718 275,98,4,875155140 189,520,5,893265380 219,82,1,889452455 218,209,5,877488546 123,427,3,879873020 119,222,5,874775311 158,177,4,880134407 222,118,4,877563802 302,322,2,879436875 279,501,3,875308843 301,79,5,882076403 181,3,2,878963441 201,695,1,884140115 13,198,3,881515193 1,189,3,888732928 145,237,5,875270570 23,385,4,874786462 201,767,4,884114505 296,705,5,884197193 42,546,3,881105817 33,872,3,891964230 301,554,3,882078830 16,64,5,877720297 95,135,3,879197562 154,357,4,879138713 77,484,5,884733766 296,508,5,884196584 302,303,2,879436785 244,673,3,880606667 222,77,4,878183616 13,215,5,882140588 16,705,5,877722736 270,452,4,876956264 145,15,2,875270655 187,64,5,879465631 200,304,5,876041644 170,749,5,887646170 101,829,3,877136138 184,218,3,889909840 128,204,4,879967478 181,1295,1,878961781 184,153,3,889911285 1,33,4,878542699 1,160,4,875072547 184,321,5,889906967 54,595,3,880937813 94,343,4,891725009 128,508,4,879967767 23,323,2,874784266 301,227,3,882077222 301,191,3,882075672 112,903,1,892440172 82,183,3,878769848 222,724,3,878181976 218,430,3,877488316 308,1197,4,887739521 303,134,5,879467959 133,751,3,890588547 215,212,2,891435680 69,256,5,882126156 254,662,4,887347350 276,2,4,874792436 104,984,1,888442575 63,1067,3,875747514 267,410,4,878970785 13,56,5,881515011 240,879,3,885775745 286,237,2,875806800 294,271,5,889241426 90,1086,4,891384424 18,26,4,880129731 92,229,3,875656201 308,649,4,887739292 144,89,3,888105691 191,302,4,891560253 59,951,3,888206409 200,96,5,884129409 16,197,5,877726146 61,678,3,892302309 271,199,4,885848448 271,709,3,885849325 142,169,5,888640356 275,597,3,876197678 222,151,3,878182109 87,40,3,879876917 207,258,4,877879172 272,1393,2,879454663 177,333,4,880130397 207,1115,2,879664906 299,577,3,889503806 271,378,4,885849447 305,425,4,886324486 49,959,2,888068912 94,1224,3,891722802 130,1017,3,874953895 10,175,3,877888677 203,321,3,880433418 191,286,4,891560842 43,323,3,875975110 21,558,5,874951695 197,96,5,891409839 13,344,2,888073635 194,66,3,879527264 234,206,4,892334543 308,402,4,887740700 308,640,4,887737036 269,522,5,891447773 94,265,4,891721889 268,62,3,875310824 272,12,5,879455254 121,291,3,891390477 296,20,5,884196921 134,286,3,891732334 180,462,5,877544218 234,612,3,892079140 104,117,2,888465972 38,758,1,892434626 269,845,1,891456255 7,163,4,891353444 234,1451,3,892078343 275,405,2,876197645 52,250,3,882922661 102,823,3,888801465 13,186,4,890704999 178,731,4,882827532 236,71,3,890116671 256,781,5,882165296 263,176,5,891299752 244,186,3,880605697 279,1181,4,875314001 43,815,4,883956189 83,78,2,880309089 151,197,5,879528710 254,436,2,886474216 109,631,3,880579371 297,716,3,875239422 249,188,4,879641067 144,699,4,888106106 301,604,4,882075994 64,392,3,889737542 92,501,2,875653665 222,97,4,878181739 268,436,3,875310745 293,135,5,888905550 213,173,5,878955442 160,460,2,876861185 13,498,4,882139901 59,715,5,888205921 5,17,4,875636198 125,163,5,879454956 174,315,5,886432749 114,505,3,881260203 213,515,4,878870518 23,196,2,874786926 128,15,4,879968827 239,56,4,889179478 181,279,1,878962955 291,80,4,875086354 250,238,4,878089963 201,649,3,884114275 60,60,5,883327734 181,325,2,878961814 119,407,3,887038665 287,1,5,875334088 216,228,3,880245642 216,531,4,880233810 203,471,4,880434463 92,587,3,875660408 13,892,3,882774224 213,176,4,878956338 286,288,5,875806672 117,1047,2,881009697 99,111,1,885678886 11,558,3,891904214 65,47,2,879216672 295,194,4,879517412 269,217,2,891451610 85,259,2,881705026 250,596,5,878089921 137,144,5,881433689 201,960,2,884112077 257,137,4,882049932 111,328,4,891679939 91,480,4,891438875 215,211,4,891436202 181,938,1,878961586 189,1060,5,893264301 1,20,4,887431883 303,404,4,879468375 299,305,3,879737314 187,210,4,879465242 222,278,2,877563913 214,568,4,892668197 293,770,3,888906655 285,191,4,890595859 303,252,3,879544791 96,156,4,884402860 72,1110,3,880037334 115,1067,4,881171009 7,430,3,891352178 116,350,3,886977926 73,480,4,888625753 269,246,5,891457067 263,419,5,891299514 70,431,3,884150257 221,475,4,875244204 72,182,5,880036515 25,357
Answered Same DayDec 04, 2021

Answer To: Python 101 for Data Analysis. Homework is Jupyter notebook using Pandas. The Jupyternotebook and 3...

Pooja answered on Dec 05 2021
154 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `Project - MovieLens Data Analysis`\n",
"\n",
"The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. The data is widely used for collaborative filtering and other filtering solutions. However, we will be using this data to act as a means to demonstrate our skill in using Python to “play” with data.\n",
"\n",
"### `Objective:`\n",
"- To implement the techniques learnt as a part of the course.\n",
"\n",
"### `Datasets Information:`\n",
"\n",
"*rating.csv:* It contains information on ratings given by the users to a particular movie.\n",
"- user id: id assigned to every user\n",
"- movie id: id assigned to every movie\n",
"- rating: rating given by the user\n",
"- timestamp: Time recorded when the user gave a rating\n",
"\n",
"*movie.csv:* File contains information related to the movies and their genre.\n",
"- movie id: id assigned to every movie\n",
"- movie title: Title of the movie\n",
"- release date: Date of release of the movie\n",
"- Action: Genre containing binary values (1 - for action 0 - not action)\n",
"- Adventure: Genre containing binary values (1 - for adventure 0 - not adventure)\n",
"- Animation: Genre containing binary values (1 - for animation 0 - not animation)\n",
"- Children’s: Genre c
ontaining binary values (1 - for children's 0 - not children's)\n",
"- Comedy: Genre containing binary values (1 - for comedy 0 - not comedy)\n",
"- Crime: Genre containing binary values (1 - for crime 0 - not crime)\n",
"- Documentary: Genre containing binary values (1 - for documentary 0 - not documentary)\n",
"- Drama: Genre containing binary values (1 - for drama 0 - not drama)\n",
"- Fantasy: Genre containing binary values (1 - for fantasy 0 - not fantasy)\n",
"- Film-Noir: Genre containing binary values (1 - for film-noir 0 - not film-noir)\n",
"- Horror: Genre containing binary values (1 - for horror 0 - not horror)\n",
"- Musical: Genre containing binary values (1 - for musical 0 - not musical)\n",
"- Mystery: Genre containing binary values (1 - for mystery 0 - not mystery)\n",
"- Romance: Genre containing binary values (1 - for romance 0 - not romance)\n",
"- Sci-Fi: Genre containing binary values (1 - for sci-fi 0 - not sci-fi)\n",
"- Thriller: Genre containing binary values (1 - for thriller 0 - not thriller)\n",
"- War: Genre containing binary values (1 - for war 0 - not war)\n",
"- Western: Genre containing binary values (1 - for western - not western)\n",
"\n",
"\n",
"*user.csv:* It contains information of the users who have rated the movies.\n",
"- user id: id assigned to every user\n",
"- age: Age of the user\n",
"- gender: Gender of the user\n",
"- occupation: Occupation of the user\n",
"- zip code: Zip code of the use\n",
"\n",
"**`Please provide your insights wherever necessary.`**\n",
"\n",
"### `Learning Outcomes:`\n",
"- Exploratory Data Analysis\n",
"\n",
"- Visualization using Python\n",
"\n",
"- Pandas – groupby, merging \n",
"\n",
"\n",
"### `Domain` \n",
"- Internet and Entertainment\n",
"\n",
"**Note that the project will need you to apply the concepts of groupby and merging extensively.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Observations:**\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"data--> user id movie id rating timestamp\n",
"0 196 242 3 881250949\n",
"1 186 302 3 891717742\n",
"2 22 377 1 878887116\n",
"3 244 51 2 880606923\n",
"4 166 346 1 886397596\n",
"item--> movie id movie title release date unknown Action Adventure Animation \\\n",
"0 1 Toy Story 01-Jan-1995 0 0 0 1 \n",
"1 2 GoldenEye 01-Jan-1995 0 1 1 0 \n",
"2 3 Four Rooms 01-Jan-1995 0 0 0 0 \n",
"3 4 Get Shorty 01-Jan-1995 0 1 0 0 \n",
"4 5 Copycat 01-Jan-1995 0 0 0 0 \n",
"\n",
" Childrens Comedy Crime ... Fantasy Film-Noir Horror Musical \\\n",
"0 1 1 0 ... 0 0 0 0 \n",
"1 0 0 0 ... 0 0 0 0 \n",
"2 0 0 0 ... 0 0 0 0 \n",
"3 0 1 0 ... 0 0 0 0 \n",
"4 0 0 1 ... 0 0 0 0 \n",
"\n",
" Mystery Romance Sci-Fi Thriller War Western \n",
"0 0 0 0 0 0 0 \n",
"1 0 0 0 1 0 0 \n",
"2 0 0 0 1 0 0 \n",
"3 0 0 0 0 0 0 \n",
"4 0 0 0 1 0 0 \n",
"\n",
"[5 rows x 22 columns]\n",
"head--> user id age gender occupation zip code\n",
"0 1 24 M technician 85711\n",
"1 2 53 F other 94043\n",
"2 3 23 M writer 32067\n",
"3 4 24 M technician 43537\n",
"4 5 33 F other 15213\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import os\n",
"\n",
"data=pd.read_csv(r\"C:\\Users\\HP\\Desktop\\data.csv\")\n",
"item=pd.read_csv(r\"C:\\Users\\HP\\Desktop\\item.csv\")\n",
"user=pd.read_csv(r\"C:\\Users\\HP\\Desktop\\user.csv\")\n",
"\n",
"print(\"data-->\",data.head())\n",
"print(\"item-->\",item.head())\n",
"print(\"head-->\", user.head())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Insights:**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Find the movies that have more than one genre - 5 marks\n",
"\n",
"hint: use sum on the axis = 1\n",
"\n",
"Display movie name, number of genres for the movie in dataframe\n",
"\n",
"and also print(total number of movies which have more than one genres)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"required dataframe= movie title genre\n",
"0 Toy Story 3\n",
"1 GoldenEye 3\n",
"2 Four Rooms 1\n",
"3 Get Shorty 3\n",
"4 Copycat 3\n",
"5 Shanghai Triad (Yao a yao yao dao waipo qiao) 1\n",
"6 Twelve Monkeys 2\n",
"7 Babe 3\n",
"8 Dead Man Walking 1\n",
"9 Richard III 2\n",
"10 Seven (Se7en) 2\n",
"11 Usual Suspects, The 2\n",
"12 Mighty Aphrodite 1\n",
"13 Postino, Il 2\n",
"14 Mr. Holland's Opus 1\n",
"15 French Twist (Gazon maudit) 2\n",
"16 From Dusk Till Dawn 5\n",
"17 White Balloon, The 1\n",
"18 Antonia's Line 1\n",
"19 Angels and Insects 2\n",
"20 Muppet Treasure Island 5\n",
"21 Braveheart 3\n",
"22 Taxi Driver 2\n",
"23 Rumble in the Bronx 3\n",
"24 Birdcage, The 1\n",
"25 Brothers McMullen, The 1\n",
"26 Bad Boys 1\n",
"27 Apollo 13 3\n",
"28 Batman Forever 4\n",
"29 Belle de jour 1\n",
"... ... ...\n",
"1651 Entertaining Angels: The Dorothy Day Story 1\n",
"1652 Chairman of the Board 1\n",
"1653 Favor, The 2\n",
"1654 Little City 2\n",
"1655 Target 2\n",
"1656 Substance of Fire, The 1\n",
"1657 Getting Away With Murder 1\n",
"1658 Small Faces 1\n",
"1659 New Age, The 1\n",
"1660 Rough Magic 2\n",
"1661 Nothing Personal 2\n",
"1662 8 Heads in a Duffel Bag 1\n",
"1663 Brother's Kiss, A 1\n",
"1664 Ripe 1\n",
"1665 Next Step, The 1\n",
"1666 Wedding Bell Blues 1\n",
"1667 MURDER and murder 3\n",
"1668 Tainted 2\n",
"1669 Further Gesture, A 1\n",
"1670 Kika 1\n",
"1671 Mirage 2\n",
"1672 Mamma Roma 1\n",
"1673 Sunchaser, The 1\n",
"1674 War at Home, The 1\n",
"1675 Sweet Nothing 1\n",
"1676 Mat' i syn 1\n",
"1677 B. Monkey 2\n",
"1678 Sliding Doors 2\n",
"1679 You So Crazy 1\n",
"1680 Scream of Stone (Schrei aus Stein) 1\n",
"\n",
"[1681 rows x 2 columns]\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
col_0count
genre
False836
True845
\n",
"
"
],
"text/plain": [
"col_0 count\n",
"genre \n",
"False 836\n",
"True 845"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = item.iloc[:,3:21]\n",
"df1=df1.sum(axis = 1) \n",
"item['genre'] = df1\n",
"\n",
"df2 = pd.DataFrame(item,columns=['movie title','genre'])\n",
"print(\"required dataframe=\",df2)\n",
"\n",
"df3=df2[\"genre\"]>1\n",
"my_tab = pd.crosstab(index=df3, columns=\"count\")\n",
"my_tab"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Insights:**\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7. Univariate plots of columns: 'rating', 'Age', 'release year', 'Gender' and 'Occupation' - 10 marks\n",
"\n",
"*HINT: Use distplot for age and countplot for release year, ratings, *\n",
"\n",
"*HINT: Please refer to the below snippet to understand how to get to release year from release date. You can use str.split() as depicted below or you could convert it to pandas datetime format and extract year (.dt.year)*"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"brown\n",
"brown\n",
"brown\n"
]
}
],
"source": [
"a = 'My*cat*is*brown'\n",
"print(a.split('*')[3])\n",
"\n",
"#similarly, the release year needs to be taken out from release date\n",
"\n",
"#also you can simply slice existing string to get the desired data, if we want to take out the colour of the cat\n",
"\n",
"print(a[10:])\n",
"print(a[-5:])"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here