need help with a python assignment dealing with data visualization
2 Version 1 Grading and Feedback The maximum possible score for this homework is 100 points. We will auto-grade all questions (except Q1.2d) using the Gradescope platform. Based on our experience, students (you all!) benefit from using Gradescope to obtain feedback as they work on this assignment. Keep the following important points in mind: 1. Every student will receive an email within the next 48 hours of the HW release inviting you to use Gradescope for all HW1 questions. If you did not receive the email, it can take up to 48 hours from when we sync the roster. You can still get to Gradescope directly through Canvas. 2. You may upload your code periodically to Gradescope to obtain feedback for your code. This is accomplished by having Gradescope auto-grade your submission using the same test cases that we will use to grade your work. The test cases’ results may help inform you of potential errors and ways to improve your code. 3. Gradescope should not be the primary way to test your code’s correctness, since it provides only a few test cases, and error messages may not be as informative as local debuggers. You should test your code locally to more efficiently and effectively test your code, and only use Gradescope as a "final" check. 4. Gradescope cannot run code that contains syntax errors. If Gradescope is not running your code, before seeking help, verify that: a. Your code is free of syntax errors (by running it locally) b. All methods have been implemented c. You have submitted the correct file with the correct name 5. When many students use Gradescope simultaneously, it may slow down or fail to communicate with the tester. It can become even slower as the submission deadline approaches. You are responsible for submitting your work in time. Download the HW1 Skeleton before you begin. Homework Overview Vast amounts of digital data are generated each day, but raw data are often not immediately “usable”. Instead, we are interested in the information content of the data: what patterns are captured? This assignment covers a few useful tools for acquiring, cleaning, storing, and visualizing datasets. In Question 1 (Q1), you will collect data using an API for The Movie Database (TMDb). You will construct a graph representation of this data that will show which actors have acted together in various movies, and use Argo Lite to visualize this graph and highlight patterns that you find. This exercise demonstrates how visualizing and interacting with data can help with discovery. In Q2, you will construct a TMDb database in SQLite, with tables capturing information such as how well each movie did, which actors acted in each movie, and what the movie was about. You will also partition and combine information in these tables in order to more easily answer questions such as "which actors acted in the highest number of movies?". https://www.gradescope.com/ https://poloclub.github.io/cse6242-2021spring-online/hw1/Y7b5hemF5P_hw1.zip 3 Version 1 In Q3, you will visualize temporal trends in movie releases, using a JavaScript-based library called D3. This part will show how creating interactive rather than static plots can make data more visually appealing, engaging and easier to parse. Data analysis and visualization is only as good as the quality of the input data. Real-world data often contain missing values, invalid fields, or entries that are not relevant or of interest. In Q4, you will use OpenRefine to clean data from Mercari, and construct GREL queries to filter the entries in this dataset. Finally, in Q5, you will build a simple web application that displays a table of TMDb data on a single-page website. To do this, you will use Flask, a Python framework for building web applications that allows you to connect Python data processing on the back end with serving a site that displays these results. Q1 [40 points] Collect data from TMDb and visualize co-actor network Q1.1 [30 points] Collect data from TMDb and build a graph For this Q1.1, you will be using and submitting a python file. Complete all tasks according to the instructions found in submission.py to complete the Graph class, the TMDbAPIUtils class, and the two global functions. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDB API for data retrieval. NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. this question has been developed, tested for these versions. you must not use any other versions (e.g., python 3.8). while we want to be able to extend to more python versions, the specified versions are what we can definitively support at this time. note: you must only use the modules and libraries provided at the top of submission.py and modules from the python standard library. pandas and numpy cannot be used — while we understand that they are useful libraries to learn, completing this question is not critically dependent on their functionality. in addition, to enable our tas to provide better, more consistent support to our students, we have decided to focus on the subset of libraries. note: we will call each function once in submission.py during grading. the total runtime of submission.py must not exceed 10 minutes. submissions exceeding this limit will receive zero credit. the average runtime of the code during grading is expected to take approximately 4 seconds. when we grade, we will take into account what your code does, and aspects that may be out of your control. for example, sometimes the server may be under heavy load, which may significantly increase the response time (e.g., the closer it is to hw1 deadline, likely the longer the response time!). a) [10 pts] implementation of the graph class according to the instructions in submission.py b) [10 pts] implementation of the tmdbapiutils class according to the instructions in submission.py. you will use version 3 of the tmdb api to download data about actors and their co-actors. to use the tmdb api: o create a tmdb account and obtain your client id / client secret which are required to obtain an authentication token. refer to this document for detailed instructions (log in using your https://docs.python.org/3/library/ https://pandas.pydata.org/ https://numpy.org/ https://poloclub.github.io/cse6242-2021spring-online/hw1/7urmemxdf8_tmdb_registration_instructions.pdf 4 version 1 gt account). o refer to the tmdb api documentation as you work on this question. the documentation contains a helpful ‘try-it-out’ feature for interacting with the api calls. c) [10 pts] producing correct nodes.csv and edges.csv. you must upload your nodes.csv and edges.csv files to argo-lite as directed in q1.2. note: q1.2 builds on the results of q1.1 q1.2 [10 points] visualizing a graph of co-actors using argo-lite using argo lite, visualize a network of actors and their co-actors. you will produce an argo lite graph snapshot your edges.csv and nodes.csv from q1.1.c. a. to get started, review argo lite’s readme on github. argo lite has been open-sourced. b. importing your graph ● launch argo lite ● from the menu bar, click ‘graph’ → ‘import csv’. in the dialogue that appears: o select ‘i have both nodes and edges file’ ● under nodes, use ‘choose file’ to select nodes.csv from your computer o leave 'has headers' selected o verify ‘column for node id’ is ‘id’ ● under edges, use ‘choose file’ to select edges.csv from your computer o verify ‘column for source id’ is ‘source’ o select ‘column for target id’ to ‘target’ o verify ‘selected delimiter’ is ',' ● at the bottom of the dialogue, verify that ‘after import, show’ is set to ‘all nodes’ ● the graph will load in the window. note that the layout is paused by default; you can select to 'resume’ or ‘pause’ layout as needed. ● dragging a node will 'pin' it, freezing its position. selecting a pinned node, right clicking it, then choosing 'unpin selected' will unpin that node, so its position will once again be computed by the graph layout algorithm. experiment with pinning and unpinning nodes. note: if a malformed .csv is uploaded, argo-lite could become un-responsive. if you suspect this is the case, open the developer tools for your browser and review any console error messages. c. [7 points] setting graph display options ● on “graph options” panel, under 'nodes' → 'modifying all nodes', expand 'color' menu o select color by 'degree', with scale: ‘linear scale’ o select a color gradient of your choice that will assign lighter colors to nodes with higher node degrees, and darker colors to nodes with lower degrees ● collapse the 'color' options, expand the 'size' options. o select 'scale by' to 'degree', with scale: linear scale' o select meaningful size range values of your choice or use the default range. ● collapse the 'size' options https://developers.themoviedb.org/3/getting-started/introduction https://github.com/poloclub/argo-graph-lite https://poloclub.github.io/argo-graph-lite/ 5 version 1 ● on the menu, click ‘tools’ → ‘data sheet’ ● within the ‘data sheet’ dialogue: o click ‘hide all’ o set ‘10 more nodes with highest degree’ o click ‘show’ and then close the ‘data sheet’ dialogue ● click and drag a rectangle selection around the visible nodes ● with the nodes selected, configure their node visibility by setting the following: o go to 'graph options' → 'labels' o click ‘show labels of selected nodes’ o at the bottom of the menu, select 'label by' to ‘name' o adjust the ‘label length’ so that the full text of the actor name is displayed ● show only non-leaf vertices. on the menu, click ‘tools’ 3.8="" for="" this="" question.="" this="" question="" has="" been="" developed,="" tested="" for="" these="" versions.="" you="" must="" not="" use="" any="" other="" versions="" (e.g.,="" python="" 3.8).="" while="" we="" want="" to="" be="" able="" to="" extend="" to="" more="" python="" versions,="" the="" specified="" versions="" are="" what="" we="" can="" definitively="" support="" at="" this="" time.="" note:="" you="" must="" only="" use="" the="" modules="" and="" libraries="" provided="" at="" the="" top="" of="" submission.py="" and="" modules="" from="" the="" python="" standard="" library.="" pandas="" and="" numpy="" cannot="" be="" used="" —="" while="" we="" understand="" that="" they="" are="" useful="" libraries="" to="" learn,="" completing="" this="" question="" is="" not="" critically="" dependent="" on="" their="" functionality.="" in="" addition,="" to="" enable="" our="" tas="" to="" provide="" better,="" more="" consistent="" support="" to="" our="" students,="" we="" have="" decided="" to="" focus="" on="" the="" subset="" of="" libraries.="" note:="" we="" will="" call="" each="" function="" once="" in="" submission.py="" during="" grading.="" the="" total="" runtime="" of="" submission.py="" must="" not="" exceed="" 10="" minutes.="" submissions="" exceeding="" this="" limit="" will="" receive="" zero="" credit.="" the="" average="" runtime="" of="" the="" code="" during="" grading="" is="" expected="" to="" take="" approximately="" 4="" seconds.="" when="" we="" grade,="" we="" will="" take="" into="" account="" what="" your="" code="" does,="" and="" aspects="" that="" may="" be="" out="" of="" your="" control.="" for="" example,="" sometimes="" the="" server="" may="" be="" under="" heavy="" load,="" which="" may="" significantly="" increase="" the="" response="" time="" (e.g.,="" the="" closer="" it="" is="" to="" hw1="" deadline,="" likely="" the="" longer="" the="" response="" time!).="" a)="" [10="" pts]="" implementation="" of="" the="" graph="" class="" according="" to="" the="" instructions="" in="" submission.py="" b)="" [10="" pts]="" implementation="" of="" the="" tmdbapiutils="" class="" according="" to="" the="" instructions="" in="" submission.py.="" you="" will="" use="" version="" 3="" of="" the="" tmdb="" api="" to="" download="" data="" about="" actors="" and="" their="" co-actors.="" to="" use="" the="" tmdb="" api:="" o="" create="" a="" tmdb="" account="" and="" obtain="" your="" client="" id="" client="" secret="" which="" are="" required="" to="" obtain="" an="" authentication="" token.="" refer="" to="" this="" document="" for="" detailed="" instructions="" (log="" in="" using="" your="" https://docs.python.org/3/library/="" https://pandas.pydata.org/="" https://numpy.org/="" https://poloclub.github.io/cse6242-2021spring-online/hw1/7urmemxdf8_tmdb_registration_instructions.pdf="" 4="" version="" 1="" gt="" account).="" o="" refer="" to="" the="" tmdb="" api="" documentation="" as="" you="" work="" on="" this="" question.="" the="" documentation="" contains="" a="" helpful="" ‘try-it-out’="" feature="" for="" interacting="" with="" the="" api="" calls.="" c)="" [10="" pts]="" producing="" correct="" nodes.csv="" and="" edges.csv.="" you="" must="" upload="" your="" nodes.csv="" and="" edges.csv="" files="" to="" argo-lite="" as="" directed="" in="" q1.2.="" note:="" q1.2="" builds="" on="" the="" results="" of="" q1.1="" q1.2="" [10="" points]="" visualizing="" a="" graph="" of="" co-actors="" using="" argo-lite="" using="" argo="" lite,="" visualize="" a="" network="" of="" actors="" and="" their="" co-actors.="" you="" will="" produce="" an="" argo="" lite="" graph="" snapshot="" your="" edges.csv="" and="" nodes.csv="" from="" q1.1.c.="" a.="" to="" get="" started,="" review="" argo="" lite’s="" readme="" on="" github.="" argo="" lite="" has="" been="" open-sourced.="" b.="" importing="" your="" graph="" ●="" launch="" argo="" lite="" ●="" from="" the="" menu="" bar,="" click="" ‘graph’="" →="" ‘import="" csv’.="" in="" the="" dialogue="" that="" appears:="" o="" select="" ‘i="" have="" both="" nodes="" and="" edges="" file’="" ●="" under="" nodes,="" use="" ‘choose="" file’="" to="" select="" nodes.csv="" from="" your="" computer="" o="" leave="" 'has="" headers'="" selected="" o="" verify="" ‘column="" for="" node="" id’="" is="" ‘id’="" ●="" under="" edges,="" use="" ‘choose="" file’="" to="" select="" edges.csv="" from="" your="" computer="" o="" verify="" ‘column="" for="" source="" id’="" is="" ‘source’="" o="" select="" ‘column="" for="" target="" id’="" to="" ‘target’="" o="" verify="" ‘selected="" delimiter’="" is="" ','="" ●="" at="" the="" bottom="" of="" the="" dialogue,="" verify="" that="" ‘after="" import,="" show’="" is="" set="" to="" ‘all="" nodes’="" ●="" the="" graph="" will="" load="" in="" the="" window.="" note="" that="" the="" layout="" is="" paused="" by="" default;="" you="" can="" select="" to="" 'resume’="" or="" ‘pause’="" layout="" as="" needed.="" ●="" dragging="" a="" node="" will="" 'pin'="" it,="" freezing="" its="" position.="" selecting="" a="" pinned="" node,="" right="" clicking="" it,="" then="" choosing="" 'unpin="" selected'="" will="" unpin="" that="" node,="" so="" its="" position="" will="" once="" again="" be="" computed="" by="" the="" graph="" layout="" algorithm.="" experiment="" with="" pinning="" and="" unpinning="" nodes.="" note:="" if="" a="" malformed="" .csv="" is="" uploaded,="" argo-lite="" could="" become="" un-responsive.="" if="" you="" suspect="" this="" is="" the="" case,="" open="" the="" developer="" tools="" for="" your="" browser="" and="" review="" any="" console="" error="" messages.="" c.="" [7="" points]="" setting="" graph="" display="" options="" ●="" on="" “graph="" options”="" panel,="" under="" 'nodes'="" →="" 'modifying="" all="" nodes',="" expand="" 'color'="" menu="" o="" select="" color="" by="" 'degree',="" with="" scale:="" ‘linear="" scale’="" o="" select="" a="" color="" gradient="" of="" your="" choice="" that="" will="" assign="" lighter="" colors="" to="" nodes="" with="" higher="" node="" degrees,="" and="" darker="" colors="" to="" nodes="" with="" lower="" degrees="" ●="" collapse="" the="" 'color'="" options,="" expand="" the="" 'size'="" options.="" o="" select="" 'scale="" by'="" to="" 'degree',="" with="" scale:="" linear="" scale'="" o="" select="" meaningful="" size="" range="" values="" of="" your="" choice="" or="" use="" the="" default="" range.="" ●="" collapse="" the="" 'size'="" options="" https://developers.themoviedb.org/3/getting-started/introduction="" https://github.com/poloclub/argo-graph-lite="" https://poloclub.github.io/argo-graph-lite/="" 5="" version="" 1="" ●="" on="" the="" menu,="" click="" ‘tools’="" →="" ‘data="" sheet’="" ●="" within="" the="" ‘data="" sheet’="" dialogue:="" o="" click="" ‘hide="" all’="" o="" set="" ‘10="" more="" nodes="" with="" highest="" degree’="" o="" click="" ‘show’="" and="" then="" close="" the="" ‘data="" sheet’="" dialogue="" ●="" click="" and="" drag="" a="" rectangle="" selection="" around="" the="" visible="" nodes="" ●="" with="" the="" nodes="" selected,="" configure="" their="" node="" visibility="" by="" setting="" the="" following:="" o="" go="" to="" 'graph="" options'="" →="" 'labels'="" o="" click="" ‘show="" labels="" of="" selected="" nodes’="" o="" at="" the="" bottom="" of="" the="" menu,="" select="" 'label="" by'="" to="" ‘name'="" o="" adjust="" the="" ‘label="" length’="" so="" that="" the="" full="" text="" of="" the="" actor="" name="" is="" displayed="" ●="" show="" only="" non-leaf="" vertices.="" on="" the="" menu,="" click=""> 3.8 for this question. this question has been developed, tested for these versions. you must not use any other versions (e.g., python 3.8). while we want to be able to extend to more python versions, the specified versions are what we can definitively support at this time. note: you must only use the modules and libraries provided at the top of submission.py and modules from the python standard library. pandas and numpy cannot be used — while we understand that they are useful libraries to learn, completing this question is not critically dependent on their functionality. in addition, to enable our tas to provide better, more consistent support to our students, we have decided to focus on the subset of libraries. note: we will call each function once in submission.py during grading. the total runtime of submission.py must not exceed 10 minutes. submissions exceeding this limit will receive zero credit. the average runtime of the code during grading is expected to take approximately 4 seconds. when we grade, we will take into account what your code does, and aspects that may be out of your control. for example, sometimes the server may be under heavy load, which may significantly increase the response time (e.g., the closer it is to hw1 deadline, likely the longer the response time!). a) [10 pts] implementation of the graph class according to the instructions in submission.py b) [10 pts] implementation of the tmdbapiutils class according to the instructions in submission.py. you will use version 3 of the tmdb api to download data about actors and their co-actors. to use the tmdb api: o create a tmdb account and obtain your client id / client secret which are required to obtain an authentication token. refer to this document for detailed instructions (log in using your https://docs.python.org/3/library/ https://pandas.pydata.org/ https://numpy.org/ https://poloclub.github.io/cse6242-2021spring-online/hw1/7urmemxdf8_tmdb_registration_instructions.pdf 4 version 1 gt account). o refer to the tmdb api documentation as you work on this question. the documentation contains a helpful ‘try-it-out’ feature for interacting with the api calls. c) [10 pts] producing correct nodes.csv and edges.csv. you must upload your nodes.csv and edges.csv files to argo-lite as directed in q1.2. note: q1.2 builds on the results of q1.1 q1.2 [10 points] visualizing a graph of co-actors using argo-lite using argo lite, visualize a network of actors and their co-actors. you will produce an argo lite graph snapshot your edges.csv and nodes.csv from q1.1.c. a. to get started, review argo lite’s readme on github. argo lite has been open-sourced. b. importing your graph ● launch argo lite ● from the menu bar, click ‘graph’ → ‘import csv’. in the dialogue that appears: o select ‘i have both nodes and edges file’ ● under nodes, use ‘choose file’ to select nodes.csv from your computer o leave 'has headers' selected o verify ‘column for node id’ is ‘id’ ● under edges, use ‘choose file’ to select edges.csv from your computer o verify ‘column for source id’ is ‘source’ o select ‘column for target id’ to ‘target’ o verify ‘selected delimiter’ is ',' ● at the bottom of the dialogue, verify that ‘after import, show’ is set to ‘all nodes’ ● the graph will load in the window. note that the layout is paused by default; you can select to 'resume’ or ‘pause’ layout as needed. ● dragging a node will 'pin' it, freezing its position. selecting a pinned node, right clicking it, then choosing 'unpin selected' will unpin that node, so its position will once again be computed by the graph layout algorithm. experiment with pinning and unpinning nodes. note: if a malformed .csv is uploaded, argo-lite could become un-responsive. if you suspect this is the case, open the developer tools for your browser and review any console error messages. c. [7 points] setting graph display options ● on “graph options” panel, under 'nodes' → 'modifying all nodes', expand 'color' menu o select color by 'degree', with scale: ‘linear scale’ o select a color gradient of your choice that will assign lighter colors to nodes with higher node degrees, and darker colors to nodes with lower degrees ● collapse the 'color' options, expand the 'size' options. o select 'scale by' to 'degree', with scale: linear scale' o select meaningful size range values of your choice or use the default range. ● collapse the 'size' options https://developers.themoviedb.org/3/getting-started/introduction https://github.com/poloclub/argo-graph-lite https://poloclub.github.io/argo-graph-lite/ 5 version 1 ● on the menu, click ‘tools’ → ‘data sheet’ ● within the ‘data sheet’ dialogue: o click ‘hide all’ o set ‘10 more nodes with highest degree’ o click ‘show’ and then close the ‘data sheet’ dialogue ● click and drag a rectangle selection around the visible nodes ● with the nodes selected, configure their node visibility by setting the following: o go to 'graph options' → 'labels' o click ‘show labels of selected nodes’ o at the bottom of the menu, select 'label by' to ‘name' o adjust the ‘label length’ so that the full text of the actor name is displayed ● show only non-leaf vertices. on the menu, click ‘tools’>