Please do all 4 sections.
11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall… https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/blob/main/week_12/HW_units_11_12_13.ipynb 1/6 UC-Berkeley-I-School /mids-w200-assignments-upstream-fall2021 Private Code Issues Pull requests Actions Projects Wiki Security mids-w200-assignments-upstream-fall2021 / week_12 / HW_units_11_12_13.ipynb fosterrj Added week12 activity and HW History 1 contributor main 782 lines (782 sloc) 22 KB https://github.com/UC-Berkeley-I-School https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021 https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021 https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/issues https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/pulls https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/actions https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/projects https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/wiki https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/security https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021 https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/tree/main/week_12 https://github.com/fosterrj https://github.com/fosterrj https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/commit/0763ad3fa5c44b917452d56d6bb10b7290db3b97 https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/commits/main/week_12/HW_units_11_12_13.ipynb 11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall… https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/blob/main/week_12/HW_units_11_12_13.ipynb 2/6 Week 12 Assignment - W200 Introduction to Data Science Programming, UC Berkeley MIDS Write code in this Jupyter Notebook to solve the following problems. This assignment addresses material covered in Unit 11. Please upload this Notebook with your solutions to your GitHub repository in your SUBMISSIONS/week_12 folder by 11�59PM PST the night before class. Do NOT push/upload the data file. If you turn-in anything on ISVC please do so under the Week 12 Assignment category. Objectives Explore and glean insights from a real dataset using pandas Practice using pandas for exploratory analysis, information gathering, and discovery Practice using matplotlib for data visualization Dataset You are to analyze campaign contributions to the 2016 U.S. presidential primary races made in California. Use the csv file located here: https://drive.google.com/file/d/1Lgg- PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing. You should download and save this file in the same folder as this notebook is stored. This file originally came from the U.S. Federal Election Commission (https://www.fec.gov/). DO NOT PUSH THIS FILE TO YOUR GITHUB REPO! Best practice is to not have DATA files in your code repo. As shown below, the default load is outside of the folder this notebook is in. If you change the folder where the file is stored please update the first cell! If you do accidentally push the file to your github repo - follow the directions here to fix it: https://docs.google.com/document/d/15Irgb5V5G7pKPWgAerH7FPMpKeQRunbNflaW- hR2hTA/edit?usp=sharing Documentation for this data can be found here: https://drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view? usp=sharing General Guidelines: This is a real dataset and so it may contain errors and other pecularities to work through This dataset is ~218mb, which will take some time to load (and probably won't load in Google Sheets or Excel) If you make assumptions, annotate them in your responses Whil th i d / kd ll iti d ft h ti https://drive.google.com/file/d/1Lgg-PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing https://www.fec.gov/ https://docs.google.com/document/d/15Irgb5V5G7pKPWgAerH7FPMpKeQRunbNflaW-hR2hTA/edit?usp=sharing https://drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view?usp=sharing 11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall… https://github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/blob/main/week_12/HW_units_11_12_13.ipynb 3/6 While there is one code/markdown cell positioned after each question as a placeholder, some of your code/responses may require multiple cells Double-click the markdown cells that say YOUR ANSWER HERE to enter your written answers. If you need more cells for your written answers, make them markdown cells (rather than code cells) Setup Run the two cells below. The first cell will load the data into a pandas dataframe named contrib . Note that a custom date parser is defined to speed up loading. If Python were to guess the date format, it would take even longer to load. The second cell subsets the dataframe to focus on just the primary period through May 2016. Otherwise, we would see general election donations which would make it harder to draw conclusions about the primaries. 1. Data Exploration (20 points) 1a. First, take a preliminary look at the data. Print the shape of the data. What does this tell you about the number of variables and rows you have? Print a list of column names. Review the documentation for this data (link above). Do you have all of the columns you expect to have? S ti i bl t l l d th d t ti I In [ ]: import pandas as pd import matplotlib.pyplot as plt import datetime # These commands below set some options for pandas and to have matplotlib pd.set_option('display.max_rows', 1000) pd.options.display.float_format = '{:,.2f}'.format %matplotlib inline # Define a date parser to pass to read_csv d = lambda x: pd.datetime.strptime(x, '%d-%b-%y') # Load the data # We have this defaulted to the folder OUTSIDE of your repo - please chang contrib = pd.read_csv('../../P00000001-CA.csv', index_col=False, parse_dat print(contrib.shape) # Note - for now, it is okay to ignore the warning about mixed types. In [ ]: # Subset data to primary period contrib = contrib.copy()[contrib['contb_receipt_dt'] <= datetime.datetime( print(contrib.shape) 11/15/21, 8:28 pm mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb at main · uc-berkeley-i-school/mids-w200-assignments-upstream-fall… https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb 4/6 sometimes variable names are not clear unless we read the documentation. in your own words, based on the documentation, what information does the election_tp variable contain? 1a your response here 1b. print the first 5 rows from the dataset to manually check some of the data. this is a good idea to ensure the data loaded and the columns parsed correctly! 1c. pick three variables from the dataset above and run some quick sanity checks. when working with a new dataset, it is important to explore and sanity check your variables. for example, you may want to examine the maximum and minimum values, a frequency count, or something else. use the three markdown cells below to explain if your three chosen variables "pass" your sanity checks or if you have concerns about the integrity of your data and why. 1c your response here 1d. plotting a histogram make a histogram of one of the variables you picked above. what are some insights that you can see from this histogram? remember to include on your histogram: include a title include axis labels the correct number of bins to see the breakout of values hint: for some variables the range of values is very large. to do a better exploration, make the initial histogram the full range and then you can make a smaller histogram 'zoomed' in on a discreet range. 1d your response here 2. exploring campaign contributions (30 points) in [2]: # 1a your code here in [3]: # 1b your code here in [4]: # 1c your code here for variable #1 in [ ]: # 1c your code here for variable #2 in [ ]: # 1c your code here for variable #3 in [2]: # 1d your code here 11/15/21, 8:28 pm mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb at main · uc-berkeley-i-school/mids-w200-assignments-upstream-fall… https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb 5/6 let's investigate the donations to the candidates. 2a. present a table that shows the number of donations to each candidate sorted by number of donations. when presenting data as a table, it is often best to sort the data in a meaningful way. this makes it easier for your reader to examine what you've done and to glean insights. from now on, all tables that you present in this assignment (and course) should be sorted. hint: use the groupby method. groupby is explained in unit 13: async 13.3 & 13.5 hint: use the sort_values method to sort the data so that candidates with the largest number of donations appear on top. which candidate received the largest number of contributions (variable 'contb_receipt_amt')? 2a your response here 2b. now, present a table that shows the total value of donations to each candidate. sorted by total value of the donations which candidate raised the most money in california? 2b your response here 2c. combine the tables (sorted by either a or b above). looking at the two tables you presented above - if those tables are series convert them to dataframes. rename the variable (column) names to accurately describe what is presented. merge together your tables to show the count and the value of donations to each candidate in one table. hint: use the merge method. 2d. calculate and add a new variable to the table from 2c that shows the average $ per donation. print this table sorted by the average donation 2e. plotting a bar chart make a single bar chart that shows two different bars per candidate with one bar as the total value of the donations and the other as average $ per donation. in [3]: # 2a your code here in [ ]: # 2b your code here in [ ]: # 2c your code here in [ ]: # 2d your code here 11/15/21, 8:28 pm mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb at main · uc-berkeley-i-school/mids-w200-assignments-upstream-fall… https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb 6/6 show the candidates name on the x-axis show the amount on the y-axis include a title include axis labels hint: make the y-axis a log-scale to show both numbers! (matplotlib docs: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html ) 2f. comment on the results of your data analysis in a short paragraph. there are several interesting conclusions you can draw from the table you have created. what have you learned about campaign contributions in california? we are looking for data insights here rather than comments on the code! 2f your response here 3. exploring donor occupations (30 points) above in part 2, we saw that some simple data analysis can give us insights into the campaigns of our candidates. now let's quickly look to see what kind of person is donating to each campaign using the contbr_occupation variable. 3a. show the top 5 occupations of individuals that contributed to hillary clinton. subset your data to create a dataframe with only donations for hillary clinton. then use the value_counts and head methods to present the top 5 occupations ( contbr_occupation ) for her donors. in [ ]: # 2e your code here https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html datetime.datetime(="" print(contrib.shape)="" 11/15/21,="" 8:28="" pm="" mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb="" at="" main="" ·="" uc-berkeley-i-school/mids-w200-assignments-upstream-fall…="" https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb="" 4/6="" sometimes="" variable="" names="" are="" not="" clear="" unless="" we="" read="" the="" documentation.="" in="" your="" own="" words,="" based="" on="" the="" documentation,="" what="" information="" does="" the="" election_tp="" variable="" contain?="" 1a="" your="" response="" here="" 1b.="" print="" the="" first="" 5="" rows="" from="" the="" dataset="" to="" manually="" check="" some="" of="" the="" data.="" this="" is="" a="" good="" idea="" to="" ensure="" the="" data="" loaded="" and="" the="" columns="" parsed="" correctly!="" 1c.="" pick="" three="" variables="" from="" the="" dataset="" above="" and="" run="" some="" quick="" sanity="" checks.="" when="" working="" with="" a="" new="" dataset,="" it="" is="" important="" to="" explore="" and="" sanity="" check="" your="" variables.="" for="" example,="" you="" may="" want="" to="" examine="" the="" maximum="" and="" minimum="" values,="" a="" frequency="" count,="" or="" something="" else.="" use="" the="" three="" markdown="" cells="" below="" to="" explain="" if="" your="" three="" chosen="" variables="" "pass"="" your="" sanity="" checks="" or="" if="" you="" have="" concerns="" about="" the="" integrity="" of="" your="" data="" and="" why.="" 1c="" your="" response="" here="" 1d.="" plotting="" a="" histogram="" make="" a="" histogram="" of="" one="" of="" the="" variables="" you="" picked="" above.="" what="" are="" some="" insights="" that="" you="" can="" see="" from="" this="" histogram?="" remember="" to="" include="" on="" your="" histogram:="" include="" a="" title="" include="" axis="" labels="" the="" correct="" number="" of="" bins="" to="" see="" the="" breakout="" of="" values="" hint:="" for="" some="" variables="" the="" range="" of="" values="" is="" very="" large.="" to="" do="" a="" better="" exploration,="" make="" the="" initial="" histogram="" the="" full="" range="" and="" then="" you="" can="" make="" a="" smaller="" histogram="" 'zoomed'="" in="" on="" a="" discreet="" range.="" 1d="" your="" response="" here="" 2.="" exploring="" campaign="" contributions="" (30="" points)="" in [2]:="" #="" 1a="" your="" code="" here="" in [3]:="" #="" 1b="" your="" code="" here="" in [4]:="" #="" 1c="" your="" code="" here="" for="" variable="" #1="" in [ ]:="" #="" 1c="" your="" code="" here="" for="" variable="" #2="" in [ ]:="" #="" 1c="" your="" code="" here="" for="" variable="" #3="" in [2]:="" #="" 1d="" your="" code="" here="" 11/15/21,="" 8:28="" pm="" mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb="" at="" main="" ·="" uc-berkeley-i-school/mids-w200-assignments-upstream-fall…="" https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb="" 5/6="" let's="" investigate="" the="" donations="" to="" the="" candidates.="" 2a.="" present="" a="" table="" that="" shows="" the="" number="" of="" donations="" to="" each="" candidate="" sorted="" by="" number="" of="" donations.="" when="" presenting="" data="" as="" a="" table,="" it="" is="" often="" best="" to="" sort="" the="" data="" in="" a="" meaningful="" way.="" this="" makes="" it="" easier="" for="" your="" reader="" to="" examine="" what="" you've="" done="" and="" to="" glean="" insights.="" from="" now="" on,="" all="" tables="" that="" you="" present="" in="" this="" assignment="" (and="" course)="" should="" be="" sorted.="" hint:="" use="" the="" groupby="" method.="" groupby="" is="" explained="" in="" unit="" 13:="" async="" 13.3="" &="" 13.5="" hint:="" use="" the="" sort_values="" method="" to="" sort="" the="" data="" so="" that="" candidates="" with="" the="" largest="" number="" of="" donations="" appear="" on="" top.="" which="" candidate="" received="" the="" largest="" number="" of="" contributions="" (variable="" 'contb_receipt_amt')?="" 2a="" your="" response="" here="" 2b.="" now,="" present="" a="" table="" that="" shows="" the="" total="" value="" of="" donations="" to="" each="" candidate.="" sorted="" by="" total="" value="" of="" the="" donations="" which="" candidate="" raised="" the="" most="" money="" in="" california?="" 2b="" your="" response="" here="" 2c.="" combine="" the="" tables="" (sorted="" by="" either="" a="" or="" b="" above).="" looking="" at="" the="" two="" tables="" you="" presented="" above="" -="" if="" those="" tables="" are="" series="" convert="" them="" to="" dataframes.="" rename="" the="" variable="" (column)="" names="" to="" accurately="" describe="" what="" is="" presented.="" merge="" together="" your="" tables="" to="" show="" the="" count="" and="" the="" value="" of="" donations="" to="" each="" candidate="" in="" one="" table.="" hint:="" use="" the="" merge="" method.="" 2d.="" calculate="" and="" add="" a="" new="" variable="" to="" the="" table="" from="" 2c="" that="" shows="" the="" average="" $="" per="" donation.="" print="" this="" table="" sorted="" by="" the="" average="" donation="" 2e.="" plotting="" a="" bar="" chart="" make="" a="" single="" bar="" chart="" that="" shows="" two="" different="" bars="" per="" candidate="" with="" one="" bar="" as="" the="" total="" value="" of="" the="" donations="" and="" the="" other="" as="" average="" $="" per="" donation.="" in [3]:="" #="" 2a="" your="" code="" here="" in [ ]:="" #="" 2b="" your="" code="" here="" in [ ]:="" #="" 2c="" your="" code="" here="" in [ ]:="" #="" 2d="" your="" code="" here="" 11/15/21,="" 8:28="" pm="" mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb="" at="" main="" ·="" uc-berkeley-i-school/mids-w200-assignments-upstream-fall…="" https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb="" 6/6="" show="" the="" candidates="" name="" on="" the="" x-axis="" show="" the="" amount="" on="" the="" y-axis="" include="" a="" title="" include="" axis="" labels="" hint:="" make="" the="" y-axis="" a="" log-scale="" to="" show="" both="" numbers!="" (matplotlib="" docs:="" https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html="" )="" 2f.="" comment="" on="" the="" results="" of="" your="" data="" analysis="" in="" a="" short="" paragraph.="" there="" are="" several="" interesting="" conclusions="" you="" can="" draw="" from="" the="" table="" you="" have="" created.="" what="" have="" you="" learned="" about="" campaign="" contributions="" in="" california?="" we="" are="" looking="" for="" data="" insights="" here="" rather="" than="" comments="" on="" the="" code!="" 2f="" your="" response="" here="" 3.="" exploring="" donor="" occupations="" (30="" points)="" above="" in="" part="" 2,="" we="" saw="" that="" some="" simple="" data="" analysis="" can="" give="" us="" insights="" into="" the="" campaigns="" of="" our="" candidates.="" now="" let's="" quickly="" look="" to="" see="" what="" kind="" of="" person="" is="" donating="" to="" each="" campaign="" using="" the="" contbr_occupation="" variable.="" 3a.="" show="" the="" top="" 5="" occupations="" of="" individuals="" that="" contributed="" to="" hillary="" clinton.="" subset="" your="" data="" to="" create="" a="" dataframe="" with="" only="" donations="" for="" hillary="" clinton.="" then="" use="" the="" value_counts="" and="" head="" methods="" to="" present="" the="" top="" 5="" occupations="" (="" contbr_occupation="" )="" for="" her="" donors.="" in [ ]:="" #="" 2e="" your="" code="" here="">= datetime.datetime( print(contrib.shape) 11/15/21, 8:28 pm mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb at main · uc-berkeley-i-school/mids-w200-assignments-upstream-fall… https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb 4/6 sometimes variable names are not clear unless we read the documentation. in your own words, based on the documentation, what information does the election_tp variable contain? 1a your response here 1b. print the first 5 rows from the dataset to manually check some of the data. this is a good idea to ensure the data loaded and the columns parsed correctly! 1c. pick three variables from the dataset above and run some quick sanity checks. when working with a new dataset, it is important to explore and sanity check your variables. for example, you may want to examine the maximum and minimum values, a frequency count, or something else. use the three markdown cells below to explain if your three chosen variables "pass" your sanity checks or if you have concerns about the integrity of your data and why. 1c your response here 1d. plotting a histogram make a histogram of one of the variables you picked above. what are some insights that you can see from this histogram? remember to include on your histogram: include a title include axis labels the correct number of bins to see the breakout of values hint: for some variables the range of values is very large. to do a better exploration, make the initial histogram the full range and then you can make a smaller histogram 'zoomed' in on a discreet range. 1d your response here 2. exploring campaign contributions (30 points) in [2]: # 1a your code here in [3]: # 1b your code here in [4]: # 1c your code here for variable #1 in [ ]: # 1c your code here for variable #2 in [ ]: # 1c your code here for variable #3 in [2]: # 1d your code here 11/15/21, 8:28 pm mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb at main · uc-berkeley-i-school/mids-w200-assignments-upstream-fall… https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb 5/6 let's investigate the donations to the candidates. 2a. present a table that shows the number of donations to each candidate sorted by number of donations. when presenting data as a table, it is often best to sort the data in a meaningful way. this makes it easier for your reader to examine what you've done and to glean insights. from now on, all tables that you present in this assignment (and course) should be sorted. hint: use the groupby method. groupby is explained in unit 13: async 13.3 & 13.5 hint: use the sort_values method to sort the data so that candidates with the largest number of donations appear on top. which candidate received the largest number of contributions (variable 'contb_receipt_amt')? 2a your response here 2b. now, present a table that shows the total value of donations to each candidate. sorted by total value of the donations which candidate raised the most money in california? 2b your response here 2c. combine the tables (sorted by either a or b above). looking at the two tables you presented above - if those tables are series convert them to dataframes. rename the variable (column) names to accurately describe what is presented. merge together your tables to show the count and the value of donations to each candidate in one table. hint: use the merge method. 2d. calculate and add a new variable to the table from 2c that shows the average $ per donation. print this table sorted by the average donation 2e. plotting a bar chart make a single bar chart that shows two different bars per candidate with one bar as the total value of the donations and the other as average $ per donation. in [3]: # 2a your code here in [ ]: # 2b your code here in [ ]: # 2c your code here in [ ]: # 2d your code here 11/15/21, 8:28 pm mids-w200-assignments-upstream-fall2021/hw_units_11_12_13.ipynb at main · uc-berkeley-i-school/mids-w200-assignments-upstream-fall… https://github.com/uc-berkeley-i-school/mids-w200-assignments-upstream-fall2021/blob/main/week_12/hw_units_11_12_13.ipynb 6/6 show the candidates name on the x-axis show the amount on the y-axis include a title include axis labels hint: make the y-axis a log-scale to show both numbers! (matplotlib docs: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html ) 2f. comment on the results of your data analysis in a short paragraph. there are several interesting conclusions you can draw from the table you have created. what have you learned about campaign contributions in california? we are looking for data insights here rather than comments on the code! 2f your response here 3. exploring donor occupations (30 points) above in part 2, we saw that some simple data analysis can give us insights into the campaigns of our candidates. now let's quickly look to see what kind of person is donating to each campaign using the contbr_occupation variable. 3a. show the top 5 occupations of individuals that contributed to hillary clinton. subset your data to create a dataframe with only donations for hillary clinton. then use the value_counts and head methods to present the top 5 occupations ( contbr_occupation ) for her donors. in [ ]: # 2e your code here https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html>