Dataset
You are to analyze campaign contributions to the 2016 U.S. presidential primary races made in California. Use the csv file attached (I could not attach it since it is like 220 MB – How can I transfer it?). You should download and save this file in the same folder as this notebook is stored.
General Guidelines:
· This is arealdataset and so it may contain errors and other pecularities to work through
· This dataset is ~218mb, which will take some time to load (and probably won't load in Google Sheets or Excel)
· If you make assumptions, annotate them in your responses
· While there is one code/markdown cell positioned after each question as a placeholder, some of your code/responses may require multiple cells
· Double-click the markdown cells that say YOUR ANSWER HERE to enter your written answers. If you need more cells for your written answers, make them markdown cells (rather than code cells)
Setup
Run the two cells below.
The first cell will load the data into a pandas dataframe namedcontrib
. Note that a custom date parser is defined to speed up loading. If Python were to guess the date format, it would take even longer to load.
The second cell subsets the dataframe to focus on just the primary period through May 2016. Otherwise, we would see general election donations which would make it harder to draw conclusions about the primaries.
import
pandas
as
pd
import
matplotlib.pyplot
as
plt
import
datetime
# These commands below set some options for pandas and to have matplotlib show the charts in the notebook
pd.set_option('display.max_rows', 1000)
pd.options.display.float_format = '{:,.2f}'.format
%matplotlib inline
# Define a date parser to pass to read_csv
d =
lambda
x: pd.datetime.strptime(x, '%d-%b-%y')
# Load the data
# We have this defaulted to the folder OUTSIDE of your repo - please change it as needed
contrib = pd.read_csv('../../P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)
print(contrib.shape)
# Note - for now, it is okay to ignore the warning about mixed types.
# Subset data to primary period
contrib = contrib.copy()[contrib['contb_receipt_dt'] datetime.datetime(2016, 5, 31)]
print(contrib.shape)
1. Data Exploration
1a. First, take a preliminary look at the data.
· Print theshapeof the data. What does this tell you about the number of variables and rows you have?
· Print a list of column names.
· Review the documentation for this data (link above). Do you have all of the columns you expect to have?
· Sometimes variable names are not clear unless we read the documentation. In your own words, based on the documentation, what information does theelection_tpvariable contain?
# 1a YOUR CODE HERE
1a YOUR RESPONSE HERE
1b. Print the first 5 rows from the dataset to manually check some of the data.
This is a good idea to ensure the data loaded and the columns parsed correctly!
# 1b YOUR CODE HERE
1c. Pick three variables from the dataset above and run some quick sanity checks.
When working with a new dataset, it is important to explore and sanity check your variables. For example, you may want to examine the maximum and minimum values, a frequency count, or something else. Use the three markdown cells below to explain if yourthreechosen variables "pass" your sanity checks or if you have concerns about the integrity of your data and why.
# 1c YOUR CODE HERE for variable #1
# 1c YOUR CODE HERE for variable #2
# 1c YOUR CODE HERE for variable #3
1c YOUR RESPONSE HERE
1d. Plotting a histogram
Make a very nice histogram and professional ofoneof the variables you picked above. What are some insights that you can see from this histogram? Remember to include on your histogram:
· Include a title
· Include axis labels
· The correct number of bins to see the breakout of values
· Hint: For some variables the range of values is very large. To do a better exploration, make the initial histogram the full range and then you can make a smaller histogram 'zoomed' in on a discreet range.
# 1d YOUR CODE HERE
1d YOUR RESPONSE HERE
2. Exploring Campaign Contributions
Let's investigate the donations to the candidates.
2a. Present a table that shows the number of donations to each candidate sorted by number of donations.
· When presenting data as a table, it is often best to sort the data in a meaningful way. This makes it easier for your reader to examine what you've done and to glean insights. From now on, all tables that you present in this assignment (and course) should be sorted.
· Hint: Use thegroupbymethod. Groupby is explained in Unit 13: async 13.3 & 13.5
· Hint: Use thesort_valuesmethod to sort the data so that candidates with the largest number of donations appear on top.
Which candidate received the largest number of contributions (variable 'contb_receipt_amt')?
# 2a YOUR CODE HERE
2a YOUR RESPONSE HERE
2b. Now, present a table that shows the total value of donations to each candidate. sorted by total value of the donations
Which candidate raised the most money in California?
# 2b YOUR CODE HERE
2b YOUR RESPONSE HERE
2c. Combine the tables (sorted by either a or b above).
· Looking at the two tables you presented above - if those tables are Series convert them to DataFrames.
· Rename the variable (column) names to accurately describe what is presented.
· Merge together your tables to show thecountand thevalueof donations to each candidate in one table.
· Hint: Use themergemethod.
# 2c YOUR CODE HERE
2d. Calculate and add a new variable to the table from 2c that shows the average \$ per donation. Print this table sorted by the average donation
# 2d YOUR CODE HERE
2e. Plotting a Bar Chart
Make a single bar chart that shows two different bars per candidate with one bar as the total value of the donations and the other as average $ per donation.
· Show the Candidates Name on the x-axis
· Show the amount on the y-axis
· Include a title
· Include axis labels
· Hint: Make the y-axis a log-scale to show both numbers! (matplotlib docs:https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html)
# 2e YOUR CODE HERE
2f. Comment on the results of your data analysis in a short paragraph.
· There are several interesting conclusions you can draw from the table you have created.
· What have you learned about campaign contributions in California?
· We are looking for data insights here rather than comments on the code!
2f YOUR RESPONSE HERE
3. Exploring Donor Occupations
Above in part 2, we saw that some simple data analysis can give us insights into the campaigns of our candidates. Now let's quickly look to see whatkindof person is donating to each campaign using thecontbr_occupationvariable.
3a. Show the top 5 occupations of individuals that contributed to Hillary Clinton.
· Subset your data to create a dataframe with only donations for Hillary Clinton.
· Then use thevalue_countsandheadmethods to present the top 5 occupations (contbr_occupation) for her donors.
· Note: we are just interested in the count of donations, not the value of those donations.
In[]:
# 3a YOUR CODE HERE
3b. Write a function called
get_donors
.
Imagine that you want to do the previous operation on several candidates. To keep your work neat, you want to take the work you did on the Clinton-subset and wrap it in a function that you can apply to other subsets of the data.
· The function should take a DataFrame as a parameter, and return a Series containing the counts for the top 5 occupations contained in that DataFrame.
In[]:
def
get_donors(df):
"""This function takes a dataframe that contains a variable named contbr_occupation.
It outputs a Series containing the counts for the 5 most common values of that
variable."""
# 3b YOUR CODE HERE
3c. Now run the
get_donors
function on subsets of the dataframe corresponding to three candidates. Show each of the three candidates below.
· Hillary Clinton
· Bernie Sanders
· Donald Trump
In[]:
# 3c YOUR CODE HERE
3d. Finally, use
groupby
to separate the entire dataset by candidate.
· Call .apply(get_donors) on your groupby object, which will apply the function you wrote to each subset of your data.
· Look at your output and marvel at what pandas can do in just one line!
# 3d YOUR CODE HERE
3e. Comment on your data insights & findings in a short paragraph.
3e YOUR RESPONSE HERE
3f. Think about your findings in section 3 vs. your findings in section 2 of this assignment.
Do you have any new data insights into the results you saw in section 2 now that you see the top occupations for each candidate?
3f YOUR RESPONSE HERE
4. Plotting Data
There is an important element that we have not yet explored in this dataset - time.
4a. Present a single line chart with the following elements.
· Show the date on the x-axis
· Show the contribution amount on the y-axis
· Include a title
· Include axis labels
# 4a YOUR CODE HERE
4b. Make a better time-series line chart
This chart is messy and it is hard to gain insights from it. Improve the chart from 4a so that your new chart shows a specific insight. In the spot provided, write the insight(s) that can be gained from this new time-series line chart.
# 4b YOUR CODE HERE
4b YOUR RESPONSE HERE