see attached file
Instructions: You will be looking at data from a survey in the US state of Colorado on opinions of the oil and gas industry, and evaluating whether Facebook ads changed opinions of the oil & gas industry. For context, in this study, some individuals in Colorado were randomly selected to receive video advertisements on Facebook, which highlighted the risks of the oil & gas industry. This is the ‘treatment’ group. Another set of individuals on Facebook were the ‘control’ group and did not receive ads. All individuals in both the treatment and control groups were asked to complete a survey. Not all individuals started the survey, and not all individuals who started the survey completed it. The survey asked respondents a number of demographic questions, then asked “Do you believe your community is better or worse off because of the oil and gas industry?”. Respondents selected one of the following choices: ● 1 - Definitely better off ● 2 - Somewhat better off ● 3 - Neither better nor worse off ● 4 - Somewhat worse off ● 5 - Definitely worse off We can compare answers between the treatment and control groups to evaluate the effectiveness of the advertisements. Data: You will use two datasets: 1. Survey Data: includes a row for every individual who started the survey. Includes fields for survey responses and attributes of individuals. · Description of fields: 5-digit FIPS code for county of respondent treatment 1- indicates respondent was in treatment group, 0 - indicates respondent was in control group total_duration_in_sec time respondent to took respond to survey, in seconds Q1_answer_code The respondent's numerical response to survey question 1 Q1_answer_text The respondent's text response to survey question 1 Field Description person_id ID for survey respondent county county of respondent FIPS 5-digit FIPS code for county of respondent treatment 1- indicates respondent was in treatment group, 0 - indicates respondent was in control group total_duration_in_sec time respondent to took respond to survey, in seconds Q1_answer_code The respondent's numerical response to survey question 1 Q1_answer_text The respondent's text response to survey question 1 2. County Shapefiles: Standard zip file of county boundary shapefiles from the US Census Objectives: With this data, your goal is to: ● Clean up and QA survey data ● Understand scope of cleaned data: what is the geographic coverage of our survey respondents? ● Compare the survey responses of the treatment group (those who saw video advertisements) and control group (those who did not see video advertisements). ASSIGNMENT Part 1: Data Intro and QA In Part 1, we will load the survey data and clean it. 1.1: Set Up Run the code below to import modules. Then read in the survey data into a dataframe called df_survey. The survey data is available on GitHub at the link below: 'https://raw.githubusercontent.com/smsidekick/project-sidekick/main/blihkjhdrsers.csv' # Install Geopandas ! pip install geopandas --q # Import pandas and numpy import pandas as pd import numpy as np # Import geopandas import geopandas as gpd # Import plotnine from plotnine import * import plotnine 1.2: Explore Data Orient to the survey data. 1.3: Duplicate IDs Is the person_id field unique? Are there any duplicate values in that field? If there are duplicates, remove the duplicates. Save this back to df_survey 1.4: Complete Survey Responses Using code, check if any individuals did not answer survey question 1. If so, filter df_survey to include responses only from those who completed survey question 1: filter out any rows where Q1_answer_code is null. Save this filtered data to a new dataframe called df_complete. 1.5: Survey Speeders Did any respondents in df_complete speed through the survey? Filter out any responses that were impossibly fast outliers based on your judgement. Save this filtered data back to df_complete Make the rationale for your decision clear. A histogram may be helpful. 1.6: Survey Responses Show the distribution of the survey responses in Q1_answer_text (i.e. how many people responded with each answer?) In a sentence, brainstorm why you think some may say the oil & gas industry makes their community better off vs. worse off? Part 2: Survey Coverage in Colorado In Part 2, we will explore the survey results by Colorado county and then create a map to understand the geographic coverage of our responses. We'll explore all the results (for both treatment and control). We're looking to inform two questions: 1. Do we think we have a good, representative sample of the entire state? 2. Do we think have enough data to evaluate the experiment by county? 2.1: Read in County Shapefiles Use command line code to read in the county shapefiles for the entire US from the link below. Read the data into a geodataframe, df_counties https://www2.census.gov/geo/tiger/TIGER2019/COUNTY/tl_2019_us_county.zip 2.2: Filter Geodataframe Filter df_counties to include only Colorado counties by filtering for when STATEFP is 08 (the State FIPS code for Colorado). Save this to a new geodataframe, df_counties_co. 2.3: Summarize Survey by County Turning back to the survey results: create a dataframe summarizing the total number of survey responses by county and FIPS. Save this summary to a new dataframe, df_county_survey. (In the next step, we'll join this onto df_counties_co.) Then, dig into the county results and answer: · How many unique counties do we have in total in df_county_survey? · What is the minimum number of responses in a county? Describe the new dataset, and the distribution of the number of survey responses by county 2.4 Bucket Number of Responses In df_country_survey, create a new column N_resp_bucket that buckets the number of survey responses in steps of 25: <25, 25-50, 50-100, etc. 2.5: join survey and geo data join df_counties_co and df_county_survey, matching the fips column to the geoid column. save the joined dataframe to a new geodataframe, df_map. 2.6: map plot a choropleth map of df_map, coloring each county by the bucketed number of survey responses, n_resp_bucket. 2.7 takeaways on survey scope take a few sentences to answer our two questions. looking at this data, in your opinion: 1. do we have a good, representative sample of the entire state? 2. do have enough data to evaluate the experiment by county? what other information might you want to more robustly inform these questions? (don't worry if you don't know much about colorado. just discuss what you see and what you might want to know more about.) part 3: evaluate experiment in part 3, we'll evaulate if the survey responses from the treatment group (who saw the ads on facebook about the negative impacts of oil & gas) were significantly different from those in the control group. in the survey, question 1 asked respondents "do you believe your community is better or worse off because of the oil and gas industry?" respondents answered on a scale of 1 to 5, where 1 meant "definitely better off" and 5 meant "definitely worse off" 3.1: treatment vs. control size how many survey respondents were in the treatament group vs. the control group? 3.2: differences between treatment and control calculate the average q1_answer_code value for the treatment and the control groups. 3.3 interpet results in a few sentences, discuss what you calculated above in 3.2. what is one follow up question you have, or what might be a next step to understand what is going on in greater detail? 25-50,="" 50-100,="" etc.="" 2.5:="" join="" survey="" and="" geo="" data="" join="" df_counties_co="" and="" df_county_survey,="" matching="" the="" fips="" column="" to="" the="" geoid="" column.="" save="" the="" joined="" dataframe="" to="" a="" new="" geodataframe,="" df_map.="" 2.6:="" map="" plot="" a="" choropleth="" map="" of="" df_map,="" coloring="" each="" county="" by="" the="" bucketed="" number="" of="" survey="" responses,="" n_resp_bucket.="" 2.7="" takeaways="" on="" survey="" scope="" take="" a="" few="" sentences="" to="" answer="" our="" two="" questions.="" looking="" at="" this="" data,="" in="" your="" opinion:="" 1.="" do="" we="" have="" a="" good,="" representative="" sample="" of="" the="" entire="" state?="" 2.="" do="" have="" enough="" data="" to="" evaluate="" the="" experiment="" by="" county?="" what="" other="" information="" might="" you="" want="" to="" more="" robustly="" inform="" these="" questions?="" (don't="" worry="" if="" you="" don't="" know="" much="" about="" colorado.="" just="" discuss="" what="" you="" see="" and="" what="" you="" might="" want="" to="" know="" more="" about.)="" part="" 3:="" evaluate="" experiment="" in="" part="" 3,="" we'll="" evaulate="" if="" the="" survey="" responses="" from="" the="" treatment="" group="" (who="" saw="" the="" ads="" on="" facebook="" about="" the="" negative="" impacts="" of="" oil="" &="" gas)="" were="" significantly="" different="" from="" those="" in="" the="" control="" group.="" in="" the="" survey,="" question="" 1="" asked="" respondents="" "do="" you="" believe="" your="" community="" is="" better="" or="" worse="" off="" because="" of="" the="" oil="" and="" gas="" industry?"="" respondents="" answered="" on="" a="" scale="" of="" 1="" to="" 5,="" where="" 1="" meant="" "definitely="" better="" off"="" and="" 5="" meant="" "definitely="" worse="" off"="" 3.1:="" treatment="" vs.="" control="" size="" how="" many="" survey="" respondents="" were="" in="" the="" treatament="" group="" vs.="" the="" control="" group?="" 3.2:="" differences="" between="" treatment="" and="" control="" calculate="" the="" average="" q1_answer_code="" value="" for="" the="" treatment="" and="" the="" control="" groups.="" 3.3="" interpet="" results="" in="" a="" few="" sentences,="" discuss="" what="" you="" calculated="" above="" in="" 3.2.="" what="" is="" one="" follow="" up="" question="" you="" have,="" or="" what="" might="" be="" a="" next="" step="" to="" understand="" what="" is="" going="" on="" in="" greater="">25, 25-50, 50-100, etc. 2.5: join survey and geo data join df_counties_co and df_county_survey, matching the fips column to the geoid column. save the joined dataframe to a new geodataframe, df_map. 2.6: map plot a choropleth map of df_map, coloring each county by the bucketed number of survey responses, n_resp_bucket. 2.7 takeaways on survey scope take a few sentences to answer our two questions. looking at this data, in your opinion: 1. do we have a good, representative sample of the entire state? 2. do have enough data to evaluate the experiment by county? what other information might you want to more robustly inform these questions? (don't worry if you don't know much about colorado. just discuss what you see and what you might want to know more about.) part 3: evaluate experiment in part 3, we'll evaulate if the survey responses from the treatment group (who saw the ads on facebook about the negative impacts of oil & gas) were significantly different from those in the control group. in the survey, question 1 asked respondents "do you believe your community is better or worse off because of the oil and gas industry?" respondents answered on a scale of 1 to 5, where 1 meant "definitely better off" and 5 meant "definitely worse off" 3.1: treatment vs. control size how many survey respondents were in the treatament group vs. the control group? 3.2: differences between treatment and control calculate the average q1_answer_code value for the treatment and the control groups. 3.3 interpet results in a few sentences, discuss what you calculated above in 3.2. what is one follow up question you have, or what might be a next step to understand what is going on in greater detail?>