11/17/2019 1/14 Analyzing police traffic stop data The goal of this assignment is to give you experience using the pandas data analysis library. It will also give you experience using a...

1 answer below »
using pandas and numpy for this assignmentplease add commit so that I can review the code


11/17/2019 1/14 Analyzing police traffic stop data The goal of this assignment is to give you experience using the pandas data analysis library. It will also give you experience using a well-documented third-party library, and navigating its documentation in search of specific features that can help you complete your implementation. You may work alone or in a pair on this assignment. North Carolina traffic stop data Much has been written about the impact of race on police traffic stops. In this assignment, we will examine and analyze trends in traffic stops that occurred in the state of North Carolina from 2000 until 2015. We will not be able to look at every single traffic stop, and will instead look at different subsets of data. Feel free to continue exploring this data on your own after the assignment. You should have the necessary programming skills to do so! You can find additional clean data from the Stanford Open Policing Project. The pa6 directory includes a file named traffic_stops.py, which you will modify, a file named pa6_helpers.py that contains a few useful functions, and a file named test_traffic_stops.py with tests. Please put all of your code for this assignment in traffic_stops.py. Do not add extra files and do not modify any files other than traffic_stops.py. The data/ directory in pa6 contains a file called get_files.sh that will download the data files necessary for this assignment, along with some other files needed by the tests. To download these files, change into the data/ directory and run this command from the Linux command-line: $ sh get_files.sh (Recall that we use $ to indicate the Linux command-line prompt. You should not include it when you run this command.) Please note that you must be connected to the network to use this script. Do not add the data files to your repository! If you wish to use both CSIL & the Linux servers, and your VM, you will need to run the get_files.sh script twice: once for CSIL & the Linux servers and once for your VM. Some of our utility code uses seaborn, a plotting library. This library is installed on the machines in CSIL. You will need to install it on your VM using the following command: sudo -H pip3 install seaborn Due: Nov 20 at 23:59 Getting started http://pandas.pydata.org/ https://www.citylab.com/life/2018/06/is-it-time-to-reconsider-traffic-stops/561557/ https://openpolicing.stanford.edu/data/ 11/17/2019 2/14 The sudo command will ask for a password. Use uccs as the password. We suggest that, as you work through the assignment, you keep an ipython3 session open to test the func- tions you implement in traffic_stops.py. Run the following commands in ipython3 to get started: In [1]: %load_ext autoreload In [2]: %autoreload 2 In [3]: import pandas as pd In [4]: import numpy as np In [5]: import traffic_stops as ts We will use ts to refer to the traffic_stops module in our examples below. Data The Stanford Open Policing Project maintains a database of records from traffic stops (i.e., when a police officer pulls a driver over) around the country. We will be working with two different datasets extracted from this database. The first dataset contains data on traffic stops that occurred in the state of North Carolina. For each stop, the dataset includes information related to the driver (gender, race, age, etc.), the stopping officer (a unique identifier for the officer), and the stop itself (a unique identifier for the stop), the date of the stop, the violation that triggered the stop, if any, etc. More specifically, the records from this dataset include the following fields: stop_id: a unique identifier of the stop stop_date: the date of the stop officer_id: a unique identifier for officers driver_gender: the driver’s gender driver_age: the driver’s age driver_race: a column that combines information about the driver’s race and ethnicity violation: the violation for which the driver was stopped is_arrested: a boolean that indicates whether the driver was arrested stop_outcome: the outcome of a stop (arrest, citation, written warning) The gender column presumably contains information copied from the binary classification listed on the driver’s license, which may or may not match the driver’s actual personal gender identity. The race column presumably contains information about what the officer perceived the driver’s race to be, which may or may not match the driver’s actual personal racial and ethnic identity. We have constructed three files from this dataset for this assignment: The first, all_stops_basic.csv, contains a small hand-picked sample of the data and is used in our test code. The second, all_stops_assignment.csv, contains a random sample of records from 500K stops (out of 10M). The third, all_stops_mini.csv, contains a random sample of 20 records and will be useful for debugging. Here, for example, is the data from all_stops_basic.csv: stop_id,stop_date,officer_id,driver_gender,driver_age,driver_race,ethnicity,violatio 2168033,2004-05-29,10020,M,53.0,White,N,Registration/plates,False,Written Warning 4922383,2009-09-04,21417,M,22.0,Hispanic,H,Other,False,Citation 924766,2001-08-13,10231,M,38.0,White,N,Other,False,Citation 8559541,2014-05-25,11672,F,19.0,White,N,Other,False,Citation 8639335,2014-07-05,21371,F,76.0,White,N,Other,False,Citation 6198324,2011-04-30,11552,M,35.0,White,N,DUI,True,Arrest 11/17/2019 3/14 Keep in mind that even “clean” data often contains irregularities. You’ll notice when you look at these files that some values are missing. For example, the officer_id is missing in the eighth line of the file. When you load the data into a dataframe, missing values like these will be represented with NaN values. The second dataset contains information specific to those stops from the first dataset that resulted in a search. Each record in this dataset includes fields for: stop_id: the stop’s unique identifier search_type: the type of search (e.g., incident to arrest or protective frisk) contraband_found: indicates whether contraband was found during the search search_basis: the reason for the search (e.g., erratic behavior or official information) drugs_related_stop: indicates whether the stop was related to drugs Here are the first ten lines from search_conducted_mini.csv: stop_id,search_type,contraband_found,search_basis,drugs_related_stop 4173323,Probable Cause,False,Observation Suspected Contraband, 996719,Incident to Arrest,True,Observation Suspected Contraband, 5428741,Incident to Arrest,False,Other Official Info, 824895,Incident to Arrest,False,Erratic Suspicious Behaviour, 816393,Protective Frisk,False,Erratic Suspicious Behaviour, 5657242,Incident to Arrest,False,Other Official Info, 4534875,Incident to Arrest,False,Suspicious Movement, 4733445,Incident to Arrest,False,Other Official Info, 1537273,Incident to Arrest,False,Other Official Info, As with the first dataset, some values are missing and will be represented with NaN values when you load the data into a dataframe. Please note that a stop from the first dataset will be represented in this second dataset only if it resulted in a search. Pandas You could write the code for this assignment using the csv library, lists, dictionaries, and loops. The pur- pose of this assignment, however, is to help you become more comfortable using pandas. As a result, you are required to use pandas data frames to store the data and pandas methods to do the necessary compu- tations. If you use pandas methods efficiently and effectively, functions should be short and will likely use multiple pandas methods. Some of the tasks we will ask you to do require using pandas features that have not been covered in class. This is by design: one of the goals of this assignment is for you to learn to read and use API documenta- tion. So, when figuring out these tasks, you are allowed (and, in fact, encouraged) to look at the Pandas documentation. Not just that, any code you find in the Pandas documentation can be incorporated into your code without attribution. (For your own convenience, though, we encourage you to include citations for any code you get from the documentation that is more than one or two lines.) If, however, you find Pandas examples elsewhere on the Internet, and use that code either directly or as inspiration, you must include a code comment specifying its origin. When solving the tasks in this assignment, you should assume that you can use a series of Pandas opera- tions to perform the required computations. Before trying to implement anything, you should spend some time trying to find the right methods for the task. We also encourage you to experiment with them in ipython3 before you incorporate them into your code. Our implementation used filtering and vector operations, as well as methods like agg, apply, cut, to_datetime, fillna, groupby, isin, loc, merge, read_csv, rename, size, transform, unstack, np.mean, np.where, along with a small number of lists and loops. Do not worry if you are not using all of these methods! 58220,2000-02-09,,F,42.0,Black,N,Other,False,Citation 5109631,2009-12-23,11941,M,65.0,Black,N,Seat belt,False,Citation http://pandas.pydata.org/pandas-docs/stable/ 11/17/2019 4/14 Your tasks Task 1: Reading in CSV files Before we analyze our data, we must read it in. Often, we also need to process the data to make it analysis- ready. It is usually good practice to define a function to read and process your data. In this task, you will complete two such functions, one for each type of data. You may find pd.read_csv, pd.to_datetime, pd.cut, and np.where along with dataframe methods, such as fillna and isin, useful for Tasks 1a and 1b. Task 1a: Building a dataframe from the stops CSV files Your first task is to complete the function read_and_process_allstops in traffic_stops.py. This func- tion takes the name of a CSV file that pertains to the all_stops dataset and should return a pandas dataframe, if the file exists. If the file does not exist, your function should return None. (You can use the library function os.path.exists to determine whether a file exists or a try block (see R&S Exceptions) that returns None when the file cannot be opened. Note about reading the data The pandas read_csv function allows you to read a CSV file into a dataframe. When you use this function, it is good practice to specify data types for the columns. You can do so by specifying a dictionary that maps column names to types using the dtypes parameter. The set of types available for this purpose is a little primitive. In particular, you can specify str, int, float, and bool (or their np equivalents) as initial col- umn types. In some cases, you will need to adjust the types after you read in the data. For this assignment (and in general), you should be very thoughtful about how you specify column data types. Here are a few guidelines to consider: A number that can begin
Answered Same DayNov 18, 2021

Answer To: 11/17/2019 1/14 Analyzing police traffic stop data The goal of this assignment is to give you...

Kshitij answered on Nov 21 2021
144 Votes
import numpy as np
import pandas as pd
# Defined constants for column names
ARREST_CITATION = 'arrest_or_citation'
IS_ARRESTED = 'is_arrest
ed'
YEAR_COL = 'stop_year'
MONTH_COL = 'stop_month'
DATE_COL = 'stop_date'
STOP_SEASON = 'stop_season'
STOP_OUTCOME = 'stop_outcome'
SEARCH_TYPE = 'search_type'
SEARCH_CONDUCTED = 'search_conducted'
AGE_CAT = 'age_category'
OFFICER_ID = 'officer_id'
STOP_ID = 'stop_id'
DRIVER_AGE = 'driver_age'
DRIVER_RACE = 'driver_race'
DRIVER_GENDER = 'driver_gender'
VIOLATION = "violation"
SEASONS_MONTHS = {
"winter": [12, 1, 2],
"spring": [3, 4, 5],
"summer": [6, 7, 8],
"fall": [9, 10, 11]}
NA_DICT = {
'drugs_related_stop': False,
'search_basis': "UNKNOWN"
}
AGE_BINS = [0, 21, 36, 50, 65, 100]
AGE_LABELS = ['juvenile', 'young_adult', 'adult', 'middle_aged', 'senior']
SUCCESS_STOPS = ['Arrest', 'Citation']
CATEGORICAL_COLS = [AGE_CAT, DRIVER_GENDER, DRIVER_RACE,
STOP_SEASON, STOP_OUTCOME, VIOLATION]
# Task 1a
def read_and_process_allstops(csv_file):

type_dict = {STOP_ID: int, OFFICER_ID: str}
try:
df = pd.read_csv(csv_file, dtype= type_dict, parse_dates = [DATE_COL])
except:
return None
df[YEAR_COL] = df[DATE_COL].dt.year # Create Year column
df[MONTH_COL] = df[DATE_COL].dt.month # Create Month column
df[STOP_SEASON] = df[MONTH_COL].map({v_: k for k, v in...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here