Python Data Engineering Homework
The goal of this homework is to take a semi-structured non-normalized CSV �le and turn it into a set of normalized tables that you then push to your postgres database on AWS. The original dataset contains 100k district court decisions, although I've downsampled it to only 1000 rows to make the uploads faster. Each row contains info about a judge, their demographics, party a�liation, etc. Rows also contain information about the case they were deciding on. Was it a criminal or civil case? What year was it? Was the direction of the decision liberal or conservative? While the current denormalized format is �ne for analysis, it's not �ne for a database as it violates many normalization rules. Your goal is to normalize it by designing a simple schema, then wrangling it into the proper dataframes, then pushing it all to AWS. For the �rst part of this assignment you should wind up with three tables. One with case information, one with judge information, one that has casetype information, and the last with case category information. Each table should be reduced so that there are not then repeating rows, and primary keys should be assigned within each. These tables should be called 'cases', 'judges', and 'casetype'. For the last part you should make a rollup table that calculates the percent of liberal decisions for each party level and each case category. This will allow for one to get a quick look at how the political party a�liation of judges impacts the direction of a decision for different case categories (e.g. criminal, civil, labor). HW 4 - Building a normalized RDB Start by bringing in your data to cases . Call a .head() on it to see what columns are there and what they contain. Bring in data, explore, make schema import pandas as pd cases = pd.read_csv('https://ista322data.s3-us-west-1.amazonaws.com/cases_denorm_small.csv') # head of cases cases.head() c OK, given that head, you need to make three related tables that will make up a normalized database. Those tables are 'cases', 'judges', and 'casetype'. If it's not clear what info should go into each, Make schema Code Text explore the data more, ask on slack, etc. Remember, you might not have keys, will need to reduce the rows, select certain columns, etc. There isn't a de�ned path here. Here's a textbox to de�ne what should be in your schema... feel free to add to this. Start by making a table that contains just each case's info. I would call this table that you're going to upload cases_df so you don't overwrite your raw data. This table should have six columns and 1000 rows. Note, one of these columns should be a judge_id that links to the judges table. You'll need to make this foreign key. Also, you can leave 'category_name' in this table as well as its id. Normally you'd split that off into it's own table as well, but you're already doing that for casetype which is enough for now. Make cases table - 3 points # Make judge_id in cases ... # select necessary columns to make cases_df ... # What's the shape of cases_df? ... Go bring over your connection function from the last workbook. This must be in here for me to grade the homework If I can't access your database you'll get a zero. It's a good idea to bring our other exploratory functions as well. Once you do that you'll need to do the following Connect, make a table called 'cases' with the correct column names and data types. Be sure to execute and commit the table. Make tuples of your data Write a SQL string that allows you to insert each tuple of data into the correct columns Execute the string many times to �ll out 'cases' Commit changes and check the table. Make cases table in your database I'm not going to leave a full roadmap beyond this. Feel free to add cells as needed to do the above. # Use sql_head to check cases sql_head(table_name='cases') Now make your judges table from the original cases dataframe (not the SQL table you just made). Judges should have �ve columns, including the judge_id column you made. There should be 553 rows after you drop duplicates (remember that judges may have had more than one case). After you make the dataset go and push to a SQL table called 'judges'. Make judges - 2 points Go make the casetype table. This should have only two columns that allow you to link the casetype name back to the ID in the 'cases' table. There should be 27 rows as well. Make casetype - 2 points Below is a query to get the number of unique judges that have ruled on criminal court motion cases. You should get a value of 119 as your return if your database is set up correctly! A quick test of your tables run_query("""SELECT COUNT(DISTINCT(judges.judge_id)) FROM cases JOIN judges ON cases.judge_id = judges.judge_id WHERE casetype_id = (SELECT casetype_id FROM casetype WHERE casetype_name = 'criminal court motions'); """) Now let's make that rollup table! The goal here is to make a summary table easily accessed. We're going to roll the whole thing up by the judges party and the category, but you could imagine doing this for each judge to track how they make decisions over time which would then be useful for an analytics database. The one we're making could also be used as a dimension table where we needed overall party averages. We want to get a percentage of liberal decisions by each grouping level (party_name, category_name). To do this we need �rst, the number of cases seen at each level, and second, the Make rollup table - 3 points number of liberal decisions made at each level. cases contains the columns libcon_id which is a 0 if the decision was conservative in its ruling, and a 1 if it was liberal in its ruling. Thus, you can get a percentage of liberal decisions if you divide the sum of that column by the total observations. Your agg() can both get the sum and count. After you groupby you'll need to reset the index, rename the columns, then make the percentage. Once you do that you can push to a SQL table called 'rollup' Let's get started # Make a groupby called cases_rollup. This should group by party_name and categrory name. I ... # reset your index ... # rename your columns now. Keep the first to the same but call the last two 'total_cases' an ... Now make a new column called 'percent_liberal' This should calucalte the percentage of decisions that were liberal in nature. Multiple it by 100 so that it's a full percent. Also use the round() function on the whole thing to keep it in whole percentages. # make your metric called 'percent_liberal' ... Now go and push the whole thing to a table called 'rollup' There should be �ve columns and nine rows. # Check sql_head('rollup')