see the files
Homework 3: Election Prediction Files to turn in: election.py, answers.txt. In this assignment, you will practice use of data structures such as lists, dictionaries, and sets. Contents: • Introduction and background o Election prediction: pundits vs. statisticians o Election polls: Which ones to trust? o The US Electoral College o Why polls are approximate • Assignment Overview o Problem 0: Obtain the files, add your name o Problem 1: State edges o Testing your implementation o Problem 2: Find the most recent poll row o Problem 3: Pollster predictions o Problem 4: Pollster errors o Problem 5: Pivot a nested dictionary o Problem 6: Average the edges in a single state o Problem 7: Predict the 2012 election o Reflection and submitting your work • Appendix: Data type reference Introduction and background Election prediction: pundits vs. statisticians In the past, the outcome of political campaigns was predicted by political analysts and pundits, using a combination of their experience, intuition, and personal biases and preferences. In recent decades there has been a shift to a more scientific approach, in which election results are predicted statistically using a poll. A small random sample of voters is asked how they will vote, and from that the result of the entire election is extrapolated. The 2012 presidential election was a watershed in the fight between pundits and statisticians. The rivalry became front-page news, with many pundits loudly proclaiming that the statisticians would be humiliated on November 6. In fact, the opposite happened: statistician Nate Silver (of the website FiveThirtyEight) correctly predicted the outcome in every state, whereas pundits' predictions varied significantly. Literally dozens of prominent political analysts had predicted a Romney win. Other pundits said the election was “too close to call”, though Silver and other statisticians had been predicting an Obama win for months. These results changed the way many Americans view political commentators, revealing them as entertainers but not as reliable sources of information. How did Nate Silver do it? In this assignment, you will find out, and you will replicate his results by using polling data to predict the outcome of the 2012 US presidential election. Election polls: Which ones to trust? An election poll is a survey that asks a small sample of voters how they plan to vote. If the sample of voters is representative of the voting population at large, then the poll predicts the result of the entire election. In practice, a poll's prediction must be taken with a grain of salt, because the sample is only approximately representative of the voting population. (See below for an explanation of why.) For example, in late October 2012, the Gallup poll consistently gave Romney a 6-percentage-point lead in the popular vote, but in fact Obama won the popular vote by 2.6 percentage points. On the other hand, RAND Corporation was biased toward the Democrats and tended to overstate Obama's lead by 1.5 percentage points. How can you decide which polls to rely upon? Depending on which poll you trust, you might make a very different prediction. One approach is to average together the different polls. This is better than trusting any one of them, but it is still rather crude. What if most of them are biased? That was the case for the 23 organizations that conducted at least 5 polls in the last 21 days of the 2012 Presidential campaign: 19 of the 23 organizations gave a result that favored Republicans more than the actual results did. Nonetheless, Nate Silver's prediction was very close to correct, and showed no bias toward either party. Silver's approach is very sophisticated, but its key idea is to combine different polls using a weighted average. In a normal average, each data point contributes equally to the result. In a weighted average, some data points contribute more than others. Silver examined how well each polling organization had predicted previous elections, and then weighted their polls according to their accuracy: more biased pollsters had less effect on the weighted average. The general structure of FiveThirtyEight's algorithm is: http://www.fivethirtyeight.com/ 1. Calculate the average error of each pollster's predictions for previous elections. This is known as the pollster's rank. A smaller rank indicates a more accurate pollster. 2. Transform each rank into a weight (for use in a weighted average). A larger weight indicates a more accurate pollster. FiveThirtyEight considers a number of factors when computing a weight, including rank, sample size, and when a poll was conducted. For this assignment, we simply set weight to equal the inverse square of rank (weight = rank**(-2)). 3. In each state, perform a weighted average of predictions made by pollsters. This predicts the winner in that state. 4. Calculate the outcome of the Electoral College, using the per-state predictions. The candidate with the most electoral votes wins the election. The algorithm is described in more detail at the FiveThirtyEight blog. You do not have to read or understand this information to complete this assignment, but you may find it interesting nonetheless. The US Electoral College We have given you an implementation of the electoral_college_outcome function, so this section is for your information but you do not need it while writing code for your assignment. Here is information about US Presidential elections and the US Electoral College, paraphrased from Wikipedia: The President of the United States is not elected directly by the voters. Instead, the President is elected indirectly by “electors” who are selected by popular vote on a state-by-state basis. Each state has as many electors as members of Congress. There are 538 electors, based on Congress having 435 representatives and 100 senators, plus three electors from the District of Columbia. Electors are selected on a “winner-take-all” basis. That is, all electoral votes go to the presidential candidate who wins the most votes in the state. (Actually, Maine and Nebraska use a slightly different method, but for simplicity in this assignment, we will assume they use the “winner-take- all” approach.) Our analysis only considers the Democratic and Republican political parties. This is a reasonable simplification, since a third-party candidate has received an electoral vote only once in the past 60 years (in 1968, George Wallace won 8% of the electoral vote). Why polls are approximate This section of the handout explains why poll results are only approximate, and how poll aggregation helps. Recall that a poll sample is only approximately representative of the voting population. There are two reasons for this: sampling error and pollster bias. 1. Sampling error: If you randomly choose a sample from a population, then random chance may cause the sample to differ from the population. The US population is 50.7% female and 49.3% male, but a random sample of 1000 individuals might include 514 females and 486 males or 496 females and 504 males. An extrapolation from the sample to the entire population would be http://en.wikipedia.org/wiki/Weighted_mean http://fivethirtyeight.blogs.nytimes.com/methodology/ http://en.wikipedia.org/wiki/Electoral_College_(United_States) slightly incorrect. The larger the sample, the more likely it is to be representative of the population. Sampling error is unavoidable, but it can be reduced by increasing the sample size. This is one reason that poll aggregation can be successful: it effectively uses a larger sample than any one individual poll. 2. Pollster bias or “house effects”: These are systematic inaccuracies caused by faulty methodology — essentially, the pollster has not chosen a random sample of US voters. Suppose that a pollster sampled only Mormons or only African-Americans; it would be meaningless to predict the overall vote from these biased samples. Actual pollster bias comes in subtler forms, and can be a positive or a negative factor. Here are some examples: o Not all Americans vote, so each polling firm should adjust its sampling to select not among all Americans, but among likely voters. Poor people and young people are are less likely to vote, so a polling firm might adjust its statistics to account for that, but the firm might over- or undercompensate. o Survey response rates are typically lowest in urban areas, so unweighted samples routinely under-represent black and Hispanic Americans who frequently live in urban areas. o Some telephone polls call only landline numbers, but 1/3 of Americans rely on cellphones — and they are younger, more urban, poorer, and more likely to be black and Hispanic, all of which correlate with Democratic voting. o Question wording and order has a significant effect on responses. Pollster bias is avoidable by improving methodology — or, if you can determine a pollster's bias, you can adjust their scores accordingly and use the adjusted scores rather than what the pollster reports. That is what Nate Silver and other “poll aggregators” did — even without knowing the specific sources of bias. Assignment Overview In this assignment, you will write a Python program that predicts the outcome of the 2012 US Presidential election, based on polling data and results from the 2012 and 2008 elections. The Professor designed the overall program, including deciding the names and specifications of all the functions and implementing some of the functions. Your job is to implement the rest of the functions. You will verify your implementation using the testing code that we provide. Along the way, you will learn about Python collections. Don't panic! This assignment might look long, but we have already done most of the work for you. You only have to implement 10 functions — and the Professor has already written the documentation and tests for those functions, so you know exactly what to do and you know whether your solution is correct. The implementation of those 10 function bodies consists of only 63 lines of code in total, and 8 of the 10 functions have a body consisting of 6 or fewer lines of code. Your solution might be smaller or larger, and that is fine; we mention the size only to give you a feel for the approximate amount of code you have to write. While solving this assignment, you should expect to spend more time thinking than programming. Hint: Before you implement any function, try describing the algorithm and hand-simulating it on some sample data. Problem