Assignment 4
For this assignment you will use
Spyder
and
Python
to perform all of the analyses (do not use Excel). are two data files which you will use to perform analyses and answer questions given below. When you are done, upload this completed worksheet as a Word document (copy and paste your code, visualizations, and results where indicated). Also save and upload the .py files that contain your Python code (create a separate .py file for each case), along with the modified data file from Case 1. Comment your Python code as appropriate to explain what you are doing at each step (or group of steps).
Scoring for this assignment will be based on correctly constructing the needed Python code and providing the correct results, along with the quality of your analysis. Be sure to follow the instructions exactly and provide all information requested to receive full credit.
Refer to the Python examples that we covered in class for guidance. This assignment will require use of the
Pandas
and
Matplotlib
libraries. Some of the code can be copied and used as a starting point for this assignment, but be careful to make changes where needed. There may be some questions that require a command or parameter that we did not use in class. In this case, you can search the Web for examples or go to specific Web sites (e.g. matplotlib.org). You will also need to find external information for use in 1D.
Case 1: Passengers who sailed on the Titanic.
The CSV file “Titanic1912” contains data about each passenger aboard the HMS Titanic when it sank in 1912. Each record includes whether or not the passenger survived (1 = yes); the pclass (ticket class) they were traveling (1st, 2nd, 3rd); their name, sex, and age; and the fare they paid (in US dollars).
1A. List the names of all passengers over the age of 58 that survived. What percentage of the total number of passengers over the age of 58 is this? Paste your Python code and a copy of the results below.
1B. Create two histograms of age on the same chart. One histogram should be for those passengers that survived, and one for those who perished. The first should have blue bars, and the second yellow bars. Both should use 80 bins, have black edges, and have alpha set to 0.5. Add an appropriate title, axis labels, and a legend. Paste your Python code and a copy of the histogram below. Analyze the two distributions and comment on any similarities and differences.
1C. What is the average fare paid by all passengers? Of only those who survived? Of only those who did not? Paste your Python code and a copy of the results below. Comment on any perceived correlation of fare with survival.
1D. Adjust each fare for inflation. That is, convert the values from 1912 $US to 2021 $US. Save the modified data (all fields) to a new file called “mod.csv” and upload it with your assignment. Provide the source of information you used to find the value needed to adjust the fares for inflation below.
Case 2: Is a set of data values normally distributed?
The CSV file “WeightofMales” contains the weight (in pounds) of 5000 randomly selected adult males in the United States. Analyze how well this data set meets the conditions of normality.
2A. Create a histogram of the data using 60 bins. The bars should be red with blue edges. Add an appropriate title and axis labels. Paste your Python code and a copy of the histogram below. Describe how normal the distribution looks (based on the histogram).
2B. Generate a set of descriptive statistics for the data. Paste your Python code and a copy of the output below. What do these statistics tell you about well the data meets the definition of normally distributed data?
2C. Test the distribution of the data set against the empirical rule by calculating the percentage of data points that are 1, 2, and 3 standard deviations away from the mean. Paste your Python code and a copy of the results below. How well does the data seem to meet the empirical rule?
2D. Potential outliers in a data set can be defined as data points that are more than 3 standard deviations away from the mean. Calculate how many data points are potential outliers, then create a list of these potential outliers (the data points themselves). Paste your Python code and a copy of the list below.