Attached in files
Programming with Python Fall 2020 © Dr. Kamesam 1 Final Project 1. Conduct this analysis in a Jupyter notebook and submit your notebook with all the code and output. Your notebook filename must include your last name. 2. Displaying the answer is NOT enough to get full points. You must use markdown to write down your conclusions/ understanding / observations to each question. Your answer can be brief. 3. You may conduct the data analysis by using Pandas library, or by writing Python code, or using both, whatever is your choice. 4. Your notebook should start with a cell of markdown in which you include the assignment number, class, your name, any other comments you want to add. 5. Make your work look professional by following all the good practices Problem 1. (70 pts). Your task is to analyze and draw insight and conclusions from the dataset births.csv provided to you. The data was collected and published by New York City. It is limited to NY city. The dataset may not be complete as we do not know if all births are registered. births.csv Description: Baby Names by Sex and Ethnic Group: Data was collected through civil birth registration. Each row (record) shows the summary for a specific name given to newborn babies. The same name may be given to several babies in a year, across the years, and ethnic groups. • Example: birth_year gender ethnicity first_name frequency 2011 female hispanic geraldine 13 • The above record shows that the name “geraldine” was given to 13 newborn female babies of Hispanic ethnicity in 2011. Ethnicity is derived from mother’s ethnicity. • There may be more records in the dataset showing that the same name is given to babies of other ethnicities in 2011, and possibly in other years as well. Answer each question in a separate cell along with your answer in markdown 1.1 Read the data from the given file and create a data frame with an appropriate name. • Conduct due diligence inspection, data cleaning, data reformatting as needed. • Are there any missing values or blank values? • Summarize your findings in markdown. 1.2 In total, how many babies were reported born in the dataset? 1.3 Of the babies reported born in the previous question, how many are male, how many are female? Display an appropriate chart/graph as well. 1.4 What is the number of babies reported born in each year (ignoring gender, ethnicity)? Show as a table as well as a visual (chart). 1.5 How many babies born in the year 2013 were given the name emma? 1.6 create a table as well as a chart to show the number of babies born of each ethnicity, ignoring year, and gender 1.7 What are the ten most popular names in the dataset (ignoring gender, year, ethnicity)? 1.8 What are the ten most popular female baby names in the dataset (ignoring year, ethnicity)? Programming with Python Fall 2020 © Dr. Kamesam 2 1.9 In total, how many distinct first names are there in the dataset? Each name must be counted once. For example, the name David may be given to many babies. It should count as one. 1.10 How many distinct male names, distinct female names are there? Each name should be counted only once. For example, the name David may be given to many babies. It should count as one. 1.11 Is the answer to 1.9 consistent with 1.10? If not, what is your (plausible) explanation? 1.12 Create a table (dataframee) to show the total number of babies of each ethnicity born in each year. Show subtotals by ethnicity as well as year. There may be some cells in the table with no values. Study the data and results carefully to come up with an explanation. Write down your answer in a markdown cell 1.13 Some sociologists have expressed concern that the number of female babies born in NYC is less than the number of male babies born year after year. Create a table (report) which can be used to support the above claim or refute it. Write down your conclusions in markup. 2. (30 points) Analyze the data in house_prices.xlsx. The purpose of the analysis: Understand the price of houses in a community, and understand why price varies. Use Seaborn library wherever it makes sense. address: address of the house price: market price of the house (in thousands of dollars) acres: property size in acres size_sqft: square feet of living space age: number of years since the house was built num_rooms: number of rooms in the house bath rooms: number of bathrooms in the house garages: number of car garages in the house 2.1 Read and Inspect the dataset before you do any analysis. Make any necessary changes. 2.2 Summarize house prices in the community. Add an appropriate chart. 2.3 How many houses are there with 1 bath, 2 baths, 3 baths etc.? show a visual as well 2.4 Create a heatmap to explore the relation between price and the other attributes of the house. Write down your observations. 2.5 Using the seaborn library, analyze the relationship between price and property size, price and living space, price and age. Write down your observations. 2.6 Based on the analysis, which variable has the strongest relationship (influence) with price? 2.7 Explore the relationship between the number of bathrooms and living space (size_sqft). Write down your observations. 2.8 Include the following code in the last cell of your notebook, and make sure it is executed.