Homework – Pandas
In this homework, you will demonstrate proficiency in analysis using Pandas, reading and using Python library documentation, and incorporating data visualization with analysis with Series and DataFrames. You may only import the following libraries for this homework: pandas, matplotlib, seaborn, and random. You may not import any other libraries.
Video Game Analysis
The data file for this assignment describes information about historical video game sales. Categorical information includes rank (in sales), title, platform (system), year of release, genre, publisher, and sales amount (in millions) for the regions of: North America, Europe, Japan, and Other Countries.
Keep in mind as the data sets are getting larger, your code efficiency becomes even more important.
You have been hired as a data analyst to work on the data provided. To do this, you will create a Jupyter Notebook that performs the following:
1. Load the data from the file into a data frame with the rank denoted as the index.
a. Print the total number of video games.
b. Print a data frame containing the top 15 video games with the highest rank, including all data known about them.
2. The total for global sales is missing, but you realize it would be helpful to provide insights. This is the sum of sales for all regions and other sales. Create and populate a new column in your data frame for global sales. Display a revised formatted table of data, including global sales, for the top 15 video games with the highest rank, including all data known about them.
3. Display descriptive statistics that include:
a. Total number of video games.
b. Global Sales: min, max, mean, median, and standard deviation.
c. Percentage of video games that earned at least $5,000,000 in global sales starting in the year 2000 or later.
4. Write a Python function with parameters of data frame, category, value, and a number that specifies the number of rows to display in the result. The category and value should have default values of 'None' and the number of rows (n) should have a default value of 10. n will specify the number of rows to display in the result. The function should return a data frame containing the top n rows of video games, based on global sales for a specific value in the category, only if both category and value are provided. If both category and value are not provided, the top n video games overall should be displayed.
5. Based on the previously created function, create well formatted tables for each of the following:
a. The top five overall sellers
b. The top ten Wii sellers
c. The top five selling puzzle games
d. The top ten selling games in the year 1990
6. (10 points) Create a linear correlation matrix between sales in each market (e.g. NA, EU, etc.) and global sales. List two observations after analyzing the data. Then, provide two inferences based on your observations.
7. Create two more linear correlation matrices between sales in each market (e.g. NA, EU, etc.) and global sales – the first for the year 2000 and the second for the year 2015. List two observations and one inference for each observation that you can make about the relationship between change in sales over the 15-year period.
8. You have done such a good job with the previous analyses, your boss has said to you: "Thank you for your analysis. The client is very happy, but wants to know what other insights we can gather from the data". You only have a short time between now and the client meeting. Load more data of your choice, create a visualization of your choice (must be something other than a bar chart), and make at least one inference from your analysis.
Format the Code using an appropriate level of markdown cells so that your analysis is fully explained at each step of what you have done.