All the assignments.
CP5805 Assignment 2 Main task DataFrame manipulation and visualisation Task Design and implement a data analysis program in Python using pandas as detailed in the instructions below. 85% of your mark will be based on the correctness and quality of the basic program, and 15% is based on the functionality in the challenge section. You will need to use the skills covered across weeks one to five for this main task. Some portions may require some further investigation of the pandas docs. Important note about libraries For this assessment, you are free to use any standard Python libraries, as well the libraries we have covered in subject contents. In fact, you must use pandas appropriately to fulfill the requirements of this assessment. You may, if it allows you to write more efficient or effective code, use additional libraries, provided these libraries are included in the standard Anaconda installation. You may not use any libraries that need to be installed separately (e.g., via pip). Detailed instructions Your program will allow users to load a DataFrame from a CSV file, clean the data in various ways, display statistics, and create visualisations. When the program runs, the user will see an introductory message (you are welcome to determine this as you see you fit, but make sure to include your name). For example: Welcome to The DataFrame Statistician! Programmed by Ada Lovelace After the welcome message, the user will be presented with the following menu: Please choose from the following options: 1 – Load data from a file 2 – View data 3 – Clean data 4 – Analyse data 5 – Visualise data 6 - Save data to a file 7 - Quit Option 7 will exit the program; every other option will do some task and then display the menu again until the user chooses 7 from this menu. If the user enters anything other than a value between 1 and 7, display an appropriate error message (e.g., Invalid selection!), then get the user to enter another choice. Menu option 1 - load data from a file When the user chooses option 1, they will be asked for a filename to load, which is expected to be in the same directory as the program (no need for path information). Your program should use the exact filename as stated. Do not append .csv or any other extension – although the contents of the file will be expected to be CSV, a CSV file could be stored under any extension, or no extension. Your program should be able to handle any file in a format like the following: day,min_temp,max_temp,rainfall,humidity 1,11,23,3,55 1,11,23,3,55 2,13,25,0,60 3,9,19,17,80 4,9,18,36,85 5,,,,50 6,12,22,,60 7,13,23,0,65 So, the first row should be the names of the columns, and the following rows should consist of the data. Your program should not be hard coded to deal with the example weather format above, it should work with any CSV file where all the column values are numeric and it can be loaded as a DataFrame. Your program should work for any number of rows or columns. There are two problems your program may encounter here. • the file does not exist or cannot be opened • pandas cannot interpret the data as a DataFrame In both of these cases your program should display an appropriate error message (e.g., "File not found", "Unable to load data") then return control to the main menu. Your program only needs to handle one DataFrame in the system at a time. If a DataFrame was previously loaded, it should be replaced. After the file loads successfully, the program should display the names of the columns, and ask the user if they want to set any of the columns as an index. Valid input in this case will consist of either one of the column names, or the blank string (user just presses `Enter`). If the input is not valid, loop until the user enters a valid column name or blank. The program should then set the DataFrame's index to the selected column or skip this if the user entered the blank string. Menu option 2 - View data This option simply prints the DataFrame to the screen. In the following example, day was set as the index when the DataFrame was loaded. min_temp max_temp rainfall humidity day 1 11.0 23.0 3.0 55 1 11.0 23.0 3.0 55 2 13.0 25.0 0.0 60 3 9.0 19.0 17.0 80 3 9.0 19.0 17.0 80 4 9.0 18.0 36.0 85 5 NaN NaN NaN 50 6 12.0 22.0 NaN 60 7 13.0 23.0 0.0 65 Menu option 3 - Clean data This option will enter a submenu offering various cleaning operations. Cleaning data: 1 - Drop rows with missing values 2 - Fill missing values 3 - Drop duplicate rows 4 - Drop column 5 - Rename column 6 - Finish cleaning Cleaning option 1 - Drop rows with missing values This option will ask the user for a threshold value. This must be a non-negative integer. Drop all rows that have fewer non-null entries than the threshold value. E.g., if the user enters 3, drop all rows that fewer than 3 non-null values. Cleaning option 2 - Fill missing values This option will ask the user to enter a value to fill in all the missing cells of the DataFrame. Accept any number for this value. and display an error message if the user enters a non-number. Cleaning option 3 - Drop duplicate rows This option will remove any (fully) duplicate rows from the DataFrame. Cleaning option 4 - Drop column Present the user with the list of columns in the data and ask them to enter a name. If the entered column name exists in the DataFrame, drop this column from the DataFrame. If the entered column name does not exist, ask again. Cleaning option 5 - Rename column The user will choose a column to rename, then enter a new name. Make sure the new name is not the name of an existing column, and that it is not blank. Cleaning option 6 - Finish cleaning Return to the main menu. Menu option 4 - Analyse data For each of the columns in the DataFrame, produce a report like the one below. Make sure to use pandas functions as appropriate. humidity -------- number of values (n): 7 minimum: 50.00 maximum: 85.00 mean: 65.00 median: 60.00 standard deviation: 12.91 std. err. of mean: 4.88 Display each statistic to two decimal places (except for number of values, which is always a whole number). After displaying the statistics reports, finish by displaying a table of correlations like the one below (hint: you don't have to write your own code to compute correlations, search the pandas docs). min_temp max_temp rainfall humidity min_temp 1.000000 0.916131 -0.795016 -0.845247 max_temp 0.916131 1.000000 -0.882108 -0.920701 rainfall -0.795016 -0.882108 1.000000 0.882754 humidity -0.845247 -0.920701 0.882754 1.000000 Menu option 5 - Visualise data In this case, ask the user: • If they want a bar graph, line graph, or boxplot (repeat until they give a valid selection) • Whether they want to use subplots • For a title (skip if they leave it blank) • For an x-axis label (skip if they leave it blank) • For a y-axis label (skip if they leave it blank) Then display the plot. Menu option 6 - Save data to a file Ask the user for a filename, including file extension (e.g., data.csv). Use the exact filename given including the extension – if the user wants to save with no extension or a non-standard one, let them do so. If the