Jupyter (Python 3) assignment. All files are attached including the questions file which is on the corresponding .ipynb file.
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical Exam (CSC 4780/6780) -- April 25th, 2020 (10:00 AM- 5:00 PM)\n", "## Submit your answers to iCollege. Do not email. Multiple submissions are allowed.\n", "\n", "_This is an open book exam. During the exam, you are allowed to use course materials (notes, slides, sample codes, homework assignments) or external resources (such as library documentation or example code pieces). However, you must not get help from any individual, including your peers in this class._\n", "\n", "\n", "By submitting your answers, you certify that the answers is your own work, based on your personal study and research, and that you have not copied in part or whole or otherwise plagiarised the work of other students and/or persons. You also certify that you have read and understood the class policies and consequences of academic dishonesty as explained in the class website (https://grid.cs.gsu.edu/~baydin2/courses/csc4780/index.html and references/links therein)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset and Background\n", "Food spectrographs are used in chemometrics to classify food types, a task that has obvious applications in food safety and quality assurance. You are given a time series dataset obtained after spectral analysis of fresh fruit purees [(Holland et al. 1999)](https://doi.org/10.1002/(SICI)1097-0010(199802)76:2%3C263::AID-JSFA943%3E3.0.CO;2-F). The classes are strawberry (authentic samples) and non-strawberry (adulterated strawberries and other fruits) [encoded as `S1` and `S2`]. The dataset contains 983 time series instances. Each timseries has a length of 235. Note here that time series are stored in the rows. The data can be read from `Strawberry_TS.csv` (therefore, place it in the same directory as this starter code). In this practical exam, you will explore this data and build predictive models. Note here that while the specifics and details of data is given for completeness, they are not relevant for the exam.\n", "\n", "The values of time series are represented in the cells corresponding to columns $\\{t0, t1, ... t234\\}$.\n", "The `index` is the identifier of the instances. The `Class` column shows the class of time series and is your target variable. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run the below code to load time series dataset\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "\n", "ts_df = pd.read_csv('Strawberry_TS.csv', index_col=0)\n", "ts_df_c = ts_df.copy()\n", "\n", "# ts_df\n", "ts_df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1 (15 points)\n", "Your first task is to create an analytics base table from the given time series dataset.\n", "The dataset will contain simple statistical features of time series. Those features are the following: \n", "- mean (column: mean)\n", "- standard deviation (column: std)\n", "- minimum (column: min)\n", "- 1st quartile (column: Q1)\n", "- median (column: median)\n", "- 3rd quartile (column: Q3)\n", "- maximum (column: max)\n", "- interquartile range (column: IQR)\n", "\n", "You will also need to fetch the target variable to `class` column.\n", "\n", "Below, a pandas DataFrame object (`abt`) is created for you (dimensions: __983 rows × 9 columns__) with proper index and column names. Feel free to use it. \n", "In the end, you will have 983 instances with eight descriptive features and a target variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abt = pd.DataFrame(index = ts_df.index, columns = ['mean', 'std', 'min', 'Q1', 'median', 'Q3', 'max', 'IQR', 'class'])\n", "\n", "# your code for extracting features goes here!\n", "\n", "abt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
**Next three questions will use the analytical base table (ABT) created in Question 1. For those who want to skip to Questions 2-4, they can load a sample ABT (`sampleABT.csv`).**
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2 (30 points)\n", "\n", "Using the dataset (i.e., analytical base table) you created in **Question 1**, analyze and interpret the relationships between the descriptive features and target feature. In this question you are expected to create \n", "1. a bar plot demonstrating the distribution of the target variable (see `class` column in `abt`) (**5 points**)\n", "2. the correlation matrix of descriptive features (**5 points**)\n", "3. a scatter plot matrix for descriptive features and target feature (Hint: `sns.pairplot`, also use the `hue` for answering the question) (**5 points**)\n", "\n", "**Q2.1** After creating (1), answer the following questions:\n", "- Is this dataset balanced? What is the class imbalance ratio (i.e., the ratio between the number of instances in majority class and the number of instances in minority class?) (**3 points**)\n", "- If imbalanced, what can we do to balance the dataset? (**3 points**)\n", "\n", "**Q2.2** After creating (2), answer the following questions:\n", "- Among the pairs of descriptive features in correlation matrix, which two has the highest correlation? (**2 points**)\n", "- Among the pairs of descriptive features in correlation matrix, which two has the lowest (negative) correlation? (**2 points**)\n", "\n", "**Q2.3** After creating (3), answer the following questions:\n", "- Based on your scatter plot matrix, which features are less likely to be important for predicting the target variable and why? (**5 points**)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feel free to use as many cells as needed. Do not leave empty cells." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Q2.1 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Q2.2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Q2.3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3 (20+5 points)\n", "\n", "Using the dataset (`abt`) you created in **Question 1**, create a logistic regression model. Your target feature is `class`, while your descriptive features are `['mean', 'std', 'min', 'Q1', 'median', 'Q3', 'max', 'IQR']`. (Hint: the model's class is imported in the preamble as \n", "\n", "`from sklearn.linear_model import LogisticRegression` \n", "\n", "and you can use the default model [without any input parameters]. Use 50% holdout sampling for evaluating this model. (**8 pts**)\n", "\n", "After you train your logistic regression model, test the performance of your model using your test set. \n", "When evaluating the model, you are expected to report the confusion matrix (**3 pts**), overall classification error [misclassification rate] (**3 pts**), the precision of the class `S1` (**3 pts**) and the recall of the class `S2` (**3 pts**). \n", "\n", "**Bonus Question** Output the classification accuracy and F1-score. Compare the similarities and differences between these two measures. Explain the reasons behind these similarities or differences. (**5 pts**)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression # See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import precision_score, recall_score\n", "\n", "# your answer goes here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source":