work needs to be clean and tidy and understandable.
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "\n", "This question is inspired from Exercise 3 in Chapter 5 in the textbook. \n", "\n", "On Canvas, you will see a CSV file named \"THA_diamonds.csv\". This file is a small subset of a real dataset on diamond prices in a [Kaggle competition](https://www.kaggle.com/shivam2503/diamonds). You will use this dataset for this question and the next question. \n", "\n", "**Some Background Information:** In our version of the dataset, the `price` feature has been discretized as `low`, `medium`, and `high`, and `premium`. If you are interested, these levels correspond to the following price ranges in the actual diamonds dataset:\n", "- `low` price: price between \\\\$1000 and \\\\$2000\n", "- `medium` price: price between \\\\$2000 and \\\\$3000\n", "- `high` price: price between \\\\$3000 and \\\\$3500\n", "- `premium` price: price between \\\\$3500 and \\\\$4000\n", "\n", "**Question Overview:** For this question, you will use the (unweighted) KNN algorithm for predicting the `carat` (numerical) target feature for the following single observation using the **Euclidean distance** metric with different number of neighbors:\n", "- `cut` = good\n", "- `color` = D\n", "- `depth` = 60\n", "- `price` = premium\n", "- (`carat` = 0.71 but you will pretend that you do not have this information)\n", "\n", "In practice, you would use cross-validation or train-test split for determining optimal values of KNN hyperparameters. **However, as far as this assessment is concerned, you are to use entire data for training.**\n", "\n", "\n", "### Part A (15 points)\n", "Prepare your dataset for KNN modeling. Specifically, \n", "1. Perform one-hot encoding of the categorical descriptive features in the input dataset.\n", "2. Scale your descriptive features to be between 0 and 1.\n", "3. Display the **last** 10 rows after one-hot encoding and scaling.\n", "\n", "
**IMPORTANT NOTE: If your data preparation steps are incorrect, you will not get full credit for a correct follow-through.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** For Parts (B), (C), and (D) below, you are **not** allowed to use the `KNeighborsRegressor()` in Scikit-Learn module, but rather use manual calculations (via either Python or Excel). That is, you will need to show and explain all your solution steps **without** using Scikit-Learn. The reason for this restriction is so that you get to learn how some things work behind the scenes. \n", "\n", "### Part B (5 points)\n", "What is the prediction of the 1-KNN algorithm (i.e., k=1 in KNN) for the `carat` target feature using your manual calculations (using the Euclidean distance metric) for the single observation given above?\n", "\n", "### Part C (5 points)\n", "What is the prediction of the 5-KNN algorithm?\n", "\n", "### Part D (5 points)\n", "What is the prediction of the 10-KNN algorithm?\n", "\n", "\n", "### Part E (15 points)\n", "\n", "This part (E) is an exception to the solution mode instructions for this question. In particular, you will need to use the `KNeighborsRegressor()` in Scikit-Learn to perform the same predictions in each Part (B) to (D). That is, \n", "- What is the prediction of the 1-KNN algorithm using `KNeighborsRegressor()`?\n", "- What is the prediction of the 5-KNN algorithm using `KNeighborsRegressor()`?\n", "- What is the prediction of the 10-KNN algorithm using `KNeighborsRegressor()`?\n", "\n", "Are you able to get the same results as in your manual calculations? Please explain.\n", "\n", "\n", "### Part F: Wrap-up (5 points)\n", "\n", "
**IMPORTANT NOTE: This Wrap-up section is mandatory. That is, for Parts (B) to (E) (inclusive), you will not get any points for solutions not presented in the table format explained below.**\n", "\n", "Add and display two tables called **\"df_summary_manual\"** and **\"df_summary_sklearn\"** respectively:\n", "- For the table **\"df_summary_manual\"**, you will report your results for Parts (B) to (D) using your manual calculations.\n", "- For the table **\"df_summary_sklearn\"**, you will report your results for the 3 predictions in Part (E) using `KNeighborsRegressor()`.\n", "\n", "\n", "Each of these tables need to have the following 3 columns:\n", "- method\n", "- prediction for the observation given (to be rounded to 3 decimal places)\n", "- is_best (True or False - only the best prediction's is_best flag needs to be True and all the others need to be False)\n", "\n", "Your table needs to have 3 rows (one for each method) in each table that summarizes your results. These tables should look like below:\n", "\n", "|method | prediction | is_best |\n", "|---|---|---\n", "|1-KNN | ? | ? | ? |\n", "|5-KNN | ? | ? | ? |\n", "|10-KNN | ? | ? | ? |\n", "\n", "In case of a Pandas data frame, you can populate this data frame line by line by referring to Cell #6 in our [Pandas tutorial](https://www.featureranking.com/tutorials/python-tutorials/pandas/).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This question is inspired from Exercise 3 in Chapter 4 in the textbook. \n", "\n", "You will use the same CSV file as in Question 1 named \"THA_diamonds.csv\". You will build a simple decision tree with **depth 1** using this dataset for predicting the `price` (categorical) target feature using the **Entropy** split criterion. \n", "\n", "To clarify, for Question 1, your target feature will be `carat` whereas for this Question 2, your target feature will be `price`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part A (10 points)\n", "\n", "The dataset for this question has 2 numerical descriptive features, `carat` and `depth`. \n", "1. Discretize these 2 features separately as \"category_1\", \"category_2\", and \"category_3\" respectively using the *equal-frequency binning* technique. \n", "2. Display the first 10 rows after discretization of these two features.\n", "\n", "After this discretization, all features in your dataset will be categorical (which we will assume to be **\"nominal categorical\"**). \n", "\n", "For this question, please do **NOT** perform any one-hot-encoding of the categorical descriptive features nor any scaling. Also, please do **NOT** perform any train-test splits.\n", "\n", "
**IMPORTANT NOTE: If your discretizations are incorrect, you will not get full credit for a correct follow-through.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part B (5 points)\n", "\n", "Compute the impurity of the `price` target feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part C (20 points)\n", "\n", "
**IMPORTANT NOTE: For Parts C and D below, you will not get any points for solutions not presented in the required table format.**\n", "\n", "In this part, you will determine the root node for your decision tree.\n", "\n", "Your answer to this part needs to be a table and it needs to be called **\"df_splits\"**. Also, it needs to have the following 4 columns:\n", "- split\n", "- remainder\n", "- info_gain\n", "- is_optimal (True or False - only the optimal split's is_optimal flag needs to be True and the others need to be False)\n", "\n", "In your **\"df_splits\"** table, you should have **one row for each descriptive feature in the dataset**. As an example for your **\"df_splits\"** table, consider the `spam prediction` example in Table 4.2 in the textbook (**FIRST** Edition) on page 121, which was also covered in lectorials. The `df_splits` table would look something like the table below.\n", "\n", "|split| remainder | info_gain| is_optimal |\n", "|---|---|---|---|\n", "|suspicious words | ? | ? | True |\n", "|unknown sender | ? | ? | False |\n",