Question XXXXXXXXXXmarks): Short-answer questions: answer each of the following questions in a short paragraph. 1) Given the following datasets: (A XXXXXXXXXXB XXXXXXXXXXC) I. If we want to apply...

Its a online test


Question 1 (10 marks): Short-answer questions: answer each of the following questions in a short paragraph. 1) Given the following datasets: (A) (B) (C) I. If we want to apply clustering technique on each dataset, would it be better to apply k-means or DBSCAN ? And explain why? (4 marks) Answer: - (A) DBSCAN is better. (1 mark) - (B) DBSCAN is better. (1 mark) - (C) K-means or DBSCAN. (1 mark) - Because DBSCAN doesn’t assume a cluster shape, while Kmeans is suitable for spherical clusters. (1 mark) II. In figure (A), you can observe some noise in the dataset. (3 marks) ● Which step(s) in the typical Data Science process will help to identify and fix this noise? ● Briefly explain each step. ● Clearly indicate the order of the step(s) as part of your answer. Answer: - Data preparation / data exploration (1 mark) - Order: Data preparation / data exploration (1 mark) - There should be detailed explanation for each step (1 mark) 2) Suppose you have a data set that includes two categorical and three numerical columns . (If you don't know the name, you can sketch an example picture.) (3 marks) i) Name two kinds of graphs that can be used to visualise categorical data 2 Answer: Barplot, pie chart (0.5 marks each) ii) Propose a simple analysis to explore the relationship between a categorical and a numerical column. Answer: Boxplot by category (1 mark) iii) Propose a simple analysis to explore the relationship between two numerical columns. Answer: Scatterplot (1 mark) Question 2 (4 marks) : Considering the following iris dataset to train a classifier. The attributes are sepal_length, sepal_width, petal_length and petal_width . The class labels are in ‘target’ column. The datasets contains 150 observations: the first 50 observations are for the type of ‘ Iris-setosa ’, the middle 50 observations are for ‘ Iris-virginica ’, and the last 50 observations are for ‘ Iris-versicolor ’. It is required to train a classifier with 3-fold cross validation. Please answer the following questions with plain English, and explain (you may draw diagrams to explain). 1. What are the necessary step(s) to preprocess the data, and explain why preprocessing is important. (2 marks) Answer: - Checking errors (outliers or typos) in data entry, which result in few observations having different logic. Extra white spaces, which will affect string comparison. Impossible values, e.g. negative age. (0.5 mark, should mention at least two steps) 3 - Prepare the data by managing the order of the rows, as they are listed by different classes. If not randomizing them, it will cause the training/test data not reliable. (1 mark) - Data preprocessing is important because it will help models to perform better keeping in mind that “Garbage in equals garbage out”. (0.5 mark) 2. Apply 3-fold cross validation to the dataset, explaining the process step by step. You may wish to include a diagram as part of your answer. (2 marks) Answer: - Split the data into 3 equal folds. Use 2 for training and 1 testing, iteratively. (1 mark) - K-fold cross validation is useful with small datasets. (1 mark) Question 3 (3 marks): Consider we have a sample of 30 loan applicants with two variables Income range (Low/ High) and Years of employment( 1-5/ >5) . 15 out of these 30 were granted the loan. Now, we want to build a Decision Tree on this data. In the figure below, we split the population using the two input variables Income and Years of employment . 4 Which split is producing more homogeneous sub-nodes using the Gini index (equation is given at the end of the exam paper)? and explain why. (3 marks) Answer: - For split on Income: (1 mark) Gini for sub-node Low income =
Jun 17, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here