Please find instructions for this assignment in the word document. You need to complete all questions, for short answer questions please provide explanations in words.ExpertDelivery Manager
(Upto 15 mnt delay)
For the following assignments, please provide as much evidence of the results as possible, including the code, screenshots (only plots – not text or code) and documentation. Submit only one pdf file and .ipynb / .py files containing the code with documentation. 1.a. [20 points] Choose one of the cleaned datasets at https://www.kaggle.com/annavictoria/ml-friendly-public-datasets. Split it into training and test data. Write from scratch and apply any ML algorithm that you learned in the class to this dataset. You can use Python to implement it. For the implementation, you may use any classes, modules, and functions in Python libraries such as NumPy to do various math / linear algebra operations, but not use the ML classes or functions directly. Apply another algorithm that you learned to the same dataset. For this one, you are free to implement it from scratch or use the ML class and functions directly from the ML packages. Which one of the algorithm fares better? Use as many evaluation metrics as possible to discuss the performance of the algorithms. Write down your comments in your script. 1.b. [5 points] Derive an equation for accuracy in terms of Specificity and Sensitivity. The equation can include metrics such as number of True Positives or number of False Positives, etc. in addition to accuracy, Specificity and Sensitivity. Give an interpretation of the equation. 2.(a) [15 points] Assume we have only two features in our dataset. The transposes of the feature vectors comprise of the first 10 consecutive pairs of primes: [2 3], [5 7], …, [67 71]. For k=2, show step-by-step (either manually or programmatically) iterations of k-means clustering when the centroids are initialized to (i) [2 3] and [5 7] and to (ii) [2 3] and [67 71]. Compare and comment on the results in both cases. 2.(b) [10 points] K-means algorithm is applied to the wines dataset in the tutorial available at https://www.kaggle.com/xvivancos/clustering-wines-with-k-means However, K-means algorithm assumes that the mean is representative of the cluster. In real-world though, most often, the “most” vociferous, “most” influential, “most” wealthy, or some other “most” ____ person gets elected to represent the people of a constituency. Explain using visualization tools (like in the tutorial) and words, the difference the clustering algorithm makes, if based on the “mode” instead of the “mean” on the above wines dataset or any categorical dataset such as https://www.kaggle.com/sl6149/data-scientist-job-market-in-the-us Explain based on the cluster analysis, if mode, which is also a measure of central tendency can represent a real-world cluster of similar data points. You can use the k-modes algorithm described here: https://www.kaggle.com/ashydv/bank-customer-clustering-k-modes-clustering If you are using the wine dataset, you may have to convert it into a categorical dataset by applying binning, for the results to make sense.