We will use a Kaggle dataset that shows different socio-economic & health factors for different countries. The dataset can be downloaded at this link:...


We will use a Kaggle dataset that shows different socio-economic & health factors for different countries. The dataset can be downloaded at this link:



Download the file, open it and check what the different columns represent.


Lab Instructions



KMEANS


Write a python script that clusters the countries using KMeans. Your code shoud:



  1. Load the data file into a data frame.

  2. Separate the first column (which holds the country name) from the rest of the columns.

  3. Run KMeans on the other columns. Set K = 2.

  4. Extract the resulting cluster ID's as a list and appending them side by side to the country names. The output from this step should be a dataframe that contains two columns and it should look something like this:




















    CountryClusterID
    Afghanistan0
    Albania1
    ...0


  5. Sort this dataframe by ClusterID & save it to file as a csv file.

  6. Open this csv file & see which countries are grouped together into the same cluster.


Do you notice anything interesting about how the different countries are grouped into clusters?


Repeat the clustering process for k = 3, 4 & 6.


For each value of K, show the resulting country clusters, sorted by cluster ID.



Agglomerative



  • Repeat the activity above using Agglomerative Clustering. You still need to set the number of clusters through the n_clusters parameter.




DBSCAN



  • Run the DBSCAN clustering method with epsilon = 800 and min_samples = 3.

  • Check the output file. Notice that when the cluster ID is -1, this means that this is a noise point that is not assigned to any cluster.

  • What is the number of obtained clusters?

  • How many data points are considered as noise?

  • How do you assess the obtained clustering results as compared to the output generated by KMEANS & Agglomerative Clustering?

  • Re-run DBSCAN for epsilon values: [700, 800,900] and min_samples values: [2,3]. You can do that by writing two nested for loops of this form:


for eps in range(700,1000,100):
for min_samples in range(2,4,1):
run DBSCAN for eps & min_samples


  • Are the obtained results for these values different or similar to each other?


Resources



SciKit Learn Documentation Pages



Apr 12, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions ยป

Submit New Assignment

Copy and Paste Your Assignment Here