Chapter Three MATERIALS AND METHODS 3.1 Data Source We used three benchmark datasets in this study: Wines, Irish, and Pima India Diabetis. The dataset is outlined in the following sections. 3.2 Data...

1 answer below »
Enhanced genetic Algorithm for centroid initialization in k-means algorithm



Chapter Three MATERIALS AND METHODS 3.1 Data Source We used three benchmark datasets in this study: Wines, Irish, and Pima India Diabetis. The dataset is outlined in the following sections. 3.2 Data Preprocessing Data preprocessing is a data mining technique that entails converting raw data into a usable format. Data is frequently incomplete, inconsistent, and prone to errors. Data preprocessing resolves such issues by cleaning, transforming, reducing features, and standardizing data. We import the necessary libraries and dataset, look for missing values, and either replace or delete them. Then, if necessary, we convert Categorical values to numeric values. The data set is then Splitted into training and test Set and fimally standardized the data. 3.3 Genetic Algorithm GA employs a population of size N, with each member represented by a binary string of length l. The number of chromosomes is known as a population. Each chromosome has a fitness attribute correlated with it. The chromosomes with the highest fitness values are chosen to be the parents of the population of the next generation. Crossover genetic operators are used to produce stronger offspring for the next population, which will increase the algorithm's overall efficiency. To achieve improved outcomes, the offspring may be subjected to a mutation operation in order to escape from local optima. The fundamental GA algorithm is seen below. In this analysis, we used two separate heuristics to boost the k-means algorithm's clustering efficiency proposed by (Mustafi et al., 2017)⁠. 1. For initialization, an improved GA-based fitness function that covers the whole solution space is used. This is done to ensure that outliers are properly handled within the data space. 2. To ensure that no empty clusters are generated, a differential evolution-based heuristic was used, and each run of the k-means algorithm always produces the required number of clusters. For the GA based initialization of centers the following fitness function has been used, Where “m” is the grand mean of the entire data space and is the center of the , cluster, and is the chosen distance metric. The floating point GA is used in this analysis to find the set of k initial centroids. In this case, the chromosome is an array of size K * no of features, where "no of features" is the data space's dimensionality. To ensure the formation of the required number of clusters while maintaining a good choice of centers, we employ the following heuristics. If the desired number of clusters is 2 but only one has been generated, select any random point as the second cluster point and proceed. If the desired number of clusters is 3, but only two have been generated, choose the third center as the point obtained by adding the two centers obtained by vector addition, i.e., If desired number of clusters is greater 3 and only three clusters have been generated, choose a new cluster center as , where , and are the three clusters with the smallest cluster densities in decreasing order, and F is a mutation factor represented by a floating point constant Parameter Value of Existing algorithm Value of Existing algorithm Population size 50 50 No. of generations 10 10 Stall generation limit 10 10 Crossover Single point Two point Crossover fraction 0.8 0.8 Mutation uniform adaptive Mutation fraction 0.1 0.1 Elite count 2 2 Flow chart showing Existing and proposed algorithmStart Enhaced GA Base Initialization Perform K-Means Required Number Of Clusters? Required Number Of Clusters? DE Base Algorithm stop Start GA Base Initialization Perform K-Means Required Number Of Clusters? Required Number Of Clusters? DE Base Algorithm Stop
Answered 23 days AfterApr 26, 2021

Answer To: Chapter Three MATERIALS AND METHODS 3.1 Data Source We used three benchmark datasets in this study:...

Sandeep Kumar answered on May 14 2021
146 Votes
iris.csv
sepal_length,sepal_width,petal_length,petal_width,classification
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3,1.4,0.1,Iris-setosa
4.3,3,1.1,0.1,Iris-setosa
5.8,4,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1
.5,0.4,Iris-setosa
4.6,3.6,1,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5,3,1.6,0.2,Iris-setosa
5,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5,3.3,1.4,0.2,Iris-setosa
7,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5,2,3.5,1,Iris-versicolor
5.9,3,4.2,1.5,Iris-versicolor
6,2.2,4,1,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3,5,1.7,Iris-versicolor
6,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6,2.7,5.1,1.6,Iris-versicolor
5.4,3,4.5,1.5,Iris-versicolor
6,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3,4.1,1.3,Iris-versicolor
5.5,2.5,4,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3,4.6,1.4,Iris-versicolor
5.8,2.6,4,1.2,Iris-versicolor
5,2.3,3.3,1,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3,5.8,2.2,Iris-virginica
7.6,3,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3,5.5,2.1,Iris-virginica
5.7,2.5,5,2,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6,2.2,5,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2,Iris-virginica
7.7,2.8,6.7,2,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6,3,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3,5.2,2.3,Iris-virginica
6.3,2.5,5,1.9,Iris-virginica
6.5,3,5.2,2,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3,5.1,1.8,Iris-virginica
GA-K-Means.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.metrics import silhouette_score\n",
"from sklearn.cluster import KMeans"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"data=pd.read_csv('iris.csv')\n",
"stddata=np.array(data.drop('classification',axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" # Code initialization & defining all functions"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"cp=0.8\n",
"mp=0.5"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def isValid(sample,this):\n",
" if(set(sample)==this):\n",
" return True\n",
" else:\n",
" return False"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"## Calculating cluster centers for given population \n",
"\n",
"def SCenters(samp):\n",
" centers={}\n",
" for i in set(samp):\n",
" A=[]\n",
" for j in samp:\n",
" if(j==i):\n",
" A.append(True)\n",
" else:\n",
" A.append(False)\n",
" centers[i]=np.array(stddata[A].mean())\n",
" return centers\n",
"\n",
"\n",
"def getCenters(Gen):\n",
" GenCenters=[]\n",
" for samp in Gen:\n",
" GenCenters.append(SCenters(samp))\n",
" return GenCenters"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Calculating Fitness\n",
"from scipy.spatial import distance\n",
" ##Calculating the Euclidean distance for within clusters and summing them\n",
"def...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here