- Research Questions (if research questions are not specifically mentioned, what is the theoretical background or overarching theme)
- Procedures:
- Data Analysis:
- Findings or Results
- Conclusions/Implications
- Student’s Reflections (changes to your understanding):
Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients (IJEAT) International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-1, Issue-3, February 2012 147 Abstract— Medical Data mining is the process of extracting hidden patterns from medical data. This paper presents the development of a hybrid model for classifying Pima Indian diabetic database (PIDD). The model consists of three stages. In the first stage, K-means clustering is used to identify and eliminate incorrectly classified instances. In the second stage Genetic algorithm (GA) and Correlation based feature selection (CFS) is used in a cascaded fashion for relevant feature extraction, where GA rendered global search of attributes with fitness evaluation effected by CFS. Finally in the third stage a fine tuned classification is done using K-nearest neighbor (KNN) by taking the correctly clustered instance of first stage and with feature subset identified in the second stage as inputs for the KNN. Experimental results signify the cascaded K-means clustering and KNN along with feature subset identified GA_CFS has enhanced classification accuracy of KNN. The proposed model obtained the classification accuracy of 96.68% for diabetic dataset. Index Terms—Genetic algorithm, Correlation based feature selection ,K-nearest neighbor, K-means clustering , Pima Indian Diabetics. I. INTRODUCTION The data mining functionalities are used to specify the kind of patterns to be found in the data-mining task. The data mining functionalities mainly include association rule mining, classification, prediction & clustering. Association analysis is used for discovering interesting relations between variables in large databases, which in given in the form of rules to user. Classification predicts the class labels. Prediction is used to access the value of an attribute that a given sample is likely to have. Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Classification is supervised learning algoirthms in contrasts with clustering, which are unsupervised learning algorithm [1]. Classification is a supervised model, which maps or classifies a data item into one of several predefined classes. Data classification is a two-step process. In the first step, a model is built describing a Manuscript received Feb, 2012. Asha Gowda Karegowda, Dept. of Master of Computer Applications, Siddaganga Institute of Technology, Tumkur, India, 9844327268,
[email protected] M.A. Jayaram, Dept. of Master of Computer Applications, Siddaganga Institute of Technology, Tumkur, India, 8095992902,
[email protected] A.S. Manjunath, Dept. of Computer Science and Engg., Siddaganga Institute of Technology, Tumkur, India, 9845141040,
[email protected] predetermined set of data classes or concepts. Typically the learned model is represented in the form of classification rules, decision trees, or mathematical formulae. In the second step the model is used for classification. The most common classification data mining techniques are Case-Based Reasoning, Decision tree, Backpropagation neural network, Radial basis neural network, Bayesian classification, Rough set Approach, Fuzzy Set Approaches, K-nearest Neighbor classifiers.The classifiers are of two types. (a) Instance based or lazy learners in which it store all of training samples and do not build a classifier until a new sample with no class label needs to be classified. K-nearest neighbor(KNN), Case-based reasoning (CBR) are instance-based classifiers. (b) Eager learning methods construct classification model using training data which is tested using test data. Decision tree, Backpropagation neural network, Radial basis neural network using eager learning methods. In this paper a cascaded K-means clustering and k-nearest neighbor classfication algorithm has been used to categorize diabetics patients. Literature survey of classification of diabetic data set is briefed in section II. For the sake of completeness KNN classifier and K-mean clustering have been briefly explained in section III and IV. Feature extraction using GA_CFS and working of cascaded K-means clustering and KNN classifier is explained in section V, followed by results and conclusion in section VI and VII respectively. II. RELATED WORK ON CLASSIFICATION OF DIABETIC DATA SET A. Diabetes A 199 World Health Organization (WHO) report had shown a marked increase in the number of diabetics and this trend is expected to grow in the next couple of decades. In the International Diabetes Federation Conference 2003 held in Paris, India was labeled, as "Diabetes Capital of the World," as of about 190 million diabetics worldwide, more than 33 million are Indians. The worldwide figure is expected to rise to 330 million, 52 million of them Indians by 2025, largely due to population growth, ageing, urbanization, unhealthy eating habits and a sedentary lifestyle. Diabetes mellitus is a disease in which the body is unable to produce or unable to properly use and store glucose (a form of sugar). Glucose backs up in the bloodstream causing one’s blood glucose or "sugar" to rise too high. There are two major types of diabetes. In type 1 (also called juvenile-onset or insulin-dependent) diabetes, the body completely stops producing any insulin, a hormone that enables the body to use glucose found in foods for energy. Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients Asha Gowda Karegowda , M.A. Jayaram, A.S. Manjunath Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients 148 People with type 1 diabetes must take daily insulin injections to survive. This form of diabetes usually develops in children or young adults, but can occur at any age. Type 2 (also called adult-onset or non insulin-dependent) diabetes results when the body doesn’t produce enough insulin and/or is unable to use insulin properly (insulin resistance). This form of diabetes usually occurs in people who are over 40, overweight, and have a family history of diabetes, although today it is increasingly occurring in younger people, particularly adolescents. Type II Diabetes (not depending on insulin) is the most common form of diabetes (90 to 95 per cent) and occurs primarily in adults but is now also affecting children and young adults. Type I Diabetes (insulin-dependant) affects predominately children and youth, and is the less common form of diabetes (5 to 10 percent). The major risk factors for diabetes include obesity, high cholesterol, high blood pressure and physical inactivity. The risk of developing diabetes also increases, as people grow older. People who develop diabetes while pregnant (a condition called gestational diabetes) are more likely to develop full-blown diabetes later in life. Poorly managed diabetes can lead to a host of long-term complications among these are heart attacks, strokes, blindness, kidney failure, blood vessel disease [2], [3]. B. Literature review of classfication of diabetic dataset A lot of research work has been done on various medical data sets including Pima Indian diabetes dataset. Classification accuracy achieved for Pima Indian diabetes dataset using 22 different classifiers is given in [4] and using 43 different classifiers is given in [5]. The performance of proposed cascaded model (k-means+KNN) is compared with [4] and [5]. The results of [5] and [4] are shown in Table 1 and Table 2 respectively. The accuracy of most of these classifiers is in the range of 66. 6% to 77.7%. Hybrid K-means and Decision tree [6] achieved the classification accuracy of 92.38% using 10 fold cross validations, cascaded learning system based on Generalized Discriminate analysis (GDA) and Least Square Support Vector Machine (LS_SVM), showed accuracy of 82.05% for diagnosis of Pima dataset [7]. Further authors have achieved classification accuracy of % 72.88 using ANN, 78.21% using DT_ANN where decision tree C4.5 is used to identify relevant features and given as input to ANN [8], 79.50% using Cascaded GA_CFS_ANN, relevant feature identified by Genetic algorithm with Correlation based feature selection is given as input to ANN [9], 77.71% using GA optimized ANN, 84.10% using GA optimized ANN with relevant features identified by decision tree and 84.71% with GA optimized ANN with relevant features identified by GA_CFS[10 ]. III. K-NEAREST NEIGHBOR ALGORITHM KNN are instance-based or lazy learners[1]. It delays the process of modeling the training data until it is needed to classify the test samples. It can be used both for classification and prediction. The training samples are described by n-dimensional numeric attributes. The training samples are stored in an n-dimensional space. When a test sample (unknown class label) is given, the k-nearest neighbor classifier searches the k tranining samples which are clsoest to the unknown sample. Closeness is usually defined in terms of Eucliedean distance. The euclidean distance is between two points P(p1,p2, …. Pn) and Q( q1,q2,…. qn) given by equation 1. 2 1 )(),( ii n i qpQPd (1) k-nearest neighbor classfication algorithm. 1. Let k be number of nearest neighbors and D be the set of training samples(yj). 2. for each test sample xi do compute d( xi , yj ) using Euclidean distance for every sample yj of D 3. Select the k closeset training samples yj (neighbours) to test sample xi 4. Classify the sample xi based on majority class among its nearest neighbors. 5. end for Some