Research Questions (if research questions are not specifically mentioned, what is the theoretical...

Question

Research Questions (if research questions are not specifically mentioned, what is the theoretical background or overarching theme)

Procedures:

Data Analysis:

Findings or Results

Conclusions/Implications

Student’s Reflections (changes to your understanding):

Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients (IJEAT) International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-1, Issue-3, February 2012 147  Abstract— Medical Data mining is the process of extracting hidden patterns from medical data. This paper presents the development of a hybrid model for classifying Pima Indian diabetic database (PIDD). The model consists of three stages. In the first stage, K-means clustering is used to identify and eliminate incorrectly classified instances. In the second stage Genetic algorithm (GA) and Correlation based feature selection (CFS) is used in a cascaded fashion for relevant feature extraction, where GA rendered global search of attributes with fitness evaluation effected by CFS. Finally in the third stage a fine tuned classification is done using K-nearest neighbor (KNN) by taking the correctly clustered instance of first stage and with feature subset identified in the second stage as inputs for the KNN. Experimental results signify the cascaded K-means clustering and KNN along with feature subset identified GA_CFS has enhanced classification accuracy of KNN. The proposed model obtained the classification accuracy of 96.68% for diabetic dataset. Index Terms—Genetic algorithm, Correlation based feature selection ,K-nearest neighbor, K-means clustering , Pima Indian Diabetics. I. INTRODUCTION The data mining functionalities are used to specify the kind of patterns to be found in the data-mining task. The data mining functionalities mainly include association rule mining, classification, prediction & clustering. Association analysis is used for discovering interesting relations between variables in large databases, which in given in the form of rules to user. Classification predicts the class labels. Prediction is used to access the value of an attribute that a given sample is likely to have. Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Classification is supervised learning algoirthms in contrasts with clustering, which are unsupervised learning algorithm [1]. Classification is a supervised model, which maps or classifies a data item into one of several predefined classes. Data classification is a two-step process. In the first step, a model is built describing a Manuscript received Feb, 2012. Asha Gowda Karegowda, Dept. of Master of Computer Applications, Siddaganga Institute of Technology, Tumkur, India, 9844327268, [email protected] M.A. Jayaram, Dept. of Master of Computer Applications, Siddaganga Institute of Technology, Tumkur, India, 8095992902, [email protected] A.S. Manjunath, Dept. of Computer Science and Engg., Siddaganga Institute of Technology, Tumkur, India, 9845141040, [email protected] predetermined set of data classes or concepts. Typically the learned model is represented in the form of classification rules, decision trees, or mathematical formulae. In the second step the model is used for classification. The most common classification data mining techniques are Case-Based Reasoning, Decision tree, Backpropagation neural network, Radial basis neural network, Bayesian classification, Rough set Approach, Fuzzy Set Approaches, K-nearest Neighbor classifiers.The classifiers are of two types. (a) Instance based or lazy learners in which it store all of training samples and do not build a classifier until a new sample with no class label needs to be classified. K-nearest neighbor(KNN), Case-based reasoning (CBR) are instance-based classifiers. (b) Eager learning methods construct classification model using training data which is tested using test data. Decision tree, Backpropagation neural network, Radial basis neural network using eager learning methods. In this paper a cascaded K-means clustering and k-nearest neighbor classfication algorithm has been used to categorize diabetics patients. Literature survey of classification of diabetic data set is briefed in section II. For the sake of completeness KNN classifier and K-mean clustering have been briefly explained in section III and IV. Feature extraction using GA_CFS and working of cascaded K-means clustering and KNN classifier is explained in section V, followed by results and conclusion in section VI and VII respectively. II. RELATED WORK ON CLASSIFICATION OF DIABETIC DATA SET A. Diabetes A 199 World Health Organization (WHO) report had shown a marked increase in the number of diabetics and this trend is expected to grow in the next couple of decades. In the International Diabetes Federation Conference 2003 held in Paris, India was labeled, as "Diabetes Capital of the World," as of about 190 million diabetics worldwide, more than 33 million are Indians. The worldwide figure is expected to rise to 330 million, 52 million of them Indians by 2025, largely due to population growth, ageing, urbanization, unhealthy eating habits and a sedentary lifestyle. Diabetes mellitus is a disease in which the body is unable to produce or unable to properly use and store glucose (a form of sugar). Glucose backs up in the bloodstream causing one’s blood glucose or "sugar" to rise too high. There are two major types of diabetes. In type 1 (also called juvenile-onset or insulin-dependent) diabetes, the body completely stops producing any insulin, a hormone that enables the body to use glucose found in foods for energy. Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients Asha Gowda Karegowda , M.A. Jayaram, A.S. Manjunath Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients 148 People with type 1 diabetes must take daily insulin injections to survive. This form of diabetes usually develops in children or young adults, but can occur at any age. Type 2 (also called adult-onset or non insulin-dependent) diabetes results when the body doesn’t produce enough insulin and/or is unable to use insulin properly (insulin resistance). This form of diabetes usually occurs in people who are over 40, overweight, and have a family history of diabetes, although today it is increasingly occurring in younger people, particularly adolescents. Type II Diabetes (not depending on insulin) is the most common form of diabetes (90 to 95 per cent) and occurs primarily in adults but is now also affecting children and young adults. Type I Diabetes (insulin-dependant) affects predominately children and youth, and is the less common form of diabetes (5 to 10 percent). The major risk factors for diabetes include obesity, high cholesterol, high blood pressure and physical inactivity. The risk of developing diabetes also increases, as people grow older. People who develop diabetes while pregnant (a condition called gestational diabetes) are more likely to develop full-blown diabetes later in life. Poorly managed diabetes can lead to a host of long-term complications among these are heart attacks, strokes, blindness, kidney failure, blood vessel disease [2], [3]. B. Literature review of classfication of diabetic dataset A lot of research work has been done on various medical data sets including Pima Indian diabetes dataset. Classification accuracy achieved for Pima Indian diabetes dataset using 22 different classifiers is given in [4] and using 43 different classifiers is given in [5]. The performance of proposed cascaded model (k-means+KNN) is compared with [4] and [5]. The results of [5] and [4] are shown in Table 1 and Table 2 respectively. The accuracy of most of these classifiers is in the range of 66. 6% to 77.7%. Hybrid K-means and Decision tree [6] achieved the classification accuracy of 92.38% using 10 fold cross validations, cascaded learning system based on Generalized Discriminate analysis (GDA) and Least Square Support Vector Machine (LS_SVM), showed accuracy of 82.05% for diagnosis of Pima dataset [7]. Further authors have achieved classification accuracy of % 72.88 using ANN, 78.21% using DT_ANN where decision tree C4.5 is used to identify relevant features and given as input to ANN [8], 79.50% using Cascaded GA_CFS_ANN, relevant feature identified by Genetic algorithm with Correlation based feature selection is given as input to ANN [9], 77.71% using GA optimized ANN, 84.10% using GA optimized ANN with relevant features identified by decision tree and 84.71% with GA optimized ANN with relevant features identified by GA_CFS[10 ]. III. K-NEAREST NEIGHBOR ALGORITHM KNN are instance-based or lazy learners[1]. It delays the process of modeling the training data until it is needed to classify the test samples. It can be used both for classification and prediction. The training samples are described by n-dimensional numeric attributes. The training samples are stored in an n-dimensional space. When a test sample (unknown class label) is given, the k-nearest neighbor classifier searches the k tranining samples which are clsoest to the unknown sample. Closeness is usually defined in terms of Eucliedean distance. The euclidean distance is between two points P(p1,p2, …. Pn) and Q( q1,q2,…. qn) given by equation 1. 2 1 )(),( ii n i qpQPd    (1) k-nearest neighbor classfication algorithm. 1. Let k be number of nearest neighbors and D be the set of training samples(yj). 2. for each test sample xi do compute d( xi , yj ) using Euclidean distance for every sample yj of D 3. Select the k closeset training samples yj (neighbours) to test sample xi 4. Classify the sample xi based on majority class among its nearest neighbors. 5. end for Some

10116777940-1-ytlezxjc.pdf 10116777940-1-dgpa2ljt-ap1bg5lf.pdf 10116777940-1-dgpa2ljt-jjxffg1b.pdf

Monali · Accepted Answer

(1) Research theme: 
Following facts about diabetes are highlighted; 
· Marked increase in number of diabetes patients reported by WHO (World Health Organisation). 
· In 2003, 190 million people having diabetes out of which 33 million are India. 
· By 2025, expected increase in number of diabetes patients to 330 million out of which 52 million are Indian. 
Based on this, need for (1) data mining and (2) bringing out information from diabetes medical database is apparent. There are two main pivotal themes of this research paper.  
These themes of paper revolve around following concepts, principals and mechanism; 
· Clustering is subgrouping mechanism for classifying similar characteristic variables in to one group. Summary for such analysed data is presented with visual tools. In kth mean clustering such subgrouping is done so that each data belongs to only one group and dataset is finite. 
· Algorithm based on kth mean clustering uses repetition, that is iterations, to address business problem. For example, let us take example of similar groups where variation within group should be minimum. To minimise variance in each cluster, central datapoint in each group arrived. This way all other datapoints have minimum distance from central point. Such and similar application allow analysis of big data. 
· An application of above-mentioned is described in research paper with example of data mining of medical records. Central theme of research paper proposes a cascading hybrid model for data mining using three stages.  
(2) Explanation of data analysed in reference: 
Dataset used for analysis is from Data from Prima India Diabetes databased (PIDD).

Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients (IJEAT) International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958,...

Answer To: Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment