2 pages review of the attached articleData pre processing techniques A Comprehensive Approach...

Question

2 pages review of the attached articleData pre processing techniques  A Comprehensive Approach Towards Data Preprocessing Techniques &  Association Rules  Jasdeep Singh Malik, Prachi Goyal, 3Mr.Akhilesh K Sharma    3Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India  jasdeepsinghmalik@gmail.com ,  engineer.prachi@gmail.com   3akhileshshm@yahoo.com               ABSTRACT  Data pre-processing is an important and critical step in the  data mining process and it has a huge impact on the success of  a data mining project.[1](3) Data pre-processing is a step of  the Knowledge discovery in databases (KDD) process that  reduces the complexity of the data and offers better conditions  to subsequent analysis. Through this the nature of the data is  better understood and the data analysis is performed more  accurately and efficiently. Data pre-processing is challenging  as it involves extensive manual effort and time in developing  the data operation scripts. There are a number of different  tools and methods used for pre-processing, including:  sampling, which selects a representative subset from a large  population of data; transformation, which manipulates raw  data to produce a single input; denoising, which removes noise  from data; normalization, which organizes data for more  efficient access; and feature extraction, which pulls out  specified data that is significant in some particular context.  Pre-processing technique is also useful for association rules  algo. Like- Aprior, Partitioned, Princer-search algo. and many  more algos.   KEYWORDS: KDD, Data mining, association rules,  Preprocessing algos.,Data warehouse,Two sin-wave.  INTRODUCTION  Data analysis is now integral to our working lives. It is the  basis for investigations in many fields of knowledge, from  science to engineering and from management to process  control. Data on a particular topic are acquired in the form of  symbolic and numeric attributes. Analysis of these data gives a  better understanding of the phenomenon of interest. When  development of a knowledge-based system is planned, the data  analysis involves discovery and generation of new knowledge  for building a reliable and comprehensive knowledge base.  Data preprocessing is an important  issue for both data  warehousing and data mining, as real-world data tend to be  incomplete, noise, and inconsistent. Data preprocessing include  data cleaning, data integration, data transformation, and data  reduction. Data cleaning can be applied to remove noise and  correct inconsistencies in the data. Data integration merge data  from multiple source into a coherent data store, such as a data  warehouse. Data transformation, such as normalization, may be  applied. [2]Data reduction can reduce the data size by  aggregation, elimination redundant feature, or clustering, for  instance. By the help of this all data preprocessed techniques  we can improve the quality of data and consequently of the  mining results. Also we can improve the efficiency of mining  process.  Data preprocessing techniques helpful in OLTP (online  transaction Processing) and OLAP (online analytical  processing). Preprocessing technique is also use full for  association rules algo.like- aprior, partitional, princer search  algo and many more algos. Data preprocessing is important  stage for Data warehousing and Data mining.  [2]Many efforts are being made to analyze data using a  commercially available tool or to develop an analysis tool that  meets the requirements of a particular application. Almost all  these efforts have ignored the fact that some form of data pre- processing is usually required to intelligently analyze the data.  This means that through data pre-processing one can learn  more about the nature of the data, solve problems that may  exist in the raw data (e.g. irrelevant or missing attributes in the  data sets), change the structure of data (e.g. create levels of  granularity) to prepare the data for a more efficient and  intelligent data analysis, and solve problems such as the  problem of very large data sets. There are several different  types of problems, related to data collected from the real world,  that may have to be solved through data pre-processing.  Examples are: (i) data with missing, out of range or corrupt  elements, (ii) noisy data, (iii) data from several levels of  granularity, (iv) large data sets, data dependency, and irrelevant  data, and  (v) multiple sources of data.  NEEDS  Problem with huge real-world database    Incomplete Data :- Missing value.    Noisy.    Inconsistent.  [2](1)Noise refers to modification of original values  Examples:- distortion of a person’s voice when   talking on a poor phone and “snow” on television screen  mailto:jasdeepsinghmalik@gmail.com� mailto:engineer.prachi@gmail.com�     Two Sine Waves                      Two Sine Waves + Noise    WHY DATA PREPROCESSING?  Data in the real world is dirty  incomplete: missing attribute values, lack of certain attributes  of interest, or containing only aggregate data  e.g., occupation=“”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records  A well-accepted multi-dimensional view:  Accuracy,  Completeness  Consistency,  Timeline  Believability,  Value added  Interpretability,  Accessibility.    Major Tasks in Data Pre-processing   1) Data cleaning  2) Data integration   3) Data transformation   4) Data reduction    1) Data cleaning      a) Missing values:   i. Ignore the tuple  ii. Fill in the missing value manually  iii. Use a global constant to fill in the missing value  iv. Use the attribute mean to fill in the missing value  v. Use the attribute mean for all samples belonging to           the same class.   vi.Use the most probable value to fill in the missing         value     b) Noisy data:   i. Binning    ii. Clustering  iii. Regression       c) Inconsistent data  2) Data Integration and Data Transformation  a) Data Integration   b) Data Transformation  i.Smoothing   ii.Aggregation   iii.Generalization   iv.Normalization  v. Attribute construction    3)  Data reduction  a) Data cube aggregation  b) Attribute subset selection   c) Dimensional reduction .  d) Data Sampling.  e)  Numerosity reduction  f) Discretization and concept hierarchy generation     1) DATA CLEANING   Real world data tend to be in complete, noisy and inconsistent  .data cleaning routines attempts to fill in missing values,  smooth out noise while identifying outliers, and correct  inconsistencies in the data.[2]  a)Ways for handling missing values:  a. Ignore the tuple: this is usually done when class label is  missing. This method is not very effective, unless tuple  contains several attributes with missing values. It is especially  poor when the percentage of missing values per attributes  varies considerably.  b. Fill in the missing value manually: this approach is time  consuming and may not be feasible given a large data set with  missing values.  c. Use a global constant to fill in the missing value: replace all  missing attribute values by the same constant, such as label like  “unknown”. if missing value are replaced by ,say unknown  then the mining program may mistakenly think that they form  an interesting concept ,since they all have a value common – that of “unknown”. Hence, although this method is simple, it is  not foolproof.     Sorted data for price (in dollars):  4,8,15,21,21,24,25,28,34   Partition into (equi-depth) bins:  -- Bin 1:  4, 8, 15  -- Bin 2: 21, 21, 24  -- Bin 3: 25, 28, 34    Smoothing by bin means:  --Bin  1: 9, 9, 9  --Bin 2: 22, 22, 22  --Bin 3: 29,29,29     Smoothing by bin boundaries:  --Bin 1: 4, 4, 15  --Bin 2: 21, 21, 24  --Bin 3: 25, 25, 34  d. Use the attribute mean to fill in the missing value:    For example, suppose that the average income of AllElectronics                                                    customers is $28,000.Use this value to replace the missing value for  income.  e.  Use the attribute mean for all samples belonging to the same    class as the given tuple: For example, if classifying customers  according to credit risk, replace the missing value with the  average income value for customers in the same credit risk  category as that of the given tuple.  f. Use the most probable value to fill in the missing value: This  may be determined with regression, inference-based tools using  a Bayesian formalism, or decision tree induction. For example,  using the other customer attributes in your data set, you may  construct a decision tree to predict the missing values for  income.    b) Noisy data  “What is noise?" Noise is a random error or variance in a  measured variable. Given a numeric attribute such as, say,  price, how can we “smooth" out the data to remove the noise?  Let's look at the following data smoothing techniques.    a. Binning methods:[1] Binning methods smooth a sorted data  value by consulting the “neighborhood", or values around it.  The sorted values are distributed into a number of “buckets", or  bins. Because binning methods consult the neighborhood of  values, they perform local smoothing. Figure illustrates some  binning techniques. In this example, the data for price are first  sorted and then partitioned into equal-frequency bins of size 3  (i.e., each bin contains 3 values). In smoothing by bin means,  each

Roshan · Accepted Answer

Name:	Date:
	(Malik, Goyal & Sharma)
	
INTRODUCTION
	The paper gives idea about data preprocessing techniques data mining and algorithms which are applied in data mining and machine learning and data analysis. The need of data preprocessing is briefly explained in this paper.  The paper mainly focused on four preprocessing of data that is data cleaning, data integration, data transformation and data reduction. Data preprocessing is one of the step of knowledge discovery in database (KDD).
	Purpose of research: The purpose of the research is to provide a brief knowledge about data preprocessing techniques. Algorithms related to it. An association technique which is one of the most used data mining and data analysis algorithms. In this research paper we will come to know some definitions, algorithms and parameters related with data preprocessing technique.
	
METHODOLOGY:
	Participants: Data, rules, algorithms, mathematical data.
Procedure: There are four major task in data preprocessing:
Data cleaning, Data integration,

Data pre processing techniques A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules Jasdeep Singh Malik, Prachi Goyal, 3Mr.Akhilesh K Sharma 3Assistant Professor, IES-IPS...

Answer To: Data pre processing techniques A Comprehensive Approach Towards Data Preprocessing Techniques &...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment