2 pages review of the attached article
Data pre processing techniques A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules Jasdeep Singh Malik, Prachi Goyal, 3Mr.Akhilesh K Sharma 3Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India
[email protected] ,
[email protected] [email protected] ABSTRACT Data pre-processing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project.[1](3) Data pre-processing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions to subsequent analysis. Through this the nature of the data is better understood and the data analysis is performed more accurately and efficiently. Data pre-processing is challenging as it involves extensive manual effort and time in developing the data operation scripts. There are a number of different tools and methods used for pre-processing, including: sampling, which selects a representative subset from a large population of data; transformation, which manipulates raw data to produce a single input; denoising, which removes noise from data; normalization, which organizes data for more efficient access; and feature extraction, which pulls out specified data that is significant in some particular context. Pre-processing technique is also useful for association rules algo. Like- Aprior, Partitioned, Princer-search algo. and many more algos. KEYWORDS: KDD, Data mining, association rules, Preprocessing algos.,Data warehouse,Two sin-wave. INTRODUCTION Data analysis is now integral to our working lives. It is the basis for investigations in many fields of knowledge, from science to engineering and from management to process control. Data on a particular topic are acquired in the form of symbolic and numeric attributes. Analysis of these data gives a better understanding of the phenomenon of interest. When development of a knowledge-based system is planned, the data analysis involves discovery and generation of new knowledge for building a reliable and comprehensive knowledge base. Data preprocessing is an important issue for both data warehousing and data mining, as real-world data tend to be incomplete, noise, and inconsistent. Data preprocessing include data cleaning, data integration, data transformation, and data reduction. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merge data from multiple source into a coherent data store, such as a data warehouse. Data transformation, such as normalization, may be applied. [2]Data reduction can reduce the data size by aggregation, elimination redundant feature, or clustering, for instance. By the help of this all data preprocessed techniques we can improve the quality of data and consequently of the mining results. Also we can improve the efficiency of mining process. Data preprocessing techniques helpful in OLTP (online transaction Processing) and OLAP (online analytical processing). Preprocessing technique is also use full for association rules algo.like- aprior, partitional, princer search algo and many more algos. Data preprocessing is important stage for Data warehousing and Data mining. [2]Many efforts are being made to analyze data using a commercially available tool or to develop an analysis tool that meets the requirements of a particular application. Almost all these efforts have ignored the fact that some form of data pre- processing is usually required to intelligently analyze the data. This means that through data pre-processing one can learn more about the nature of the data, solve problems that may exist in the raw data (e.g. irrelevant or missing attributes in the data sets), change the structure of data (e.g. create levels of granularity) to prepare the data for a more efficient and intelligent data analysis, and solve problems such as the problem of very large data sets. There are several different types of problems, related to data collected from the real world, that may have to be solved through data pre-processing. Examples are: (i) data with missing, out of range or corrupt elements, (ii) noisy data, (iii) data from several levels of granularity, (iv) large data sets, data dependency, and irrelevant data, and (v) multiple sources of data. NEEDS Problem with huge real-world database Incomplete Data :- Missing value. Noisy. Inconsistent. [2](1)Noise refers to modification of original values Examples:- distortion of a person’s voice when talking on a poor phone and “snow” on television screen mailto:
[email protected]� mailto:
[email protected]� Two Sine Waves Two Sine Waves + Noise WHY DATA PREPROCESSING? Data in the real world is dirty incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data e.g., occupation=“” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records A well-accepted multi-dimensional view: Accuracy, Completeness Consistency, Timeline Believability, Value added Interpretability, Accessibility. Major Tasks in Data Pre-processing 1) Data cleaning 2) Data integration 3) Data transformation 4) Data reduction 1) Data cleaning a) Missing values: i. Ignore the tuple ii. Fill in the missing value manually iii. Use a global constant to fill in the missing value iv. Use the attribute mean to fill in the missing value v. Use the attribute mean for all samples belonging to the same class. vi.Use the most probable value to fill in the missing value b) Noisy data: i. Binning ii. Clustering iii. Regression c) Inconsistent data 2) Data Integration and Data Transformation a) Data Integration b) Data Transformation i.Smoothing ii.Aggregation iii.Generalization iv.Normalization v. Attribute construction 3) Data reduction a) Data cube aggregation b) Attribute subset selection c) Dimensional reduction . d) Data Sampling. e) Numerosity reduction f) Discretization and concept hierarchy generation 1) DATA CLEANING Real world data tend to be in complete, noisy and inconsistent .data cleaning routines attempts to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.[2] a)Ways for handling missing values: a. Ignore the tuple: this is usually done when class label is missing. This method is not very effective, unless tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attributes varies considerably. b. Fill in the missing value manually: this approach is time consuming and may not be feasible given a large data set with missing values. c. Use a global constant to fill in the missing value: replace all missing attribute values by the same constant, such as label like “unknown”. if missing value are replaced by ,say unknown then the mining program may mistakenly think that they form an interesting concept ,since they all have a value common – that of “unknown”. Hence, although this method is simple, it is not foolproof. Sorted data for price (in dollars): 4,8,15,21,21,24,25,28,34 Partition into (equi-depth) bins: -- Bin 1: 4, 8, 15 -- Bin 2: 21, 21, 24 -- Bin 3: 25, 28, 34 Smoothing by bin means: --Bin 1: 9, 9, 9 --Bin 2: 22, 22, 22 --Bin 3: 29,29,29 Smoothing by bin boundaries: --Bin 1: 4, 4, 15 --Bin 2: 21, 21, 24 --Bin 3: 25, 25, 34 d. Use the attribute mean to fill in the missing value: For example, suppose that the average income of AllElectronics customers is $28,000.Use this value to replace the missing value for income. e. Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. f. Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income. b) Noisy data “What is noise?" Noise is a random error or variance in a measured variable. Given a numeric attribute such as, say, price, how can we “smooth" out the data to remove the noise? Let's look at the following data smoothing techniques. a. Binning methods:[1] Binning methods smooth a sorted data value by consulting the “neighborhood", or values around it. The sorted values are distributed into a number of “buckets", or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. Figure illustrates some binning techniques. In this example, the data for price are first sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains 3 values). In smoothing by bin means, each