A reading with 5 questions
How could you avoid the issue report it and figure B 4 if you are using Python instead of talent for the transformation operation? semantic similarity is often a reality when performing ETL. Suggest one possible solution to mitigate or completely resolve the issue. The solution may include talent components or completely other processes figured make sure the solution is realistic. Use the example in the articles to support your solution describe the difference between process oriented and data oriented approaches to data quality then give an example of each and how you could incorporate the steps into your ETL workflow pick one problem from each category and table 1 (E,T,L) and give an example in the possible solution on improving the data quality, assuming that you have little to no control over the incoming data format explain the issues reported in figure B2 and discuss a possible solution that supports data quality Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 159 (2019) 676–687 23rd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems Data quality in ETL process: A preliminary study Manel Souibguia,b,∗, Faten Atiguib, Saloua Zammalia, Samira Cherfib, Sadok Ben Yahiaa aUniversity of Tunis El Manar, Faculty of Sciences of Tunis LIPAH-LR11ES14, Tunis, Tunisia ( b ){manel.souibgui,saloua.zammali}@fst.utm.tn,
[email protected] Conservatoire National des Arts et Me´tiers CEDRIC-CNAM, Paris, France {faten.atigui,samira.cherfi}@cnam.fr Abstract The accuracy and relevance of Business Intelligence & Analytics (BI&A) rely on the ability to bring high data quality to the data warehouse from both internal and external sources using the ETL process. The latter is complex and time-consuming as it manages data with heterogeneous content and diverse quality problems. Ensuring data quality requires tracking quality defects along the ETL process. In this paper, we present the main ETL quality characteristics. We provide an overview of the existing ETL process data quality approaches. We also present a comparative study of some commercial ETL tools to show how much these tools consider data quality dimensions. To illustrate our study, we carry out experiments using an ETL dedicated solution (Talend Data Integration) and a data quality dedicated solution (Talend Data Quality). Based on our study, we identify and discuss quality challenges to be addressed in our future research. © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. Keywords: Business Intelligence & Analytics; ETL quality; Data and process quality; Talend Data Integration; Talend Data Quality 1. Introduction Business Intelligence & Analytics (BI&A) is defined as ”a broad category of applications, technologies, and processes for gathering, storing, accessing, and analyzing data to help business users make better decisions” [37]. Data warehouses (DW) stand as the cornerstone of BI&A systems. Inmon [21] defines the DW as ”a subject-oriented, integrated, time-variant, non-volatile collection of data in support of managements decision-making process”. Figure A.1 shows BI&A architecture where data is gathered from company operational databases and external data. Gathered data is heterogeneous, and has different types and formats. Before being loaded into the DW, this data is transformed and integrated using the ETL process [34]. The latter performs three basic functions: (i) extraction from data source; (ii) data transformation where the data is converted to be stored in the proper format or structure for the purposes of ∗ Corresponding author. Tel.: +216-26-767-794 E-mail address:
[email protected] 1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. 10.1016/j.procs.2019.09.223 querying and analysis (e.g., data cleansing, reformatting, matching, aggregation, etc.); (iii) the resulting data set is loaded into the target system, typically the DW. A data mart (data cube) is the access layer of the DW environment that is used to get data out to the decision-makers. Lavalle et al. [25] conducted a study based on 3 000 business executives and managers. This survey showed that 50% of the respondents consider improvement of data management and BI&A as their priority. It also revealed that 20% of them cited concerns with data quality as a primary obstacle to BI&A systems. Data analytics has been used for many years to provide support for business decision making. Several authors stressed out that poor data quality has direct and indirect impacts on the underlying business decisions [4, 36]. According to Redman [28], at least three proprietary studies have provided estimates of poor data quality costs between 8 and 12% of revenue range. In order to properly identify quality-related issues, in the literature, Data Quality (DQ) is recognized as multi- dimensional to better reflect its facets and influences. Each dimension is associated to a set of metrics allowing its evaluation and measurement. The quality dimensions are organized into four categories according to Wang et al. [35] namely: Intrinsic, Contextual, Representational and Accessibility. The Intrinsic quality dimensions are accuracy, reputation, believability and provenance. They rely on internal characteristics of the data during evaluation. The Contextual quality is more information than data oriented, since it refers to attributes that are dependent to the context in which data is produced or used. It comprises amount of data, relevance, completeness and timeliness quality dimensions. Representational quality however, is more related to the way data is perceived by its users and relies on understandability, consistency and conciseness quality dimensions. Finally, Accessibility allows measuring the ease with which data could be accessed and covers accessibility and security dimensions. Ensuring data quality in the data warehouse and in data cube relies on the quality of the ETL phase which is considered as the sine qua non condition to a successful BI&A system. In this paper, we explore the different facets of quality within an ETL process. We carry out a literature review to gather the different approaches that deal with quality problem in the ETL process. Through our study, we have demonstrated that authors tackle ETL related DQ problems from two main perspectives: (i) process centred and (ii) data centered. Also, we have shown that both ETL and DQ tools still have DQ limits. These limits are highlighted through Talend Data Integration (TDI) [14] and Talend Data Quality (TDQ) [15]. We refer to the following study as preliminary because we make no claim of completeness. The remainder of this paper is structured as follows: Section 2 shows a classification of ETL process related DQ problems. In Section 3, we present existing ETL DQ approaches. In Section 4, we present a comparative study based on four ETL tools to show how much these tools consider DQ dimensions. In Section 5, we carry out experiments using an ETL solution, i.e., TDI and a data quality dedicated solution, i.e, TDQ in order to highlight DQ limits of these tools. Finally, Section 6 outlines the limits of the surveyed approaches and tools dealing with DQ problems in the ETL process and presents a set of open issues for research in ETL quality management. 2. Data quality defects within the ETL process Many reasons stand behind the need of data integration phase within the decision system: (i) heterogeneous for- mats; (ii) data format can be difficult to be interpreted or ambiguous; (iii) legacy systems using obsolete databases; and (iv) data source’s structure is changing over time. All these characteristics of data sources make DQ uncertain. A variety of studies were conducted in the sake of identifying different quality issues within the data integration process. The majority of them agree that DQ faces different challenges. Indeed, ETL is a crucial part in the data warehousing process where most of the data cleansing and curation are carried out. Hence, we propose a classification of typical DQ issues according to ETL stages, i.e., extract, transform and load. As depicted in Table 1, each one of these stages is prone to different quality problems in both schema and instance level. DQ issues found in the ETL process are the focus of DQ improvement phase. In practice, the improvement phase is often a prerequisite for DQ assessment. The process of integrating DQ into the ETL process is an indicator of the gap between the quality obtained and that expected. Furthermore, overcoming all the DQ problems is still challenging. In the rest, we classify the pioneering approaches that integrate and improve DQ in the ETL process. ( 678 ) ( Manel Souibgui et al. / Procedia Computer Science 159 (2019) 676–687 ) ( Manel Souibgui et al. / Procedia Computer Science 159 (2019) 676–687 ) ( 677 ) Table 1. Examples of ETL data quality problems ProblemsDescriptions E Schema Lack of integrity constraints [27]Rule that defines the consistency of a given data or dataset in the database (e.g., Primary key, uniqueness). Example of uniqueness violation: Two customers having the same SSN number cus- tomer 1= (name=”John”, SSN=”12663”) , customer 2= (name=”Jane”, SSN=”12663”). Poor schema designImperfect schema level definition [27, 16]. Example 1: Attributes names are not significant: ”FN” stands for First Name and ”Add” stands for Address Example 2: Source without schema: ”John;Doe;
[email protected];USA”. Embedded valuesMultiple values entered in one attribute [16] . Example: name=” John D. Tunisia Freedom 32”. Instance Duplicate recordsData is repeated [9]. Misspellings, different ways of writing names and even address changes over time can all lead to duplicate entries [18]. Another form of duplication is the conflicts of entities when inserting a record having the same id as an existing record [39]. Missing valuesYang et al. have classified missing values into two types: Data in one field appears to be null or empty [39, 19] (i.e., Direct incompleteness) and missing values caused by data operations such as update (i.e., Indirect incompleteness)[39]. TSchemaVariety of data typesDifferent data types between the source and the target schema. Naming conflictsIf we have two data sources which have two synonymous attributes (e.g., gender/sex) then the union of the aforementioned sources requires schema recognition [19, 27, 18]. Instance Syntax inconsistency (Structural conflicts) The date retrieved from the source hasn’t the same format as the DW’s date [39]. There are a different syntactic representations of attributes whose type is the same [9]. Example 1: French date format (i.e., dd/mm/yyyy) is different from that of the US format (i.e., mm/dd/yyyy). Example 2: Gender attribute is represented differently in the two data sources, e.g., 0/1, F/M. LWrong mapping of dataLinking