Assignment 2: Data Quality & Integration Challenge Due Monday, 15th May 2017 Assessment Value: 30% Brief Due to the debt crisis, two national banks, the Dodgy Irish Bank and the National Risky Bank...


Assignment 2: Data Quality & Integration Challenge Due Monday, 15th May 2017 Assessment Value: 30% Brief Due to the debt crisis, two national banks, the Dodgy Irish Bank and the National Risky Bank have been forced to merge to reduce overall costs. As a result of this merger process databases from both banks are to be integrated wherever possible. As part of an initial investigation of data integration issues, you have been asked to perform an integration of sample excerpts from the account details databases of both organisations. The account databases excerpts only provide a small sample of the total datasets, and are also limited to only a small number of attributes for each customer record. You should perform: 1. An initial quality assessment of the datasets provided by both banks. 2. Design and perform an integration of the datasets 3. Evaluate the integration process through an investigation of resultant dataset quality. The database excerpts (datasets) will be provided to you thorugh a number of files that will be made available through Webcourses. Specifically, Dodgy Irish Bank has provided information on approximately 400 customer records from the Dublin north central region. This information, available in the ‘Dodgy Irish Bank.csv’ file includes details on customer name, address, date of birth and gender. National Risky Bank meanwhile did not think it was important to maintain a single integrated database of customer information. Instead older bank details were maintained on a legacy system (‘National Risky Bank Old.csv’), with newer account information maintained on a separate database (‘National Risky Bank New.csv’). Both, new customers and existing customers, who purchased new financial products from the bank had entries created in the new database. Unfortunately, not all dated records in the older database were purged when existing customers were added to the new database. Both datasets provided by National Risky Bank and Dodgy Irish Bank relate to the same geographic area. Therefore, there is potential for customer overlap between the two banks. Your integration project should identify: 1. such overlapping customers, 2. as well as a. eliminating replication of information within individual bank datasets, and b. providing a master dataset of contact details for the new merged bank. These generated datasets should be submitted alongside your report. More specifically, in order to complete this project you will need to:  Investigate the properties of the three datasets with a Data Quality investigation performed in Pentaho Kettle. Your report must show evidence of using kettle functionality. Excel/SQL/other third party software is not permitted. • On the basis of this project specification and your initial investigation, you should design and implement a Data Integration project to: o Eliminate replicated records found in either National Risky Bank or Dodgy Irish Banks’s datasets o Identify any customers who are potentially shared by both Dodgy Irish Bank and National Risky Bank o Create a new master customer list with standardised information on all customers. Standardised here refers to the use of consistent customer information representation on address, date of birth, and so forth. As a means of evaluating your Data Integration Project, you should perform a follow-up Data Quality investigation to determine the properties of the newly created master data file. To aid you in your integration effort, the bank has obtained a license to a GeoDirectory (‘geodirectory.csv’) dataset that provides information on valid addresses in the north Dublin central districts. For this assignment you are asked to deliver the following:  One Kettle file called integration_studentNumber.ktr, where studentNumber is your own student number.  A resultant integrated dataset called master.csv  A dataset called shared_customers.csv which identifies potential shared customers between the banks.  A report on the integration project (detailed below) called iReport_studentNumber.doc, where studentNumber is your own student number. For the integration part you are required to make use of Pentaho data integration Kettle. Excel or Google Refine, or similar black box solutions cannot be used for the matching or duplicate elimination tasks. A report template is provided that is structured for your convenience. You may add further sections and sub-sections, but you must not omit any sections in the report template. Assignment Evaluation The assignment will be marked out of 100. These 100 marks correspond to 30% of the overall module marks. The assignment will be evaluated on the basis of # Heading Weight To do well I have to … (1) Initial Quality Investigation 20 >80% of quality issues within the bank files provided have been identified and clearly described. Evidence of having used Kettle Data Cleaner. (2) High-Level Design (methodology and motivation) 20 The integration methodology has been clearly described, motivated and justified. A chart illustrates and supports the high level design. Hint: Check week 5 lecture slides for high-level description. (3) Detailed Design and Implementation 20 All details on the choice of algorithms and the implementations of algorithms have been provided. Choices made are clearly justified, and where appropriate alternative approaches are critiqued. Evidence in the form of kettle screenshots has been provided for each algorithm/step. Kettle steps are appropriately named. Commenting used in kettle transformation as appropriate. (4) Goal Achievement 20 Data was integrated into a master file, shared customers were identified, and duplicates were removed through an automated process. ETL process idea (no scripting) fully adhered to. All requested files have been submitted (see checklist). (5) Evaluation and Reflection 10 Detailed and concise analysis of final dataset that addresses any remaining quality issues, and a reflection statement which clearly demonstrates an understanding of issues involved. (6) Quality of Documentation 10 High quality detailed and concise report, which outlines issues above appropriately. Template followed fully. Submission Details You are required to submit all your files as one zip file called integrationCA_studentNumber.zip|rar|tar through webcourses, where studentNumber is your own student number. The due date for this submission is stated on the first page of this assignment specification. There is a maximum of 4,000 words (excluding appendices) for this assignment. Regular late submission policy applies.





Oct 07, 2019
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here