LEARNING OUTCOMES 1. Evaluate the various data types, data storage systems and associated techniques...

Question

LEARNING OUTCOMES 1. Evaluate the various data types, data storage systems and associated techniques for indexing and retrieving data. 2. Design feature engineering techniques to transform...

LEARNING OUTCOMES

1. Evaluate the various data types, data storage systems and associated techniques for indexing and retrieving data.

2. Design feature engineering techniques to transform transactional data into meaningful inputs in order to create a predictive model.

3. Propose a suitable approach to designing a data warehouse to store and process large datasets.

DATA MANAGEMENT

The machine learning pipeline involves several tasks before the development of a predictive/descriptive models. The inevitable and vital process includes preparing and understanding the data. Moreover, the performance of the predictive/descriptive model depends on the choice of pre-processing techniques.

For the assignment, you are required to prepare and explore the given dataset. It is imperative to explain and justify the pre-processing, transformation, and feature engineering techniques that have been chosen. Your analysis should be deep and in detail, also it must go further than what has already been covered in this course.

The assignment should involve a number of experiments, and a detailed exploration and analysis of the results using SAS Studio, Apache Hadoop distribution, and Visual Analytics Tools (Tableau).

You need to do the following tasks:

1.

Related Works

In this section, you are supposed to research and present the other works related to the application domain.

2.

Initial Data Exploration

This section should contain the following task.

· Indicate the type of each attribute (nominal, ordinal, interval or ratio).

· Identify the values of the summarising properties for each attribute including frequency and spread e.g. value ranges of the attributes, frequency of values, distributions, medians, means, variances, and percentiles. Wherever necessary, use proper visualisations for the corresponding statistics.

· Using SAS explore your dataset and identify any outliers, missing values, "interesting" attributes and specific values of those attributes.

3.

Data Pre-processing

Investigate the required method(s) to handle the incomplete, noisy and inconsistent data.

Report each of the applied techniques with detailed explanations. Show your results and justify your approach.

NOTE: Easiest way to handle dirty data is through removing the feature(s) / instance(s). Choosing this method will be award ZERO for pre-processing.

4.

Feature Engineering

Several Data Mining/Machine Learning algorithms are designed to work with qualitative or quantitative data and very few algorithms support mixed data. Hence, this task requires you to develop two datasets. The first dataset should represent all variables in the qualitative and second dataset in quantitative.

Individual attributes, need to be discretized/transformed with an appropriate method(s) and proper justification to be provided. In addendum, the
metadata
should be created for each dataset.

5.

Exploratory Data Analysis (EDA)

This task requires you to perform an analysis on the two datasets generated during your feature engineering. You are evaluated based on the approaches undertaken to get familiar with the dataset.

6.

Apache Hadoop

Load the dataset (cleaned dataset or transformed dataset) into Hive configured with optimized read performance on the tables.

You are free to choose your own choice of Apache Hadoop distribution (Hortonworks, Cloudera, MapR etc.).

7.

Hypothesis

Formulate a minimum of
FIVE (5) hypotheses
based on the dataset (cleaned dataset or transformed dataset) with required analytical variable(s). Interpret the hypotheses with the query resulted from HIVEQL and/or visualization.

Deliverables

The deliveries include:

A report, which structure should follow the tasks of the assignment.

SAS program (Initial Data Exploration, Data Pre-processing, and Dataset Transformation) and Hive queries with an individual file for each task.

Your report should include the following:

Abstract
– A self-contained, short, and powerful statement/brief that describes your work. It may contain the scope, purpose, results, and contents of the work.
[180 to 250 words]

Introduction
- The purpose of your report. Background information about the topic. You also have to place some brief details of your methods applied for the study. Include an outline of the structure of the report.
[800 to 1000 words]

Related Work
- Carefully structure your findings. It may be useful to do a chronological format where you discuss from the earliest to the latest research, placing your research appropriately in the chronology. Alternately, you could write in a thematic way, outlining the various themes that you discovered in the research regarding the topic.
[1000 to 1500 words]

Method
- This section should contain detail exploration of the dataset, pre-processing, feature engineering, EDA, Hive and Hypothesis.
[No limit]

Discussion -
For each of the task include a section title in your report. Finally, you need to summarize your findings, and this summary section should
NOT
be a narrative of your tasks, but a summarized informative section of what is your findings of the data. This section should provide detail interpretation of the work along with the supporting related works.
[500 to 1000 words]

For example, it should include details like specific characteristics (or values) of some attributes, important information about the distributions, relationship or association that exist between variables found that should be investigated more rigorously, etc.

Conclusion
– In this section, you need to state your position about what you gained in this assignment that can contribute to other readers.

Documentation Format:

· Typeface: Times New Roman. Boldface, italic & lines can be used for emphasizing and to enhance readability.

· Font size: 12 (except titles and headings).

· Margins: 1” from the left, right, top & bottom of the edges of the A4 paper.

· Spacing: 1.5 lines between texts of a paragraph.

· Alignment: Justify.

· Headers and footers can be used all pages must be numbered accordingly.

· Standard cover page as available in the learning management system

Answered Same DayJan 18, 2021

Sakshi · Accepted Answer

]Page 1
 predictive analytics of automobile insurance.
 Predictive Analytics of Automobile Insurance
Abstract
	For the past and coming decades automobile insurance would be the competitive business. The un predictable natural disasters made the humans life most complicated. The smart persons who have sensible to situations make their life easy by taking smarter decisions like insuring their possessions.  This study tends to analyse the purchasing behaviour of customers on buying insurance policy with different attributes given. Also this analytics intended to predict the given customer, will buy car insurance or not. The descriptive statistical analytical method with data mining techniques is used to analyse the data. The data is analysed in 3 phases with 6 stages. Data observation and data cleaning is performed in SAS studio. Exploratory analysis is done using tableau and SAS. Data transformation and feature engineering is done using data mining and statistical studies. The K-nearest neighbouring algorithm is used to predict the given test data. Using hive the hypothesis is formulated and results are generated.
Table of Contents
Abstract	2
Introduction	3
Related work	4
Phase 1: [using SAS studio]	7
Data observation	7
Data cleaning	8
Phase2: [using SAS and tableau]	8
EDA	8
Data transformation	8
Feature engineering	8
Phase 3: [SAS and Cloudera hadoop & hive]	8
Prediction &classification	8
Hypothesis	8
Discussion	8
i.	Initial data exploration	8
Data pre-processing/cleaning	11
Prediction/classification:	15
Loading on hdfs/hive	16
Hypothesis with Hivesql	16
Conclusion:	18
References	19
Introduction
	The automobile insurance business is one of the leading trends in insurance sectors. Identifying the best customers and retaining the good customers are challenging task to marketers. Predictive analysis is one of the best solutions to understand the customer’s behaviour using historical data and to predict the future customers.
	The goal of this study is, a comprehensive analysis of customer’s purchasing behaviour of automobile insurance and to predict the unknown customer’s buying behaviour. The car insurance data set is given to analyse the customer, will purchase car insurance or not. several interesting attributes are given with the data set, and it has been sliced as train and test data with 4000 and 1000 observations.
	First, the data is observed completely by loading and exploring the data with SAS studio, the number of attributes and their data types are addressed. Then the data set is examined for its inconsistency. The messy data is cleaned for further process, because the noisy data cannot produce proper results. The distribution of data is inspected and data transformation is dome to tame the data to avoid outliers.
	Feature engineering, feature extraction also done with the data set to address the imp actable variables on the outcome. The binary outcome attribute ‘car insurance’ is the target variable in this study. Exploratory analysis also prepared to know the relation between attributes with the target attribute. EDA is done using SAS and tableau. It revels several interesting correlation between the attributes. The hidden patterns are uncovered and the important features are addressed thru this data visualization
	The first part of sliced data is trained knn classification algorithm and the rest is tested and the outcomes are predicted. The cleaned and transposed data is used to train and test the model. and the model is evaluated with confusion matrix for  accuracy. The data set is loaded into hadoop distribution file system of Cloudera distribution.  the data is converted as table using queries. The simple and complex hypothesis is formed and results are scrutinized.
	Using statistical, data mining and exploratory analysis techniques the data is analysed and the test data is predicted and the prediction is calculated. The findings of hypothesis are very much useful to design the marketing strategies, to address the best customers and to sustain them for their revenue.
Related work
	The study of analysing “Predictive Analysis of Auto Insurance Purchasing Behavior” by “Roosevelt C. Mosley” proposes two major types of study. One is likelihood of customers; purchasing the quote another is likelihood of customers purchasing the policy. The outcomes helped to identify the potential of customers who buy maximum policy coverage, and designing marketing strategies. The proposed work is for analysing internet automobile insurance purchasing uses a dataset from ComScore.inc which provides  1 million internet users of US provide permission to track their online behaviour. Decision tree, Neural Network and linear regression are the predictive models used to predict the binary target variable. An ensemble method is created to calculate the accuracy. Univariate and multivariate exploratory analysis also did using SAS. The findings of study is ,The age of first driver from 25 to 45 have maximum likelihood of purchasing policy. Below 25 and above 45 has minimum likelihood. The status of education of customers, who had higher education also have maximum likelihood of policy purchase(Roosevelt C & Mosley J 2011)
	Kittipong Trongsawad and Jongsawas Chongwatpol of bankok analysed “Revaluating Policy and Claims Analytics”. The difference between behaviour of purchasing and claiming policies of corporate fleet and non-fleet customers is studied through this analytics. this study analyses how corporate companies sustain their best customers and policy claim and loss of claims. The SEMMA methodology used to analyse the two types of categorized data fleet and non-fleet with 25 variables. The target variable is claim_occured and sum_insured. The three main predicting methodologies of data mining , linear regression, decision tree,

Sun	Mon	Tue	Wed	Thu	Fri	Sat
30	31	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	1	2	3

LEARNING OUTCOMES 1. Evaluate the various data types, data storage systems and associated techniques for indexing and retrieving data. 2. Design feature engineering techniques to transform...

Answer To: LEARNING OUTCOMES 1. Evaluate the various data types, data storage systems and associated...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment