You have been commissioned to design a database to manage the i Assignment - Data Mining Practice and Analysis Due date: 5pm Wednesday 16 January 2019 This assignment can be worked either as a group...

1 answer below »
i have uploaded the file


You have been commissioned to design a database to manage the i Assignment - Data Mining Practice and Analysis Due date: 5pm Wednesday 16 January 2019 This assignment can be worked either as a group (two students at maximum) or as an individual. If you work as a group, then group members must equally contribute to the group work. Also, all group members must participate in the presentation. Aims · Familiarise with some well-known data mining techniques in order to understand their working principles; · Apply data mining techniques to domain-specific datasets; · Review cutting-edge data mining techniques to gain good overview on current data mining technology; Requirements (Tasks) The whole task of this assignment consists of the following procedural steps. Step 1 Set up (by your imagination of a real-like business situation or by applying an actual analysis problem case) a scenario in which you are given a set of domain-specific dataset and asked to analyze the given dataset. The purpose of the analysis might be to understand (overview or learn about) the given data or to solve a specific analytical problem – depending on the scenario you made up. Step 2 Find and get your own domain-specific dataset to fit for the scenario you made up. The dataset could be unique or publicly available. Public data sets are available from: · http://archive.ics.uci.edu/ml/ · http://service.re3data.org/search/ · https://dataverse.harvard.edu/ · http://catalog.data.gov/dataset · http://dataportals.org/ · http://mldata.org/ · http://oad.simmons.edu/oadwiki/Data_repositories · https://www.quandl.com/ · http://www.google.com/publicdata · http://www.kdnuggets.com/datasets/index.html · http://lib.stat.cmu.edu/datasets/ · http://webscope.sandbox.yahoo.com/catalog.php · http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html · https://github.com/caesar0301/awesome-public-datasets · https://www.reddit.com/r/datasets/ Step 3 Choose appropriate data mining techniques (algorithms) – see more details for each option in Step 4 below. ** Note: The procedural order of the above three steps can be alternated. For example, you may find an interesting dataset first and then set up a specific data-mining scenario which fits for the analysis on the dataset chosen. ** Step 4 You can select either of two options for this assignment. · Option (1) – Programming-intensive Assignment · Once you have your own domain-specific dataset and chosen data mining algorithm, then you need to design and implement the chosen algorithm in your preferred programming language. · A series of preprocessing will be required at this step. The preprocessing procedure should be designed carefully (considering what kind of processing will be required? How? Why?) to make your data ready to be fed to your program. Some parts of this preprocessing procedure can be included in your program as a part of “pre-data-mining module”. · Your final program must become a stand-alone data-mining tool designed for your own purpose of data analysis. It is expected that your program should include the following modules (and may include more sub-modules if needed); 1) pre-data-mining module – designed for necessary preprocessing and for getting the data ready to be fed to the next module (data-mining module). You don’t need to include all required pre-processing in this module. It is assumed that some initial preprocessing (e.g. cleaning noise data) can be done externally using other software tools (e.g. Excel or Weka). 2) data-mining module – the chosen data mining algorithm is implemented. You can directly borrow the algorithm from one popular existing data mining method, or you can design your own algorithm (by amending the existing one) 3) post-mining module – this module is for presenting/reporting the output result produced through previous modules. The result can be made in a simple text report or additionally in a non-text visualization way (e.g. graph, chart or diagram). · This programming-intensive assignment still requires an analysis. Try to find all the patterns you can detect with your implemented algorithm. Try to compare and contrast the result using your chosen preprocessing scheme and algorithm with using other existing algorithm or with using other preprocessing methods. Note: in particular for the comparison the result using your program with using other existing algorithm, you can use other existing data mining tools (e.g. Weka) to get the result using other algorithm. · Option (2) – Analysis-intensive Assignment · Once you have your own domain-specific dataset chosen, you need to design your own data-mining analysis scheme. This analysis scheme can consist of multiple steps of procedures: 1) Set up a strategy for preprocessing on your data. A series of preprocessing will be required and need to be designed carefully (considering what kind of processing will be required? How? Why?). You may include multiple different preprocessing schemes for the comparison analysis. 2) Set up a strategy for data-mining. you need to select one data mining areas (clustering, classification, association rules mining) of your choice and select AT LEAST TWO existing data mining algorithms in your chosen data mining area. For example, if you chose Clustering as your data mining area, you can apply two algorithms; DBScan and K-mean and compare the two results. Alternatively you can design a combined algorithm which applies multiple algorithms from same/different data mining areas in a series. Your strategy also can be designed to apply different parameters for one algorithm. Another strategy you can set up is to apply multiple preprocessing (attribute selection) schemes for one algorithm. · You can choose one data mining tool (e.g. Weka) to analyze your chosen dataset. Apply the data-mining strategy (you had set up) on your chosen data (preprocessed) using the data mining tool and try to find all the patterns you can detect. · Do various comparison experiments either by applying different data mining algorithms (or strategy) to the same chosen dataset or by applying a same algorithm to the differently pre-processed datasets. · Critically analyze experimental results and discuss/demonstrate why a chosen algorithm (strategy) is superior/inferior to other algorithm (strategy). Step 5 · You need to present an in-class presentation (15 minutes presentation + 5 minutes question) based on your chosen algorithm (strategy) and experimental test, and also you need to write a scientific paper as an experimental report. · The presentation must generally include a good overview on your project, aims/objectives, reasons of your choice, brief overview of strategy/algorithm you chosen, findings, comparison including experimental results and conclusion. · You need to write a research report paper of minimum 10~15 pages (for CP3300 students) or 15~20 pages (for CP5605/CP5634 students) in length on your project to summarise your algorithm and experimental results. The report should contain all topics listed above for presentation but with more details. For CP5605/CP5634 students, you need to add in your report one additional section for a brief (mini) literature review about the data mining methods (strategy, algorithm and/or preprocessing methods) you chose for your project. Please refer to the following link if you need to get further idea of “literature review”: http://www-public.jcu.edu.au/libcomp/assist/training/JCUPRD_026326 · The research paper must follow the generally accepted format of research article consisting of introduction, related work (brief review of methodologies (algorithm/strategy used), a summarized description of your experimental settings and procedures (description of data, justification of chosen data mining area, justification of chosen algorithm, preprocessing details, etc.), comparison, discussion, issues, conclusion, possible future work and a list of references. (you may add more sections if needed) · In addition to the general components listed above, the report from “Programming-intensive option” should include a summary of your program (including the program structure, implementation details, a summarized algorithm for the main modules etc. including code if necessary). · For “Analysis-intensive option”, it is required to include a more in-depth analysis on the investigation and experimental comparison made through the project. Submission · Due for the report submission: Wednesday 16 January 2019 (in week 9) · Presentation: Before the report is due, usually during week 8. · You need to submit your final report as a single document file (MS Word or PDF format) to LearnJCU. · For the “Programming intensive option”, you need to submit the source code and executable file of your program accompanied to your report. Please make a zip file including all necessary files (report document and program files). Useful Links • http://www.kdnuggets.com/ • http://www.cs.waikato.ac.nz/ml/weka/ • http://mlearn.ics.uci.edu/MLRepository.html • http://kdd.ics.uci.edu/ • http://www.sigkdd.org/ Writing Skills: http://www-public.jcu.edu.au/learningskills/resources/wsonline/ Scientific Report Writing: http://unilearning.uow.edu.au/report/2b.html http://writing.wisc.edu/Handbook/ScienceReport.html and more on the Web. Page 4
Answered Same DayJan 23, 2021CP5634

Answer To: You have been commissioned to design a database to manage the i Assignment - Data Mining Practice...

Sundeep answered on Jan 25 2021
140 Votes
Student Name:
Course ID:
Assessor Name:
Submission Date:
PROBLEM STATEMENT
1) To understand the major factors that shape behaviour of youth towards Shampoo purchase
2) To identify different customer segments for Shampoo category
RESEARCH METHODOLOGY
i) Secondary research:
Secondary research was performed to understand the factors and variables on which sales and revenue of shampoo products depend.
There were 26 variable identified and these factors were considered as the basis of our research further. Some of the important factors a
nd their description are given below:
    Variable
    Description
    Hair cleansing
    Most customers agreed that hair cleansing is main purpose of using a shampoo and they choose their
shampoo mainly on it.
    Hair Moisturizing
    Many feel that moisturIzation of scalp and hair is very important and if certain shampoo is providing that then they
prefer it.
    Silky hair
    Some customers feel that they like to move their hands in their hair so if shampoo can provide that feel, then
they prefer it.
    Removes dandruff
    Dandruff is major hair problem as per customers and a good shampoo is one
which can remove dandruff properly.
    Reduces hair breakage
    With changing waters in Ghaziabad, many customers are facing hair breakage issues; so, a shampoo or hair care, which can help them with that, is chosen by
them.
    Nourishes hair
    Some customers mentioned that take care of hair nourishment while choosing a shampoo and consider protein supplements and creatine etc in
shampoo
    Doesn’t contain chemical
    Customers
    understand
    that
    their
    
    shampoos have chemicals, but they are looking for chemical free yet effective
shampoo for future usage.
    Easily available
    Availability is very important factor and
absence leads to brand switching
    Good fragrance
    Some customers mention that it is important for them that how their hair smell after wash. But some customers pointed that too much of smell of
natural ingredients is not good.
ii) Primary research:
Primary research was performed to understand the actual market sentiments towards shampoo products as well as their consumer behavior towards their purchase decision of shampoo product category. Challenge of biases involved in primary research has been taken care by adopting Stratified sampling technique to avoid any demographic or behavioral biases.
Sample size: 112
Question asked: 36
· 10 questions were asked to understand the customers personal and mandatory demographic details
· 26 questions were framed to understand the consumer behavior from aspect of variables we decided in beginning. Response to these 26 questions would help us to apply marketing analytical concepts and tools to recommend about business problems
RESPONDENTS PROFILE
    
TOOLS USED FOR MARKETING ANALYTICS
· Factor Analysis : There are 26 independent variables identified after secondary research but to apply concept of Marketing analytics and further tests to come up at decision making point, it is important to club factors and variables together which are closely correlated with each other. Factor analysis tool helps to understand these correlations and club different variables indicating same trend.
· Cluster Analysis: Cluster Analysis is used to identify different cases which are similar to some extent indicating similar kind of respondents which are eventually called clusters. This is basically done on basis of across factors similarity between two or more respondents indicating similar trend.
ANALYSIS
Following steps would be followed for research purpose:
Step 1: Missing value analysis Step 2: Outlier detection Step 3: Factor Analysis
Step 5: Factor Labeling Step 6: Cluster Analysis
Step 7: Understanding clusters and defining segments Step 8: Recommendations
MISSING VALUE ANALYSIS
**We need to refer column ‘Missing value’ and check absolute number as well as %age of missing values
No missing value is identified after running this test which suggests that every respondent has filled form thoroughly. In case, there had been any missing value identified we would have treated missing value by either imputing value or deleting the cases with missing values to prepare correct data set for further analysis.
Outlier detection
Outlier detection technique is used to identify any disturbing and abnormal values or cases if collected while survey. There might be the occurrence of outliers due to outlier respondents but most of the cases it is about the misprint or error while collecting data.
Outlier detection is important as these outliers can create issue for the further test techniques and thus are needed to get rid of these outliers by either removing them completely from the data set considered for the analysis or replace it with some probable values to keep the test normal and smooth.
Mahalanobis Test is used to identify the mahalanobis distance of each case which is used to identify if the particular response or data case is lying beyond the range of data and may impact overall analysis.
Above shown is the pic of spss output after doing manhabolis test. Here we need to find the cases where psig <0.001
So we have identified 4 customers with psig value less than 0.01 . This creates problem for the regular analytical tools but with 4 customers lying at outlier gives us hint to ignore the particular cases and do not delete outliers from the data set because these four customers might form a fresh segment of...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here