Assignment 1 The development of a data mining or rule visualisation routine There are two parts to this assignment. You are required to answer EITHER part. Part A You will be provided with various...

1 answer below »
hithis is advanced data mining topic.there are 2 parts for this assignment and you can do any 1 part.refer to lecture slides first. topic 4 in lecture slides will help you what to do in this assignment.


Assignment 1 The development of a data mining or rule visualisation routine There are two parts to this assignment. You are required to answer EITHER part. Part A You will be provided with various sets of data for mining (and you can create your own). The assignment is to develop and implement a data mining algorithm (of any kind) such that: • it does not already exist in any commercially available system (although a significant extension to one that does is acceptable), • it is backed by appropriate research. Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm. Note that I do not require "formal" documentation). Part B You will be provided with various rulesets that require appropriate visualisation tools. The assignment is to develop and implement a visualisation algorithm (of any kind) such that: • it does not already exist in any commercially available system • it is backed by appropriate research Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm. Note that I do not require "formal" documentation). Extensions Other extension (or undertaking both parts of the assignment!) would be looked on favourably and marks will be awarded up to the maximum mark available for the assignment - ie a nice extension can make up for lost marks. Marking Criteria for Both Parts Basic algorithm coded in any language 16 marks. Bonus for extensions 4 marks. Documentation 10 marks. As far as the algorithm is concerned, you will be marked on the quality of your solution as follows: 
 a. computational complexity of you algorithm. 
 b. elegance of your programming. 
 c. accuracy and configurability (ie. setting thresholds). As far as the documentation is concerned, you will be marked on: 
 a. your research into methods available and the novelty of your solution. 
 b. your explanation of your algorithm. Submission of Assignment All assignment 1's should be zipped into an archive (using your favorite zip package) and uploaded to FLO. It should include everything including documentation, the source, the executable and any test data you developed for yourself. Name the document surname.zip where surname is your surname. Data Mining and Knowledge Discovery COMP7707 Advanced Data Mining (and Knowledge Discovery) Prof. John Roddick [email protected] With contributions from Aaron Ceglar, Carl Mooney and Mark Lethbridge. Naturally occurring Cubic Pyrite COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Overview of Topic © 2018, Flinders University * Topics Introduction The Role of Common Sense Trends in Information Management Fundamental Ideas Developing Data Mining Algorithms Applications of Knowledge Discovery Future Directions in DMKD Data Mining Techniques Association Rule Mining Clustering Algorithms Classification and Prediction Sequential Pattern Mining Text Mining Higher Order Data Mining Visualisation Techniques Including Higher Semantics Spatial Data Mining Temporal and Longitudinal Data Mining Interestingness Web Mining Knowledge Discovery Ethics in Data Mining Knowledge Discovery Frameworks Naturally occurring Cubic Pyrite COMP7707 Advanced Data Mining Prof. John Roddick Flinders University [email protected] * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University DMKD - the discipline A merger of (at least) four disciplines. * © 2018, Flinders University * Data Mining and Knowledge Discovery Artificial Intelligence Database Systems Statistics Visualisation VLDB, data warehousing, data modelling, data semantics, … Decision Tree Induction, Clustering, Inductive Logic, … Validity, Confidence, Autocorrelation, … Data Visualisation, Dimension Reduction, … © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Where it fits in ICT Database queries can be considered to confirm answers to fairly well formed questions or provide simple answers to (relatively) simple questions. Data Analysis is used to give answers to questions which might require some discussion or where the answer is at first vague. Data Mining allows the question itself to be ill-formed. “Tell me something interesting about …” * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Terminology Data Mining is the term used to describe the algorithms/routines used to discover interesting aspects about a dataset. Knowledge Discovery is the term used to describe the overarching discovery process. The difference is similar to the difference between programming and software engineering. The terminology is misused (and misappropriated) quite a bit. DMKD is one of the hottest research topic to emerge in the database research area in some years. * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Research Sources Major Conferences ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD IEEE International Conference on Data Mining, ICDM European Conference on Principles of Data Mining and Knowledge Discovery, PKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD SIAM International Conference on Data Mining International Conference on Data Warehousing and Knowledge Discovery, DaWaK … plus local conferences such as AusDM Conferences that have many DMKD papers ACM SIGMOD International Conference on the Management of Data, SIGMOD International Conference on Information and Knowledge Management, CIKM International Conference on Very Large Data Bases, VLDB IEEE International Conference on Data Engineering, ICDE Journals Data Mining and Knowledge Discovery, DMKD ACM Transactions on Knowledge Discovery from Data, TOKDD ACM Transactions on Database Systems, TODS IEEE Transactions on Knowledge and Data Engineering, TKDE Knowledge and Intelligent Systems, KAIS Data and Knowledge Engineering, DKE * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University About ADM - the topic Knowledge of Database Systems, Artificial Intelligence, Statistics and Visualisation is not required for this topic. HOWEVER, if you find something a little difficult as a result of not having studied it, do read up on it. I will try and provide references. Being such a new area, some of the subject matter will come direct from research material. Ie. do not expect to find all of the things we talk about implemented in commercial systems yet. Enormous scope to join the team at Flinders in doing postdoctoral, postgraduate or adjunct research. * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic Organisation SAM has important details - please read Assignments I’ve kept it simple. You can do all of them and get best of them - but be strategic. Tutorial/Discussions Sessions Will start in week 3 * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic Organisation 2 Timetable Thursdays for 13 weeks Lectures. 3pm – 5pm, 1 hr 50 mins Tonsley 1.03 Tutorial - Starting wk 3. noon – 1pm, 50 mins Tonsley 1.14 Text Book Tan, Steinbach and Kumar - worth the investment but not critical to buy Other resources available in various University libraries * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Assessment Any two of… Assignment 1 - The development of a data mining or rule visualisation routine Assignment 2 - A research based paper Assignment 3 - A critique of a seminal DMKD paper * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic 1 The Role of Common Sense COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Benford’s Law In 1938 Benford noticed that pages of logarithms corresponding to numbers starting with the numeral 1 were much dirtier than other pages. The Theory … Ask anyone to choose numbers randomly and, over a largish number of numbers, there will be 1/9th starting with 1, 1/9th starting with 2, etc. * © 2018, Flinders University * However, naturally occurring numbers do not follow this pattern. They generally have: 30% starting with 1, 18% starting with 2, etc. © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Benford’s Law, cont. We can therefore tell if something that was supposed to be naturally occurring has been faked. For example, the numbers in an audited set of accounts … random samples from a day's stock quotations, a tournament's tennis scores, the numbers on the front page of The New York Times, the populations of towns, the molecular weights of compounds, the half-lives of radioactive atoms… Has been applied to fraud cases in Brooklyn Income tax fraud in California * © 2018, Flinders University * (From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998) © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic 2 Trends in Information Management COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders
Answered Same DayMay 09, 2020COMP7707Flinders University

Answer To: Assignment 1 The development of a data mining or rule visualisation routine There are two parts to...

Abr Writing answered on May 20 2020
147 Votes
DataMining
May 25, 2018
Decision trees are very often used for prediction task and is extremely useful for following
reasons:
1. Decision trees perform the task of feature selection absolutely Features selecti
on is one of
the most important task in data analysis. In a decision tree, when we fit the classifier to
dataset, it become very easy to figure out the most important features in the data from the
top few nodes. Higher the node is in the hierarchy, the more important and better its power
to split the data and perform the classification task. We described here why feature selection
is important in analytics.
2. Decision trees classifier can be easily trained by users than other classifier The different kind
of data normalization and transformation is not necessary in a decision tree because the
structure of the tree remains the same irrespective of that. For example, if we have to mea-
sure the passenger fare based on the different features available in the titanic dataset, we can
fit a regression model and then interpret the slopes/coefficients of the resulting model but
such a fit requires some form of normalization or scaling of the data. In addition, even if
we have any missing data points that will not affect the decision tree from building trees or
splitting the training data as well as the outliers will not cause any difference to decision tree
unlike other classifier like regression model.
3. The performance of the decision tree classifier is not dependent on nonlinear relationship In
some simple models such as a regression, any kind of nonlinear relationships makes a model
invalid. However, decision trees do not require any assumptions of linearity in the data.
4. Decision trees are easy to interpret and explain. Decision trees are very intuitive and easy
to explain. With these benefits, it is important to reduce the importance of decision trees:
without pruning or limiting tree growth, they often get accustomed to overfitting training
data, which can be very harmful. The algorithm just grows the tree top-down. It looks
at all the variables in the input...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here