hithis is advanced data mining topic.there are 2 parts for this assignment and you can do any 1 part.refer to lecture slides first. topic 4 in lecture slides will help you what to do in this assignment.
Assignment 1 The development of a data mining or rule visualisation routine There are two parts to this assignment. You are required to answer EITHER part. Part A You will be provided with various sets of data for mining (and you can create your own). The assignment is to develop and implement a data mining algorithm (of any kind) such that: • it does not already exist in any commercially available system (although a significant extension to one that does is acceptable), • it is backed by appropriate research. Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm. Note that I do not require "formal" documentation). Part B You will be provided with various rulesets that require appropriate visualisation tools. The assignment is to develop and implement a visualisation algorithm (of any kind) such that: • it does not already exist in any commercially available system • it is backed by appropriate research Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm. Note that I do not require "formal" documentation). Extensions Other extension (or undertaking both parts of the assignment!) would be looked on favourably and marks will be awarded up to the maximum mark available for the assignment - ie a nice extension can make up for lost marks. Marking Criteria for Both Parts Basic algorithm coded in any language 16 marks. Bonus for extensions 4 marks. Documentation 10 marks. As far as the algorithm is concerned, you will be marked on the quality of your solution as follows:
a. computational complexity of you algorithm.
b. elegance of your programming.
c. accuracy and configurability (ie. setting thresholds). As far as the documentation is concerned, you will be marked on:
a. your research into methods available and the novelty of your solution.
b. your explanation of your algorithm. Submission of Assignment All assignment 1's should be zipped into an archive (using your favorite zip package) and uploaded to FLO. It should include everything including documentation, the source, the executable and any test data you developed for yourself. Name the document surname.zip where surname is your surname. Data Mining and Knowledge Discovery COMP7707 Advanced Data Mining (and Knowledge Discovery) Prof. John Roddick
[email protected] With contributions from Aaron Ceglar, Carl Mooney and Mark Lethbridge. Naturally occurring Cubic Pyrite COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Overview of Topic © 2018, Flinders University * Topics Introduction The Role of Common Sense Trends in Information Management Fundamental Ideas Developing Data Mining Algorithms Applications of Knowledge Discovery Future Directions in DMKD Data Mining Techniques Association Rule Mining Clustering Algorithms Classification and Prediction Sequential Pattern Mining Text Mining Higher Order Data Mining Visualisation Techniques Including Higher Semantics Spatial Data Mining Temporal and Longitudinal Data Mining Interestingness Web Mining Knowledge Discovery Ethics in Data Mining Knowledge Discovery Frameworks Naturally occurring Cubic Pyrite COMP7707 Advanced Data Mining Prof. John Roddick Flinders University
[email protected] * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University DMKD - the discipline A merger of (at least) four disciplines. * © 2018, Flinders University * Data Mining and Knowledge Discovery Artificial Intelligence Database Systems Statistics Visualisation VLDB, data warehousing, data modelling, data semantics, … Decision Tree Induction, Clustering, Inductive Logic, … Validity, Confidence, Autocorrelation, … Data Visualisation, Dimension Reduction, … © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Where it fits in ICT Database queries can be considered to confirm answers to fairly well formed questions or provide simple answers to (relatively) simple questions. Data Analysis is used to give answers to questions which might require some discussion or where the answer is at first vague. Data Mining allows the question itself to be ill-formed. “Tell me something interesting about …” * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Terminology Data Mining is the term used to describe the algorithms/routines used to discover interesting aspects about a dataset. Knowledge Discovery is the term used to describe the overarching discovery process. The difference is similar to the difference between programming and software engineering. The terminology is misused (and misappropriated) quite a bit. DMKD is one of the hottest research topic to emerge in the database research area in some years. * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Research Sources Major Conferences ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD IEEE International Conference on Data Mining, ICDM European Conference on Principles of Data Mining and Knowledge Discovery, PKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD SIAM International Conference on Data Mining International Conference on Data Warehousing and Knowledge Discovery, DaWaK … plus local conferences such as AusDM Conferences that have many DMKD papers ACM SIGMOD International Conference on the Management of Data, SIGMOD International Conference on Information and Knowledge Management, CIKM International Conference on Very Large Data Bases, VLDB IEEE International Conference on Data Engineering, ICDE Journals Data Mining and Knowledge Discovery, DMKD ACM Transactions on Knowledge Discovery from Data, TOKDD ACM Transactions on Database Systems, TODS IEEE Transactions on Knowledge and Data Engineering, TKDE Knowledge and Intelligent Systems, KAIS Data and Knowledge Engineering, DKE * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University About ADM - the topic Knowledge of Database Systems, Artificial Intelligence, Statistics and Visualisation is not required for this topic. HOWEVER, if you find something a little difficult as a result of not having studied it, do read up on it. I will try and provide references. Being such a new area, some of the subject matter will come direct from research material. Ie. do not expect to find all of the things we talk about implemented in commercial systems yet. Enormous scope to join the team at Flinders in doing postdoctoral, postgraduate or adjunct research. * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic Organisation SAM has important details - please read Assignments I’ve kept it simple. You can do all of them and get best of them - but be strategic. Tutorial/Discussions Sessions Will start in week 3 * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic Organisation 2 Timetable Thursdays for 13 weeks Lectures. 3pm – 5pm, 1 hr 50 mins Tonsley 1.03 Tutorial - Starting wk 3. noon – 1pm, 50 mins Tonsley 1.14 Text Book Tan, Steinbach and Kumar - worth the investment but not critical to buy Other resources available in various University libraries * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Assessment Any two of… Assignment 1 - The development of a data mining or rule visualisation routine Assignment 2 - A research based paper Assignment 3 - A critique of a seminal DMKD paper * © 2018, Flinders University * © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic 1 The Role of Common Sense COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Benford’s Law In 1938 Benford noticed that pages of logarithms corresponding to numbers starting with the numeral 1 were much dirtier than other pages. The Theory … Ask anyone to choose numbers randomly and, over a largish number of numbers, there will be 1/9th starting with 1, 1/9th starting with 2, etc. * © 2018, Flinders University * However, naturally occurring numbers do not follow this pattern. They generally have: 30% starting with 1, 18% starting with 2, etc. © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Benford’s Law, cont. We can therefore tell if something that was supposed to be naturally occurring has been faked. For example, the numbers in an audited set of accounts … random samples from a day's stock quotations, a tournament's tennis scores, the numbers on the front page of The New York Times, the populations of towns, the molecular weights of compounds, the half-lives of radioactive atoms… Has been applied to fraud cases in Brooklyn Income tax fraud in California * © 2018, Flinders University * (From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998) © 2018, Flinders University COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders University Topic 2 Trends in Information Management COMP7707 Advanced Data Mining, Semester 1, 2018 COMP7707 Advanced Data Mining, Semester 1, 2018 John F. Roddick, Flinders University * John F. Roddick, Flinders