same as before
Lecture Recording URL: https://ucidce.zoom.us/rec/share/-tJOIKHfykJOSIXf6h7DQYQ7Rp73eaa81HUb8_MOxI7uyzsycMubf064CsL-1TM Day 6: Assignment, Ensembles, Decision Trees, and Trading Systems Submit Assignment Instructions This assignment looks at using k-nearest neighbors to create a simple recommendation engine. Homework steps: · Open the homework notebook link: LINK TO NOTEBOOK (Links to an external site.) · Save a copy to your Google drive · Answer the questions in the notebook copy with your code and answers to the question · Set sharing to "Anyone with a link can view". · Save the notebook and submit the link Day 6: Content Overview In information-based modeling, we use again utilize the structure of past data in order to build models for regression and classification problems. In this module, we cover decision trees which is a modeling method based on information gain. The resulting model is a tree structure based on actual values of attributes in the data and is often a big favorite in machine learning because of its readability. Ensemble methods are also touched upon in this module as a way to augment the modeling power from multiple models. Readings and Media · Class slides: Information-based Modeling · Modeling Methods, Deploying, and Refining Predictive Models Modeling Methods, Deploying, and Refining Predictive Models UCI Spring 2020 Class 6 Information-based Modeling Schedule 2 Introduction and Overview Data and Modeling + Simulation Modeling Error-based Modeling Probability-based Modeling Similarity-based Modeling Information-based Modeling Time-series Modeling Deployment At the end of this module: You will learn how to build: Decision Trees and Ensembles For Regression and classification 3 Supervised Methods Error-based SIMILARITY-based Information-based Probability-based Neural networks and deep Learning-based methods Ensembles Today’s Objectives Information-based Modeling Decision Trees Ensembles Information-based Algorithms Models which are based on information gain in data sets such as decision trees. Decision tree methods construct a model of decisions made based on actual values of attributes in the data. Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning. The most popular decision tree algorithms are: Classification and Regression Tree (CART) Iterative Dichotomiser 3 (ID3) C4.5 and C5.0 (different versions of a powerful approach) Chi-squared Automatic Interaction Detection (CHAID) Decision Stump M5 Conditional Decision Trees Today’s Objectives Information-based Modeling Decision Trees Ensembles Decision Trees Robust and intuitive predictive models when the target attribute is categorical in nature and when the data set is of mixed data types Unlike more numerical methods, decision trees are better at handling attributes that have missing or inconsistent values Decision trees tell the user what is predicted, how confident that prediction can be, and how we arrived at that prediction Popular method when communicability is a priority Computationally efficient Applications Medicine: used for diagnosis in numerous specialties Financial analysis: credit risk modeling Internet routing: used in routing tables to find next router to handle packet based on the prefix sequence of bits Computer vision: tree-based classification for recognizing 3D objects Many more… An example of a Decision Tree developed in RapidMiner Decision trees are made of nodes and leaves to represent the best predictor attributes in a data set Elements of a decision tree Contains images Suspicious words Unknown sender spam legit spam legit true false true false Root Node Nodes Nodes Leaf Nodes Depth Decision Path true false The ABT for decision trees Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature Obs 1Obs 1Obs 1Target value 1 Obs 2Obs 2Target value 2 ... .Obs 2.. Obs n-2Obs n-2. Obs n-1Obs n-1. Obs nObs nObs nTarget value n Categorical, numeric, or mixed feature space. This just represents sets. The heterogeneity of sets represents entropy. Can be numeric or categorical Shannon’s entropy model and cards Entropy(card) = 0.0 Entropy(card) = .81 Entropy(card) = 1.0 Entropy(card) = 1.50 Entropy(card) = 1.58 Entropy(card) = 3.58 Entropy increases as uncertainty increases Entropy(card) = 0.0 Entropy(card) = .81 Entropy(card) = 1.0 Entropy(card) = 1.50 Entropy(card) = 1.58 Entropy(card) = 3.58 Shannon’s Model of Entropy Cornerstone of modern information theory Measures heterogeneity of a set Defined as: P(d=l): probability of randomly selecting an element d of type l L is number of different types of d in the set s is an arbitrary base, but for information modeling, 2 is used to represent bits Shannon’s Model of Entropy Cornerstone of modern information theory Measures heterogeneity of a set Defined as: P(d=l): probability of randomly selecting an element d of type l L is number of different types of d in the set s is an arbitrary base The ABT for decision trees Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature Obs 1Obs 1Obs 1Target value 1 Obs 2Obs 2Target value 2 ... .Obs 2.. Obs n-2Obs n-2. Obs n-1Obs n-1. Obs nObs nObs nTarget value n Our dataset . We can repartition by a descriptive feature, d, for each of the levels that d can take; e.g. . Each partition reduces the entropy in the set. The difference is the information gain. Levels(Y) is the set of levels in the domain of target feature Y and is a value in Levels(Y) with L levels Entropy for our dataset Entropy for our dataset Remaining entropy in a partitioned dataset Entropy remaining when we partition the dataset: Information gain Entropy remaining when we partition the dataset: Information gained by splitting the dataset using the feature d: Decision tree process Compute the entropy of the original dataset with respect to the target feature. This gives us a measure of how much information is required in order to organize datasets into pure sets which relates to the heterogeneity or entropy of the set. Decision tree process Compute the entropy of the original dataset with respect to the target feature. For each descriptive feature, create the sets that result by partitioning the instances in the dataset using their feature values and then sum the entropy scores of each of these sets. This is the remaining entropy in the partitioned sets and is required to organize the instances into pure sets after we have split them using the descriptive feature Decision tree process Compute the entropy of the original dataset with respect to the target feature. For each descriptive feature, create the sets that result by partitioning the instances in the dataset using their feature values and then sum the entropy scores of each of these sets. Subtract the remaining entropy value from the original entropy value to compute the information gain. Implementation Iterative Dichotomizer 3 (ID3) algorithm is one of the most popular approaches. Top-down, recursive, depth-first partitioning beginning at the root node and finishing at the leaf nodes. Assumes categorical features and clean data but can be extended to handle numeric features and targets and noisy data via thresholding and pruning. Today’s Objectives Information-based Modeling Decision Trees Ensembles Ensembles Instead of focusing on a single model for prediction, what if we generate a set of independent models, aggregate them, and compose their outputs? Ensemble properties Build multiple independent models from the same dataset but each model uses a modified subset of the dataset Make a prediction by aggregating the predictions of the different models in the ensemble. For categorical targets, this can be done using voting mechanisms. For numeric targets, this can be done using a measure of central tendency like the mean or median. Ensemble properties Build multiple independent models from the same dataset but each model uses a modified subset of the dataset Make a prediction by aggregating the predictions of the different models in the ensemble. For categorical targets, this can be done using voting mechanisms. For numeric targets, this can be done using a measure of central tendency like the mean or median. Boosting Increasing repetitions to target weak performance. Boosting idea Step 1: Use a weighted dataset where each instance has an associated weight. Initially, distribute the weights uniformly to all instances. Sample over this weighted set to create a replicated training set and create a model using the replicated training set. Find the total error in the set of predictions made by the model. (Prediction, Error Rate) Boosting idea Step 2 Increase the weight for the misclassified instances and decrease the weight for correctly classified instances. The number of times an instance is replicated is proportional to its weight Calculate a confidence measure of the model based on the error. This is used to weight the predictions from the models (Prediction, Error Rate) (Prediction, Error Rate) Model 1 Model 2 Confidence measures Replicated instances Boosting idea Step 3 Make a prediction using the weighted models by: For categorical targets, this can be done using voting mechanisms. For numeric targets, this can be done using a measure of central tendency like the mean or median. Model 1 Model 3 Model 2 Bagging (bootstrap aggregating) Bagging and subspace sampling Bagging is another method to generate ensembles. Bagging (bootstrap aggregating) Bagging and subspace sampling Random samples the same size of the dataset are sampled with replacement from the dataset. These are the bootstrap samples. Bootstrap samples Bagging (bootstrap aggregating) Bagging and subspace sampling For each of the bootstrap samples, we create a model. Because we trained the models on sampled datasets with replacement, there will be duplicates and missing instances in each training set. This creates many different models because of the different data sets This is called subspace sampling Decision trees Bagging (bootstrap aggregating) Bagging and subspace sampling Random Forest The ensemble of decision trees resulting from subspace sampling is referred to as a Random Forest The ensemble makes predictions by returning the majority vote or by the median for continuous features. Boosting vs. Bagging Which method is preferred is up to experimentation Typically, boosting exhibits a tendency towards overfitting with a large number of features. Review of topics Information-based Modeling Decision Trees Entropy Information gain Categorical and numeric prediction Ensembles Boosting Bagging Comparison of ABT/Feature matrix concepts Error-based Probability-based Similarity-based Information-based We need to have an Analytics Base Table (ABT) before we can model anything Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature The ABT and the Model Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature Obs 1Obs 1Obs 1 CategoricalTarget value 1 Obs 2Obs 2 CategoricalTarget value 2 ... .Obs 2.. Obs n-2Obs n-2 Categorical. Obs n-1Obs n-1 Categorical. Obs nObs nObs n CategoricalTarget value n Existence of a target feature automatically make the modeling problem supervised. The data type of the feature restrict which models can be used The dataset characteristics may restrict the resolution of the model, force you to make assumptions, or require modeling for imputation, de-noising, data generation, etc. Understanding and manipulating feature spaces is the key to data analytics N-dimensional vector space representation of language produces an incredible ability to perform word-vector arithmetic. Image source: Deep Learning Illustrated by Krohn The ABT/ Feature space The ABT/feature space representation is nothing more than an n-dimensional matrix Modeling methods are just different ways to perform statistical, mathematical, or even heuristic