same the assignment of day 1 and day2 to creat the link
Modeling Methods, Deploying, and Refining Predictive Models Modeling Methods, Deploying, and Refining Predictive Models UCI Spring 2020 I&C X425.34 Modeling Methods, Deploying, and Refining Predictive Models Module: Error-based Modeling Schedule 2 Introduction and Overview Data and Modeling + Simulation Modeling Error-based Modeling Probability-based Modeling Similarity-based Modeling Information-based Modeling Time-series Modeling Deployment At the end of this module: You will learn how: Regression models work In order to Predict continuous values and classifications 3 Today’s Objectives Error-based models Multiple regression Simple linear regression Logistic regression Today’s Objectives Error-based models Multiple regression Simple linear regression Logistic regression Regression is a hammer Error-based Modeling Regression Linear Regression Logistic Regression Supervised Methods Error-based Instance-based Information-based Probability-based Neural networks and deep Learning-based methods Ensembles Practical Approach to Learning Machine Learning What does each method do? What type of analytics can I apply it to? {descriptive, diagnostic, predictive, prescriptive} What type of data can it input and output? {categorical, continuous, discrete, probabilistic, etc.} How does the method work? Assumptions How do I use the model in practice? Code, off-the-shelf solutions, etc. Advantages/Disadvantages Conceptual strengths and weaknesses Applications Model Evaluation Deployment and Integration Inputs/ Outputs Basic mathematical foundation or pseudocode Ease of interpretability, monitoring and deployment Error-based Methods Error-based methods like regression are concerned with modeling the relationship between variables both continuous or categorical using a measure of error in the predictions made by the model to determine the optimal relationship. Regression methods are a workhorse of statistics and have been regularly used as a baseline in machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process. The most popular regression algorithms are: Ordinary Least Squares Regression (OLSR) Linear Regression Logistic Regression Stepwise Regression Multivariate Adaptive Regression Splines (MARS) Locally Estimated Scatterplot Smoothing (LOESS) Simulation verses Machine Learning Our simulation model examples did not make use of all the information it had in devising a method for forecasting. We had to produce all the parameters and logic for the system but nothing in the algorithm produced its own parameters or logic. The system never learned from the data since we never provided the model a way to discover increasingly better parameters. What we need is a way to optimize. All of machine learning revolves around optimization. If we compare simulation to regression SimulationRegression Types of analyticsPredictive modeling for discrete, continuous, and probabilistic inputs and outputsPredictive modeling for discrete, continuous, categorical, and probabilistic outputs How it worksSimulationError minimization ApplicationsComplex systems typically stochastic processes, Monte Carlo simulations, and probabilistic forecastingLinear and non-linear modeling of discrete, continuous values for forecasting and classification Advantages/DisadvantagesRelatively robust, easy to understand. Able to model complex systems. Can become overly complex with declining performance relative to simpler methods.Easy to understand and able to incorporate many features. Flexibility of regression means that it can handle most prediction and classification cases. However, model is sensitive to data. Machine Learning - Regressions How does it work? Regressions minimize the error between the predicted target, and the actual target value . These types of methods can handle categorical and continuous-valued data in both the input and the output. The ABT and the Model Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature Obs 1Obs 1Obs 1 CategoricalTarget value 1 Obs 2Obs 2 CategoricalTarget value 2 ... .Obs 2.. Obs n-2Obs n-2 Categorical. Obs n-1Obs n-1 Categorical. Obs nObs nObs n CategoricalTarget value n Existence of a target feature automatically make the modeling problem supervised. The data type of the feature restrict which models can be used The dataset characteristics may restrict the resolution of the model, force you to make assumptions, or require modeling for imputation, de-noising, data generation, etc. The ABT for a regression model Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature Obs 1Obs 1Obs 1 CategoricalTarget value 1 Obs 2Obs 2 CategoricalTarget value 2 ... .Obs 2.. Obs n-2Obs n-2 Categorical. Obs n-1Obs n-1 Categorical. Obs nObs nObs n CategoricalTarget value n Today’s Objectives Error-based models Multiple regression Simple linear regression Logistic regression We are all familiar with linear regression on an intuitive level Simply by looking at a graph of the predictor X and the target value Y, we can usually guess if there is a linear relation. Mathematically, we can start with the equation of a line Recall that the equation of a line can be written as: Y m is the slope of the line, b is the y-intercept of the line (where the line meets the vertical axis when x = 0). b m Defined by the slope and intercept Linear regression finds values for m and b such that we now have an estimate of the target Y for any value of X. . b m And minimize the prediction error We would like to make sure m and b give us the smallest expected difference between the predicted value and actual value difference Defined by the sum of squared errors For simple linear regression, most used is sum of squared errors, or error. difference To find the parameters for a simple linear regression Given a set of d and d on a scatterplot, find the optimal line: Such that the sum of squared errors is minimized The statistical solution uses Recall that correlation between two variables is: To find the relationship between target and predictor as Recall that correlation between two variables is: It can be shown that the errors are minimized when Correlation!@# Recall that correlation between two variables is: It can be shown that the errors are minimized when This is related to the correlation then by: Correlation is at the heart of many models Correlation implies some sort of dependence between the variables X and Y even though it does not imply causation. Correlation only implies causality if and only if correlation is exactly 1. CAPM, a benchmark model in finance The Capital Asset Pricing Model (CAPM) is a famous linear model Created by Nobel Prize winner in Economics William Sharpe Estimates the return of an asset based on the return of the market and the asset’s linear relationship to the return of the market. The linear relationship of an asset to the market is the “beta” coefficient. CAPM “is the centerpiece of MBA investment courses. Indeed, it is often the only asset pricing model taught in these courses…unfortunately, the empirical record of the model is poor.” - Fama and French CAPM is a linear model More formally: where But, is just slope m which is proportional to the correlation between the asset/portfolio and the market. CAPM optimal parameters Given a set of asset returns and market returns on a scatterplot, find the best-fit line: Such that the sum of squared errors is minimized So: Beta measures how much the asset will change when the market changes A more general way to think about regression Think about averages. Its essentially finding the line which is the average value y for any input x. difference In terms of statistics Regression can be thought of as a condition mean. difference Which relates back to our linear regression Regression can be thought of as a condition mean. We can relate the mean back to our linear equation: But, isn’t there some variance? We have a conditional mean = would have some variance But, isn’t there some variance? We have a conditional mean = would have some variance Putting it back into the equation: with Normally distributed errors with mean 0 and standard deviation Our first model check with Normally distributed errors with mean 0 and standard deviation Check to see if our assumptions are correct by testing the error distribution via t-test, p-values, plotting, etc. These errors are often called the Residuals and we assume they are independently and identically distributed (i.i.d.) with normal distribution And we can score the model by: The error differences: Mean Absolute Error (MAE) Mean Squared Error (MSE) Root Mean Squared Error Amount of variance explained by the model Variance explain-ability Variance of the data Explained variance by the model is the amount of variance the model explains so: Variance explain-ability Variance of the data Explained variance by the model is the amount of variance the model explains so: The better the fit, the more variance the model accounts for. The closer is to 1, the better the model. General structure of modeling Data Training Set Test Set Model Development Model Evaluation Performance measures: Accuracy Precision Recall Simple Linear Regression Demo Very easy to fall into traps… From XKCD All the assumptions The assumptions that must be met for linear regression to be valid depend on the purposes for which it will be used. Any application of linear regression makes two assumptions: (A) The data used in fitting the model are representative of the population. (B) The true underlying relationship between X and Y is linear. All you need to assume to predict Y from X are (A) and (B). To estimate the standard error of the prediction , you also must assume that: (C) The variance of the residuals is constant (homoscedastic, not heteroscedastic). For linear regression to provide the best linear unbiased estimator of the true Y, (A) through (C) must be true, and you must also assume that: (D) The residuals must be independent. To make probabilistic statements, such as hypothesis tests involving b or r, or to construct confidence intervals, (A) through (D) must be true, and you must also assume that: (E) The residuals are normally distributed. Regression is a hammer Linear regression does not assume anything about the distributions of either X or Y; it only makes assumptions about the distribution of the residuals . As with many other statistical techniques, it is not necessary for the data themselves to be normally distributed, only for the errors (residuals) to be normally distributed. And this is only required for the statistical significance tests (and other probabilistic statements) to be valid; regression can be applied for many other purposes even if the errors are non-normally distributed. Steps for Simple Regression Models Plot and examine the data Transform X and Y Calculate the linear regression statistics. By hand that would be: Calculate: Calculate: Examine the regression slope and intercept Examine the residuals plot, that is Plot the residuals versus X If the residuals increase or decrease with X, they are heteroscedastic. Transform Y to cure this. If the residuals are curved with X, the relationship between X and Y is nonlinear. Either transform X, or fit a nonlinear curve to the data. If there are outliers, check their validity, and/or use robust regression techniques. Plot the residuals versus Y and check the cases like X above Plot the residuals against every other possible explanatory variable in the data set