.use same account with yesterday. Because we have to send link to prof.we have to use the same account to submit daily assignment
Modeling Methods, Deploying, and Refining Predictive Models Modeling Methods, Deploying, and Refining Predictive Models UCI Spring 2020 I&C X425.34 Modeling Methods, Deploying, and Refining Predictive Models Module: Data Preparation and the Modeling Process Schedule 2 Introduction and Overview Data and Modeling + Simulation Modeling Error-based Modeling Probability-based Modeling Similarity-based Modeling Information-based Modeling Time-series Modeling Deployment At the end of this module: You will review how to: Prepare and analyze data to understand its properties and relationships In relation to: The modeling process You will learn how to build: Simulation models For Scenario analysis 3 Today’s Objectives Data and the modeling process Data pitfalls Model risk Simulation Modeling Today’s Objectives Data and the modeling process Data pitfalls Model risk Simulation Modeling First translate the business question to a data problem Understand Business Problem Propose Analytics Solutions Explore Data Assess Analytics Solutions Choose Analytics Solutions Agree on Analytics Goals Design Domain Concepts Brainstorm Domain Concepts Review Domain Concepts Explore Data Design Features Review Features Build ABT Clean & Prepare Data Deploy Data Business Understanding Data Understanding Data Preparation Next, understand the data available Understand Business Problem Propose Analytics Solutions Explore Data Assess Analytics Solutions Choose Analytics Solutions Agree on Analytics Goals Design Domain Concepts Brainstorm Domain Concepts Review Domain Concepts Explore Data Design Features Review Features Build ABT Clean & Prepare Data Deploy Data Business Understanding Data Understanding Data Preparation Setup the data for modeling Understand Business Problem Propose Analytics Solutions Explore Data Assess Analytics Solutions Choose Analytics Solutions Agree on Analytics Goals Design Domain Concepts Brainstorm Domain Concepts Review Domain Concepts Explore Data Design Features Review Features Build ABT Clean & Prepare Data Deploy Data Business Understanding Data Understanding Data Preparation The process is highly iterative Understand Business Problem Propose Analytics Solutions Explore Data Assess Analytics Solutions Choose Analytics Solutions Agree on Analytics Goals Design Domain Concepts Brainstorm Domain Concepts Review Domain Concepts Explore Data Design Features Review Features Build ABT Clean & Prepare Data Deploy Data Business Understanding Data Understanding Data Preparation Real world data must be represented in some digital form like: Numeric: True numeric values that allow arithmetic operations (price, measurement, etc.) Interval: allow ordering and subtraction but not other arithmetic operations (date, time, etc.) Ordinal: Values that allow ordering but do not permit arithmetic operations (e.g. size as S, M, L) Categorical: Finite set of values that cannot be ordered and allow no arithmetic operation Binary: A set of two values (e.g. T/F) Textual: Free-form Tuples: n-tuple identifiers such as lat/lon coordinates Translating the business question to a target concept is hard Analytics Solution Domain Concept Domain Concept Target Concept Domain Subconcept Domain Subconcept Domain Subconcept Domain Subconcept Target Feature Feature Feature Feature Feature Feature Feature Feature The more specific you are about the question, the better For example, instead of asking “how can we increase sales for the next quarter”, we could ask much more narrowly focused questions: What impact did the latest 5% pricing discount have on sales for item x? Which demographic spent more online within 1 week of the big push email campaign? What items are typically purchased together for customers who also bought item x? Defining what domain concept might impact the target feature is even harder Analytics Solution Domain Concept Domain Concept Target Concept Domain Subconcept Domain Subconcept Domain Subconcept Domain Subconcept Target Feature Feature Feature Feature Feature Feature Feature Feature The range of domain concepts is only limited by what you can quantify or label Prediction subject details: Descriptive details of any aspect of the prediction subject Demographics or Cohorts Features of users, customers, issuance, origination such as age, gender, origination date, occupation, gender, category, race Usage Frequency and Recency Cumulative value Mix or diversity Changes in Usage Usage measures but in change terms Special Usage Extraordinary events Unusual activity Increased usage, decreased usage, drop out Lifecycle Phase Early, middle, late Network Links Relationships between other measures from a structural, social, geo-spatial, temporal view Domain subconcept breakdowns can help to make the connection to features in the dataset Analytics Solution Domain Concept Domain Concept Target Concept Domain Subconcept Domain Subconcept Domain Subconcept Domain Subconcept Target Feature Feature Feature Feature Feature Feature Feature Feature Features are the measurable attributes we use for creating the model to predict the target feature Analytics Solution Domain Concept Domain Concept Target Concept Domain Subconcept Domain Subconcept Domain Subconcept Domain Subconcept Target Feature Feature Feature Feature Feature Feature Feature Feature Conceptually mapping Gross Domestic Product (GDP) is hard From Econ 101, gross domestic product is calculated as: Conceptually mapping Gross Domestic Product (GDP) is hard From Econ 101, gross domestic product is calculated as: Let’s say the question is what will be the % change in next quarter’s GDP? The target feature: % change in next quarter GDP Specifics? adjusted for inflation? Detrended? Motivation? Are we trying to gauge economic output or production? Are we trying to look at economic growth relative to other countries? Perhaps, other measures are more direct at asking the question? Conceptually mapping Gross Domestic Product (GDP) is hard From Econ 101, gross domestic product is calculated as: Let’s say the question is what will be the % change in next quarter’s GDP? What features do we have? Are they up to date or reported with a lag? Can we build a predictor for each of the inputs into GDP? How do constant historical revisions of the data impact the analysis? What conceptually goes into measuring each of the features? Is there enough data to make a prediction? Next, we need to transform the raw data into features for the model Raw Sensor-based Digitally tracked Polled User-submitted Statistics System defined Etc. Derived Aggregates: defined over a group or period usually as: count, sum, average, min, max Flags: binary features indicating presence or absence of characteristic in data Ratios: continuous features that capture the relationship between two or more data values. Mapping: converts continuous features into categorical features to reduce the number of unique values or provide higher level conceptual mapping We need to have an Analytics Base Table (ABT) before we can model anything Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature The ABT and the Model Descriptive Feature 1Descriptive Feature 2…Descriptive Feature mTarget Feature Obs 1Obs 1Obs 1 CategoricalTarget value 1 Obs 2Obs 2 CategoricalTarget value 2 ... .Obs 2.. Obs n-2Obs n-2 Categorical. Obs n-1Obs n-1 Categorical. Obs nObs nObs n CategoricalTarget value n Existence of a target feature automatically make the modeling problem supervised. The data type of the feature restrict which models can be used The dataset characteristics may restrict the resolution of the model, force you to make assumptions, or require modeling for imputation, de-noising, data generation, etc. Understanding and manipulating feature spaces is the key to data analytics N-dimensional vector space representation of language produces an incredible ability to perform word-vector arithmetic. Image source: Deep Learning Illustrated by Krohn The ABT/ Feature space The ABT/feature space representation is nothing more than an n-dimensional matrix Modeling methods are just different ways to perform statistical, mathematical, or even heuristic transformations of the matrix to capture patterns and relationships. Today’s Objectives Data and the modeling process Data pitfalls Model risk Simulation Modeling Data Pitfalls Epistemic Errors: How we think about data Assuming data is perfect reflection of reality Forming conclusions about future based on historical data only Seeking to use data to verify a previously held belief rather than test its veracity Technical Traps: How we process data Dirty data with mismatched category levels, typos Units of measurement, date/time misalignments Merging of disparate data sources, duplication Mathematical miscues: How we calculate data Summing at various levels of aggregation Calculating rates or ratios Working with proportions and percentages Dealing with different units Statistical slipups Correct distributional assumptions Sampling issues Comparative issues Analytical abberations Overfitting Missing signals Extrapolation or interpolation issues Using unnecessary metrics Graphical issues Suitable visualization type Clarity and reasonableness of visualization Charting errors Embellishment issues Communication Interactivity Data fallacies to avoid data-literacy.geckoboard.com data-literacy.geckoboard.com Data fallacies to avoid data-literacy.geckoboard.com Data fallacies to avoid Simpson’s Paradox Source: Rafael Irizarry The aggregation of data reverses the trend for the individual groups. So the resulting correlation is inverted and counter to the individual groups and to the average of the individual groups. Simpson’s Paradox Example: Effectiveness of kidney stones treatment A and B are shown in the following table: Treatment A works better for both small and large kidney stones. However, when you aggregate the data, treatment B works better for all cases. Simpson’s Paradox How can this be? It turns out that small stones are considered less serious cases. Treatment A is more invasive than treatment B. Therefore, doctors are more likely to recommend the inferior treatment, B, for small kidney stones where the patient is more likely to recover successfully in the first place because the case is less severe. For large, serious stones, doctors more often go with the better but more invasive treatment A. Even though treatment A performs better on these cases, because it is applied to more serious cases, the overall recovery rate for treatment A is lower than treatment B. Domain mapping to data oftentimes requires meta-level thinking One way to resolve Simpson’s paradox is to employ causal analysis which requires domain knowledge akin to what we just heard. Size of the kidney stone which corresponds to the severity of the case is a confounding variable because it affects both the independent variable, the treatment, and the dependent variable, successful recovery. To resolve, we need to control for the confounding variable by segmenting the two groups rather than aggregating over them. Size of stone Treatment selected Successful Recovery Confounding variable Effects Data is 99% of the work Avoiding data pitfalls and fallacies is and unending task and most certainly should include having or working with domain expertise along with solid understanding of the statistical issues that might arise from leaving data cleansing, transformations, aggregations, and derivations unchecked. Today’s Objectives Data and the modeling process Data pitfalls Model risk Simulation Modeling Typical Sources of Model Risk Assumptions: Linearity Stationarity Normality Statistical Bias Sampling Bias Over and under-fitting Survivorship Bias Confirmation Bias Omitted-variable, Confounders Linearity Assumption Assumption that the relationship between any two variables can be expressed using a straight-line graph. Linearity is a common assumption that is buried in many models because most correlation metrics reflect linearity between two variables or assumes something about the distribution of the variables Pearson correlation (common correlation measure) scaled between -1 and 1 measures propensity of random phenomenon to have linear association As we’ll see in regression, the slope of the line of fit is fully determined by Pearson’s correlation coefficients. If the relationship is non-linear, a linearity assumption will either Not detect the relationship Over or underestimate the relationship. Stationarity Assumption Assumption is that a variable or distribution from which a random variable is sampled is constant over time. For many stochastic models, particularly dealing with volatility and correlation, this is a strong assumption that can lead to completely unrealistic results. Normality Assumption Normal distributions or Gaussian distributions are often used as a matter of convenience. Sums of any combination of normal distributions are also