(A very simple summary assignment) The first paragraph should summarize the reading. The second paragraph should highlight an specific point and/or briefly explore something that interested you (e.g., you may wish to focus on one aspect of the paper in more depth, you may wish to discuss something in the reading that you disagree with). Each paragraph represents one point of the assignment.
You should submit a summary paragraph and idea highlight paragraphper for 2 separate article. 4 paragraphs in total, no word min requirement, write what ever words you want.
Geostatistical Learning: Challenges and Opportunities Geostatistical Learning: Challenges and Opportunities? Júlio Hoffimanna,∗, Maciel Zorteab, Breno de Carvalhob, Bianca Zadroznyb aInstituto de Matemática Pura e Aplicada bIBM Research Brazil Abstract Statistical learning theory provides the foundation to applied machine learning, and its various successful applications in computer vision, natural language processing and other scientific domains. The theory, however, does not take into account the unique challenges of performing statistical learning in geospatial settings. For instance, it is well known that model errors cannot be assumed to be independent and identically distributed in geospatial (a.k.a. regionalized) variables due to spatial correlation; and trends caused by geophysical processes lead to covariate shifts between the domain where the model was trained and the domain where it will be applied, which in turn harm the use of classical learning methodologies that rely on random samples of the data. In this work, we introduce the geostatistical (transfer) learning problem, and illustrate the challenges of learning from geospatial data by assessing widely-used methods for estimating generalization error of learning models, under covariate shift and spatial correlation. Experiments with synthetic Gaussian process data as well as with real data from geophysical surveys in New Zealand indicate that none of the methods are adequate for model selection in a geospatial context. We provide general guidelines regarding the choice of these methods in practice while new methods are being actively researched. Keywords: geostatistical learning, transfer learning, covariate shift, geospatial, density ratio estimation, importance weighted cross-validation 1. Introduction Classical learning theory [1, 2, 3] and its applied machine learning methods have been popularized in the geosciences after various technological advances, ?Software is available at https://github.com/IBM/geostats-gen-error ∗Corresponding author Email addresses:
[email protected] (Júlio Hoffimann),
[email protected] (Maciel Zortea),
[email protected] (Breno de Carvalho),
[email protected] (Bianca Zadrozny) Júlio Hoffimann: Conceptualization, Methodology, Software, Formal Analysis, Inves- tigation, Visualization, Writing - Original Draft Maciel Zortea: Methodology, Validation Breno de Carvalho: Data Curation, Validation Bianca Zadrozny: Methodology, Valida- tion, Supervision Preprint submitted to arXiv.org February 18, 2021 ar X iv :2 10 2. 08 79 1v 1 [ st at .M L ] 1 7 Fe b 20 21 https://github.com/IBM/geostats-gen-error leading initiatives in open-source software [4, 5, 6, 7], and intense marketing from a diverse portfolio of industries. In spite of its popularity, learning theory cannot be applied straightforwardly to solve problems in the geosciences as the characteristics of these problems violate fundamental assumptions used to derive the theory and related methods (e.g. i.i.d. samples). Among these methods derived under classical assumptions (more on this later), those for estimating the generalization (or prediction) error of learned models in unseen samples are crucial in practice [2]. In fact, estimates of gen- eralization error are widely used for selecting the best performing model for a problem, out of a collection of available models [8]. If estimates of error are inac- curate because of violated assumptions, then there is great chance that models will be selected inappropriately [9]. The issue is aggravated when models of great expressiveness (i.e. many learning parameters) are considered in the collection since they are quite capable of overfitting the available data [10, 11]. In the following paragraphs, we consider statistical learning broadly as minimization of generalization error. The literature on generalization error estimation methods is vast [8, 12], and we do not intend to review it extensively here. Nevertheless, some methods have gained popularity since their introduction in the mid 70s because of their generality, ease of use, and availability in open-source software: Leave-one-out (1974). The leave-one-out method for assessing and selecting learning models was based on the idea that to estimate the prediction error on an unseen sample one only needs to hide a seen sample from a dataset and learn the model. Because the hidden sample has a known label, the method can compare the model prediction with the true label for the sample. By repeating the process over the entire dataset, one gets an estimate of the expected generalization error [13]. Leave-one-out has been investigated in parallel by many statisticians, including Nicholson (1960) and Stone (1974), and is also known as ordinary cross-validation. k-fold cross-validation (1975). The term k-fold cross-validation refers to a family of error estimation methods that split a dataset into non-overlapping “folds” for model evaluation. Similar to leave-one-out, each fold is hidden while the model is learned using the remaining folds. It can be thought of as a generalization of leave-one-out where folds may have more than a single sample [14, 15]. Cross-validation is less computationally expensive than leave-one-out depending on the size and number of folds, but can introduce bias in the error estimates if the number of samples in the folds used for learning is much smaller than the original number of samples in the dataset. Major assumptions are involved in the derivation of the estimation methods listed above. The first of them is the assumption that samples come from independent and identically distributed (i.i.d.) random variables. It is well- known that spatial samples are not i.i.d., and that spatial correlation needs to be modeled explicitly with geostatistical theory. Even though the sample mean of the empirical error used in those methods is an unbiased estimator 2 of the prediction error regardless of the i.i.d. assumption, the precision of the estimator can be degraded considerably with non-i.i.d. samples. Motivated by the necessity to leverage non-i.i.d. samples in practical appli- cations, and evidence that model’s performance is affected by spatial correlation [16, 17], the statistical community devised new error estimation methods using the spatial coordinates of the samples: h-block leave-one-out (1995). Developed for time-series data (i.e. data showing temporal dependency), the h-block leave-one-out method is based on the principle that stationary processes achieve a correlation length (the “h”) af- ter which the samples are not correlated. The time-series data is then split such that samples used for error evaluation are at least ”h steps” distant from the samples used to learn the model [18]. Burman (1994) showed how the method outperformed traditional leave-one-out in time-series prediction by selecting the hyperparameter “h” as a fraction of the data, and correcting the error estimates accordingly to avoid bias. Spatial leave-one-out (2014). Spatial leave-one-out is a generalization of h- block leave-one-out from time-series to spatial data [19]. The principle is the same, except that the blocks have multiple dimensions (e.g. norm-balls). Block cross-validation (2016). Similarly to k-fold cross-validation for non- spatial data, block cross-validation was proposed as a faster alternative to spatial leave-one-out. The method creates folds using blocks of size equal to the spatial correlation length, and separates samples for error evaluation from samples used to learn the model. The method introduces the concept of “dead zones”, which are regions near the evaluation block that are discarded to avoid over-optimistic error estimates [20, 21]. Unlike the estimation methods proposed in the 70s, which use random splits of the data, these methods split the data based on spatial coordinates and what the authors called “dead zones”. This set of heuristics for creating data splits avoids configurations in which the model is evaluated on samples that are too near (< spatial correlation length) other samples used for learning the model. consequently, these estimation methods tend to produce error estimates that are higher on average than their non-spatial counterparts, which are known to be over-optimistic in the presence of spatial correlation. however, systematic splits of the data introduce bias, which have not been emphasized enough in the literature. all methods for estimating generalization error in classical learning theory, including the methods listed above, rely on a second major assumption. the assumption that the distribution of unseen samples to which the model will be applied is equal to the distribution of samples over which the model was trained. this assumption is very unrealistic for various applications in geosciences, which involve quite heterogeneous (i.e. variable), heteroscedastic (i.e. with different variability) processes [22]. very recently, an alternative to classical learning theory has been proposed, known as transfer learning theory, to deal with the more difficult problem of 3 learning under shifts in distribution, and learning tasks [23, 24, 25]. the theory introduces methods that are more amenable for geoscientific work [26, 27, 28], yet these same methods were not derived for geospatial data (e.g. climate data, earth observation data, field measurements). of particular interest in this work, the covariate shift problem is a type of transfer learning problem where the samples on which the model is applied have a distribution of covariates that differs from the distribution of covariates over which the model was trained [29]. it is relevant in geoscientific applications in which a list of explanatory features is known to predict a response via a set of physical laws that hold everywhere. under covariate shift, a generalization error estimation method has been proposed: importance-weighted cross-validation (2007). under covariate shift, and assuming that learning models may be misspecified, classical cross-validation is not unbiased. importance weights can be considered for each sample to re- cover the unbiasedness property of the method, and this is the core idea of importance-weighted cross-validation [30, 31]. the method is unbiased under covariate shift for the two most common supervised learning tasks: regression and classification. the importance weights used in importance-weighted cross-validation are ratios between the target (or test) probability density and the source (or train) probability density of covariates. density ratios are useful in a much broader set of applications including two-sample tests, outlier detection, and distribu- tion comparison. for that reason, the problem of density ratio estimation has become a general statistical problem [32]. various density ratio estimators have been proposed with increasing performance [33, 34, 35, 36], yet an investigation is missing that contemplates importance-weighted cross-validation and other existing error estimation methods in geospatial settings. in this work, we introduce geostatistical (transfer) learning, and discuss how most prior work in spatial statistics fits in a specific type of learning from geospa- tial data that we term pointwise learning. in order to illustrate the challenges of learning from geospatial data, we assess existing estimators of generalization error from the literature using synthetic gaussian process data and real data from geophysical well logs in new zealand that we made publicly available [37]. the paper is organized as follows. in section 2, we introduce geostatistical (transfer) learning, which contains all the elements involved in learning from geospatial data. we define covariate shift in the geospatial setting and briefly review the concept of spatial correlation. in section 3, we define generalization error in geostatistical learning, discuss how it generalizes the classical definition of error in non-spatial settings, and review estimators of generalization error from the literature devised for pointwise learning. in section 4, spatial="" correlation="" length)="" other="" samples="" used="" for="" learning="" the="" model.="" consequently,="" these="" estimation="" methods="" tend="" to="" produce="" error="" estimates="" that="" are="" higher="" on="" average="" than="" their="" non-spatial="" counterparts,="" which="" are="" known="" to="" be="" over-optimistic="" in="" the="" presence="" of="" spatial="" correlation.="" however,="" systematic="" splits="" of="" the="" data="" introduce="" bias,="" which="" have="" not="" been="" emphasized="" enough="" in="" the="" literature.="" all="" methods="" for="" estimating="" generalization="" error="" in="" classical="" learning="" theory,="" including="" the="" methods="" listed="" above,="" rely="" on="" a="" second="" major="" assumption.="" the="" assumption="" that="" the="" distribution="" of="" unseen="" samples="" to="" which="" the="" model="" will="" be="" applied="" is="" equal="" to="" the="" distribution="" of="" samples="" over="" which="" the="" model="" was="" trained.="" this="" assumption="" is="" very="" unrealistic="" for="" various="" applications="" in="" geosciences,="" which="" involve="" quite="" heterogeneous="" (i.e.="" variable),="" heteroscedastic="" (i.e.="" with="" different="" variability)="" processes="" [22].="" very="" recently,="" an="" alternative="" to="" classical="" learning="" theory="" has="" been="" proposed,="" known="" as="" transfer="" learning="" theory,="" to="" deal="" with="" the="" more="" difficult="" problem="" of="" 3="" learning="" under="" shifts="" in="" distribution,="" and="" learning="" tasks="" [23,="" 24,="" 25].="" the="" theory="" introduces="" methods="" that="" are="" more="" amenable="" for="" geoscientific="" work="" [26,="" 27,="" 28],="" yet="" these="" same="" methods="" were="" not="" derived="" for="" geospatial="" data="" (e.g.="" climate="" data,="" earth="" observation="" data,="" field="" measurements).="" of="" particular="" interest="" in="" this="" work,="" the="" covariate="" shift="" problem="" is="" a="" type="" of="" transfer="" learning="" problem="" where="" the="" samples="" on="" which="" the="" model="" is="" applied="" have="" a="" distribution="" of="" covariates="" that="" differs="" from="" the="" distribution="" of="" covariates="" over="" which="" the="" model="" was="" trained="" [29].="" it="" is="" relevant="" in="" geoscientific="" applications="" in="" which="" a="" list="" of="" explanatory="" features="" is="" known="" to="" predict="" a="" response="" via="" a="" set="" of="" physical="" laws="" that="" hold="" everywhere.="" under="" covariate="" shift,="" a="" generalization="" error="" estimation="" method="" has="" been="" proposed:="" importance-weighted="" cross-validation="" (2007).="" under="" covariate="" shift,="" and="" assuming="" that="" learning="" models="" may="" be="" misspecified,="" classical="" cross-validation="" is="" not="" unbiased.="" importance="" weights="" can="" be="" considered="" for="" each="" sample="" to="" re-="" cover="" the="" unbiasedness="" property="" of="" the="" method,="" and="" this="" is="" the="" core="" idea="" of="" importance-weighted="" cross-validation="" [30,="" 31].="" the="" method="" is="" unbiased="" under="" covariate="" shift="" for="" the="" two="" most="" common="" supervised="" learning="" tasks:="" regression="" and="" classification.="" the="" importance="" weights="" used="" in="" importance-weighted="" cross-validation="" are="" ratios="" between="" the="" target="" (or="" test)="" probability="" density="" and="" the="" source="" (or="" train)="" probability="" density="" of="" covariates.="" density="" ratios="" are="" useful="" in="" a="" much="" broader="" set="" of="" applications="" including="" two-sample="" tests,="" outlier="" detection,="" and="" distribu-="" tion="" comparison.="" for="" that="" reason,="" the="" problem="" of="" density="" ratio="" estimation="" has="" become="" a="" general="" statistical="" problem="" [32].="" various="" density="" ratio="" estimators="" have="" been="" proposed="" with="" increasing="" performance="" [33,="" 34,="" 35,="" 36],="" yet="" an="" investigation="" is="" missing="" that="" contemplates="" importance-weighted="" cross-validation="" and="" other="" existing="" error="" estimation="" methods="" in="" geospatial="" settings.="" in="" this="" work,="" we="" introduce="" geostatistical="" (transfer)="" learning,="" and="" discuss="" how="" most="" prior="" work="" in="" spatial="" statistics="" fits="" in="" a="" specific="" type="" of="" learning="" from="" geospa-="" tial="" data="" that="" we="" term="" pointwise="" learning.="" in="" order="" to="" illustrate="" the="" challenges="" of="" learning="" from="" geospatial="" data,="" we="" assess="" existing="" estimators="" of="" generalization="" error="" from="" the="" literature="" using="" synthetic="" gaussian="" process="" data="" and="" real="" data="" from="" geophysical="" well="" logs="" in="" new="" zealand="" that="" we="" made="" publicly="" available="" [37].="" the="" paper="" is="" organized="" as="" follows.="" in="" section="" 2,="" we="" introduce="" geostatistical="" (transfer)="" learning,="" which="" contains="" all="" the="" elements="" involved="" in="" learning="" from="" geospatial="" data.="" we="" define="" covariate="" shift="" in="" the="" geospatial="" setting="" and="" briefly="" review="" the="" concept="" of="" spatial="" correlation.="" in="" section="" 3,="" we="" define="" generalization="" error="" in="" geostatistical="" learning,="" discuss="" how="" it="" generalizes="" the="" classical="" definition="" of="" error="" in="" non-spatial="" settings,="" and="" review="" estimators="" of="" generalization="" error="" from="" the="" literature="" devised="" for="" pointwise="" learning.="" in="" section=""> spatial correlation length) other samples used for learning the model. consequently, these estimation methods tend to produce error estimates that are higher on average than their non-spatial counterparts, which are known to be over-optimistic in the presence of spatial correlation. however, systematic splits of the data introduce bias, which have not been emphasized enough in the literature. all methods for estimating generalization error in classical learning theory, including the methods listed above, rely on a second major assumption. the assumption that the distribution of unseen samples to which the model will be applied is equal to the distribution of samples over which the model was trained. this assumption is very unrealistic for various applications in geosciences, which involve quite heterogeneous (i.e. variable), heteroscedastic (i.e. with different variability) processes [22]. very recently, an alternative to classical learning theory has been proposed, known as transfer learning theory, to deal with the more difficult problem of 3 learning under shifts in distribution, and learning tasks [23, 24, 25]. the theory introduces methods that are more amenable for geoscientific work [26, 27, 28], yet these same methods were not derived for geospatial data (e.g. climate data, earth observation data, field measurements). of particular interest in this work, the covariate shift problem is a type of transfer learning problem where the samples on which the model is applied have a distribution of covariates that differs from the distribution of covariates over which the model was trained [29]. it is relevant in geoscientific applications in which a list of explanatory features is known to predict a response via a set of physical laws that hold everywhere. under covariate shift, a generalization error estimation method has been proposed: importance-weighted cross-validation (2007). under covariate shift, and assuming that learning models may be misspecified, classical cross-validation is not unbiased. importance weights can be considered for each sample to re- cover the unbiasedness property of the method, and this is the core idea of importance-weighted cross-validation [30, 31]. the method is unbiased under covariate shift for the two most common supervised learning tasks: regression and classification. the importance weights used in importance-weighted cross-validation are ratios between the target (or test) probability density and the source (or train) probability density of covariates. density ratios are useful in a much broader set of applications including two-sample tests, outlier detection, and distribu- tion comparison. for that reason, the problem of density ratio estimation has become a general statistical problem [32]. various density ratio estimators have been proposed with increasing performance [33, 34, 35, 36], yet an investigation is missing that contemplates importance-weighted cross-validation and other existing error estimation methods in geospatial settings. in this work, we introduce geostatistical (transfer) learning, and discuss how most prior work in spatial statistics fits in a specific type of learning from geospa- tial data that we term pointwise learning. in order to illustrate the challenges of learning from geospatial data, we assess existing estimators of generalization error from the literature using synthetic gaussian process data and real data from geophysical well logs in new zealand that we made publicly available [37]. the paper is organized as follows. in section 2, we introduce geostatistical (transfer) learning, which contains all the elements involved in learning from geospatial data. we define covariate shift in the geospatial setting and briefly review the concept of spatial correlation. in section 3, we define generalization error in geostatistical learning, discuss how it generalizes the classical definition of error in non-spatial settings, and review estimators of generalization error from the literature devised for pointwise learning. in section 4,>