Access the full text.
Sign up today, get DeepDyve free for 14 days.
Earth observation from satellite sensory data poses challenging problems, where machine learning is currently a key player. In recent years, Gaussian Process (GP) regression have excelled in biophysical parameter estimation tasks from airborne and satellite observations. GP regression is based on solid Bayesian statistics, and generally yields ecient and accurate parame- ter estimates. However, GPs are typically used for inverse modeling based on concurrent observations and in situ measurements only. Very often a forward model encoding the well-understood physical relations between the state vec- tor and the radiance observations is available though and could be useful to improve predictions and understanding. In this work, we review three GP models that respect and learn the physics of the underlying processes in the context of both forward and inverse modeling. After reviewing the traditional application of GPs for parameter retrieval, we introduce a Joint GP (JGP) model that combines in situ measurements and simulated data in a single GP model. Then, we present a latent force model (LFM) for GP modeling that encodes ordinary dierential equations to blend data-driven modeling and physical constraints of the system governing equations. The LFM performs Preprint. Paper published in Applied Soft Computing Volume 68, July 2018, Pages 69-82. https://doi.org/10.1016/j.asoc.2018.03.021 Preprint submitted to Elsevier December 16, 2020 arXiv:2012.07986v1 [eess.SP] 7 Dec 2020 multi-output regression, adapts to the signal characteristics, is able to cope with missing data in the time series, and provides explicit latent functions that allow system analysis and evaluation. Finally, we present an Automatic Gaussian Process Emulator (AGAPE) that approximates the forward phys- ical model using concepts from Bayesian optimization and at the same time builds an optimally compact look-up-table for inversion. We give empirical evidence of the performance of these models through illustrative examples of vegetation monitoring and atmospheric modeling. Keywords: Earth observation, remote sensing, vegetation, kernel methods, Gaussian Processes (GPs), Inverse modeling, Geosciences, Radiative transfer models (RTMs) 2 Contents 1 Introduction 4 2 Gaussian Process Models for Inverse Modeling 8 2.1 A brief overview of GPs . . . . . . . . . . . . . . . . . . . . . 8 2.2 GPs for model inversion in precision agriculture . . . . . . . . 10 3 Forward and Inverse Joint GP Models 13 3.1 Notation and formulation . . . . . . . . . . . . . . . . . . . . 13 3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Same-site experiments . . . . . . . . . . . . . . . . . . 16 3.2.2 Cross-site experiments . . . . . . . . . . . . . . . . . . 17 4 Inverse Modelling with Latent Force Models 19 4.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Learning the LAI-fAPAR relationships . . . . . . . . . 23 4.2.2 Dealing with missing data . . . . . . . . . . . . . . . . 25 5 Automatic Emulation 26 5.1 Theoretical Formulation . . . . . . . . . . . . . . . . . . . . . 29 5.2 Speci c Implementations . . . . . . . . . . . . . . . . . . . . . 30 5.3 Automatic Gaussian Process Emulator (AGAPE) . . . . . . . 31 5.4 Multi-output Automatic Emulation . . . . . . . . . . . . . . . 32 5.5 Emulation of Costly Radiative Transfer Codes . . . . . . . . . 34 5.6 Example of Multi-output Emulation . . . . . . . . . . . . . . . 35 6 Conclusions 36 3 forward problem RTM g(y; !) Variables y Observations x Retrieval f (x; ) inverse problem Figure 1: Forward (solid lines) and inverse (dashed lines) problems in remote sensing. 1. Introduction Solving inverse problems is a recurrent topic of research in Engineering and Physics in general, and in Remote Sensing and Earth Observation (EO) in particular. A very relevant inverse problem is that of estimating vegeta- tion properties from remotely sensed images. Accurate inverse models help to determine the phenological stage and health status (e.g., development, productivity, stress) of crops and forests [1], which has important societal, environmental and economical implications. Leaf area index (LAI) de ned as half the total intercepting leaf area per unit ground surface area [2], leaf chlorophyll content (Chl), fraction of absorbed photosynthetically active ra- diation (fAPAR), and fractional vegetation cover (FVC) are among the most important vegetation parameters to retrieve from space observations [3, 4]. In general, physical models implement the laws of Physics and allow us to compute the observation values given a state and a model [5]. Sometimes, and depending on the body of literature, they are known as process-based models or mechanistic models. In remote sensing, we refer to them as radia- tive transfer models as they implement the equations of energy (radiation) 4 transfer. This is known as the forward modeling problem. In the inverse modeling problem, the aim is to reconstruct the system state from a set of measurements (observations), see Fig. 1. Notationally, a forward model describing the system is generally expressed as x = g(y;!), where x is a measurement obtained by the satellite (e.g. radiance); the vector y repre- sents the state of the biophysical variables on the Earth (which we desire to infer or predict and are often referred to as outputs in the inverse model- ing approach); ! contains a set of controllable conditions (e.g. wavelengths, viewing direction, time, Sun position, and polarization); and g() is a func- tion which relates y with x. Such a function g is typically considered to be nonlinear, smooth and continuous. Our goal is to obtain an inverse model, f () g (), parameterized by , which approximates the bio-geo-physical variables y given the data x received by the satellite, i.e. y ^ = f (x;). Ra- diative transfer models (RTMs) are typically used to implement the forward direction [6, 7]. However, inverting RTMs directly is very complex because the number of unknowns is generally larger than the number of indepen- dent radiometric information [8]. Also, estimating physical parameters from RTMs is hampered by the presence of high levels of uncertainty and noise, primarily associated to atmospheric conditions, sensor calibration, sun an- gle, viewing geometry, as well as the poor sampling of the parameter space in most of the applications. All these issues translate into inverse problems where deemed similar spectra may correspond to very diverse solutions. This gives raise to undetermination and ill-posed problems. Methods for model inversion and parameter retrieval can be roughly sep- arated in three main families: statistical, physical and hybrid methods [9]. Statistical inversion predicts a biogeophysical parameter of interest using a training dataset of input-output data pairs coming from concurrent mea- surements of the parameter of interest (e.g. leaf area index -LAI-) and the corresponding satellite observations (e.g. radiances or re ectances). Sta- tistical methods typically outperform other approaches, but ground truth 5 measurements involving a terrestrial campaign are necessary. On the con- trary, physical inversion reverses RTMs by searching for similar spectra in look-up-tables (LUTs), and assigning the parameter state corresponding to the most similar observed spectrum. This requires selecting an appropri- ate cost function, and generating a rich, representative LUT from the RTM. The use of RTMs to generate data sets is a common practice, and especially convenient because acquisition campaigns are very costly in terms of time, money, and human resources, and usually limited in terms of parameter com- binations. Finally, hybrid inversion exploits the input-output data generated by RTM simulations and train statistical regression models to invert the RTM model. Hybrid models combine the exibility and scalability of ma- chine learning while respecting the physics encoded in the RTMs. Currently, kernel machines in general [10], and Bayesian non-parametric approaches such as Gaussian Process (GP) regression [11] in particular, are among the preferred regression models [12, 13]. These GP models have been implemented in Earth observation opera- tional chains for the derivation of biophysical variables at global scale, such as LAI through an hybrid approach. In addition, multitemporal LAI retrievals were derived with similar methodology at local scale taking the advantage of remote sensing data at decametric spatial resolutions and short revisit time, such as Sentinel-2 data [14, 15]. These features allow spatial and tempo- ral interpretation of GP estimates and their associated uncertainties at eld level which can be related with remote sensing artifacts (e.g., clouds) and crop heterogeneity (e.g., crop damages). While hybrid inversion is practical when no in situ data is available, in- tuitively it makes sense to let predictions be guided by actual measurements whenever they are present. Likewise, when only very few real in situ mea- surements are available, it is sensible to incorporate simulated data from RTMs to properly ground the models. This is a novel approach considered in this work, which extends the hybrid inversion by proposing a statistical 6 method that performs nonlinear and nonparametric inversion blending both real and simulated data. The so-called joint GP (JGP) essentially learns how to trade o noise variance in the real and simulated data [16]. A second topic covered in this work follows an alternative pathway to learn latent functions that generated the observations using GP models. We introduce a latent force model (LFM) for GP modelling [17, 18]. The proposed LFM-GP combines the ordinary dierential equations of the forward model (through smoothing kernels) and empirical data (from in situ campaigns). The LFM presented here performs multi-output structured regression, adapts to the signal characteristics, is able to cope with missing data in the time series, and provides explicit latent functions that allow system analysis and evaluation. Finally, we deal with the important issue of emulation of RTMs, that is learning surrogate GP models to approximate costly RTMs. The proposed Automatic Gaussian Process Emulator (AGAPE) methodology combines the interpolation capabilities of Gaussian processes (GPs) with the accurate de- sign of an acquisition function that favours sampling in low density regions and atness of the interpolation function. AGAPE allows building compact sets to perform ecient inverse modelling while respecting the complex phys- ical rules encoded in RTMs. All in all, in this paper we will illustrate the use of GPs in standard retrieval applications. In particular, we will introduce GPs to tackle prob- lems of hybrid modeling, extending the naive application of previous works. We formalize a full framework for Earth observation with GPs. The frame- work incorporates dierent GPs models, and extend our previous works on including temporal information in GP modeling [19, 20], incorporating both simulated and real data [16], advancing in the incorporation of physical rules in the modeling through the generation of kernel functions out of dierential equations [17, 21], multiple output GPs to assess consistency of the pre- dictions [21], and to learn compact look-up-tables (LUT) and emulators of 7 RTMs using GPs in a Bayesian optimization procedure [22, 23, 24]. This work improves the previous survey in [25] with an improved literature re- view and contextualization, as well as new experimental results on the use of GPs in precision agriculture (see §2), new results and application to transfer learning of the joint GP (see §3), new results for the latent force model in gap- lling problems originally introduced in [21] (see §4), as well as more de- tails, new formulations and experiments for the automatic emulator model, which is now fully automatic and works for multioutput problems, see §5. The remainder of the paper is organized as follows. We rst brie y in- troduce the standard GP for regression in §2. Then a Joint Gaussian Pro- cess (JGP) is proposed in §3 that exploits the regularities between real and simulated data, and provides a simple framework for incorporating physi- cal knowledge into a GP model. We introduce LFMs in §4 for vegetation monitoring across time, and then an automatic emulator based on GPs is presented in Section §5. We conclude in §6 with some remarks and an out- line of future work. 2. Gaussian Process Models for Inverse Modeling GPs are state-of-the-art tools for regression and function approximation, and have been recently shown to excel in biophysical variable retrieval by following both statistical [12, 13] and hybrid approaches [26, 19]. 2.1. A brief overview of GPs Let us consider a set of n pairs of observations or measurements, D := n d fx ; y g , where x 2 R and y 2 R, which are perturbed by an additive i i i=1 nd n1 independent noise. The input data pairs (X 2 R ; y 2 R ) used to t the inverse machine learning model f () come from either in situ eld campaign data (statistical approach) or simulations by means of an RTM (hybrid approach). We assume the following model, y = f (x ) + e ; e N (0; ); (1) i i i i 8 d 2 where f (x) is an unknown latent function, x 2 R , and stands for the | | noise variance. De ning y = [y ; : : : ; y ] and f = [f (x ); : : : ; f (x )] , the 1 n 1 n conditional distribution of y given f becomes p(yjf ) = N (f; I), where I is the n n identity matrix. Now, in the GP approach, we assume that f follows a n-dimensional Gaussian distribution f N (0; K) [27]. Retrieval f (x; ) Variables y Observations x Figure 2: Statistical inverse modelling. Given a set of observations x and set of parameters , the statistical model f (x; ) provides estimations of the variables y. In this case the model performs the inverse function of a physical model, which starting from the variables y provides the observations x. The covariance matrix K of this distribution is determined by a kernel 2 2 function with entries K = k(x ; x ) = exp(kx x k =(2 )), encoding ij i j i j similarity between the input points [11]. The intuition here is the following: the more similar input i and j are, according to some metric, the more correlated output i and j ought to be. Thus, the marginal distribution of y can be written as p(y) = p(yjf )p(f )df = N (0;C ); where C = K + I. Now, what we are really interested in is predicting a new output y , given an input x . The GP framework handles this by constructing a joint distribution over the training and test points, " # " #! y C k N 0; ; y k c where k = [k(x ; x ); : : : ; k(x ; x )] is an n 1 vector and c = k(x ; x ) + 1 n . Then, using standard GP manipulations, we can nd the distribution 9 over y conditioned on the training data, which is a normal distribution with predictive mean and variance given by | 2 1 (x ) = k (K + I ) y; GP n (2) 2 | 2 1 (x ) = c k (K + I ) k : GP e Thus, GPs yield not only predictions for test data, but also the so- GP called \error-bars", , assessing the uncertainty of the mean prediction. GP The hyperparameters = [; ] to be tuned in the GP determine the width of the squared exponential kernel function and the noise on the observations. This can be done by marginal likelihood maximization or simple grid search, attempting to minimize the squared prediction errors. In the next section we describe some practical cases regarding the use of GPs both and GP GP in EO. 2.2. GPs for model inversion in precision agriculture In this section, we show examples of the GP regression (GPR) model util- ity in real world applications related to precision farming/agriculture from remote sensing data. In this case, GPR was used for inverting the PROSAIL radiative transfer model (thus following a hybrid approach). PROSAIL sim- ulates leaf re ectance for the optical spectrum, from 400 to 2500 nm with a 1 nm spectral resolution, as a function of biochemistry and structure of the canopy, its leaves, the background soil re ectance and the system geometry. The leaf and canopy variables as well as the soil brightness parameter, were generated following a PROSAIL site-speci c parameterization to constrain the model to Mediterranean rice areas [19]. Firstly, PROSAIL was run in forward mode in order to build a database composed of pairs of simulated Sentinel-2 spectra and associated LAI values. A total number of 2000 sim- ulations were computed in such a way the obtained spectra and LAI values covered the expected season of rice crops as well as their management (agri- cultural practice). Then, the database (often called look-up table) was used for training the GPR model, which was then used for estimating LAI using 10 real Sentinel-2 imagery. Hence, every time a Sentinel-2 image was available, the corresponding LAI map was derived. This procedure was conducted be- tween mid-May until early-October thus completely covering the rice season. As result we derived 11 Sentinel-2 LAI maps. Figure 3 shows the LAI evolution over a rice eld which is in accordance with rice plant evolution. It is worth mentioning that an unexpected drop was detected on August 29th. LAI decreased too much on this date: a LAI decrease about 3 in a 10-day period does not correctly characterize the typ- ical rice LAI behaviour. Moreover, if we observe the temporal evolution of the GP predictive variance, , values remain virtually constant at a value GP about 0.8. However, dramatically increases (up to 3) on the date that GP the unexpected drop was observed. Figure 4 provides a Sentinel-2 a map for the predictive variance on August 29th where undetected clouds presented very high values as compared with cloud free rice elds. The lower con dence (higher prediction uncertainty) is associated with spectra non-represented in the PROSAIL training database. Therefore, non-vegetated surfaces, such as clouds, present higher prediction uncertainty (lower con dence). This assess- ment of can be useful to properly weight estimates with low con dence GP when used by crop modelers, and also to to improve cloud masks. The spatio-temporal detail of the derived maps due to the use of Sentinel- 2 data (10 m spatial and 10-day temporal resolution), allows intra- eld and multi-temporal analysis useful for crop assessment [14]. These features al- low the identi cation of signi cant dierent values within the same rice eld. Intra- eld LAI dierences are mainly due to the heterogeneity of the eld related with non-homogeneous agro-practices. The retrieved high-resolution LAI estimates can be used to continuously monitor the cropping season and to detect crop growth anomaly linked with potential crop damage. In partic- ular, Fig. 5 exhibits the temporal evolution of two pixels within the same rice eld. The blue line corresponds to the LAI evolution of a healthy pixel and the red one describes the temporal behaviour of a pixel located in the same 11 2 2 (m /m ) 21-May 10-Jun 20-Jun 10-Jul 30-Jul 9-Aug 19-Aug 29-Aug 8-Sep 18-Sep 8-Oct Figure 3: Temporal evolution of GPR LAI estimates and associated uncertainty (shaded space) over a rice pixel. >2.5 1.5 0.5 Figure 4: Sentinel-2 RGB composite over rice elds near Valencia (Spain) on August, 29th 2016 and the corresponding LAI uncertainty over an undetected cloud. rice eld but aected by a rice disease. According to the temporal pro les, the anomalous LAI behaviour started on the beginning of the season impact- 2 2 LAI (m /m ) Figure 5: Temporal evolution of GPR LAI estimates and Sentinel-2 LAI map on July, 29 2016. Red line corresponds to a damaged pixel while the blue one corresponds to a healthy pixel within the same eld. ing in the LAI values mainly in the rice development stage. This information was corroborated by in situ observations. Overall, this kind of analysis and assessment can be used to early derive anomalies maps related with crop damages which could be used by farmers in order to apply agro-practices for mitigating yield loss. 3. Forward and Inverse Joint GP Models 3.1. Notation and formulation Let us now assume that the previous dataset D is formed by two disjoint sets: one set of r real data pairs, D = f(x ; y )g , and one set of s RTM- r i i i=1 simulated pairs D = f(y ; x )g , so that n = r + s and D = D [D . s j j n r s j=r+1 rd r1 sd s1 In matrix form, we have X 2 R , y 2 R , X 2 R and y 2 R , r r s s containing all the inputs and outputs of D and D , respectively. Finally, r s 13 the n 1 vector y contains all the n outputs, sorted with the real data rst, followed by the simulated data. Now, we de ne a dierent model where the observation noise depends on the origin of the data: for real observations (x 2 D ) or = for RTM simulations (x 2 D ), where the parameter i r i s > 0 accounts for the importance of the two sources of information relative to each other. The resulting distribution of y given f is only slightly dierent from that of the regular GP, namely p(yjf ) = N (f; V) where V is an n n diagonal matrix in which the rst r diagonal elements are equal to 1 and the remaining 1 1 1 s are equal to : V = diag(1; : : : ; 1; ; : : : ; ). The predictive mean and variance of a test output y , conditioned on the training data, then becomes | 2 1 (x ) = k (K + V) y; JGP (3) 2 | 2 1 (x ) = c k (K + V) k : JGP e Note that when = 1 the standard GP formulation is obtained. Otherwise acts as an extra regularization term accounting for the relative importance of the real and the simulated data points. Selection of the hyperparameters of the JGP, = [; ; ]; is central to the eectiveness of the model, since what we are really interested is in performing predictions on the real data. We therefore maximize the pseudo-likelihood [11] of the real data only: L(X; y;) = log p(y jX ; y ;); (4) i ni ni i=1 where we sum over the log-likelihood of each real data point given the re- maining training data. The sub-index ni represents the remaining training data. The log-likelihood of a single point i given the remaining data is 1 (y ) i i log p(y jX ; y ;) = log 2 ; i ni ni 14 2 where and are computed using (3) with all r + s datapoints except the i'th. By optimizing hyperparameters in this way, becomes a measure of how useful the simulated data is in predicting the real data. 3.2. Experimental Results We are concerned about the prediction of leaf area index (LAI) parameter from space, a parameter that characterizes plant canopies and is roughly de- ned as the total needle surface area per unit ground area. Non-destructive real LAI data were acquired over Elementary Sampling Units (ESUs) within rice elds in Spain, Italy and Greece during eld campaigns in 2015 and 2016, i.e. 6 datasets. The temporal frequency of the campaigns was approximately 10 days starting from the very beginning of rice emergence (early-June) up to the maximum rice green LAI development (mid-August). LAI measure- ments were acquired using a dedicated smartphone app (PocketLAI), which uses both the smartphone's accelerometer and camera to acquire images at 57.5 below the canopy and computes LAI through an internal segmentation algorithm [26]. The center of the ESU was geo-located for later matching and association of the mean LAI estimate with the corresponding satellite spec- tra. We used Landsat 8 surface re ectance data over each area corresponding to the dates of measurements' acquisition. The resulting datasets contain a number of in situ measurements in the range of 70-300 depending on the country and year. On the other hand, three simulated datasets of s = 2000 pairs of Landsat 8 spectra and LAI, with characteristics corresponding to the relevant rice area, were obtained running the PROSAIL RTM in forward mode following the similar procedure described in Section 2.2, but in this case we simulated Landsat-8 spectra instead of Sentinel-2. Two types of experiments were conducted. In the rst one, we investi- gate, for each of the 6 datasets, how including simulated data might improve predictions in a regular 10-fold cross-validation scheme. In the second exper- iment, we explore how simulated data might help prediction of LAI in one site, given that one only has access to data from a dierent site. This is a 15 Spain Greece Italy 1.05 0.50 GP baseline 0.70 GP 0.45 1.00 r+s JGP 0.65 0.40 0.95 0.60 0.35 0.90 0.55 0.30 0.85 0.50 0.25 0.80 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0.50 0.55 1.15 0.45 0.50 1.10 0.40 0.45 1.05 0.35 0.40 1.00 0.30 0.35 0.95 0.25 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 p p p Figure 6: Performance comparison (RMSE) for dierent ways of including simulated data. p = s=r is the ratio between simulated (s) and real (r) data. The JGP and the regular GP, trained on a dataset of real and simulated data pooled together (i.e. the GP ), are r+s compared to the base line of the GP trained exclusively on real data. RMSE is shown for the dierent sites, campaign dates and simulated-to-real data ratios. As the scale is constant over the plots for better comparison, it was omitted how the GP RMSE r+s monotonically increases and reaches 0.85 for p = 8 from the plot in Italy 2016. quite habitual situation, often referred to as domain adaptation or transfer learning in machine learning [9, 28]; or simply as model transferability in re- mote sensing applications. In this case, we use datasets from 2016. We shall refer to these experiment types as same-site and cross-site respectively. 3.2.1. Same-site experiments We assessed the performance of the JGP for dierent amounts of sim- ulated data. We compare to a regular GP model (see Sec. 2) which only has access to real data, which we will refer to as GP , and a dierent regu- lar GP model which has access to a training set of both simulated and real data. This, more naive approach of including simulated data, which does not distinguish between data sources, is referred to as GP . r+s 2016 2015 Figure 6 shows the eect of the ratio between simulated and real data points, p = s=r, on the RMSE evaluated using 10-fold crossvalidation. The JGP behaves in dierent ways on dierent datasets. There are cases where a value close to 0 is tted, the simulated data is largely ignored, and it follows the GP baseline. For the datasets where this does not happen, we see that a p 1 is enough to produce an eect. This is worth noting as the inversion of the kernel matrix, needed to train the JGP, scales in time 3 3 complexity with the number of samples cubed, O(n ) = O((r + s) ). In the case of Greece 2015, an average increase in RMSE is observed which, percentage-wise is around 1 %. In Spain 2015 and Greece 2016, a decrease in RMSE of around 5 % can be observed. Interestingly, we see that the naive inclusion of simulated data (the GP scheme) generally r+s leads to an increase in error, except for the case of Greece 2016. This hints towards the fact that the GP can perform better than the JGP approach r+s when simulated data is of high quality, as it is less conservative. Overall, the JGP appears to be a safe way to include simulated data, at worst increasing RMSE by 1 % in one dataset, and at best decreasing it by 5 %. This is made possible by the hyperparameter tting procedure which attempts to assess whether the simulated data is useful or confusing for prediction. 3.2.2. Cross-site experiments We turn now to the question of whether a regression model, trained on data from one particular site, can be useful for prediction at a dierent one. The experiment is not incidental, but of practical concern and implications, as it is expensive to collect data, limiting the amount of real data available. It comes down to a question of how similar the distributions over the output variable LAI, given the input variables, are across sites. In order to visualize the 6 datasets we plot them in the NDVI-LAI rep- resentation space in Fig. 7, along with their simulated counterparts. The PROSAIL simulated data exhibits the well known exponential relation be- tween NDVI and LAI, and a similar trend is visible in the real datasets. The 17 PROSAIL NDVI NDVI NDVI Figure 7: Scatterplots in the NDVI-LAI representation space of the real and RTM- simulated data for all sites and acquisition campaigns (2015, 2016). data distribution that stands the most out, both with respect to simulated and other real datasets, is that pertaining to the Greek site. This is, as we shall see, what determines how well a model trained on that data will perform on other datasets. We consider the following strategies for predicting on a target site, having only data from a source site. One might train a GP with the available real source data, denoted GP in this section. Otherwise, we might have some knowledge about the target site that we can use to create an RTM-simulated database. This strategy of training only on simulated data will be referred to as GP here. Finally, one can try to combine the real data from the source site, with simulated target site data. This could be attempted through training a normal GP on the union of these two datasets, i.e. a GP model, r+s or through the JGP. Figure 8 shows how these methods compare, where row names indicate the train/source site and the column denotes the test/target site. Thus, the RMSE of the GP is constant across columns. We see how the fact that the simulated data distribution poorly matches that of the real data in Greece and Italy (see Fig. 7) is re ected in the RMSE for the GP approach in the two rightmost columns of Fig. 8. Conversely, we see from the second row how inclusion of simulated data in some form is very useful for predicting in 18 Spain Greece Italy 1.5 1.5 GP 1.2 1.2 GP r+ s 0.9 0.9 GP 0.6 0.6 JGP 0.3 0.3 0.0 0.0 1.5 1.5 1.2 1.2 0.9 0.9 0.6 0.6 0.3 0.3 0.0 0.0 1.5 1.5 1.2 1.2 0.9 0.9 0.6 0.6 0.3 0.3 0.0 0.0 Figure 8: Performance (RMSE) of the dierent approaches to cross-site learning, where rows and columns indicate the source and target datasets respectively. other sites when having access only to the real dataset from Greece. We note here, about the JGP, that it is the only method that consistently performs better than the simple GP strategy. In conclusion, the JGP can be said to be a safe approach to include simulated data for non-linear regression. 4. Inverse Modelling with Latent Force Models In this second case study, we are interested in inverse modelling from real in situ data, learning not only an accurate retrieval model but also the physical mechanism that generated the input-output observed relations without even accessing any RTM, see Fig. 9. Here, we assume that our observations correspond simply to the temporal variable, x t, so the latent functions are de ned in the time domain, f (t). Nevertheless, extension Italy Greece Spain LF f (x) Variables y Observations x LF f (x) Retrieval f (x; ) 2 q LF f (x) Figure 9: Inverse modeling with latent forces. In this case, the statistical inversion model does not depend directly on the inputs, but on a set of a priori unknown independent latent functions that describe the underlying physical model. to multidimensional objects such as radiances is straightforward by using dierent kernels. Notationally, let us consider a multi-output scenario with Q correlated observed time series, y (t) for 1 q Q, and let us assume that we have n samples available for each of these signals, taken at sampling points t , s.t. y [i] = y (t ) for 1 i n. This is the training set, which i q q i is composed of an input vector, t = [t ; : : : ; t ] , and an output matrix, 1 n Y = [y ; : : : ; y ] with y = [y [1]; : : : ; y [n]] . We aim to build a GP 1 Q q q q model for the Q outputs that can be used to perform inference on the test | | e e e set : t = [t ; : : : ; t ] and Y = [ye ; : : : ; ye ] with ye = [ye [1]; : : : ; ye [m]] 1 m 1 Q q q q and ye [m ] = y (t 0 ) for test inputs at t 0 . q q m m 4.1. Formulation Let us assume that a set of R independent latent functions (LFs), f (t) with 1 r R, are responsible for the observed correlation between the outputs. Then, the cross-correlation between the outputs arises naturally as a result of the coupling between the set of independent LFs, instead of being imposed directly on the set of outputs. Let us de ne the form of 20 these latent functions and the coupling mechanism between them. In this work, we model the LFs as zero-mean Gaussian processes (GPs), and the coupling system emerges through a linear convolution operator described by an impulse response, h (t), as follows: y (t) = L [t]ff (t)g = f (t) h (t) = f ( )h (t )d; (5) r;q q r r q r q where L [t]ff (t)g indicates the linear operator associated to the linear con- q r volution of the latent force f (t) with the smoothing kernel h (t). As shown r q in Fig. 10, the outputs are nally obtained as a linear weighted combina- tion of these pseudo-outputs plus an additive white Gaussian noise (AWGN) term: y (t) = S y (t) + w (t); (6) q r;q r;q q r=1 where S represents the coupling strength between the r-th LF and the r;q q-th output, and w (t) N (0; ) is the AWGN term. In practice, we consider only the squared exponential auto-covariance function for the LFs, 0 2 (t t) k (t t) / exp( ), where the hyperparameter ` controls the length- f f 2 r r r 2` scale of the process. The smoothing kernel encodes our knowledge about the linear system (that relates the unobserved LFs and the outputs), and can be based on basic physical principles of the system at hand (as in [17, 18]) or selected arbitrarily (as in [29, 30]). In this paper, we consider the Gaussian smoothing kernel, h (t) / exp( ). Since the LFs are zero-mean GPs, the noise is zero-mean q 2 and Gaussian, and all the operators involved are linear, the joint LFs-output process is also a GP. Therefore, the mean function of the q-th output is (t) = 0, whereas the cross-covariance function between two outputs is 0 0 0 2 0 k (t; t ) = S S L [t]fL [t ]fk (t; t )gg + [p q][t t]; (7) y y r;p r;q p q f f p q r r q r=1 0 0 where the term L [t]fL [t ]fk (t; t )gg denotes the application of the con- p q f f r r volutional operator twice to the autocorrelation function of the LFs, which 21 w (t) y (t) h (t) 1,1 S w (t) R,1 2 f (t) 1,2 y (t) h (t) R,2 w (t) f (t) 1,Q R,Q y (t) h (t) Figure 10: GP-LFM relating inputs (latent forces) and outputs (observations). results in the following double integral: Z Z t t 0 0 0 0 0 0 L [t]fL [t ]fk (t; t )gg = h (t )h (t )k (; )d d: p q f f p q f f r r r r 0 0 Finally, the cross-correlation between the LFs and the outputs readily gives 0 0 0 k (t; t ) = S L [t ]fk (t; t )g, which involves a single one-dimensional f y r;q q f f r q r r integral. All integrals can be solved analytically when both the LFs and the smoothing kernel have a Gaussian shape. Learning hyperparameters through marginal log-likelihood maximization is very challenging because of its complicated dependence on the hyperpa- rameters = [ ; l ; ; ; ]. We propose to solve the problem through q r n q a stochastic gradient descent technique, the scaled conjugate gradient [31]. Once the hyperparameters of the model have been learned, inference pro- ceeds by applying standard GP regression formulas [11] (cf. §2). Now, since the conditional PDF is Gaussian, the minimum mean squared error (MMSE) 22 prediction is simply given by the conditional mean: y ^ = = K K y; (8) y~jy y~y yy | | where y ^ = [y ^ ; : : : ; y ^ ] is the vectorized version of the inferred outputs, which can be expressed in matrix form as Y = [y ^ ; : : : ; y ^ ] with y ^ = 1 Q q | 0 0 [y ^ [1]; : : : ; y ^ [m]] and y ^ [m ] = y ^ (t ). q q q q 4.2. Experimental Results We are concerned about multiple time series of two (related) biophysical parameters, LAI and fraction of Absorbed Photosynthetically Active Radia- tion (fAPAR), in the locations of the experiments described in Section §3.2. We focus on a set of representative rice pixels of each area, thus allowing us to observe the inter-annual variability of rice from 2003 to 2013 at a coarse spatial resolution (2 Km), which is useful for regional vegetation modelling. 4.2.1. Learning the LAI-fAPAR relationships In this section, we explore the LAI vs. fAPAR relationship, which is usually modeled using the following exponential model, as largely observed in the literature [32]: fAPAR = 1 exp( LAI): (9) In order to determine whether the GP-LFM is able to capture this well- known relationship, we train the model using the multi-output time series composed of all the available LAI and fAPAR data from the MODIS sensor for Spain and Italy from the beginning of 2003 until the end of 2013 (i.e., N = 506 and the number of outputs is Q = 4). After removing truly missing data (marked with negative values in the original time series) this results in 2006 training samples (506 samples for each of the time series from Spain and 497 samples for the Italian ones). We have experimented with a variable 23 number of latent forces, R 2 f1; 2; 3g. Table 1 shows the quantitative results in terms of mean squared error (MSE), N 1 MSE = (y [n] y ^ [n]) ; (10) q q q n=0 and normalised MSE, MSE NMSE (%) = 100; (11) q P N 1 1 q y [n] n=0 q where y [n] denotes the true value of the n-th sample from the q-th time series, y ^ [n] is the value predicted by the model, N N is the number of q q samples available for the q-th time series (N = N = 506 and N = N = 1 2 3 4 497, as mentioned above) and q = 1; : : : ; Q = 4. MSE (NMSE) R LAI (ES) LAI (IT) fAPAR (ES) fAPAR (IT) 1 0.1139 (2.08 %) 0.2422 (5.97 %) 0.0080 (4.02 %) 0.0046 (2.49 %) 2 0.0548 (1.00 %) 0.1636 (4.03 %) 0.0013 (0.67 %) 0.0039 (2.11 %) 3 0.0012 (0.02 %) 0.1657 (4.09 %) 0.0002 (0.10 %) 0.0025 (1.38 %) Table 1: Absolute and normalised MSE using the full dataset for R 2 f1; 2; 3g LFs. Noting the substantial decrease in MSE, in the rest of this section we set R = 3. Figure 11 shows the three LFs inferred and Figure 12 displays the four output time series. In Fig. 12 we can see the good modeling accuracy for all the time series (as evidenced by the low NMSE values displayed in Table 1), whereas in Fig. 11 we see that LF3 captures the smooth and periodic component of the output and the other two LFs focus on the noisier part (albeit with an important residual periodical component). Finally, Figure 13 displays the LAI vs. fAPAR scatter plot, obtained from the modeled time series both for Spain and Italy. The shaded area corresponds to the uncertainty (that appears now in both axis, as a consequence of the modeling 24 4 5 3.5 2.5 1.5 0 1 -1 0.5 -2 -1 -3 -2 -4 -0.5 -3 -5 -1 2004 2006 2008 2010 2012 2014 2004 2006 2008 2010 2012 2014 2004 2006 2008 2010 2012 2014 Year Year Year Figure 11: Inferred LFs (black line) and uncertainty measured by 2 standard deviations about the mean predicted value (grey shaded area), obtained from the full LAI and fAPAR dataset from Spain and Italy with R = 3. uncertainty in both LAI and fAPAR), whereas the continuous red line shows the exponential model in (9), that has been tted to the data by performing a simple least squares regression in the log-domain. Note the good t in both cases of the expected LAI vs. fAPAR relationship, given by (9) with = 0:4047 and = 0:4593 for Spain and Italy respectively, and the scatter plot obtained from the GP-LFM. 4.2.2. Dealing with missing data In this second example, we show the ability of the model to recover the missing samples (i.e., to perform gap lling) that typically appear in this kind of datasets. We use again all the LAI data from the MODIS sensor for Spain from the beginning of 2003 until the end of 2013 (i.e., N = 506), but only the rst half (years 2003{2009) of the LAI data from Italy and the two fAPAR time series (i.e., N = 275, N = 276 and N = 275). In order to 2 3 4 show that the GP-LFM is able to capture the underlying dynamics of the multi-output time series with a single LF, we set R = 1. The three time series with missing data are displayed in Fig. 14, whereas the single LF (not shown) is very similar to the smooth LF (LF3) in Fig. 11. Note the good t of the three time series, even though the last four years of data are removed from all of them. LF 1 LF 2 LF 3 6 0 -1 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Year Year 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Year Year Figure 12: Training data (red circles), predicted time series (black line) and uncertainty measured by 2 standard deviations about the mean predicted value (grey shaded area), obtained from the full LAI and fAPAR dataset from Spain and Italy when R = 3. 5. Automatic Emulation Emulation deals with the challenging problem of building statistical mod- els for complex physical RTMs. The emulators are also called surrogate or proxy models, and try to learn from data the equations encoded in the RTM. Namely, an emulator is a statistical model that tries to reproduce the behav- ior of a deterministic and often very costly physical model. Emulators built with GPs are gaining popularity in remote sensing and geosciences, since they allow ecient data processing and sensitivity analysis [33, 34, 13]. Emula- tors also allow model tractability, as model-data integration, fast inference, analytical Jacobian calculation, and derivation of con dence intervals for the fAPAR (Spain) LAI [m^2/m^2] (Spain) fAPAR (Italy) LAI [m^2/m^2] (Italy) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 2 2 2 2 LAI [m /m ] (Spain) LAI [m /m ] (Italy) Figure 13: LAI vs. fAPAR for the data learned using R = 3 LFs and all the available data for years 2003{2013. 1.2 1.2 1 1 0.8 0.8 3 0.6 0.6 0.4 0.4 0.2 0.2 -1 0 0 2004 2006 2008 2010 2012 2014 2004 2006 2008 2010 2012 2014 2004 2006 2008 2010 2012 2014 Year Year Year Figure 14: Gap lling example using a single LF (i.e., R = 1). Training used all the LAI data from Spain (years 2003{2013) and the rst half (years 2003{2009) of the other three time series: LAI (IT), fAPAR (ES) and fAPAR (IT). The second half constitutes the test set of such time series. Training data (red circles), test data (red dashed line), predicted time series (black line) and uncertainty measured by 2 standard deviations about the mean predicted value (gray shaded area). estimates and parameters becomes easier and analytical when GPs are used. Here, we are interested in optimizing emulators such that a minimal num- ber of simulations is run (for a given approximation error). We describe a general framework, called Automatic Emulation (AE) technique which is re- lated to Bayesian optimization and active learning techniques. We rst de ne the generic elements of the AE methodology and then describe a speci c im- plementation based on GPs. This yields the Automatic Gaussian Process 2 2 LAI [m /m ] (Italy) fAPAR (Spain) fAPAR (Spain) fAPAR (Italy) fAPAR (Italy) x t gb (yjY ; x ) t t t RTM g(y; !) Interpolator Acquisition A (y) y y t t t + 1 t+1 Figure 15: Scheme of an automatic emulator. The goal of the model is to emulate the RTM as best as possible using a minimum number of runs. This is achieved by an iterative process which starts with a reduced input set of variables Y . Then the RTM provides t=0 the corresponding observations x . An interpolator is then tted, forming part of the t=0 acquisition function that is optimized to select new variables to add to the initial input set. The process is iterated until the stop condition is ful lled Emulator (AGAPE) model for automatic emulation and creation of a com- pact and informative look-up-table. The goal is to emulate (i.e., interpolate/mimic) a costly function g(y) choosing adequately the nodes, in order to reduce the error in the inter- polation with the smallest possible number of evaluation of g(y). Given an input matrix of nodes (used for the interpolation) at the t-th iterations, Y = [y y ], of dimension dm (where d is the dimension of each y and t 1 m t i m is the number of points), we have a vector of outputs, x = [x ; : : : ; x ] , t t 1 m where x = g(y ) is the estimation of the observations (e.g., radiances) at t t iteration t 2 N of the algorithm. Figures 15-16 show two graphical rep- resentations of a generic automatic emulator. At each iteration t one per- forms an interpolation/regression, obtaining gb (yjY ; x ), followed by an op- t t t timization step that updates the acquisition function, A (y), updates the set Y = [Y ; y ] adding a new node, set m m + 1 and t t + 1. t+1 t m +1 t t Note that the acquisition function A (y) is the core of the automatic em- ulation method. It plays the role of an oracle, suggesting in which regions it is more convenient to introduce new nodes, used in the next interpola- tion step. Clearly, the design of A (y) is a key point for the success of the 28 automatic emulation. The procedure is repeated until a suitable stopping condition is met, such as a certain maximum number of points is included or a desired precision error is achieved. Since g is unknown, we could compute kgb (y) gb (y)k . t t1 gb (y) g(y) 0 5 10 15 20 A (y) 0 5 10 15 20 Figure 16: General sketch of an Automatic Emulation (AE) procedure. The RTM model g(y) (top - solid line), its approximation gb (y) (top - dashed line) and an acquisition function A (y). Its maximum suggests where adding a new node/point to the LUT. 5.1. Theoretical Formulation The acquisition function, A (y), encodes useful information for proposing new nodes to build the emulator. Namely, the acquisition function A (y) suggests where introducing new support points for the next interpolation step. For this reason, the best new possible node, according to the designed acquisition function A (y), is exactly the arg max A (y). Therefore, at each t t 29 iteration, a new node is added maximizing A (y), i.e., y = arg max A (y); m +1 t and set Y = [Y ; y ], m = m + 1. Observe that the acquisition t+1 t m +1 t+1 t function A (y) must take into account the locations of the current nodes and the geometry of the underlying function g. Hence, we propose to factor- ize the acquisition function as product of a geometry H (y) and a diversity D (y) terms. The distribution of the previous nodes is encoded into the func- tion D (y), whereas the geometric information is included in H (y). More t t speci cally, we de ne the acquisition function as A (y) = [H (y)] D (y); 2 [0; 1]; (12) t t t t where A (y) : Y 7! R, and is an increasing function with respect to t, with t t lim = 1 (or = 1 for t > t ). Function H (y) captures the geometrical t!1 t t t information in g, while function D (y) depends on the distribution of the points in the current vector Y . More speci cally, D (y) presents a greater t t value around empty areas within Y , whereas D (y) will be approximately zero close to the support points and exactly zero at the support points, i.e., D (y ) = 0, for i = 1; : : : ; m and 8t 2 N. Since g is unknown, the t i t function H (y) can be only derived from information acquired in advance or by considering the approximation gb. The tempering value, , helps to downweight the likely less informative estimates in the very rst iterations. If = 0, we disregard H (y) and A (y) = D (y), whereas, if = 1, we have t t t t t A (y) = H (y)D (y). t t t 5.2. Speci c Implementations An automatic emulation (AE) procedure above described is completely de ned by the following elements: 1. the choice of the interpolator providing the approximation gb (yjY ; x ), t t t 2. the choice of the function D (y), 30 3. the choice of the function H (y), 4. and the choice of the tempering function . Furthermore, the stopping condition can be considered as an additional ele- ment. For the interpolation, we have to take into account the ability of the interpolator of building the approximation in high dimensional spaces and the dierentiability of gb (in the support domain with the exception of the a set of null measure). The conceptual set of elements fgb ; D ; H ; g de nes t t t t an AE method. Dierent combinations of these elements produces dierent AE techniques, each one yielding dierent performance. 5.3. Automatic Gaussian Process Emulator (AGAPE) We consider a GP technique as interpolator with y as inputs and x as t t outputs (i.e., reversed with respect to the previous section). In addition, note that interpolation forces to zero the noise standard deviation, i.e., = 0. Therefore, the AGAPE predictive mean and variance at iteration t for a new point y becomes simply | 1 | gb (y ) = k K x = k ; 2 | 1 (y ) = k(y ; y ) k K k ; where now k = [k(y ; y ); : : : ; k(y ; y )] contains the similarities between 1 m the input point x and the observed ones at iteration t, K is an m m t t kernel matrix with entries K := k(y ; y ), and = K x is the coecient i;j i j t nn vector for interpolation. The interpolation for y can be simply expressed as | t a linear combination of gb (y ) = k = k(y ; y ). We consider the t i i i=1 standard squared exponential kernel function now working with state vectors 0 0 2 2 y, i.e. k(y; y ) = exp(kyy k =(2 )). The training of the hyperparameter of kernel function k can be performed with standard procedure, as cross- validation or marginal likelihood optimization [35]. 2 2 Note that (y ) = 0 for all i = 1; : : : ; m and (y) depends on the i t distance among the support points y , and the chosen kernel function k and 31 2 associated hyper-parameter . For this reason, the function (y) is a good candidate to represent the distribution of the y 's since it is zero at each y , t i and higher far from the points y 's. Moreover, (y) takes into account the information of the GP interpolator. Therefore, we consider as the diversity term D(y) := (y); i.e., D(y) is induced by the GP interpolator. As geometric information, we consider enforcing atness on the interpolation function, and thus aim to minimize the norm of the the gradient of the interpolating function gb w.r.t. the input data y, i.e., H (y) = kr gb (yjY ; x )k = r k(y; y ) : y t t t i y i i=1 This intuitively makes wavy regions of gb require more support points than at regions. The gradient vector for the squared exponential kernel k(y; y ) = 0 2 2 | exp(kyy k =(2 )) with y = [y ; : : : ; y ] , can be computed in closed-form, 1 d k(y;y ) 0 0 0 | r k(y; y ) = [(y y ); : : : ; (y y )] . Therefore, the acquisition y 2 1 d function can be readily obtained by de ning = 1 exp( t), where 0, and A (y) = [H (y)] D (y). We optimized A (x) using interacting parallel t t t t simulated annealing methods,for the sake of simplicity [35, 36]. Indeed, these techniques do not require the knowledge of the gradient of the acquisition function, and thus more sophisticated schemes can be employed. In our ex- periments, simulated annealing schemes provide good performance, reaching a solution close to the global maximum in few iterations [37, 38]. However, as the dimension of the input space grows, the performance becomes worse. 5.4. Multi-output Automatic Emulation So far we have considered an RTM model of type x = g(y) + e, where e N (0; ) (with = 0 in AGAPE). Let us now denote the following RTM model represented by the equation x = g(y) + e; (13) 32 (1) (K) K 2 with x = [x ; : : : ; x ] 2 R and e N (0; I ) where I is a K K K K identity matrix. Then, we have K outputs for each y, i.e., (1) (K) K g(y) = [g (y); : : : ; g (y)] : Y ! R : (14) In order to design an automatic emulator of g(y), we have to extend the strategy described above. We consider that at the t-th iteration the current matrix of nodes Y = [y ; : : : ; y ] is shared by all outputs. Therefore, we t 1 m will design a multi-output emulator with a unique acquisition function A (y). The multi-output emulator is completely de ned by choosing the following elements: 1. A multi-output interpolation/regression scheme, considering the same input matrix Y = [y ; : : : ; y ] (for all the outputs) and the output t 1 m matrix X = [x ; : : : ; x ], providing an approximation t 1 m (1) (K) gb (yjY ; X ) = [gb (yjY ; X ); : : : ; gb (yjY ; X )]: t t t t t t t In [17, 39, 13, 40], dierent multi-output schemes are described. The simplest procedure consists in applying one independent interpolator for each output. 2. Given the matrix of current nodes Y = [y ; : : : ; y ] (shared by all t 1 m outputs), we obtain the diversity function D (y). Generally, we can (k) have a diversity term D (y) for each output (at least they can dier for hyper-parameters learned in dierent interpolator/regressor), hence we can de ne (k) D (y) = D (y): k=1 (k) (k) 3. Given each gb (yjY ; X ), we obtain the functions G (y), for k = t t 1; : : : ; K . Therefore, we can de ne the complete acquisition function as " # K K Y Y (k) (k) A (y) = G (y) D (y): (15) t t k=1 k=1 33 Optimizing A (y), we nd the next node y to incorporate in the next t m t+1 iteration, Y = [Y ; y ]. Other automatic emulator can be designed t+1 t m t+1 (k) considering multiple acquisition functions A (y), one for each output. 5.5. Emulation of Costly Radiative Transfer Codes We show empirical evidence of performance on the optimization of se- lected points for a complex and computationally expensive RTM: the MODTRAN5- based LUT. MODTRAN5 is considered as the de facto standard atmospheric RTM for atmospheric correction applications [41]. In our test application, and for the sake of simplicity, we have considered d = 2 with the Aerosol Optical Thickness at 550 nm ( ) and ground elevation (h) as key input pa- rameters. The underlying function g(y) consists therefore on the execution of MODTRAN5 at given values of and h and wavelength of 760 nm. The input parameter space is bounded to 0.05 { 0.4 for and 0 { 3 km for h. In order to test the accuracy of the dierent schemes, we have evaluated g(y) at all the possible 1750 combinations of 35 values of and 50 values of h. Namely, this thin grid represents the ground-truth in this example. Table 2: Averaged number of nodes m . Random Latin Hypercube AGAPE 28.43 16.69 9.16 We tested (a) a standard, yet suboptimal, random approach choosing points uniformly within Y = [0:05; 0:4] [0; 3], (b) the Latin Hypercube sampling [33], and (c) the proposed AGAPE. We start with m = 5 points > > > > y = [0:05; 0] , y = [0:05; 3] , y = [0:4; 0] , y = [0:4; 3] and y = 1 2 3 4 5 [0:2; 1:5] for all the techniques. We compute the nal number of nodes m required to obtain an ` distance between g and gb smaller than = 0:03, with the dierent methods. The results, averaged over 10 runs, are shown in Table 2. AGAPE requires the addition of 4 new points to obtain a distance smaller than 0:03. 34 5.6. Example of Multi-output Emulation We consider a multi-output toy example with scalar inputs where we can easily compare the achieved approximation gb (y) with the underlying function g(y), which is unknown in the real-world applications. In this way, we can exactly check the true accuracy of the obtained approximation using dierent schemes. For the sake of simplicity, we consider the following multi- output mapping g(y) = [log(y); 0:5 log(3y)]; y 2 (0; 10]; (16) then d = 1 and K = 2. Even in this simple scenario, the procedure used for selecting new points is relevant as con rmed by the results provided below. We start with m = 4 support points, Y = [0:1; 3:4; 6:7; 10]. We apply one 0 0 independent GP for each output. We add to Y sequentially 20 additional points, using dierent sampling strategies: (a) the multi-output version of AGAPE (denoted as AMOGAPE), (b) uniform points randomly generated in (0; 10], (c) a sequential Sobol sequence, (d) and a sequential version of the Latin Hypercube procedure (Seq-LHC). In this last case, i.e., for Seq-LHC, 20 points are generated following the LHC procedure and then one of them is added to Y at each iteration (without replacement). Note that, at each run, the results can vary even for the deterministic procedure due to the optimization of the hyperparameters (we use a parallel simulated annealing approach that is a stochastic optimization technique [35, 42, 43]). We average all the results over 500 independent runs. We compute the L distance,i.e., the Mean Square Error (MSE), between gb (y) and g(y) at each iteration, obtained by the dierent method. We show the evolution of the averaged MSE versus the number of support points m 35 (that is m = t + m ) in Figure 17. We can observe that the AMOGAPE t 0 scheme outperforms the other methods, providing the smallest MSEs between g(y) and gb (y). AMOGAPE Random Sobol -1 Seq-LHC -2 5 10 15 20 Number of nodes (m ) Figure 17: MSE (in log-scale) between g(y) and gb (y) versus the number of the number of support points m , that is m = t + 4 in this example. t t 6. Conclusions This paper treated the problems of forward and inverse modeling in re- mote sensing using advanced machine learning methods. We presented the eld formally, revised the concept of inverting or emulating radiative transfer models, and identi ed the main current shortcomings and trends when using machine learning in these settings. The rst section of the paper introduced the theory and use of GP mod- els in real world applications, including retrieval of biophysical parameters for assessment and decision making in crop management and lling gaps in time series of regional EO products. GP models provide high prediction ac- curacy and error bars for the predictions that can be useful for masking poor predictions or detecting anomalies. The models have now been adopted or are being considered for implementation in operation processing chains. MSE In the following sections we payed attention to advanced GP models. All of the presented methods were motivated by observing that two (apparently contradictory) philosophies are typically adopted: either trusting the phys- ical rules encoded in the physical model (for both simulation or inversion), or directly relying on data and thus following data science approaches (for both emulation and retrieval). We posit here that a richer, more appropriate approach to these problems emerges by developing machine learning algo- rithms that respect the Physics, that can incorporate prior knowledge, and that provide not only accurate but also credible predictions. Three types of physics-aware GP models were introduced: a simple approach to combine in situ measurements and simulated data in a single GP model, a latent force model that incorporates ordinary dierential equations, and an automatic compact emulator of physical models through GPs. The developed models demonstrated good performance, adaptation to the signal characteristics and transportability to unseen situations. We analyze the main features, pros and cons of the proposed models in what follows. The joint GP model is a very useful approach whenever one has access to real and simulated data by a physical model. By de nition, simulations can cover a large parameter space. This is why the JGP model can help to extrapo- late into the regions where real data is scarce. The JGP model is actually quite conservative and tends to perform as well or better than using real data alone. Two main shortcomings were identi ed though: 1) the more simulated data added, the better performance is obtained in general but the training process will be slower; and 2) There needs to be simulated data in the region of the real data, otherwise the model is not going to trust that the simulated data can be eectively used to improve predictions. In our experiments we illustrated the performance of the JGP in extrapolation across sites, thus demonstrating that the model learns a sort of domain adaptation by ground- 37 ing on the simulated data that should be equally relevant in dierent sites. A second approach for inverse modeling (retrieval) presented here relies on latent force models, that is, in models that encode dierential equations that model the generating system well into GPs. The GP-LFM allows us to model the correlation among multiple related outputs automatically by mix- ing the black-box, data-driven approach characteristic of machine learning approaches with the physical information used to derive purely mechanistic models. The GP-LFM should be used when: (1) data for multiple correlated outputs is available; (2) input-output relationships in the form of dieren- tial equations are known (either exactly or approximately). The GP-LFM has the following advantages: (1) it combines the rigour in the incorpora- tion of the available information of data-driven Bayesian approaches with the problem description accuracy of purely mechanistic models; (2) it can be cast as a generative model where the input-output relationships are not enforced directly (as in other multi-output/task approaches), but through physically interpretable latent forces (GPs) that are automatically learned from data; (3) some physical interpretability can be gained from the learned hyperparameters of the latent GPs and the smoothing kernels used to model input-output relationships; (4) the LFM-GP can be directly applied to out- puts with very dierent characteristics, dimensions and sampling rates; and (5) the model showed very good extrapolation capabilities, thus being able to deal with missing data and to make predictions in regions not covered by the available data. Some disadvantages can be identi ed though: (1) the model needs to work out the expressions of the output kernels for each combination of input kernels and smoothing kernels; (2) high computational cost of GPs, which is increased by the fact that output kernels are more complex than standard kernels used in GP regression; (3) cannot incorporate non-linear dierential equations, but only their linearized versions, as the whole input- output process would not be a GP any more. 38 The last novel methodology presented here has to do with the emulation of costly RTMs. For this we presented AGAPE, an automatic sequential in- terpolator/regressor that iteratively selects the parameter region to sample from, and build an optimal (compact) emulator. Additional advantages of AGAPE are that provides the information where possible new nodes should be incorporated to better approximate the unknown, underlying function en- coded in the RTM. The performance of AGAPE depends on a good choice of the starting points (and, as consequence, of an enough number of them), and a well-designed tempering function. These two considerations are connected: indeed, the key point is how much we believe/trust the estimation of the gradient from the previous function approximation. For the same reasons, as the dimensionality of the problem grows, AGAPE requires a greater number of starting nodes, in order to ensure the reliability of the initial approxima- tion of the unknown function. The framework that we presented here for attaining Physics-aware machine learning was illustrated using Gaussian Processes because of the solid grounds on Bayesian inference, properties and mathematical tractability. However, it has not escaped our notice that other machine learning algorithms could be equally applied. These issues will be subject of future research. Acknowledgments The research was funded by the European Research Council (ERC) un- der the ERC-CoG-2014 SEDAL project (grant agreement 647423), and the Spanish Ministry of Economy and Competitiveness (MINECO) and the Euro- pean Fund for Regional Development (ERDF) through the project TIN2015- 64210-R. The authors would like to thank the Institute for Electromagnetic Sensing of the Environment, the Cereal Institute of DEMETER, and the Aristotle University of Thessaloniki for providing the Italian and Greek eld data acquired under the ERMES FP7 project. 39 References [1] T. Hilker, N. C. Coops, M. A. Wulder, T. A. Black, R. D. Guy, The use of remote sensing in light use eciency based models of gross primary production: A review of current status and future requirements, Science of the Total Environment 404 (2-3) (2008) 411{423. [2] J. Chen, T. Black, De ning leaf area index for non- at leaves, Plant, Cell & Environment 15 (1992) 421{429. [3] R. H. Whittaker, P. L. Marks, Methods of assessing terrestrial produc- tivity, Primary Productivity of the Biosphere (1975) 55{118. [4] H. K. Lichtenthaler, Chlorophylls and carotenoids: Pigments of photo- synthetic biomembranes, Methods Enzymol. 148 (1987) 350{382. [5] R. Snieder, J. Trampert, Inverse Problems in Geophysics, Springer Vi- enna, Vienna, 1999, pp. 119{190. [6] S. Jacquemoud, C. Bacour, H. Poilv e, J.-P. Frangi, Comparison of four radiative transfer models to simulate plant canopies re ectance: Direct and inverse mode, Rem. Sens. Environ. 74 (3) (2000) 471{481. [7] W. Verhoef, H. Bach, Simulation of hyperspectral and directional radi- ance images using coupled biophysical and atmospheric radiative trans- fer models, Rem. Sens. Environ. 87 (2003) 23{41. [8] S. Liang, Advances in Land Remote Sensing: System, Modeling, Inver- sion and Applications, Springer Verlag, Germany, 2008. [9] G. Camps-Valls, D. Tuia, L. G omez-Chova, S. Jim enez, J. Malo (Eds.), Remote Sensing Image Processing, Morgan & Claypool Publishers, La- Porte, CO, USA, 2011. 40 [10] G. Camps-Valls, L. Bruzzone, Kernel Methods for Remote Sensing Data Analysis, John Wiley and Sons, 2009. [11] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning, The MIT Press, New York, 2006. [12] J. Verrelst, L. Alonso, G. Camps-Valls, J. Delegido, J. Moreno, Retrieval of vegetation biophysical parameters using gaussian process techniques, IEEE Transactions on Geoscience and Remote Sensing 50 (5/P2) (2012) 1832{1843, cited By 26. [13] G. Camps-Valls, J. Verrelst, J. MuA±oz-Mari, V. Laparra, F. Mateo- Jim enez, J. Gomez-Dans, A survey on gaussian processes for earth ob- servation data analysis, IEEE Geoscience and Remote Sensing Maga- zine (6). doi:http://dx.doi.org/10.1109/MGRS.2015.2510084. URL http://ieeexplore.ieee.org/document/7487896/ [14] M. Campos-Taberner, F. Garc a-Haro, G. Camps-Valls, G. Grau- Muedra, F. Nutini, L. Busetto, D. Katsantonis, D. Stavrakoudis, C. Mi- nakou, L. Gatti, M. Barbieri, F. Holecz, D. Stroppiana, M. Boschetti, Exploitation of sar and optical sentinel data to detect rice crop and esti- mate seasonal dynamics of leaf area index, Remote Sensing 9 (3) (2017) 248. doi:https://doi.org/10.3390/rs9030248. [15] L. Busetto, S. Casteleyn, C. Granell, M. Pepe, M. Barbieri, M. Campos- Taberner, R. Casa, F. Collivignarelli, R. Confalonieri, A. Crema, et al., Downstream services for rice crop monitoring in europe: From regional to local scale, IEEE Journal of Selected Topics in Applied Earth Obser- vations and Remote Sensing 10 (12) (2017) 5423{5441. [16] D. Heestermans Svendsen, L. Martino, M. Campos-Taberner, J. Garcia- Haro, G. Camps-Valls, Joint gaussian processes for biophysical param- 41 eter retrieval, IEEE Transactions on Geoscience and Remote Sensing 1 (1) (2017) 1{1. [17] M. A. Alvarez, D. Luengo, N. D. Lawrence, Latent force models, in: International Conference on Arti cial Intelligence and Statistics, 2009, pp. 9{16. [18] M. A. Alvarez, D. Luengo, N. D. Lawrence, Linear latent force models using gaussian processes, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (11) (2013) 2693{2705. [19] M. Campos-Taberner, F. Garc a-Haro, G. Camps-Valls, G. Grau- Muedra, F. Nutini, A. Crema, M. Boschetti, Multitemporal and mul- tiresolution leaf area index retrieval for operational local rice crop monitoring, Remote Sensing of Environment 187 (2016) 102 { 118. doi:10.1016/j.rse.2016.10.009. [20] M. Campos-Taberner, F. Garc a-Haro, R. Confalonieri, B. Martnez, A. Moreno, S. S anchez-Ruiz, M. Gilabert, F. Camacho, M. Boschetti, L. Busetto, Multitemporal monitoring of plant area index in the va- lencia rice district with pocketlai, Remote Sensing 8 (3) (2016) 202. doi:10.3390/rs8030202. [21] D. Luengo-Garcia, M. Campos-Taberner, G. Camps-Valls, Latent force models for earth observation time series prediction, in: 2016 IEEE Inter- national Workshop on Machine Learning for Signal Processing (MLSP 2016), Salerno, Italy, 2016. [22] G. Camps-Valls, J. Verrelst, L. Martino, J. Vicent, Advanced machine learning emulators of radiative transfer models, in: American Geophysi- cal Union (AGU) Fall meeting 2017, New Orleans, USA, 11-15 December 2017, 2017. 42 [23] L. Martino, J. Vicent, G. Camps-Valls, Automatic emulator and opti- mized look-up table generation for radiative transfer models, in: 2017 IEEE International Geoscience and Remote Sensing Symposium, Fort Worth, Texas, USA, 2017. [24] L. Martino, J. Vicent, G. Camps-Valls, Automatic emulation by adap- tive relevance vector machines, in: Scandinavian Conference on Image Analysis (SCIA), Troms, Norway, 2017. [25] G. Camps-Valls, D. Svendsen, L. Martino, J. M. n. oz Mar , V. Laparra, M. Campos-Taberner, D. Luengo, Physics-aware G aussian processes for E arth observation, in: Scandinavian Conference on Image Analysis (SCIA), Troms, Norway, 2017. [26] M. Campos-Taberner, F. Garcia-Haro, A. Moreno, M. Gilabert, S. Sanchez-Ruiz, B. Martinez, G. Camps-Valls, Mapping leaf area in- dex with a smartphone and gaussian processes, Geoscience and Remote Sensing Letters, IEEE 12 (12) (2015) 2501{2505. [27] C. M. Bishop, Pattern recognition, Machine Learning 128 (2006) 1{58. [28] D. Tuia, G. Camps-Valls, Kernel manifold alignment for domain adap- tation, PLoS ONE (6). doi:http://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0148655. [29] D. Higdon, et al., Space and space-time modeling using process convo- lutions, Quantitative methods for current environmental issues 3754. [30] P. Boyle, M. Frean, Dependent gaussian processes, in: Advances in neu- ral information processing systems, 2004, pp. 217{224. [31] I. Nabney, NETLAB: algorithms for pattern recognition, Springer Sci- ence & Business Media, 2002. 43 [32] R. Myneni, S. Homan, Y. Knyazikhin, J. Privette, J. Glassy, Y. Tian, Y. Wang, X. Song, Y. Zhang, G. Smith, A. Lotsch, M. Friedl, J. Morisette, P. Votava, R. Nemani, S. Running, Global products of vegetation leaf area and fraction absorbed par from year one of modis data, Remote Sensing of Environment 83 (1) (2002) 214 { 231, the Moderate Resolution Imaging Spectrora- diometer (MODIS): a new generation of Land Surface Monitoring. doi:https://doi.org/10.1016/S0034-4257(02)00074-3. URL http://www.sciencedirect.com/science/article/pii/ S0034425702000743 [33] D. Busby, Hierarchical adaptive experimental design for Gaussian pro- cess emulators, Reliability Engineering and System Safety 94 (2009) 1183{1193. [34] J. Rivera, J. Verrelst, J. G omez-Dans, J. Munoz-Mar ~ , J. Moreno, G. Camps-Valls, An emulator toolbox to approximate radiative transfer models with statistical learning, Remote Sensing 7 (7) (2015) 9347{9370. [35] L. Martino, V. Elvira, D. Luengo, J. Corander, F. Louzada, Orthogonal parallel MCMC methods for sampling and optimization, Digital Signal Processing 58 (2016) 64{84. [36] J. Read, L. Martino, D. Luengo, Ecient Monte Carlo optimization for multi-label classi er chains, IEEE International Conference on Acous- tics, Speech, and Signal Processing (ICASSP) (2013) 1{5. [37] L. Martino, V. Elvira, D. Luengo, A. Artes, J. Corander, Smelly parallel MCMC chains, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015) 1{5. [38] L. Martino, V. Elvira, D. Luengo, J. Corander, Interacting paral- 44 lel Markov adaptive importance sampling, European Signal Processing Conference (EUSIPCO) (2015) 1{5. [39] M. A. Alvarez, D. Luengo, M. K. Titsias, N. D. Lawrence, Ecient multioutput gaussian processes through variational inducing kernels, in: International Conference on Arti cial Intelligence and Statistics, 2010, pp. 25{32. [40] D. Tuia, J. Verrelst, L. Alonso, F. P erez-Cruz, G. Camps-Valls, Multi- output support vector regression for remote sensing biophysical param- eter estimation, IEEE Geosc. Rem. Sens. Lett. 8 (4) (2011) 804{808. [41] A. Berk, G. Anderson, P. Acharya, L. Bernstein, L. Muratov, J. Lee, M. Fox, S. Adler-Golden, J. Chetwynd, M. Hoke, R. Lockwood, J. Gard- ner, T. Cooley, C. Borel, P. Lewis, E. Shettle, MODTRAN5: 2006 up- date, The International Society for Optical Engineering. [42] L. Martino, V. Elvira, D. Luengo, F. Louzada, Parallel Metropolis chains with cooperative adaptation, International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2016) 1{5. [43] S. K. Kirkpatrick, C. D. G. Jr., M. P. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983) 671{680.
Electrical Engineering and Systems Science – arXiv (Cornell University)
Published: Dec 7, 2020
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.