Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Twin neural network regression

Twin neural network regression INTRODUCTIONRegression aims to solve one of two main classes of problems in supervised machine learning. It is the process of estimating the function that maps feature variables to an outcome variable. Regression can be applied to solve a wide range of problems. In everyday life, one may wish to predict the sales price of a house1 or the number of calories in a meal.2 In business and industry, it could be desirable to estimate stock market changes3 or the sales numbers of a certain product.4 Within the natural sciences, regression has been applied to a rich variety of problems, these include molecular formation energies,5 electronic charge distributions,6 inter‐atomic forces,7 electronic band‐gaps,8 and plasmonic response functions.9Indeed the wide range of the applications of regression makes it part of the standard curriculum across all of the quantitative domains (applied math, computer science, engineering, economics, the physical sciences, etc.). There are many existing algorithms which solve the regression problem, including linear regression, Gaussian process regression, random forest regression, xgboost, and artificial neural networks, among others.Regression problems require accurate and reliable solutions. Hence, in addition to the prediction itself, it is desirable to estimate the associated error or uncertainty.10,11 Such uncertainty signals can be used to decide when it is safe to let a model make decisions in the absence of expert supervision. For example, when a self‐driving car experiences unfamiliar road conditions, model uncertainty can be used as a signal that it must alter its behavior.12 This could mean taking a different path, slowing down, or in the extreme, stopping until a human driver can take over. Similarly, in medical diagnostics, automated classification and analysis of diagnostic imaging can improve reliability and reduce costs.13 However, such tools can only be trusted if they have the ability to gauge their own accuracy, and will only make recommendations when the expected prediction accuracy is above a safe threshold. Recent successes with surrogate models14,15 require an accurate estimate of model uncertainty. Similarly, active learning algorithms rely on an agent's self‐assessment ability; low confidence in a model can be used as a trigger for consulting the oracle.16,17In this article, we present a regression algorithm called twin neural networks regression (TNNR) which is inspired by Siamese neural networks in the sense that TNNR also leverages pairwise comparisons while making predictions in the domain of regression.TNNR naturally produces an ensemble of predictions when it compares the predictions between a new unseen data point and all training data points. It combines the strengths of real and pseudo ensembles. On one hand, it creates a large number of predictions (twice size of the training data set) at little additional cost in training time compared with a traditional neural network. On the other hand, as a real ensemble it significantly increases the prediction accuracy.TNNR also provides signals of model uncertainty by construction of a network topology which allows for self‐consistency conditions. A violated check on these conditions can be interpreted as a decrease in prediction accuracy.We first describe the approach, demonstrate its performance on well‐known data sets, and finally examine its self‐consistency conditions.The main contributions of the article are the development of a new regression algorithm, which produces a large ensemble of predictions at low cost. This algorithm is shown to be competitive and outperforms state‐of‐the‐art on many data sets. Further, self‐consistency conditions can be used to estimate the prediction error.PRIOR WORKTwin neural network regression is inspired by Siamese networks. These networks are trained to identify if an input pair is similar or different. Siamese networks consist of two identical neural networks which each act on a member of a pair to project it into a latent space. Elements close to each other in the latent space are similar, while elements far apart are different. Siamese networks were originally developed as an approach for the identification of fingerprints18 and handwritten signature verification.19 More recently, when coupled with deep convolutional architectures,20 SNNs have been used for facial recognition,21 few shot learning,22 and object tracking.23 The idea of pairwise similarity has also been shown to be an approach for unlabeled classification.24 Siamese networks have previously been used in the regressive task of extracting a camera pose from images.25 Other uses of these networks with images and regression have been focused in the medical domain as an estimator for disease severity.26 The ability of Siamese networks to extract similarities from data can also be used to determine conservation laws in physics.27 Pairwise training is also used to rank and order data points.28,29Pairwise similarity is also intrinsic to kernel methods which require a user‐specified kernel that acts as a similarity function over pairs of data points. TNNR is thus among a class of neural networks that can be related to kernel methods.30,31Uncertainty assessments for linear regression are well established. Similarly, for stochastic processes, there are standard techniques which can estimate the uncertainty for a model. Gaussian processes (GP)32‐35 naturally provide an estimate of uncertainty, but the cost of training grows quickly with the number of training samples; GP are impractical for large data‐sets, although there has been recent progress in this direction.36 GP also are unable to easily incorporate new data; it is necessary to retrain the model fully for each new data point or observation, making online learning very costly.Conversely, there is not yet a single, established protocol for quantifying error for (deep) neural networks, particularly for the case of regression. An early and straightforward approach to uncertainty estimation is the use of ensembles.Ensemble methods are commonly used in regression (and classification) tasks as a means of improving the accuracy of prediction37 and solving the bias‐variance tradeoff38,39 by combining the predictions of different models. One can implement ensembling by simply averaging uncorrelated predictions from different models to reduce the variance error. More advanced methods to combine predictions are through bagging which reduces the variance error, too, or through boosting which reduces the bias error. Ensembles of different predictions40,41 can be produced in different ways such as repetition of training while changing training‐validation splits or even sampling intermediate models along the training trajectory.42‐44 Sampling along a training trajectory is efficient in that it generates approximately five ensemble members in the time it takes to train one traditional neural network. There are two kinds of ensemble methods, real ensembles, obtained by combining the predictions of multiple models,40,41 or pseudo ensembles45,46 that are obtained by perturbing certain parameters in the data or the model. While pseudo ensembles have the advantage of requiring no overhead in training time, real ensembles47 yield a much better prediction accuracy.The disagreement between models in an ensemble can be used as a signal for confidence among the set. Recent work has highlighted the problems associated with ensembles as a method for uncertainty estimation.48‐50 There is less theoretical justification for the reliability of errors from such approaches compared with GP. Intuitively it makes sense that a mismatch of models suggests the output cannot be trusted, but it is less clear that the magnitude of this mismatch can be assigned to a particular value of uncertainty.If data far away from the training regime is used, regression becomes unreliable, conversely, a close distance of a new data point to one of the training points is correlated with lower error.51 Other methods for estimating model uncertainty and error include discriminant analysis,52 resampling,53 scoring rules,54,55 domain‐specific metrics,56‐58 and data set shifts.59 Finally, it was recently shown that the projection of data into the latent space of a model can be used as a proxy for uncertainty. a close distance to one of the training points is correlated with lower error.51TWIN NEURAL NETWORK REGRESSIONReformulation of regressionIn a regression problem we are given a training set of n data points Xtrain=x1train…xntrain and target values Ytrain=y1train…yntrain. The task is to find a function f such that approximates f(xi) = yi with the smallest possible error. Further, we require that the function f does generalize to unseen data Xtest with labels Ytest. In the following, we reformulate this regression problem. Given a pair of data points xitrainxjtrain we train a neural network (figure 1) to find a function F to predict the difference1Fxixj=yi−yj.1FIGUREReformulation of a regression problem: In the traditional case a neural network is trained to map an input x to its target value f(x) = y. We reformulate the task to take two inputs x1 and x2 and train a twin neural network to predict the difference between the target values F(x2, x1) = y2 − y1. Hence, this difference can be employed as an estimator for y2 = F(x2, x1) + y1 given an anchor point (x1, y1)This neural network can then be used as a solution for the original regression problem yi = F(xi, xj) + yj. In this setting, we call (xj, yj) the anchor for the prediction of yi. This relation can be evaluated on every training data point xjtrain such that the best estimate for the solution of the regression problem is obtained by averaging2yipred=1n∑j=1nFxixjtrain+yjtrain3=1n∑j=1n12Fxixjtrain−12Fxjtrainxi+yjtrain.In this formulation the TNNR output F(xi, xj) can be interpreted as kernel. In kernel methods the kernel together with trainable weights αjk(xi, xj) determines the magnitude of the contributions from a training data point to a new unseen data point. However, F(xi, xj) already includes all trainable weights and the magnitude of F(xi, xj) is uniform in the ensembling process of TNNR. F(xi, xj) is not symmetric under the exchange of xi and xj, while in kernel methods it (usually) is symmetric.The first advantage of the reformulation is that it creates an ensemble of twice the size of the training set of predictions of differences yi − yj for a single prediction of yi. While ensembles are in general costly to produce, the TNNR intrinsically yields a very large ensemble at little additional training cost. Since ensembles suppress the variance contribution within the bias‐variance decomposition 10, ensembles tend to be more accurate than single models as long as the ensemble members are sufficiently diverse.In general, we do not expect the ensemble diversity of the TNNR to be similar to traditional ensembles. The reason is that the prediction of a regression target yi is based on different predictions for differences yi − yj due to multiple anchor points. This allows us to combine the TNNR with any traditional ensemble method to achieve an even more accurate prediction.Each prediction of yi is made from a finite range of differences yi − yj and the anchor data points differ by more than an infinitesimal perturbation. Hence, the TNNR ensemble is not just a pseudo ensemble that is obtained by small perturbations of the model weights.The intrinsic ensembles of the TNNs are not conventional ensembles. Like k‐nearest neighbor regression or support vector regression the prediction is formed by leveraging comparison between a new unseen data point and several support vectors or nearest neighbors belonging to the training set. However, in contrast to these algorithms, TNNR can be seen from a single neural network perspective with weight diversity. Let us consider the prediction yi = F(xi, xj) + yj, we can consider xi as input to a traditional model to predict yi, while xj can be understood as auxiliary parameters which influence the weights. The offset yj can be seen as changing the bias of the output layer.In principle a reliable estimation of a regression target through Equation (3) might be prone to a larger error compared with a traditional estimation. The reason for this is that the TNNR must make predictions on two data points and thus also accumulates the noise of two target values. However, we average over the whole training set and the noise is uncorrelated at different anchor data points. According to the central limit theorem, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. Thus, the impact of noisy anchor labels on the prediction is suppressed by a factor of 1/2n, where n is the training set size.Self‐consistency conditionsThe twin neural network predicts the difference between two regression targets. For an accurate prediction this requires the satisfaction of many self‐consistency conditions, see Figure 1. Any closed loop of difference predictions F(xi, xj) = yi − yj sums up to zero. Any violation of such a condition is an indication of an inaccurate prediction. In principle there are several ways to harness this self‐consistency condition for practical use. First, it can be used to estimate the magnitude the prediction error. Second, it could be utilized to force the neural network to satisfy these conditions during training. Finally, it can enable one to use the predictions on previous test data as anchor points for new predictions.The smallest loop only contains two data points xi, xj for which an accurate TNN needs to satisfy40=yi−yj+yj−yi=Fxixj+FxjxiWhile training, in each batch we include both pairs (xi, xj) and its mirror copy (xj, xi) to enforce the satisfaction of this condition while training. The predictions on any three data xi, xj, xk points should satisfy50=Fxixj+Fxjxk+FxkxiFor xi, xj ∈ Xtrain the target values yi, yj are known. Thus, this condition becomes equivalent to the statement that the prediction of yk must be the same on any two different anchor points (xi, yi) and (xj, yj).6yk=Fxkxi+yi=Fxkxj+yjThis condition is trivially enforced during training. We examine the relation of magnitudes of the violations of these conditions and the prediction error in Section 4.2. To this end we employ the ensemble of predictions and calculate the SD corresponding to Equations (4) and (6) and find a distinct correlation with the out‐of‐domain prediction error. In Appendix D we discuss the interaction between different loop types.Twin neural network architectureThe reformulation of the regression problem does not require a solution by artificial neural networks. However, neural networks scale favorably with the number of input features in the data set. We employ the same neural network for all data sets. Our TNN takes a pair of inputs (xi, xj), where each input is connected to the fully connected neural network with two hidden layers and a single output neuron. Each hidden layer consists of 64 neurons each with a relu activation function. On data sets containing hierarchical structures, such as image data sets or audio recordings, it is helpful to include shared layers that only act on one element of the input pair. This is commonly used in few shot learning in image recognition.19 The architecture in this article does not use any kind of weight sharing and there is no latent space in which some high‐level representations are compared. Thus the network is equivalent to a fully‐connected feed‐forward neural network that uses the concatenation of the features of two samples x1, x2 as input. We optimize a common architecture that works well for all data sets considered in this work. We examined different regularization methods like dropout and L2 regularization and found that in some cases a small L2 penalty improves the results. More details can be found in the Appendix B. The improvement was not statistically significant or uniform among different splits of the data, which is why our main results omit any regularization. The training objective is to minimize the mean squared error on the training set. For this purpose we employ standard gradient descent methods adadelta (and rmsprop) to minimize the loss on a batch of 16 pairs at each iteration. We stop the training if the validation loss stops decreasing.The single feed forward neural network (ANN) that we employ for our comparisons has a similar architecture as the TNN. This means it has the same hidden layers and we examined the same hyperparameters. The convolutional neural networks are built on this architecture with the first two dense layers exchanged by convolutional layers of 64 neurons and filter size 3. The neural networks are robust with respect to changing the specific architectures in the sense that the results do not change significantly. The neural networks were implemented using keras60 and tensorflow.61Data preparationWe examine the performance of TNNR on different data sets: Boston housing (BH), concrete strength (CS), energy efficiency (EE), yacht hydrodynamics (YH), red wine quality (WN), Bio Conservation (BC),random polynomial (RP), RCL circuit (RCL), Wheatstone bridge (WSB), and the Ising Model (ISING). The common data sets can be found online in Ref. [62]. The science data sets are simulations of mathematical equations and physical systems. RP is a polynomial of degree two of five input features and random coefficients. RCL is a simulation of the electric current in an RCL circuit and WSB a simulation of the Wheatstone bridge voltage. ISING, a spin system on a lattice of size 20 × 20 and the corresponding energies are used to demonstrate an image regression problem. More details can be found in the Appendix A.All data is split into 90% training, 5% validation, and 5% test data. Each run is performed on a randomly chosen different split of the data. We normalize and center the input features to a range of [−1, 1] based on the training data. In the context of uncertainty estimation we further divide all data based on their regression targets y. In this manner we choose to exclude a certain range from the training data. Hence, we can examine the performance of the neural networks outside of the training domain.While the data can be fed into a standard ANN in a straightforward manner, one must be careful in the preparation of the TNN data. Starting with a training data set of n data points we can create n2 different pairs of training data pairs for the TNN. Hence, the TNN has the advantage of having access to a much larger training set than the ANN. However, in the case of a large number of training examples, it does not make sense to store all pairs due to memory constraints. That is why we train on a generator which generates all possible pairs batchwise before reshuffling.EXPERIMENTSPrediction accuracyWe train kernel ridge regression (KRR), random forests (RF), xgboost (XGB), traditional single neural networks ANNs, TNNs, and ensembles to solve various regression tasks outlined in Section 3.4. The hyperparameters of KRR, RF and XGB are optimized for each data set via cross‐validation, the ranges of the parameters are explained in section B. The performance on these data sets is measured by the root mean square error (RMSE) which is shown in Table 1. Each result is averaged over 20 different random splits of the data; in this manner we find the best estimate of the RMSE and the according SE. We also produce explicit ensembles (E) by training the according ANN and TNN 20 times. This means each RMSE in the table requires training 20 times or 400 times for the ensembles (E).1TABLEBest estimates for root mean square errors (RMSEs) of different algorithms on the test sets belonging to different data setsCommon dataBHCSEEYHWNBCKRR3.01 ± 0.245.76 ± 0.211.67 ± 0.061.00 ± 0.170.63 ± 0.010.73 ± 0.02RF4.24 ± 0.298.23 ± 0.242.22 ± 0.082.95 ± 0.460.64 ± 0.020.71 ± 0.03XGB2.93 ± 0.184.37 ± 0.191.17 ± 0.040.42 ± 0.060.61 ± 0.010.70 ± 0.03ANN3.09 ± 0.145.37 ± 0.170.98 ± 0.030.52 ± 0.070.64 ± 0.010.76 ± 0.02ANNE3.43 ± 0.325.14 ± 0.210.89 ± 0.040.43 ± 0.050.62 ± 0.010.72 ± 0.03MCD2.95 ± 0.156.07 ± 0.212.96 ± 0.121.42 ± 0.180.68 ± 0.010.72 ± 0.03TNN2.55 ± 0.104.19 ± 0.250.52 ± 0.020.49 ± 0.070.62 ± 0.010.83 ± 0.03TNNE2.61 ± 0.203.88 ± 0.220.46 ± 0.020.37 ± 0.060.63 ± 0.010.72 ± 0.02Science dataImage dataRPRCLWSBISINGKRR0.022 ± 0.0010.020 ± 0.0010.028 ± 0.001KRR0.382 ± 0.006RF0.604 ± 0.0130.288 ± 0.0040.141 ± 0.011RF0.601 ± 0.003XGB0.229 ± 0.0050.124 ± 0.0020.071 ± 0.006XGB0.144 ± 0.003ANN0.050 ± 0.0020.019 ± 0.0000.047 ± 0.004CNN0.050 ± 0.001ANNE0.032 ± 0.0020.016 ± 0.0010.031 ± 0.002CNNE0.044 ± 0.001MCD0.086 ± 0.0020.033 ± 0.0010.042 ± 0.003CMCD0.052 ± 0.001TNN0.022 ± 0.0010.017 ± 0.0000.020 ± 0.001CTNN0.035 ± 0.001TNNE0.016 ± 0.0010.014 ± 0.0010.022 ± 0.002CTNNE0.030 ± 0.001Note: The lowest RMSEs are in bold for clarity. Our confidence on the RMSEs is determined by their SE. Data sets: Boston housing (BH), concrete strength (CS), energy efficiency (EE), yacht hydrodynamics (YH), red wine quality (WN), Bio Conservation (BC), random polynomial (RP), RCL circuit (RCL), Wheatstone bridge (WSB) and the Ising Model (ISING). Algorithms: Random Forests (RF), xgboost (XGB), Neural Networks (ANN), Monte–Carlo dropout networks (MCD), Twin Neural Networks (TNN) and ensembles (E) or convolutional variants (C).In Table 1 we see that TNNs outperform single ANNs and even ensembles of ANNs on all data sets except BC, we assume this is an statistical outlier. Creating an ensemble of TNNs increases the performance even further. The significance of the difference in performance is clearly supported by the comparably small SE. On discrete data, especially YH, WN, and BC, XGB slightly outperforms ANNs at a much shorter training time. However, TNNs are able to compete. While on common data, especially on discrete data XGB beats KRR, on science data sets KRR beats all decision tree‐based methods, due to the smooth nature of the learned function. In these science data sets, neural network‐based methods beat tree‐based methods by orders of magnitude and KRR to a smaller extent. On image data, convolutional neural networks outperform KRR, RF, and XGB. The general trend is that TNNs outperform ANNs.The outperformance of TNNR comes at the cost of training time, as we can see in Table 3, the TNN takes approximately 7–20 times longer to converge compared with single ANNs, while all other algorithms are much faster than ANNs. While TNNR universally perform best, we would only recommend to use TNNR if training time is not a bottleneck.The out‐performance of TNNR over other regression algorithms is based on exploiting the pairing of data points to create a huge training data set and a large ensemble during inference time. If the training set is very large, the number of pairs increases quadratically to a point where the TNN will in practice converge to a minimum before observing all possible pairs. At that point, the TNN begins to lose its advantages in terms of prediction accuracy. A visualization of this fact can be seen in Figure 2. In this figure different ANN and TNN architectures are applied to the RP data set. We note that since the RP data set is created from a mathematical function we can create an arbitrary number of data points. One may observe the performance improvement in terms of lower RMSE of all neural networks when increasing the number of training data points. The TNN achieves a lower RMSE than the ANN in small and medium‐sized data sets. The TNN finds an accuracy plateau sooner than the ANN such that both algorithms perform similarly well in a regime of between 60 000 and 100 000 data points. While training on large data sets early stopping based on validation loss interrupt the training before all training pairs are seen by the algorithm.2FIGURETraditional ANN and TNN regression applied to the random polynomial function (RP). Standard architectures (2 × 64 hidden neurons) and optimized architectures (2 × 128 hidden neurons) are trained with different training data sets containing n=100,…,100000 data points. Larger training sets reduce RMSE, TNNR beats ANN regression for n<60000,…,100000Prediction uncertainty estimationEquipped with a regression method that yields a huge ensemble of predictions constrained by self‐consistency conditions we examine reasonable proxies for the prediction error.In that sense we must distinguish between in‐domain prediction uncertainty and out‐of‐domain prediction uncertainty. The in‐domain denotes data from the same manifold as the training data. However, if the test manifold differs from the training manifold one is concerned with the out‐of‐domain data. This is the case in interpolation or extrapolation, or if certain features in the test data exhibit a different correlation as in the training data.Consider, for example, the polynomial function of one variable (Figure 3) as an example of an out‐of‐domain interpolation problem. The TNN is very accurate on the test set as long as one stays on the training manifold. As soon as the test data leaves the training manifold the prediction becomes significantly worse. While the prediction resembles a line connecting the endpoints of the predictions on the training manifold it fails to capture the true function. Since the difference between the true function and the regression result is consistently larger than the SD, we further conclude that the SD of the TNNR result cannot be used as a direct estimate for the prediction uncertainty.3FIGURETNNR applied to a simple function to demonstrate its behavior outside of the training domain (interpolation in this case). For intervals within the training domain, the TNN is able to accurately reproduce the original function (black dotted lines) well. Over this interval the model has low uncertainty measured by the SD of the TNN ensemble prediction. This is equivalent to the satisfaction of the self‐consistency condition Equation (6). Conversely, within the interval which was obscured during training, the performance of the model is poor. The corresponding higher uncertainty or violation of the self‐consistency conditions is a signal that the model is performing poorlyIn the following, we examine the possibility of a meaningful estimation of the prediction uncertainty on the BH, CS, EE, and RP data sets. Before training we separate 25% of the data as out‐of‐domain test set testout dependent on a threshold on the regression target y to simulate an extrapolation problem. We use 50% as training data, 15% as test data testin, and 10% as validation data.We examine if it is possible to estimate the prediction uncertainty using established methods. We perform Monte–Carlo dropout at an ANN.45 This method is based on applying dropout during the prediction phase to estimate the uncertainty of the prediction. In the case of TNNs we examine the standard deviations of the violations of the self‐consistency conditions (Equations (4) and (6)), which includes the standard deviations of each single prediction. Finally, we compare the latent space distance58 to the training data of each prediction to its prediction uncertainty. In this case the projection into latent space is the output of the second last layer of the neural network.The results of these examinations can be found in Figure 4. We differentiate between predictions on three different data sets: training set, in‐domain test set testin and out‐of‐domain test set testout. A general observation is that the prediction error of each single sample has the tendency to be the smallest on the training set, higher on testin and even higher on the testout (Table 2). We can also see that on the training set not a single method for the prediction uncertainty estimation is accurate. However, we find a strong correlation between some of the methods and the prediction error on the test sets testin and testout. There is no clear evidence that any of the prediction uncertainty estimation methods are uniformly better than any other; however, dropout seems to be worse than the other methods. It might make sense to combine several uncertainty estimates to determine whether a prediction can be trusted.4FIGUREComparison of different estimators for the prediction uncertainty, axis are logarithmic. Dropout at inference time is applied to single ANNs. For TNNs we examine the SD of the prediction and the SD of the self‐check consistency condition Equation (4). Further, we include the latent space distance to the training set2TABLEComparison of ANN Dropout ensembles and TNN ensembles with division of testing data into testin which is on the training manifold and testout which is outside the training manifold in the context of extrapolationANN MC dropoutTNNRMSEtrainRMSEtestinRMSEtestoutRMSEtrainRMSEtestinRMSEtestoutBH2.6434.42911.5401.1374.46811.729CS5.5056.63910.1993.0964.4085.000EE3.3733.7373.6890.7411.3292.187RP0.0460.0750.5200.0150.0220.207Further, we make a practical observation. As long as any of the predictors, applied to an unseen data point, is of the same magnitude as on the in‐domain test set testin we expect the prediction uncertainty best estimated by RMSEtestin. If any of the estimators is larger than that (as it is often the case on testout), we expect the prediction error to be larger than the test error. This observation explains why the TNN self‐consistency check can be employed as a proxy for decreasing prediction accuracy.CONCLUSIONSWe have introduced TNNR. It is based on the reformulation of a traditional regression problem into a prediction for the difference between two regression targets after which an ensemble is created by averaging the differences anchored at all training data points. This bears two advantages: (a) the number of training data points increases quadratically over the original data set. (b) By anchoring each prediction at all training data points one obtains an ensemble of predictions of twice the training data set size. Although a straightforward comparison is difficult, compared with trajectory ensemble methods,44 which typically produce approximately ×5 snapshots per real training time equivalent, TNNR produces an ensemble of twice the size of the training data set in ×7 to ×20 training time (see Table 3). We have demonstrated that TNNR can compete and outperform traditional regression methods including ANN regression, kernel‐ridge regression, random forests, and xgboost. Building an ensemble of TNNs can improve the predictive accuracy even further (Table 1). TNNs are competitive with tree‐based methods on discrete data sets, however, xgboost is significantly faster to train. On continuous data sets and on image based data sets TNNs are the clear winner. However, if training time is of importance kernel‐ridge regression produces good predictions at only a fraction of the training time of neural networks. In the case where there is a large number of training data, TNNR might not see all possible pairs before convergence, such that it cannot leverage its full advantage over traditional ANNs. In this case, ANNs are able to compete with TNNR as shown in Figure 2. Since the anchor points during inference are only linear in the number of training data points sampling is normally not necessary. A successfully trained TNN must satisfy many self‐consistency conditions. The violation of these conditions give rise to an uncertainty estimate of the final prediction (Figure 4). TNNR can naturally serve as the basis for a semi‐supervised regression algorithm.63 It can be trained on loops containing three data points to satisfy the loop consistency condition, even if the data points within the loops are unlabeled. Future directions include intelligently weighting the ensemble members by improved ensembling techniques. Some problems might benefit from exchanging the ensemble mean by the median. It would also be interesting to examine the ensemble diversity of TNNR compared with other ensembles. It will also be interesting to see if it makes sense to perform twinned regression with other algorithms.3TABLEComparison of training times of different algorithmsMinimal time, s (WSB)Maximal time, s (RCL)KRR0.345.1RF3.712.5XGB1.315.6ANN4.346.2TNN26.4886.5Note: The minimal training time is measured at the smallest data set WSB while the maximal training time occurs at the largest data set RCL.The code and data supporting this publication is available in Ref. [64]. All comparisons were made using the sci‐kit learn library.65ACKNOWLEDGMENTSResearch at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Economic Development, Job Creation and Trade. We thank the National Research Council of Canada for their partnership with Perimeter on the PIQuIL. R.G.M. and I.T. acknowledge NSERC. R.G.M. is supported by the Canada Research Chair Program. We also acknowledge Compute Canada for computational resources.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available in UCI, Machine learning repository; https://archive.ics.uci.edu/ml/datasets.php (accessed: 1 may 2020).REFERENCESPark B, Bae JK. Using machine learning algorithms for housing price prediction: the case of Fairfax county, Virginia housing data. Expert Syst Appl. 2015;42(6):2928‐2934.Chokr M, Elbassuoni S. Calories prediction from food images. In Proceedings of the Thirty‐First AAAI Conference on Artificial Intelligence, AAAI'17; 2017:4664‐4669.Patel J, Shah S, Thakkar P, Kotecha K. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Syst Appl. 2015;42(1):259‐268.Sun Z‐L, Choi T‐M, Kin‐Fan A, Yong Y. Sales forecasting using extreme learning machine with applications in fashion retailing. Decis Support Syst. 2008;46(1):411‐419.Rupp M, Tkatchenko A, Müller K‐R, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett. 2012;108:058301.Ryczko K, Strubbe DA, Tamblyn I. Deep learning and density‐functional theory. Phys Rev A. 2019;100:022512.Schütt K, Kindermans P‐J, Felix HES, Chmiela S, Tkatchenko A, Müller K‐R. Schnet: a continuous‐filter convolutional neural network for modeling quantum interactions. In: Guyon I, Luxburg UV, Bengio S, et al., eds. Advances in Neural Information Processing Systems 30. Red Hook, NY: Curran Associates Inc; 2017:991‐1001.Chandrasekaran A, Kamal D, Batra R, Kim C, Chen L, Ramprasad R. Solving the electronic structure problem with machine learning. npj Comput Mater. 2019;5(1):22.Malkiel I, Mrejen M, Nagler A, Arieli U, Wolf L, Suchowski H. Plasmonic nanostructure design and characterization via deep learning. Light Sci Appl. 2018;7(1):60.Krzywinski M, Altman N. Importance of being uncertain. Nat Methods. 2013;10(9):809‐810.Ghahramani Z. Probabilistic machine learning and artificial intelligence. Nature. 2015;521(7553):452‐459.SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On‐Road Motor Vehicles. Warrendale, PA: SAE; 2018.McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89‐94.Ulissi ZW, Medford AJ, Bligaard T, Nørskov JK. To address surface reaction network complexity using scaling relations machine learning and DFT calculations. Nat Commun. 2017;8(1):1‐7.Kasim MF, Watson‐Parris D, Deaconu L, et al. Building high accuracy emulators for scientific simulations with deep neural architecture search. Machine Learning: Science and Technology. 2020;3(1).Zhang L, Lin DY, Wang H, Car R, Weinan E. Active learning of uniformly accurate interatomic potentials for materials simulation. Phys Rev Mater. 2019;3(2):023804.Zhong M, Tran K, Min Y, et al. Accelerated discovery of CO2 electrocatalysts using active machine learning. Nature. 2020;581(7807):178‐183.Baldi P, Chauvin Y. Neural networks for fingerprint recognition. Neural Comput. 1993;5(3):402‐418.Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R. Signature verification using a "siamese" time delay neural network. In: Cowan JD, Tesauro G, Alspector J, eds. Advances in Neural Information Processing Systems 6. Burlington, MA: Morgan‐Kaufmann; 1994:737‐744.LeCun Y, Boser B, Denker JS, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989;1(4):541‐551.Taigman Y, Yang M, Ranzato M, Wolf L. Deepface: closing the gap to human‐level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014:1701‐1708.Koch G, Zemel R, Salakhutdinov R. Siamese neural networks for one‐shot image recognition. In ICML Deep Learning Workshop, volume 2, Lille; 2015.Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS. Fully‐convolutional siamese networks for object tracking. In European Conference on Computer Vision, Springer; 2016:850‐865.Bao H, Niu G, Sugiyama M. Classification from pairwise similarity and unlabeled data. In International Conference on Machine Learning; 2018:452‐461.Doumanoglou A, Balntas V, Kouskouridas R, Kim T‐K. Siamese regression networks with efficient mid‐level feature extraction for 3d object pose estimation. arXiv. 2016. arXiv:1607.02257.Li MD, Chang K, Bearce B, et al. Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging. npj Digit Med. 2020;3(1):48.Wetzel SJ, Melko RG, Scott J, Panju M, Ganesh V. Discovering symmetry invariants and conserved quantities by interpreting siamese neural networks. Phys Rev Res. 2020;2(3):033499.Cohen WW, Schapire RE, Singer Y. Learning to order things. J Artif Intell Res. 1999;10:243‐270.Burges C, Shaked T, Renshaw E, et al. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning; 2005:89‐96.Seeger M. Gaussian processes for machine learning. Int J Neural Syst. 2004;14(2):69‐106.Wilson AG, Hu Z, Salakhutdinov R, Xing EP. Deep kernel learning. In Artificial Intelligence and Statistics, PMLR; 2016:370‐378.Bartók AP, Payne MC, Kondor R, Csányi G. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys Rev Lett. 2010;104:136403.Koistinen OP, Maras E, Vehtari A, Jónsson H. Minimum energy path calculations with Gaussian process regression. Nanosyst: Phys, Chem, Math. 2016;7:925‐935.Simm GN, Reiher M. Error‐controlled exploration of chemical reaction networks with Gaussian processes. J Chem Theory Comput. 2018;14(10):5238‐5248.Proppe J, Gugler S, Reiher M. Gaussian process‐based refinement of dispersion corrections. J Chem Theory Comput. 2019;15(11):6046‐6060.Liu H, Ong Y‐S, Shen X, Cai J. When gaussian process meets big data: a review of scalable gps. IEEE Trans Neural Netw Learn Syst. 2020;31(11):4405‐4423.Naftaly U, Intrator N, Horn D. Theory of ensemble averaging: optimal ensemble averaging of neural networks. Comput Neural Syst. 1997;8:283‐296.Bishop CM. Pattern Recognition and Machine Learning. Singapore: Springer; 2006.Von Luxburg U, Schölkopf B. Statistical learning theory: models, concepts, and results. In: Gabbay DM, Hartmann S, Woods J, eds. Handbook of the History of Logic. Vol 10. Amsterdam, Netherlands: Elsevier; 2011:651‐706.Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intell. 1990;12(10):993‐1001.Krogh A, Vedelsby J. Neural network ensembles, cross validation and active learning. In Proceedings of the 7th International Conference on Neural Information Processing Systems, NIPS'94; 1994:231‐238.Swann A, Allinson N. Fast committee learning: preliminary results. Electron Lett. 1998;34(14):1408‐1410.Xie J, Xu B, Chuang Z. Horizontal and vertical ensemble with deep representation for classification. arXiv. 2013. arXiv:1306.2759.Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ. Snapshot ensembles: train 1, get m for free. arXiv. 2017. arXiv:1704.00109.Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning; 2016:1050‐1059.Bachman P, Alsharif O, Precup D. Learning with pseudo‐ensembles. In Proceedings of the 27th International Conference on Neural Information Processing Systems ‐ Volume 2, NIPS'14; 2014:3365‐3373.Tran L, Veeling BS, Roth K, et al. Hydra: preserving ensemble diversity for model distillation. arXiv. 2020. arXiv:2001.04694.Ashukha A, Lyzhov A, Molchanov D, Vetrov D. Pitfalls of in‐domain uncertainty estimation and ensembling in deep learning. arXiv. 2020. arXiv:2002.06470.Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems; 2017:6402‐6413.Cortés‐Ciriano I, Bender A. Deep confidence: a computationally efficient framework for calculating reliable prediction errors for deep neural networks. J Chem Inf Model. 2019;59(3):1269‐1281.Janet JP, Duan C, Yang T, Nandy A, Kulik HJ. A quantitative uncertainty metric controls error in neural network‐driven chemical discovery. Chem Sci. 2019;10(34):7913‐7922.Morais CLM, Lima KMG, Martin FL. Uncertainty estimation and misclassification probability for classification models based on discriminant analysis and support vector machines. Anal Chim Acta. 2019;1063:40‐46.Musil F, Willatt MJ, Langovoy MA, Ceriotti M. Fast and accurate uncertainty estimation in chemical machine learning. J Chem Theory Comput. 2019;15(2):906‐915.Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102(477):359‐378.Dawid AP, Musio M. Theory and applications of proper scoring rules. Metron. 2014;72(2):169‐183.Peterson AA, Christensen R, Khorshidi A. Addressing uncertainty in atomistic machine learning. Phys Chem Chem Phys. 2017;19(18):10978‐10985.Liu R, Glover KP, Feasel MG, Wallqvist A. General approach to estimate error bars for quantitative structure‐activity relationship predictions of molecular activity. J Chem Inf Model. 2018;58(8):1561‐1575.Tran K, Neiswanger W, Yoon J, Zhang Q, Xing E, Ulissi ZW. Methods for comparing uncertainty quantifications for material property predictions. Mach Learn: Sci Technol. 2020;1(2):025006.Ovadia Y, Fertig E, Ren J, et al. Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift. Adv Neural Inf Process Syst. 2019;32:14003‐14014.Chollet F. Keras; 2015 https://keras.ioAbadi M, Agarwal A, Barham P, et al. TensorFlow: Large‐scale machine learning on heterogeneous systems. arXiv. 2015. arXiv:1603.04467.UCI. Machine learning repository; 2020. https://archive.ics.uci.edu/ml/datasets.php. Accessed May 1, 2020.Wetzel SJ, Melko RG, Tamblyn I. Twin neural network regression is a semi‐supervised regression algorithm. arXiv. 2021. arXiv:2106.06124.Wetzel S. Public tnnr Github repository; 2022.Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit‐learn: machine learning in python. J Mach Learn Res. 2011;12:2825‐2830.AAPPENDIXA.1Data setsWe describe the properties of the data sets in Table A1. In addition to the common reference data sets we included scientific data and image‐like data for regression. The random polynomial function (RP) is a data set created from the equationA1Fx1…x5=∑i,j=05aijxixj+∑i5bixi+cwith random but fixed coefficients and zero noise.A1TABLEData setsNameKeySizeFeaturesTypeBoston HousingBH50613Discrete, continuousConcrete strengthCS10308ContinuousEnergy efficiencyEF7688Discrete, continuousYacht hydrodynamicsYH3086DiscreteRed wine qualityWN159911Discrete, continuousBio concentrationBC77914Discrete, continuousRandom polynomial functionRP10005ContinuousRCL circuit currentRCL40006ContinuousWheatstone bridge voltageWSB2004ContinuousIsing modelISING2000400Image, discreteThe output in the RCL circuit current data set (RCL) is the current through an RCL circuit, modeled by the equationA2I0=V0cosωt/R2+ωL−1/ωC2with added Gaussian noise of mean 0 and SD 0.1.The output of the Wheatstone Bridge voltage (WSB) is the measured voltage given by the equationA3V=UR2/R1+R2−R3/R2+R3with added Gaussian noise of mean 0 and SD 0.1.BAPPENDIXB.1Hyperparameter optimizationB.1.1.KRR, RF, XGBWe outline the parameters that are optimized during cross‐validation for the scikit‐learn implementations of kernel ridge regression (KRR), random forests (RF), and XGBoost (XGB). For KRR we tested different kernels including radial basis function (RBF), polynomial and sigmoid, where we RBF worked best. The results are produced using grid search hyperparameter optimization using the RBF kernel and α ∈ [1, 0.1, 0.01, 0.001] γ ∈ [0.01, 0.1, 1, 10, 100]. In the case of RF and XGB we use the same hyperparameter grids since both methods build upon decision trees: max depth ∈ [2,3,4] and min samples ∈ [2,3,4].B.1.2.TNNRIn this section we have included additional results of experiments on our data sets Boston Housing (BH), concrete strength (CS), energy efficiency (EE), and random polynomial function (RP) (see Tables B1 to B4). We have varied the L2 regularization. We further exchanged our main optimizer adadelta by rmsprop. Finally, we examined the checkpoint that saves the weights based on the best performance on the validation set. We find that adadelta seems to give more consistent results. As long as the L2 penalty is not too large, there is no clear evidence to favor a certain L2 penalty over another. We further examined modifications to the architecture, dropout regularization, different learning rate schedules in preliminary experiments, which are not listed here, since none of these changes led to significant and uniform improvement.B1TABLERoot mean squared errors (RMSEs) of the validation sets for different datasets, machine learning approaches, and regularizations (Optimizer: adadelta, with model checkpoint saving)DataANNANN (E)MC DropoutTNNTNN (E)L2 = 0BH2.544 ± 0.1682.780 ± 0.2523.375 ± 0.3212.656 ± 0.2432.286 ± 0.155CS4.964 ± 0.1614.595 ± 0.2055.941 ± 0.1873.758 ± 0.2143.247 ± 0.166EE0.819 ± 0.0350.791 ± 0.0302.893 ± 0.1330.494 ± 0.0180.452 ± 0.020RP0.045 ± 0.0030.031 ± 0.0020.087 ± 0.0030.022 ± 0.0010.012 ± 0.001L2 = 1e − 6BH2.477 ± 0.1632.743 ± 0.2823.388 ± 0.1932.476 ± 0.1592.347 ± 0.135CS4.669 ± 0.1854.714 ± 0.1615.995 ± 0.1863.751 ± 0.2633.398 ± 0.228EE0.968 ± 0.0440.777 ± 0.0342.973 ± 0.0920.502 ± 0.0330.452 ± 0.022RP0.049 ± 0.0020.028 ± 0.0020.079 ± 0.0040.020 ± 0.0010.011 ± 0.001L2 = 1e − 5BH2.845 ± 0.1772.202 ± 0.0823.126 ± 0.2362.931 ± 0.2442.239 ± 0.214CS4.599 ± 0.1574.403 ± 0.1985.675 ± 0.1943.715 ± 0.3183.480 ± 0.275EE1.029 ± 0.0350.799 ± 0.0342.905 ± 0.1230.500 ± 0.0190.433 ± 0.019RP0.064 ± 0.0020.032 ± 0.0030.086 ± 0.0040.022 ± 0.0010.012 ± 0.001L2 = 1e − 4BH2.875 ± 0.1602.750 ± 0.2953.330 ± 0.2312.933 ± 0.1872.359 ± 0.187CS5.178 ± 0.1554.249 ± 0.1285.730 ± 0.1753.607 ± 0.1933.047 ± 0.176EE1.763 ± 0.0300.853 ± 0.0342.747 ± 0.1060.502 ± 0.0270.448 ± 0.024RP0.107 ± 0.0020.024 ± 0.0010.097 ± 0.0040.032 ± 0.0020.019 ± 0.001L2 = 1e − 3BH4.902 ± 0.1032.368 ± 0.1752.722 ± 0.1102.293 ± 0.1192.471 ± 0.189CS8.001 ± 0.1334.034 ± 0.1366.102 ± 0.1453.547 ± 0.1983.537 ± 0.168EE4.651 ± 0.0590.778 ± 0.0322.877 ± 0.0980.481 ± 0.0220.467 ± 0.021RP0.268 ± 0.0040.055 ± 0.0050.110 ± 0.0040.056 ± 0.0030.052 ± 0.003B2TABLERoot mean squared errors (RMSEs) of the test sets for different datasets, machine learning approaches, and regularizations (Optimizer: adadelta, with model checkpoint saving)DataANNANN (E)MC dropoutTNNTNN (E)L2 = 0BH3.093 ± 0.1403.432 ± 0.3172.953 ± 0.1542.552 ± 0.1042.614 ± 0.204CS5.373 ± 0.1685.136 ± 0.2076.067 ± 0.2094.186 ± 0.2493.882 ± 0.218EE0.984 ± 0.0330.894 ± 0.0402.957 ± 0.1170.524 ± 0.0220.456 ± 0.020RP0.050 ± 0.0020.032 ± 0.0020.086 ± 0.0020.022 ± 0.0010.016 ± 0.001L2 = 1e − 6BH3.170 ± 0.2222.749 ± 0.2003.226 ± 0.2062.869 ± 0.2332.554 ± 0.134CS5.133 ± 0.1724.963 ± 0.1926.293 ± 0.1794.030 ± 0.2254.025 ± 0.258EE1.125 ± 0.0590.992 ± 0.0423.057 ± 0.1200.590 ± 0.0270.468 ± 0.020RP0.054 ± 0.0030.034 ± 0.0020.085 ± 0.0030.020 ± 0.0010.011 ± 0.001L2 = 1e − 5BH3.184 ± 0.1732.996 ± 0.2863.780 ± 0.3773.315 ± 0.2293.201 ± 0.384CS4.991 ± 0.2214.712 ± 0.1765.934 ± 0.1393.880 ± 0.2764.044 ± 0.228EE1.255 ± 0.0430.906 ± 0.0372.943 ± 0.1220.570 ± 0.0260.477 ± 0.026RP0.067 ± 0.0020.029 ± 0.0020.092 ± 0.0070.026 ± 0.0020.014 ± 0.001L2 = 1e − 4BH3.559 ± 0.1802.890 ± 0.1803.352 ± 0.2923.173 ± 0.2542.740 ± 0.147CS5.799 ± 0.1714.924 ± 0.1835.777 ± 0.1773.878 ± 0.2053.743 ± 0.251EE1.891 ± 0.0450.923 ± 0.0362.973 ± 0.1030.581 ± 0.0290.477 ± 0.021RP0.110 ± 0.0010.035 ± 0.0050.088 ± 0.0040.032 ± 0.0020.023 ± 0.002L2 = 1e − 3BH5.382 ± 0.1632.861 ± 0.1983.566 ± 0.2793.039 ± 0.2943.048 ± 0.270CS8.325 ± 0.1404.964 ± 0.2206.132 ± 0.1803.953 ± 0.2204.186 ± 0.233EE4.714 ± 0.0541.083 ± 0.0712.891 ± 0.1220.613 ± 0.0240.533 ± 0.026RP0.270 ± 0.0040.066 ± 0.0040.113 ± 0.0060.061 ± 0.0030.054 ± 0.003B3TABLERoot mean squared errors (RMSEs) of the validation sets for different datasets, machine learning approaches, and regularizations (Optimizer: rmsprop, no model checkpoint saving)DataANNANN (E)MC dropoutTNNTNN (E)L2 = 0BH2.865 ± 0.2703.151 ± 0.3023.328 ± 0.2872.724 ± 0.2252.548 ± 0.155CS5.209 ± 0.2475.021 ± 0.2376.221 ± 0.1784.582 ± 0.1804.148 ± 0.147EE1.199 ± 0.0830.888 ± 0.0423.057 ± 0.1160.828 ± 0.0620.711 ± 0.034RP0.049 ± 0.0040.031 ± 0.0020.087 ± 0.0050.025 ± 0.0010.016 ± 0.001L2 = 1e − 6BH3.040 ± 0.2343.027 ± 0.2923.384 ± 0.2082.601 ± 0.1962.582 ± 0.147CS5.600 ± 0.2035.442 ± 0.2036.135 ± 0.1854.559 ± 0.2124.327 ± 0.198EE1.259 ± 0.0980.944 ± 0.0552.961 ± 0.1010.792 ± 0.0570.788 ± 0.023RP0.050 ± 0.0030.026 ± 0.0010.077 ± 0.0040.024 ± 0.0010.015 ± 0.001L2 = 1e − 5BH2.903 ± 0.1572.418 ± 0.0963.109 ± 0.2793.228 ± 0.2832.427 ± 0.233CS5.523 ± 0.2025.212 ± 0.1835.973 ± 0.1764.607 ± 0.2364.344 ± 0.235EE1.210 ± 0.0440.920 ± 0.0422.846 ± 0.1460.795 ± 0.0550.644 ± 0.030RP0.061 ± 0.0020.030 ± 0.0030.079 ± 0.0040.030 ± 0.0010.016 ± 0.001L2 = 1e − 4BH3.019 ± 0.0813.068 ± 0.3203.319 ± 0.2103.062 ± 0.1832.543 ± 0.215CS6.027 ± 0.1944.837 ± 0.1565.956 ± 0.1624.318 ± 0.1613.888 ± 0.152EE2.003 ± 0.0680.970 ± 0.0412.785 ± 0.0980.784 ± 0.0480.660 ± 0.034RP0.102 ± 0.0030.026 ± 0.0010.094 ± 0.0070.040 ± 0.0020.023 ± 0.001L2 = 1e − 3BH4.887 ± 0.1072.654 ± 0.1922.656 ± 0.0912.447 ± 0.1242.721 ± 0.205CS8.354 ± 0.1444.520 ± 0.1296.204 ± 0.1464.260 ± 0.1604.360 ± 0.185EE4.592 ± 0.0750.874 ± 0.0352.380 ± 0.1090.705 ± 0.0380.634 ± 0.024RP0.266 ± 0.0040.068 ± 0.0070.112 ± 0.0040.066 ± 0.0040.060 ± 0.004B4TABLERoot mean squared errors (RMSEs) of the test sets for different datasets, machine learning approaches, and regularization (Optimizer: rmsprop, no model checkpoint saving)DataANNANN (E)MC dropoutTNNTNN (E)L2 = 0BH3.291 ± 0.2293.326 ± 0.3442.820 ± 0.1692.462 ± 0.0852.593 ± 0.193CS5.185 ± 0.2295.225 ± 0.2276.116 ± 0.1574.748 ± 0.1894.382 ± 0.172EE1.145 ± 0.0900.851 ± 0.0343.060 ± 0.1040.844 ± 0.0520.680 ± 0.032RP0.051 ± 0.0020.029 ± 0.0020.084 ± 0.0040.024 ± 0.0010.019 ± 0.002L2 = 1e − 6BH2.930 ± 0.1822.748 ± 0.1793.146 ± 0.1802.882 ± 0.2542.502 ± 0.117CS5.765 ± 0.2135.423 ± 0.1786.275 ± 0.1944.790 ± 0.1634.469 ± 0.208EE1.288 ± 0.1211.028 ± 0.0572.984 ± 0.1060.826 ± 0.0410.713 ± 0.026RP0.049 ± 0.0030.030 ± 0.0020.083 ± 0.0020.023 ± 0.0010.015 ± 0.001L2 = 1e − 5BH3.344 ± 0.2832.911 ± 0.2573.567 ± 0.3962.983 ± 0.1733.155 ± 0.383CS5.867 ± 0.1785.105 ± 0.1316.195 ± 0.1564.449 ± 0.1734.563 ± 0.179EE1.210 ± 0.0420.940 ± 0.0402.845 ± 0.1330.815 ± 0.0570.679 ± 0.029RP0.060 ± 0.0020.024 ± 0.0020.081 ± 0.0040.028 ± 0.0010.017 ± 0.001L2 = 1e − 4BH3.313 ± 0.1912.625 ± 0.1453.362 ± 0.3053.199 ± 0.2272.803 ± 0.134CS5.905 ± 0.1365.141 ± 0.1865.956 ± 0.1794.518 ± 0.1904.123 ± 0.206EE2.010 ± 0.0760.891 ± 0.0322.932 ± 0.0950.762 ± 0.0300.611 ± 0.021RP0.101 ± 0.0020.033 ± 0.0040.082 ± 0.0040.039 ± 0.0020.026 ± 0.002L2 = 1e − 3BH5.197 ± 0.1932.835 ± 0.1853.296 ± 0.2512.975 ± 0.2783.028 ± 0.267CS8.381 ± 0.1535.128 ± 0.2136.153 ± 0.1554.611 ± 0.2054.678 ± 0.216EE4.659 ± 0.0650.954 ± 0.0392.360 ± 0.1080.747 ± 0.0360.723 ± 0.032RP0.267 ± 0.0030.068 ± 0.0050.114 ± 0.0060.066 ± 0.0040.059 ± 0.004CAPPENDIXC.1Performance improvement through ensemblesC.1.1.Bias‐variance tradeoff and ensemblesIn a regression problem, one is tasked with finding the true labels on yet unlabeled data points through the estimation of a function f(x) = y. Given a finite training data set D we denote this approximation f̂x;D. The expected mean squared error can be decomposed by three sources of error, bias error BiasDf̂x;D, variance error VarDf̂x;D and intrinsic error of the data set σ.C1MSE=ExBiasDf̂x;D2+VarDf̂x;D+σ2.If we replace the estimator by an ensemble of two functions f̂x;D=1/2f̂Ax;D+1/2f̂Bx;D, each exhibiting the same bias and variance as the original estimator, then we can decompose the MSEC2MSE=ExBiasD1/2f̂Ax;D+1/2f̂Bx;D2+VarD1/2f̂Ax;D+1/2f̂Bx;D+σ2C3=Ex{BiasDf̂x;D2+VarD1/2f̂Ax;D+VarD1/2f̂Bx;DC4+2CovD1/2f̂Ax;D,1/2f̂Bx;D}+σ2C5=ExBiasDf̂x;D2+1/2VarDf̂Ax;D+1/2CovDf̂Ax;D,f̂Bx;D+σ2The more uncorrelated f̂Ax;D and f̂Bx;D are, the smaller is the ratio between variance and covariance. Thus an ensemble consisting of weakly correlated ensemble members reduce the MSE by circumventing the bias‐variance tradeoff. By induction this argument extends to larger ensemble sizes.DAPPENDIXD.1Loop consistency orderLet us discuss how loops of different orders interact. In figure D1 we can graphically see how to decompose higher order loops into smaller loops. In the following, we discuss how this affects loop consistency at different orders. We define short forms of loops containing the predictions on connections between nodes as:D12‐loop:Lxixj=Fxixj+FxjxiD23‐loop:Lxixjxk=Fxixj+Fxjxk+FxkxiD34‐loop:Lxixjxkxl=Fxixj+Fxjxk+Fxkxl+FxlxiD1FIGURELoops can be decomposed into smaller loopsWe define the magnitude of the violation of the loop consistency as the upper bound ϵ of the values of all loops of a given order:D4∀xi,xj∈X:∣Lxixj∣<ϵTheorem: 2‐loop and 3‐loop consistency implies 4‐loop consistency. Proof: AssumeD5∀xi,xj∈X:∣Lxixj∣<ϵ2D6∀xi,xj,xk∈X:∣Lxixjxk∣<ϵ3ThenD7∀xi,xj,xk,xl∈X:D8∣Lxixjxkxl∣=∣Fxixj+Fxjxk+Fxkxl+Fxlxi∣D9=∣Fxixj+Fxjxk+Fxkxl+Fxlxi+Lxixk−Lxixk∣D10=∣Fxixj+Fxjxk+Fxkxi⏟Lxixjxk+Fxkxl+Fxlxi+Fxixk⏟Lxkxlxi−Lxixk∣D11<2ϵ3+ϵ2This argument holds true for larger loops by induction. It is not possible, however, to only use 2‐loops as a starting point to build larger loops. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied AI Letters Wiley

Loading next page...
 
/lp/wiley/twin-neural-network-regression-vP0zo3vcP8

References (55)

Publisher
Wiley
Copyright
© 2022 The Authors. Applied AI Letters published by John Wiley & Sons Ltd
eISSN
2689-5595
DOI
10.1002/ail2.78
Publisher site
See Article on Publisher Site

Abstract

INTRODUCTIONRegression aims to solve one of two main classes of problems in supervised machine learning. It is the process of estimating the function that maps feature variables to an outcome variable. Regression can be applied to solve a wide range of problems. In everyday life, one may wish to predict the sales price of a house1 or the number of calories in a meal.2 In business and industry, it could be desirable to estimate stock market changes3 or the sales numbers of a certain product.4 Within the natural sciences, regression has been applied to a rich variety of problems, these include molecular formation energies,5 electronic charge distributions,6 inter‐atomic forces,7 electronic band‐gaps,8 and plasmonic response functions.9Indeed the wide range of the applications of regression makes it part of the standard curriculum across all of the quantitative domains (applied math, computer science, engineering, economics, the physical sciences, etc.). There are many existing algorithms which solve the regression problem, including linear regression, Gaussian process regression, random forest regression, xgboost, and artificial neural networks, among others.Regression problems require accurate and reliable solutions. Hence, in addition to the prediction itself, it is desirable to estimate the associated error or uncertainty.10,11 Such uncertainty signals can be used to decide when it is safe to let a model make decisions in the absence of expert supervision. For example, when a self‐driving car experiences unfamiliar road conditions, model uncertainty can be used as a signal that it must alter its behavior.12 This could mean taking a different path, slowing down, or in the extreme, stopping until a human driver can take over. Similarly, in medical diagnostics, automated classification and analysis of diagnostic imaging can improve reliability and reduce costs.13 However, such tools can only be trusted if they have the ability to gauge their own accuracy, and will only make recommendations when the expected prediction accuracy is above a safe threshold. Recent successes with surrogate models14,15 require an accurate estimate of model uncertainty. Similarly, active learning algorithms rely on an agent's self‐assessment ability; low confidence in a model can be used as a trigger for consulting the oracle.16,17In this article, we present a regression algorithm called twin neural networks regression (TNNR) which is inspired by Siamese neural networks in the sense that TNNR also leverages pairwise comparisons while making predictions in the domain of regression.TNNR naturally produces an ensemble of predictions when it compares the predictions between a new unseen data point and all training data points. It combines the strengths of real and pseudo ensembles. On one hand, it creates a large number of predictions (twice size of the training data set) at little additional cost in training time compared with a traditional neural network. On the other hand, as a real ensemble it significantly increases the prediction accuracy.TNNR also provides signals of model uncertainty by construction of a network topology which allows for self‐consistency conditions. A violated check on these conditions can be interpreted as a decrease in prediction accuracy.We first describe the approach, demonstrate its performance on well‐known data sets, and finally examine its self‐consistency conditions.The main contributions of the article are the development of a new regression algorithm, which produces a large ensemble of predictions at low cost. This algorithm is shown to be competitive and outperforms state‐of‐the‐art on many data sets. Further, self‐consistency conditions can be used to estimate the prediction error.PRIOR WORKTwin neural network regression is inspired by Siamese networks. These networks are trained to identify if an input pair is similar or different. Siamese networks consist of two identical neural networks which each act on a member of a pair to project it into a latent space. Elements close to each other in the latent space are similar, while elements far apart are different. Siamese networks were originally developed as an approach for the identification of fingerprints18 and handwritten signature verification.19 More recently, when coupled with deep convolutional architectures,20 SNNs have been used for facial recognition,21 few shot learning,22 and object tracking.23 The idea of pairwise similarity has also been shown to be an approach for unlabeled classification.24 Siamese networks have previously been used in the regressive task of extracting a camera pose from images.25 Other uses of these networks with images and regression have been focused in the medical domain as an estimator for disease severity.26 The ability of Siamese networks to extract similarities from data can also be used to determine conservation laws in physics.27 Pairwise training is also used to rank and order data points.28,29Pairwise similarity is also intrinsic to kernel methods which require a user‐specified kernel that acts as a similarity function over pairs of data points. TNNR is thus among a class of neural networks that can be related to kernel methods.30,31Uncertainty assessments for linear regression are well established. Similarly, for stochastic processes, there are standard techniques which can estimate the uncertainty for a model. Gaussian processes (GP)32‐35 naturally provide an estimate of uncertainty, but the cost of training grows quickly with the number of training samples; GP are impractical for large data‐sets, although there has been recent progress in this direction.36 GP also are unable to easily incorporate new data; it is necessary to retrain the model fully for each new data point or observation, making online learning very costly.Conversely, there is not yet a single, established protocol for quantifying error for (deep) neural networks, particularly for the case of regression. An early and straightforward approach to uncertainty estimation is the use of ensembles.Ensemble methods are commonly used in regression (and classification) tasks as a means of improving the accuracy of prediction37 and solving the bias‐variance tradeoff38,39 by combining the predictions of different models. One can implement ensembling by simply averaging uncorrelated predictions from different models to reduce the variance error. More advanced methods to combine predictions are through bagging which reduces the variance error, too, or through boosting which reduces the bias error. Ensembles of different predictions40,41 can be produced in different ways such as repetition of training while changing training‐validation splits or even sampling intermediate models along the training trajectory.42‐44 Sampling along a training trajectory is efficient in that it generates approximately five ensemble members in the time it takes to train one traditional neural network. There are two kinds of ensemble methods, real ensembles, obtained by combining the predictions of multiple models,40,41 or pseudo ensembles45,46 that are obtained by perturbing certain parameters in the data or the model. While pseudo ensembles have the advantage of requiring no overhead in training time, real ensembles47 yield a much better prediction accuracy.The disagreement between models in an ensemble can be used as a signal for confidence among the set. Recent work has highlighted the problems associated with ensembles as a method for uncertainty estimation.48‐50 There is less theoretical justification for the reliability of errors from such approaches compared with GP. Intuitively it makes sense that a mismatch of models suggests the output cannot be trusted, but it is less clear that the magnitude of this mismatch can be assigned to a particular value of uncertainty.If data far away from the training regime is used, regression becomes unreliable, conversely, a close distance of a new data point to one of the training points is correlated with lower error.51 Other methods for estimating model uncertainty and error include discriminant analysis,52 resampling,53 scoring rules,54,55 domain‐specific metrics,56‐58 and data set shifts.59 Finally, it was recently shown that the projection of data into the latent space of a model can be used as a proxy for uncertainty. a close distance to one of the training points is correlated with lower error.51TWIN NEURAL NETWORK REGRESSIONReformulation of regressionIn a regression problem we are given a training set of n data points Xtrain=x1train…xntrain and target values Ytrain=y1train…yntrain. The task is to find a function f such that approximates f(xi) = yi with the smallest possible error. Further, we require that the function f does generalize to unseen data Xtest with labels Ytest. In the following, we reformulate this regression problem. Given a pair of data points xitrainxjtrain we train a neural network (figure 1) to find a function F to predict the difference1Fxixj=yi−yj.1FIGUREReformulation of a regression problem: In the traditional case a neural network is trained to map an input x to its target value f(x) = y. We reformulate the task to take two inputs x1 and x2 and train a twin neural network to predict the difference between the target values F(x2, x1) = y2 − y1. Hence, this difference can be employed as an estimator for y2 = F(x2, x1) + y1 given an anchor point (x1, y1)This neural network can then be used as a solution for the original regression problem yi = F(xi, xj) + yj. In this setting, we call (xj, yj) the anchor for the prediction of yi. This relation can be evaluated on every training data point xjtrain such that the best estimate for the solution of the regression problem is obtained by averaging2yipred=1n∑j=1nFxixjtrain+yjtrain3=1n∑j=1n12Fxixjtrain−12Fxjtrainxi+yjtrain.In this formulation the TNNR output F(xi, xj) can be interpreted as kernel. In kernel methods the kernel together with trainable weights αjk(xi, xj) determines the magnitude of the contributions from a training data point to a new unseen data point. However, F(xi, xj) already includes all trainable weights and the magnitude of F(xi, xj) is uniform in the ensembling process of TNNR. F(xi, xj) is not symmetric under the exchange of xi and xj, while in kernel methods it (usually) is symmetric.The first advantage of the reformulation is that it creates an ensemble of twice the size of the training set of predictions of differences yi − yj for a single prediction of yi. While ensembles are in general costly to produce, the TNNR intrinsically yields a very large ensemble at little additional training cost. Since ensembles suppress the variance contribution within the bias‐variance decomposition 10, ensembles tend to be more accurate than single models as long as the ensemble members are sufficiently diverse.In general, we do not expect the ensemble diversity of the TNNR to be similar to traditional ensembles. The reason is that the prediction of a regression target yi is based on different predictions for differences yi − yj due to multiple anchor points. This allows us to combine the TNNR with any traditional ensemble method to achieve an even more accurate prediction.Each prediction of yi is made from a finite range of differences yi − yj and the anchor data points differ by more than an infinitesimal perturbation. Hence, the TNNR ensemble is not just a pseudo ensemble that is obtained by small perturbations of the model weights.The intrinsic ensembles of the TNNs are not conventional ensembles. Like k‐nearest neighbor regression or support vector regression the prediction is formed by leveraging comparison between a new unseen data point and several support vectors or nearest neighbors belonging to the training set. However, in contrast to these algorithms, TNNR can be seen from a single neural network perspective with weight diversity. Let us consider the prediction yi = F(xi, xj) + yj, we can consider xi as input to a traditional model to predict yi, while xj can be understood as auxiliary parameters which influence the weights. The offset yj can be seen as changing the bias of the output layer.In principle a reliable estimation of a regression target through Equation (3) might be prone to a larger error compared with a traditional estimation. The reason for this is that the TNNR must make predictions on two data points and thus also accumulates the noise of two target values. However, we average over the whole training set and the noise is uncorrelated at different anchor data points. According to the central limit theorem, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. Thus, the impact of noisy anchor labels on the prediction is suppressed by a factor of 1/2n, where n is the training set size.Self‐consistency conditionsThe twin neural network predicts the difference between two regression targets. For an accurate prediction this requires the satisfaction of many self‐consistency conditions, see Figure 1. Any closed loop of difference predictions F(xi, xj) = yi − yj sums up to zero. Any violation of such a condition is an indication of an inaccurate prediction. In principle there are several ways to harness this self‐consistency condition for practical use. First, it can be used to estimate the magnitude the prediction error. Second, it could be utilized to force the neural network to satisfy these conditions during training. Finally, it can enable one to use the predictions on previous test data as anchor points for new predictions.The smallest loop only contains two data points xi, xj for which an accurate TNN needs to satisfy40=yi−yj+yj−yi=Fxixj+FxjxiWhile training, in each batch we include both pairs (xi, xj) and its mirror copy (xj, xi) to enforce the satisfaction of this condition while training. The predictions on any three data xi, xj, xk points should satisfy50=Fxixj+Fxjxk+FxkxiFor xi, xj ∈ Xtrain the target values yi, yj are known. Thus, this condition becomes equivalent to the statement that the prediction of yk must be the same on any two different anchor points (xi, yi) and (xj, yj).6yk=Fxkxi+yi=Fxkxj+yjThis condition is trivially enforced during training. We examine the relation of magnitudes of the violations of these conditions and the prediction error in Section 4.2. To this end we employ the ensemble of predictions and calculate the SD corresponding to Equations (4) and (6) and find a distinct correlation with the out‐of‐domain prediction error. In Appendix D we discuss the interaction between different loop types.Twin neural network architectureThe reformulation of the regression problem does not require a solution by artificial neural networks. However, neural networks scale favorably with the number of input features in the data set. We employ the same neural network for all data sets. Our TNN takes a pair of inputs (xi, xj), where each input is connected to the fully connected neural network with two hidden layers and a single output neuron. Each hidden layer consists of 64 neurons each with a relu activation function. On data sets containing hierarchical structures, such as image data sets or audio recordings, it is helpful to include shared layers that only act on one element of the input pair. This is commonly used in few shot learning in image recognition.19 The architecture in this article does not use any kind of weight sharing and there is no latent space in which some high‐level representations are compared. Thus the network is equivalent to a fully‐connected feed‐forward neural network that uses the concatenation of the features of two samples x1, x2 as input. We optimize a common architecture that works well for all data sets considered in this work. We examined different regularization methods like dropout and L2 regularization and found that in some cases a small L2 penalty improves the results. More details can be found in the Appendix B. The improvement was not statistically significant or uniform among different splits of the data, which is why our main results omit any regularization. The training objective is to minimize the mean squared error on the training set. For this purpose we employ standard gradient descent methods adadelta (and rmsprop) to minimize the loss on a batch of 16 pairs at each iteration. We stop the training if the validation loss stops decreasing.The single feed forward neural network (ANN) that we employ for our comparisons has a similar architecture as the TNN. This means it has the same hidden layers and we examined the same hyperparameters. The convolutional neural networks are built on this architecture with the first two dense layers exchanged by convolutional layers of 64 neurons and filter size 3. The neural networks are robust with respect to changing the specific architectures in the sense that the results do not change significantly. The neural networks were implemented using keras60 and tensorflow.61Data preparationWe examine the performance of TNNR on different data sets: Boston housing (BH), concrete strength (CS), energy efficiency (EE), yacht hydrodynamics (YH), red wine quality (WN), Bio Conservation (BC),random polynomial (RP), RCL circuit (RCL), Wheatstone bridge (WSB), and the Ising Model (ISING). The common data sets can be found online in Ref. [62]. The science data sets are simulations of mathematical equations and physical systems. RP is a polynomial of degree two of five input features and random coefficients. RCL is a simulation of the electric current in an RCL circuit and WSB a simulation of the Wheatstone bridge voltage. ISING, a spin system on a lattice of size 20 × 20 and the corresponding energies are used to demonstrate an image regression problem. More details can be found in the Appendix A.All data is split into 90% training, 5% validation, and 5% test data. Each run is performed on a randomly chosen different split of the data. We normalize and center the input features to a range of [−1, 1] based on the training data. In the context of uncertainty estimation we further divide all data based on their regression targets y. In this manner we choose to exclude a certain range from the training data. Hence, we can examine the performance of the neural networks outside of the training domain.While the data can be fed into a standard ANN in a straightforward manner, one must be careful in the preparation of the TNN data. Starting with a training data set of n data points we can create n2 different pairs of training data pairs for the TNN. Hence, the TNN has the advantage of having access to a much larger training set than the ANN. However, in the case of a large number of training examples, it does not make sense to store all pairs due to memory constraints. That is why we train on a generator which generates all possible pairs batchwise before reshuffling.EXPERIMENTSPrediction accuracyWe train kernel ridge regression (KRR), random forests (RF), xgboost (XGB), traditional single neural networks ANNs, TNNs, and ensembles to solve various regression tasks outlined in Section 3.4. The hyperparameters of KRR, RF and XGB are optimized for each data set via cross‐validation, the ranges of the parameters are explained in section B. The performance on these data sets is measured by the root mean square error (RMSE) which is shown in Table 1. Each result is averaged over 20 different random splits of the data; in this manner we find the best estimate of the RMSE and the according SE. We also produce explicit ensembles (E) by training the according ANN and TNN 20 times. This means each RMSE in the table requires training 20 times or 400 times for the ensembles (E).1TABLEBest estimates for root mean square errors (RMSEs) of different algorithms on the test sets belonging to different data setsCommon dataBHCSEEYHWNBCKRR3.01 ± 0.245.76 ± 0.211.67 ± 0.061.00 ± 0.170.63 ± 0.010.73 ± 0.02RF4.24 ± 0.298.23 ± 0.242.22 ± 0.082.95 ± 0.460.64 ± 0.020.71 ± 0.03XGB2.93 ± 0.184.37 ± 0.191.17 ± 0.040.42 ± 0.060.61 ± 0.010.70 ± 0.03ANN3.09 ± 0.145.37 ± 0.170.98 ± 0.030.52 ± 0.070.64 ± 0.010.76 ± 0.02ANNE3.43 ± 0.325.14 ± 0.210.89 ± 0.040.43 ± 0.050.62 ± 0.010.72 ± 0.03MCD2.95 ± 0.156.07 ± 0.212.96 ± 0.121.42 ± 0.180.68 ± 0.010.72 ± 0.03TNN2.55 ± 0.104.19 ± 0.250.52 ± 0.020.49 ± 0.070.62 ± 0.010.83 ± 0.03TNNE2.61 ± 0.203.88 ± 0.220.46 ± 0.020.37 ± 0.060.63 ± 0.010.72 ± 0.02Science dataImage dataRPRCLWSBISINGKRR0.022 ± 0.0010.020 ± 0.0010.028 ± 0.001KRR0.382 ± 0.006RF0.604 ± 0.0130.288 ± 0.0040.141 ± 0.011RF0.601 ± 0.003XGB0.229 ± 0.0050.124 ± 0.0020.071 ± 0.006XGB0.144 ± 0.003ANN0.050 ± 0.0020.019 ± 0.0000.047 ± 0.004CNN0.050 ± 0.001ANNE0.032 ± 0.0020.016 ± 0.0010.031 ± 0.002CNNE0.044 ± 0.001MCD0.086 ± 0.0020.033 ± 0.0010.042 ± 0.003CMCD0.052 ± 0.001TNN0.022 ± 0.0010.017 ± 0.0000.020 ± 0.001CTNN0.035 ± 0.001TNNE0.016 ± 0.0010.014 ± 0.0010.022 ± 0.002CTNNE0.030 ± 0.001Note: The lowest RMSEs are in bold for clarity. Our confidence on the RMSEs is determined by their SE. Data sets: Boston housing (BH), concrete strength (CS), energy efficiency (EE), yacht hydrodynamics (YH), red wine quality (WN), Bio Conservation (BC), random polynomial (RP), RCL circuit (RCL), Wheatstone bridge (WSB) and the Ising Model (ISING). Algorithms: Random Forests (RF), xgboost (XGB), Neural Networks (ANN), Monte–Carlo dropout networks (MCD), Twin Neural Networks (TNN) and ensembles (E) or convolutional variants (C).In Table 1 we see that TNNs outperform single ANNs and even ensembles of ANNs on all data sets except BC, we assume this is an statistical outlier. Creating an ensemble of TNNs increases the performance even further. The significance of the difference in performance is clearly supported by the comparably small SE. On discrete data, especially YH, WN, and BC, XGB slightly outperforms ANNs at a much shorter training time. However, TNNs are able to compete. While on common data, especially on discrete data XGB beats KRR, on science data sets KRR beats all decision tree‐based methods, due to the smooth nature of the learned function. In these science data sets, neural network‐based methods beat tree‐based methods by orders of magnitude and KRR to a smaller extent. On image data, convolutional neural networks outperform KRR, RF, and XGB. The general trend is that TNNs outperform ANNs.The outperformance of TNNR comes at the cost of training time, as we can see in Table 3, the TNN takes approximately 7–20 times longer to converge compared with single ANNs, while all other algorithms are much faster than ANNs. While TNNR universally perform best, we would only recommend to use TNNR if training time is not a bottleneck.The out‐performance of TNNR over other regression algorithms is based on exploiting the pairing of data points to create a huge training data set and a large ensemble during inference time. If the training set is very large, the number of pairs increases quadratically to a point where the TNN will in practice converge to a minimum before observing all possible pairs. At that point, the TNN begins to lose its advantages in terms of prediction accuracy. A visualization of this fact can be seen in Figure 2. In this figure different ANN and TNN architectures are applied to the RP data set. We note that since the RP data set is created from a mathematical function we can create an arbitrary number of data points. One may observe the performance improvement in terms of lower RMSE of all neural networks when increasing the number of training data points. The TNN achieves a lower RMSE than the ANN in small and medium‐sized data sets. The TNN finds an accuracy plateau sooner than the ANN such that both algorithms perform similarly well in a regime of between 60 000 and 100 000 data points. While training on large data sets early stopping based on validation loss interrupt the training before all training pairs are seen by the algorithm.2FIGURETraditional ANN and TNN regression applied to the random polynomial function (RP). Standard architectures (2 × 64 hidden neurons) and optimized architectures (2 × 128 hidden neurons) are trained with different training data sets containing n=100,…,100000 data points. Larger training sets reduce RMSE, TNNR beats ANN regression for n<60000,…,100000Prediction uncertainty estimationEquipped with a regression method that yields a huge ensemble of predictions constrained by self‐consistency conditions we examine reasonable proxies for the prediction error.In that sense we must distinguish between in‐domain prediction uncertainty and out‐of‐domain prediction uncertainty. The in‐domain denotes data from the same manifold as the training data. However, if the test manifold differs from the training manifold one is concerned with the out‐of‐domain data. This is the case in interpolation or extrapolation, or if certain features in the test data exhibit a different correlation as in the training data.Consider, for example, the polynomial function of one variable (Figure 3) as an example of an out‐of‐domain interpolation problem. The TNN is very accurate on the test set as long as one stays on the training manifold. As soon as the test data leaves the training manifold the prediction becomes significantly worse. While the prediction resembles a line connecting the endpoints of the predictions on the training manifold it fails to capture the true function. Since the difference between the true function and the regression result is consistently larger than the SD, we further conclude that the SD of the TNNR result cannot be used as a direct estimate for the prediction uncertainty.3FIGURETNNR applied to a simple function to demonstrate its behavior outside of the training domain (interpolation in this case). For intervals within the training domain, the TNN is able to accurately reproduce the original function (black dotted lines) well. Over this interval the model has low uncertainty measured by the SD of the TNN ensemble prediction. This is equivalent to the satisfaction of the self‐consistency condition Equation (6). Conversely, within the interval which was obscured during training, the performance of the model is poor. The corresponding higher uncertainty or violation of the self‐consistency conditions is a signal that the model is performing poorlyIn the following, we examine the possibility of a meaningful estimation of the prediction uncertainty on the BH, CS, EE, and RP data sets. Before training we separate 25% of the data as out‐of‐domain test set testout dependent on a threshold on the regression target y to simulate an extrapolation problem. We use 50% as training data, 15% as test data testin, and 10% as validation data.We examine if it is possible to estimate the prediction uncertainty using established methods. We perform Monte–Carlo dropout at an ANN.45 This method is based on applying dropout during the prediction phase to estimate the uncertainty of the prediction. In the case of TNNs we examine the standard deviations of the violations of the self‐consistency conditions (Equations (4) and (6)), which includes the standard deviations of each single prediction. Finally, we compare the latent space distance58 to the training data of each prediction to its prediction uncertainty. In this case the projection into latent space is the output of the second last layer of the neural network.The results of these examinations can be found in Figure 4. We differentiate between predictions on three different data sets: training set, in‐domain test set testin and out‐of‐domain test set testout. A general observation is that the prediction error of each single sample has the tendency to be the smallest on the training set, higher on testin and even higher on the testout (Table 2). We can also see that on the training set not a single method for the prediction uncertainty estimation is accurate. However, we find a strong correlation between some of the methods and the prediction error on the test sets testin and testout. There is no clear evidence that any of the prediction uncertainty estimation methods are uniformly better than any other; however, dropout seems to be worse than the other methods. It might make sense to combine several uncertainty estimates to determine whether a prediction can be trusted.4FIGUREComparison of different estimators for the prediction uncertainty, axis are logarithmic. Dropout at inference time is applied to single ANNs. For TNNs we examine the SD of the prediction and the SD of the self‐check consistency condition Equation (4). Further, we include the latent space distance to the training set2TABLEComparison of ANN Dropout ensembles and TNN ensembles with division of testing data into testin which is on the training manifold and testout which is outside the training manifold in the context of extrapolationANN MC dropoutTNNRMSEtrainRMSEtestinRMSEtestoutRMSEtrainRMSEtestinRMSEtestoutBH2.6434.42911.5401.1374.46811.729CS5.5056.63910.1993.0964.4085.000EE3.3733.7373.6890.7411.3292.187RP0.0460.0750.5200.0150.0220.207Further, we make a practical observation. As long as any of the predictors, applied to an unseen data point, is of the same magnitude as on the in‐domain test set testin we expect the prediction uncertainty best estimated by RMSEtestin. If any of the estimators is larger than that (as it is often the case on testout), we expect the prediction error to be larger than the test error. This observation explains why the TNN self‐consistency check can be employed as a proxy for decreasing prediction accuracy.CONCLUSIONSWe have introduced TNNR. It is based on the reformulation of a traditional regression problem into a prediction for the difference between two regression targets after which an ensemble is created by averaging the differences anchored at all training data points. This bears two advantages: (a) the number of training data points increases quadratically over the original data set. (b) By anchoring each prediction at all training data points one obtains an ensemble of predictions of twice the training data set size. Although a straightforward comparison is difficult, compared with trajectory ensemble methods,44 which typically produce approximately ×5 snapshots per real training time equivalent, TNNR produces an ensemble of twice the size of the training data set in ×7 to ×20 training time (see Table 3). We have demonstrated that TNNR can compete and outperform traditional regression methods including ANN regression, kernel‐ridge regression, random forests, and xgboost. Building an ensemble of TNNs can improve the predictive accuracy even further (Table 1). TNNs are competitive with tree‐based methods on discrete data sets, however, xgboost is significantly faster to train. On continuous data sets and on image based data sets TNNs are the clear winner. However, if training time is of importance kernel‐ridge regression produces good predictions at only a fraction of the training time of neural networks. In the case where there is a large number of training data, TNNR might not see all possible pairs before convergence, such that it cannot leverage its full advantage over traditional ANNs. In this case, ANNs are able to compete with TNNR as shown in Figure 2. Since the anchor points during inference are only linear in the number of training data points sampling is normally not necessary. A successfully trained TNN must satisfy many self‐consistency conditions. The violation of these conditions give rise to an uncertainty estimate of the final prediction (Figure 4). TNNR can naturally serve as the basis for a semi‐supervised regression algorithm.63 It can be trained on loops containing three data points to satisfy the loop consistency condition, even if the data points within the loops are unlabeled. Future directions include intelligently weighting the ensemble members by improved ensembling techniques. Some problems might benefit from exchanging the ensemble mean by the median. It would also be interesting to examine the ensemble diversity of TNNR compared with other ensembles. It will also be interesting to see if it makes sense to perform twinned regression with other algorithms.3TABLEComparison of training times of different algorithmsMinimal time, s (WSB)Maximal time, s (RCL)KRR0.345.1RF3.712.5XGB1.315.6ANN4.346.2TNN26.4886.5Note: The minimal training time is measured at the smallest data set WSB while the maximal training time occurs at the largest data set RCL.The code and data supporting this publication is available in Ref. [64]. All comparisons were made using the sci‐kit learn library.65ACKNOWLEDGMENTSResearch at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Economic Development, Job Creation and Trade. We thank the National Research Council of Canada for their partnership with Perimeter on the PIQuIL. R.G.M. and I.T. acknowledge NSERC. R.G.M. is supported by the Canada Research Chair Program. We also acknowledge Compute Canada for computational resources.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available in UCI, Machine learning repository; https://archive.ics.uci.edu/ml/datasets.php (accessed: 1 may 2020).REFERENCESPark B, Bae JK. Using machine learning algorithms for housing price prediction: the case of Fairfax county, Virginia housing data. Expert Syst Appl. 2015;42(6):2928‐2934.Chokr M, Elbassuoni S. Calories prediction from food images. In Proceedings of the Thirty‐First AAAI Conference on Artificial Intelligence, AAAI'17; 2017:4664‐4669.Patel J, Shah S, Thakkar P, Kotecha K. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Syst Appl. 2015;42(1):259‐268.Sun Z‐L, Choi T‐M, Kin‐Fan A, Yong Y. Sales forecasting using extreme learning machine with applications in fashion retailing. Decis Support Syst. 2008;46(1):411‐419.Rupp M, Tkatchenko A, Müller K‐R, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett. 2012;108:058301.Ryczko K, Strubbe DA, Tamblyn I. Deep learning and density‐functional theory. Phys Rev A. 2019;100:022512.Schütt K, Kindermans P‐J, Felix HES, Chmiela S, Tkatchenko A, Müller K‐R. Schnet: a continuous‐filter convolutional neural network for modeling quantum interactions. In: Guyon I, Luxburg UV, Bengio S, et al., eds. Advances in Neural Information Processing Systems 30. Red Hook, NY: Curran Associates Inc; 2017:991‐1001.Chandrasekaran A, Kamal D, Batra R, Kim C, Chen L, Ramprasad R. Solving the electronic structure problem with machine learning. npj Comput Mater. 2019;5(1):22.Malkiel I, Mrejen M, Nagler A, Arieli U, Wolf L, Suchowski H. Plasmonic nanostructure design and characterization via deep learning. Light Sci Appl. 2018;7(1):60.Krzywinski M, Altman N. Importance of being uncertain. Nat Methods. 2013;10(9):809‐810.Ghahramani Z. Probabilistic machine learning and artificial intelligence. Nature. 2015;521(7553):452‐459.SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On‐Road Motor Vehicles. Warrendale, PA: SAE; 2018.McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89‐94.Ulissi ZW, Medford AJ, Bligaard T, Nørskov JK. To address surface reaction network complexity using scaling relations machine learning and DFT calculations. Nat Commun. 2017;8(1):1‐7.Kasim MF, Watson‐Parris D, Deaconu L, et al. Building high accuracy emulators for scientific simulations with deep neural architecture search. Machine Learning: Science and Technology. 2020;3(1).Zhang L, Lin DY, Wang H, Car R, Weinan E. Active learning of uniformly accurate interatomic potentials for materials simulation. Phys Rev Mater. 2019;3(2):023804.Zhong M, Tran K, Min Y, et al. Accelerated discovery of CO2 electrocatalysts using active machine learning. Nature. 2020;581(7807):178‐183.Baldi P, Chauvin Y. Neural networks for fingerprint recognition. Neural Comput. 1993;5(3):402‐418.Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R. Signature verification using a "siamese" time delay neural network. In: Cowan JD, Tesauro G, Alspector J, eds. Advances in Neural Information Processing Systems 6. Burlington, MA: Morgan‐Kaufmann; 1994:737‐744.LeCun Y, Boser B, Denker JS, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989;1(4):541‐551.Taigman Y, Yang M, Ranzato M, Wolf L. Deepface: closing the gap to human‐level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014:1701‐1708.Koch G, Zemel R, Salakhutdinov R. Siamese neural networks for one‐shot image recognition. In ICML Deep Learning Workshop, volume 2, Lille; 2015.Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS. Fully‐convolutional siamese networks for object tracking. In European Conference on Computer Vision, Springer; 2016:850‐865.Bao H, Niu G, Sugiyama M. Classification from pairwise similarity and unlabeled data. In International Conference on Machine Learning; 2018:452‐461.Doumanoglou A, Balntas V, Kouskouridas R, Kim T‐K. Siamese regression networks with efficient mid‐level feature extraction for 3d object pose estimation. arXiv. 2016. arXiv:1607.02257.Li MD, Chang K, Bearce B, et al. Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging. npj Digit Med. 2020;3(1):48.Wetzel SJ, Melko RG, Scott J, Panju M, Ganesh V. Discovering symmetry invariants and conserved quantities by interpreting siamese neural networks. Phys Rev Res. 2020;2(3):033499.Cohen WW, Schapire RE, Singer Y. Learning to order things. J Artif Intell Res. 1999;10:243‐270.Burges C, Shaked T, Renshaw E, et al. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning; 2005:89‐96.Seeger M. Gaussian processes for machine learning. Int J Neural Syst. 2004;14(2):69‐106.Wilson AG, Hu Z, Salakhutdinov R, Xing EP. Deep kernel learning. In Artificial Intelligence and Statistics, PMLR; 2016:370‐378.Bartók AP, Payne MC, Kondor R, Csányi G. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys Rev Lett. 2010;104:136403.Koistinen OP, Maras E, Vehtari A, Jónsson H. Minimum energy path calculations with Gaussian process regression. Nanosyst: Phys, Chem, Math. 2016;7:925‐935.Simm GN, Reiher M. Error‐controlled exploration of chemical reaction networks with Gaussian processes. J Chem Theory Comput. 2018;14(10):5238‐5248.Proppe J, Gugler S, Reiher M. Gaussian process‐based refinement of dispersion corrections. J Chem Theory Comput. 2019;15(11):6046‐6060.Liu H, Ong Y‐S, Shen X, Cai J. When gaussian process meets big data: a review of scalable gps. IEEE Trans Neural Netw Learn Syst. 2020;31(11):4405‐4423.Naftaly U, Intrator N, Horn D. Theory of ensemble averaging: optimal ensemble averaging of neural networks. Comput Neural Syst. 1997;8:283‐296.Bishop CM. Pattern Recognition and Machine Learning. Singapore: Springer; 2006.Von Luxburg U, Schölkopf B. Statistical learning theory: models, concepts, and results. In: Gabbay DM, Hartmann S, Woods J, eds. Handbook of the History of Logic. Vol 10. Amsterdam, Netherlands: Elsevier; 2011:651‐706.Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intell. 1990;12(10):993‐1001.Krogh A, Vedelsby J. Neural network ensembles, cross validation and active learning. In Proceedings of the 7th International Conference on Neural Information Processing Systems, NIPS'94; 1994:231‐238.Swann A, Allinson N. Fast committee learning: preliminary results. Electron Lett. 1998;34(14):1408‐1410.Xie J, Xu B, Chuang Z. Horizontal and vertical ensemble with deep representation for classification. arXiv. 2013. arXiv:1306.2759.Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ. Snapshot ensembles: train 1, get m for free. arXiv. 2017. arXiv:1704.00109.Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning; 2016:1050‐1059.Bachman P, Alsharif O, Precup D. Learning with pseudo‐ensembles. In Proceedings of the 27th International Conference on Neural Information Processing Systems ‐ Volume 2, NIPS'14; 2014:3365‐3373.Tran L, Veeling BS, Roth K, et al. Hydra: preserving ensemble diversity for model distillation. arXiv. 2020. arXiv:2001.04694.Ashukha A, Lyzhov A, Molchanov D, Vetrov D. Pitfalls of in‐domain uncertainty estimation and ensembling in deep learning. arXiv. 2020. arXiv:2002.06470.Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems; 2017:6402‐6413.Cortés‐Ciriano I, Bender A. Deep confidence: a computationally efficient framework for calculating reliable prediction errors for deep neural networks. J Chem Inf Model. 2019;59(3):1269‐1281.Janet JP, Duan C, Yang T, Nandy A, Kulik HJ. A quantitative uncertainty metric controls error in neural network‐driven chemical discovery. Chem Sci. 2019;10(34):7913‐7922.Morais CLM, Lima KMG, Martin FL. Uncertainty estimation and misclassification probability for classification models based on discriminant analysis and support vector machines. Anal Chim Acta. 2019;1063:40‐46.Musil F, Willatt MJ, Langovoy MA, Ceriotti M. Fast and accurate uncertainty estimation in chemical machine learning. J Chem Theory Comput. 2019;15(2):906‐915.Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102(477):359‐378.Dawid AP, Musio M. Theory and applications of proper scoring rules. Metron. 2014;72(2):169‐183.Peterson AA, Christensen R, Khorshidi A. Addressing uncertainty in atomistic machine learning. Phys Chem Chem Phys. 2017;19(18):10978‐10985.Liu R, Glover KP, Feasel MG, Wallqvist A. General approach to estimate error bars for quantitative structure‐activity relationship predictions of molecular activity. J Chem Inf Model. 2018;58(8):1561‐1575.Tran K, Neiswanger W, Yoon J, Zhang Q, Xing E, Ulissi ZW. Methods for comparing uncertainty quantifications for material property predictions. Mach Learn: Sci Technol. 2020;1(2):025006.Ovadia Y, Fertig E, Ren J, et al. Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift. Adv Neural Inf Process Syst. 2019;32:14003‐14014.Chollet F. Keras; 2015 https://keras.ioAbadi M, Agarwal A, Barham P, et al. TensorFlow: Large‐scale machine learning on heterogeneous systems. arXiv. 2015. arXiv:1603.04467.UCI. Machine learning repository; 2020. https://archive.ics.uci.edu/ml/datasets.php. Accessed May 1, 2020.Wetzel SJ, Melko RG, Tamblyn I. Twin neural network regression is a semi‐supervised regression algorithm. arXiv. 2021. arXiv:2106.06124.Wetzel S. Public tnnr Github repository; 2022.Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit‐learn: machine learning in python. J Mach Learn Res. 2011;12:2825‐2830.AAPPENDIXA.1Data setsWe describe the properties of the data sets in Table A1. In addition to the common reference data sets we included scientific data and image‐like data for regression. The random polynomial function (RP) is a data set created from the equationA1Fx1…x5=∑i,j=05aijxixj+∑i5bixi+cwith random but fixed coefficients and zero noise.A1TABLEData setsNameKeySizeFeaturesTypeBoston HousingBH50613Discrete, continuousConcrete strengthCS10308ContinuousEnergy efficiencyEF7688Discrete, continuousYacht hydrodynamicsYH3086DiscreteRed wine qualityWN159911Discrete, continuousBio concentrationBC77914Discrete, continuousRandom polynomial functionRP10005ContinuousRCL circuit currentRCL40006ContinuousWheatstone bridge voltageWSB2004ContinuousIsing modelISING2000400Image, discreteThe output in the RCL circuit current data set (RCL) is the current through an RCL circuit, modeled by the equationA2I0=V0cosωt/R2+ωL−1/ωC2with added Gaussian noise of mean 0 and SD 0.1.The output of the Wheatstone Bridge voltage (WSB) is the measured voltage given by the equationA3V=UR2/R1+R2−R3/R2+R3with added Gaussian noise of mean 0 and SD 0.1.BAPPENDIXB.1Hyperparameter optimizationB.1.1.KRR, RF, XGBWe outline the parameters that are optimized during cross‐validation for the scikit‐learn implementations of kernel ridge regression (KRR), random forests (RF), and XGBoost (XGB). For KRR we tested different kernels including radial basis function (RBF), polynomial and sigmoid, where we RBF worked best. The results are produced using grid search hyperparameter optimization using the RBF kernel and α ∈ [1, 0.1, 0.01, 0.001] γ ∈ [0.01, 0.1, 1, 10, 100]. In the case of RF and XGB we use the same hyperparameter grids since both methods build upon decision trees: max depth ∈ [2,3,4] and min samples ∈ [2,3,4].B.1.2.TNNRIn this section we have included additional results of experiments on our data sets Boston Housing (BH), concrete strength (CS), energy efficiency (EE), and random polynomial function (RP) (see Tables B1 to B4). We have varied the L2 regularization. We further exchanged our main optimizer adadelta by rmsprop. Finally, we examined the checkpoint that saves the weights based on the best performance on the validation set. We find that adadelta seems to give more consistent results. As long as the L2 penalty is not too large, there is no clear evidence to favor a certain L2 penalty over another. We further examined modifications to the architecture, dropout regularization, different learning rate schedules in preliminary experiments, which are not listed here, since none of these changes led to significant and uniform improvement.B1TABLERoot mean squared errors (RMSEs) of the validation sets for different datasets, machine learning approaches, and regularizations (Optimizer: adadelta, with model checkpoint saving)DataANNANN (E)MC DropoutTNNTNN (E)L2 = 0BH2.544 ± 0.1682.780 ± 0.2523.375 ± 0.3212.656 ± 0.2432.286 ± 0.155CS4.964 ± 0.1614.595 ± 0.2055.941 ± 0.1873.758 ± 0.2143.247 ± 0.166EE0.819 ± 0.0350.791 ± 0.0302.893 ± 0.1330.494 ± 0.0180.452 ± 0.020RP0.045 ± 0.0030.031 ± 0.0020.087 ± 0.0030.022 ± 0.0010.012 ± 0.001L2 = 1e − 6BH2.477 ± 0.1632.743 ± 0.2823.388 ± 0.1932.476 ± 0.1592.347 ± 0.135CS4.669 ± 0.1854.714 ± 0.1615.995 ± 0.1863.751 ± 0.2633.398 ± 0.228EE0.968 ± 0.0440.777 ± 0.0342.973 ± 0.0920.502 ± 0.0330.452 ± 0.022RP0.049 ± 0.0020.028 ± 0.0020.079 ± 0.0040.020 ± 0.0010.011 ± 0.001L2 = 1e − 5BH2.845 ± 0.1772.202 ± 0.0823.126 ± 0.2362.931 ± 0.2442.239 ± 0.214CS4.599 ± 0.1574.403 ± 0.1985.675 ± 0.1943.715 ± 0.3183.480 ± 0.275EE1.029 ± 0.0350.799 ± 0.0342.905 ± 0.1230.500 ± 0.0190.433 ± 0.019RP0.064 ± 0.0020.032 ± 0.0030.086 ± 0.0040.022 ± 0.0010.012 ± 0.001L2 = 1e − 4BH2.875 ± 0.1602.750 ± 0.2953.330 ± 0.2312.933 ± 0.1872.359 ± 0.187CS5.178 ± 0.1554.249 ± 0.1285.730 ± 0.1753.607 ± 0.1933.047 ± 0.176EE1.763 ± 0.0300.853 ± 0.0342.747 ± 0.1060.502 ± 0.0270.448 ± 0.024RP0.107 ± 0.0020.024 ± 0.0010.097 ± 0.0040.032 ± 0.0020.019 ± 0.001L2 = 1e − 3BH4.902 ± 0.1032.368 ± 0.1752.722 ± 0.1102.293 ± 0.1192.471 ± 0.189CS8.001 ± 0.1334.034 ± 0.1366.102 ± 0.1453.547 ± 0.1983.537 ± 0.168EE4.651 ± 0.0590.778 ± 0.0322.877 ± 0.0980.481 ± 0.0220.467 ± 0.021RP0.268 ± 0.0040.055 ± 0.0050.110 ± 0.0040.056 ± 0.0030.052 ± 0.003B2TABLERoot mean squared errors (RMSEs) of the test sets for different datasets, machine learning approaches, and regularizations (Optimizer: adadelta, with model checkpoint saving)DataANNANN (E)MC dropoutTNNTNN (E)L2 = 0BH3.093 ± 0.1403.432 ± 0.3172.953 ± 0.1542.552 ± 0.1042.614 ± 0.204CS5.373 ± 0.1685.136 ± 0.2076.067 ± 0.2094.186 ± 0.2493.882 ± 0.218EE0.984 ± 0.0330.894 ± 0.0402.957 ± 0.1170.524 ± 0.0220.456 ± 0.020RP0.050 ± 0.0020.032 ± 0.0020.086 ± 0.0020.022 ± 0.0010.016 ± 0.001L2 = 1e − 6BH3.170 ± 0.2222.749 ± 0.2003.226 ± 0.2062.869 ± 0.2332.554 ± 0.134CS5.133 ± 0.1724.963 ± 0.1926.293 ± 0.1794.030 ± 0.2254.025 ± 0.258EE1.125 ± 0.0590.992 ± 0.0423.057 ± 0.1200.590 ± 0.0270.468 ± 0.020RP0.054 ± 0.0030.034 ± 0.0020.085 ± 0.0030.020 ± 0.0010.011 ± 0.001L2 = 1e − 5BH3.184 ± 0.1732.996 ± 0.2863.780 ± 0.3773.315 ± 0.2293.201 ± 0.384CS4.991 ± 0.2214.712 ± 0.1765.934 ± 0.1393.880 ± 0.2764.044 ± 0.228EE1.255 ± 0.0430.906 ± 0.0372.943 ± 0.1220.570 ± 0.0260.477 ± 0.026RP0.067 ± 0.0020.029 ± 0.0020.092 ± 0.0070.026 ± 0.0020.014 ± 0.001L2 = 1e − 4BH3.559 ± 0.1802.890 ± 0.1803.352 ± 0.2923.173 ± 0.2542.740 ± 0.147CS5.799 ± 0.1714.924 ± 0.1835.777 ± 0.1773.878 ± 0.2053.743 ± 0.251EE1.891 ± 0.0450.923 ± 0.0362.973 ± 0.1030.581 ± 0.0290.477 ± 0.021RP0.110 ± 0.0010.035 ± 0.0050.088 ± 0.0040.032 ± 0.0020.023 ± 0.002L2 = 1e − 3BH5.382 ± 0.1632.861 ± 0.1983.566 ± 0.2793.039 ± 0.2943.048 ± 0.270CS8.325 ± 0.1404.964 ± 0.2206.132 ± 0.1803.953 ± 0.2204.186 ± 0.233EE4.714 ± 0.0541.083 ± 0.0712.891 ± 0.1220.613 ± 0.0240.533 ± 0.026RP0.270 ± 0.0040.066 ± 0.0040.113 ± 0.0060.061 ± 0.0030.054 ± 0.003B3TABLERoot mean squared errors (RMSEs) of the validation sets for different datasets, machine learning approaches, and regularizations (Optimizer: rmsprop, no model checkpoint saving)DataANNANN (E)MC dropoutTNNTNN (E)L2 = 0BH2.865 ± 0.2703.151 ± 0.3023.328 ± 0.2872.724 ± 0.2252.548 ± 0.155CS5.209 ± 0.2475.021 ± 0.2376.221 ± 0.1784.582 ± 0.1804.148 ± 0.147EE1.199 ± 0.0830.888 ± 0.0423.057 ± 0.1160.828 ± 0.0620.711 ± 0.034RP0.049 ± 0.0040.031 ± 0.0020.087 ± 0.0050.025 ± 0.0010.016 ± 0.001L2 = 1e − 6BH3.040 ± 0.2343.027 ± 0.2923.384 ± 0.2082.601 ± 0.1962.582 ± 0.147CS5.600 ± 0.2035.442 ± 0.2036.135 ± 0.1854.559 ± 0.2124.327 ± 0.198EE1.259 ± 0.0980.944 ± 0.0552.961 ± 0.1010.792 ± 0.0570.788 ± 0.023RP0.050 ± 0.0030.026 ± 0.0010.077 ± 0.0040.024 ± 0.0010.015 ± 0.001L2 = 1e − 5BH2.903 ± 0.1572.418 ± 0.0963.109 ± 0.2793.228 ± 0.2832.427 ± 0.233CS5.523 ± 0.2025.212 ± 0.1835.973 ± 0.1764.607 ± 0.2364.344 ± 0.235EE1.210 ± 0.0440.920 ± 0.0422.846 ± 0.1460.795 ± 0.0550.644 ± 0.030RP0.061 ± 0.0020.030 ± 0.0030.079 ± 0.0040.030 ± 0.0010.016 ± 0.001L2 = 1e − 4BH3.019 ± 0.0813.068 ± 0.3203.319 ± 0.2103.062 ± 0.1832.543 ± 0.215CS6.027 ± 0.1944.837 ± 0.1565.956 ± 0.1624.318 ± 0.1613.888 ± 0.152EE2.003 ± 0.0680.970 ± 0.0412.785 ± 0.0980.784 ± 0.0480.660 ± 0.034RP0.102 ± 0.0030.026 ± 0.0010.094 ± 0.0070.040 ± 0.0020.023 ± 0.001L2 = 1e − 3BH4.887 ± 0.1072.654 ± 0.1922.656 ± 0.0912.447 ± 0.1242.721 ± 0.205CS8.354 ± 0.1444.520 ± 0.1296.204 ± 0.1464.260 ± 0.1604.360 ± 0.185EE4.592 ± 0.0750.874 ± 0.0352.380 ± 0.1090.705 ± 0.0380.634 ± 0.024RP0.266 ± 0.0040.068 ± 0.0070.112 ± 0.0040.066 ± 0.0040.060 ± 0.004B4TABLERoot mean squared errors (RMSEs) of the test sets for different datasets, machine learning approaches, and regularization (Optimizer: rmsprop, no model checkpoint saving)DataANNANN (E)MC dropoutTNNTNN (E)L2 = 0BH3.291 ± 0.2293.326 ± 0.3442.820 ± 0.1692.462 ± 0.0852.593 ± 0.193CS5.185 ± 0.2295.225 ± 0.2276.116 ± 0.1574.748 ± 0.1894.382 ± 0.172EE1.145 ± 0.0900.851 ± 0.0343.060 ± 0.1040.844 ± 0.0520.680 ± 0.032RP0.051 ± 0.0020.029 ± 0.0020.084 ± 0.0040.024 ± 0.0010.019 ± 0.002L2 = 1e − 6BH2.930 ± 0.1822.748 ± 0.1793.146 ± 0.1802.882 ± 0.2542.502 ± 0.117CS5.765 ± 0.2135.423 ± 0.1786.275 ± 0.1944.790 ± 0.1634.469 ± 0.208EE1.288 ± 0.1211.028 ± 0.0572.984 ± 0.1060.826 ± 0.0410.713 ± 0.026RP0.049 ± 0.0030.030 ± 0.0020.083 ± 0.0020.023 ± 0.0010.015 ± 0.001L2 = 1e − 5BH3.344 ± 0.2832.911 ± 0.2573.567 ± 0.3962.983 ± 0.1733.155 ± 0.383CS5.867 ± 0.1785.105 ± 0.1316.195 ± 0.1564.449 ± 0.1734.563 ± 0.179EE1.210 ± 0.0420.940 ± 0.0402.845 ± 0.1330.815 ± 0.0570.679 ± 0.029RP0.060 ± 0.0020.024 ± 0.0020.081 ± 0.0040.028 ± 0.0010.017 ± 0.001L2 = 1e − 4BH3.313 ± 0.1912.625 ± 0.1453.362 ± 0.3053.199 ± 0.2272.803 ± 0.134CS5.905 ± 0.1365.141 ± 0.1865.956 ± 0.1794.518 ± 0.1904.123 ± 0.206EE2.010 ± 0.0760.891 ± 0.0322.932 ± 0.0950.762 ± 0.0300.611 ± 0.021RP0.101 ± 0.0020.033 ± 0.0040.082 ± 0.0040.039 ± 0.0020.026 ± 0.002L2 = 1e − 3BH5.197 ± 0.1932.835 ± 0.1853.296 ± 0.2512.975 ± 0.2783.028 ± 0.267CS8.381 ± 0.1535.128 ± 0.2136.153 ± 0.1554.611 ± 0.2054.678 ± 0.216EE4.659 ± 0.0650.954 ± 0.0392.360 ± 0.1080.747 ± 0.0360.723 ± 0.032RP0.267 ± 0.0030.068 ± 0.0050.114 ± 0.0060.066 ± 0.0040.059 ± 0.004CAPPENDIXC.1Performance improvement through ensemblesC.1.1.Bias‐variance tradeoff and ensemblesIn a regression problem, one is tasked with finding the true labels on yet unlabeled data points through the estimation of a function f(x) = y. Given a finite training data set D we denote this approximation f̂x;D. The expected mean squared error can be decomposed by three sources of error, bias error BiasDf̂x;D, variance error VarDf̂x;D and intrinsic error of the data set σ.C1MSE=ExBiasDf̂x;D2+VarDf̂x;D+σ2.If we replace the estimator by an ensemble of two functions f̂x;D=1/2f̂Ax;D+1/2f̂Bx;D, each exhibiting the same bias and variance as the original estimator, then we can decompose the MSEC2MSE=ExBiasD1/2f̂Ax;D+1/2f̂Bx;D2+VarD1/2f̂Ax;D+1/2f̂Bx;D+σ2C3=Ex{BiasDf̂x;D2+VarD1/2f̂Ax;D+VarD1/2f̂Bx;DC4+2CovD1/2f̂Ax;D,1/2f̂Bx;D}+σ2C5=ExBiasDf̂x;D2+1/2VarDf̂Ax;D+1/2CovDf̂Ax;D,f̂Bx;D+σ2The more uncorrelated f̂Ax;D and f̂Bx;D are, the smaller is the ratio between variance and covariance. Thus an ensemble consisting of weakly correlated ensemble members reduce the MSE by circumventing the bias‐variance tradeoff. By induction this argument extends to larger ensemble sizes.DAPPENDIXD.1Loop consistency orderLet us discuss how loops of different orders interact. In figure D1 we can graphically see how to decompose higher order loops into smaller loops. In the following, we discuss how this affects loop consistency at different orders. We define short forms of loops containing the predictions on connections between nodes as:D12‐loop:Lxixj=Fxixj+FxjxiD23‐loop:Lxixjxk=Fxixj+Fxjxk+FxkxiD34‐loop:Lxixjxkxl=Fxixj+Fxjxk+Fxkxl+FxlxiD1FIGURELoops can be decomposed into smaller loopsWe define the magnitude of the violation of the loop consistency as the upper bound ϵ of the values of all loops of a given order:D4∀xi,xj∈X:∣Lxixj∣<ϵTheorem: 2‐loop and 3‐loop consistency implies 4‐loop consistency. Proof: AssumeD5∀xi,xj∈X:∣Lxixj∣<ϵ2D6∀xi,xj,xk∈X:∣Lxixjxk∣<ϵ3ThenD7∀xi,xj,xk,xl∈X:D8∣Lxixjxkxl∣=∣Fxixj+Fxjxk+Fxkxl+Fxlxi∣D9=∣Fxixj+Fxjxk+Fxkxl+Fxlxi+Lxixk−Lxixk∣D10=∣Fxixj+Fxjxk+Fxkxi⏟Lxixjxk+Fxkxl+Fxlxi+Fxixk⏟Lxkxlxi−Lxixk∣D11<2ϵ3+ϵ2This argument holds true for larger loops by induction. It is not possible, however, to only use 2‐loops as a starting point to build larger loops.

Journal

Applied AI LettersWiley

Published: Dec 1, 2022

Keywords: artificial neural networks; ensemble methods; regression; uncertainty estimation

There are no references for this article.