Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Training recurrent neural networks robust to incomplete data: application to Alzheimer's disease progression modeling

Training recurrent neural networks robust to incomplete data: application to Alzheimer's disease... Disease progression modeling (DPM) using longitudinal data is a challenging machine learning task. Existing DPM algorithms neglect temporal dependencies among measurements, make parametric assumptions about biomarker trajecto- ries, do not model multiple biomarkers jointly, and need an alignment of subjects’ trajectories. In this paper, recurrent neural networks (RNNs) are utilized to address these issues. However, in many cases, longitudinal cohorts contain incomplete data, which hinders the application of standard RNNs and requires a pre-processing step such as imputation of the missing values. Instead, we propose a generalized training rule for the most widely used RNN architecture, long short-term memory (LSTM) networks, that can handle both missing predictor and target values. The proposed LSTM algorithm is applied to model the progression of Alzheimer’s disease (AD) using six volumetric magnetic resonance imaging (MRI) biomarkers, i.e., volumes of ventricles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinal cortex, and it is compared to standard LSTM networks with data imputation and a parametric, regression-based DPM method. The results show that the proposed algorithm achieves a significantly lower mean absolute error (MAE) than the alternatives with p < 0:05 using Wilcoxon signed rank test in predicting values of almost all of the MRI biomarkers. Moreover, a linear discriminant analysis (LDA) classifier applied to the predicted biomarker values produces a significantly larger area under the receiver operating characteristic curve (AUC) of 0.90 vs. at most 0.84 with p < 0:001 using McNemar’s test for clinical diagnosis of AD. Inspection of MAE curves as a function of the amount of missing data reveals that the proposed LSTM algorithm achieves the best performance up until more than 74% missing values. Finally, it is illustrated how the method can successfully be applied to data with varying time intervals. This paper shows that built-in handling of missing values in training an LSTM network benefits the application of RNNs in neurodegenerative disease progression modeling in longitudinal cohorts. Keywords: Alzheimer’s disease, disease progression modeling, linear discriminant analysis, long short-term memory, magnetic resonance imaging, recurrent neural networks. Corresponding Author. 1. Introduction Data used in preparation of this article were obtained from Alzheimer’s disease (AD) is a chronic neurodegenera- the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI tive disorder that begins with memory loss and develops contributed to the design and implementation of ADNI and/or pro- vided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found This document is the accepted version of the manuscript pub- at http://adni.loni.usc.edu/wp-content/uploads/how_to_ lished in Medical Image Analysis in Volume 53, Pages 39-46, apply/ADNI_Acknowledgement_List.pdf with DOI: https://doi.org/10.1016/j.media.2019.01.004. Email address: mehdipour@biomediq.com (Mostafa Mehdipour c 2019. This manuscript version is made available under the CC-BY- Ghazi) NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/. arXiv:1903.07173v1 [cs.CV] 17 Mar 2019 over time, causing issues in conversation, orientation, and (Petersen et al., 2010), standard RNNs inclunding LSTMs control of bodily functions (McKhann et al., 1984). Early cannot be directly applied. Pre-processing methods such diagnosis of the disease is challenging and is usually made as data imputation and interpolation are the most common once cognitive impairment has already compromised daily approaches to handling missing data in RNNs. These two- living. Hence, developing robust, data-driven methods for step procedures decouple missing data handling and net- disease progression modeling (DPM) utilizing longitudi- work training, resulting in a sub-optimal performance that nal data is necessary to yield a complete perspective on is heavily influenced by the choice of data pre-processing the disease for better diagnosis, monitoring, and prognosis method (Lipton et al., 2016). Although RNNs themselves (Oxtoby and Alexander, 2017). have been used for estimating missing data (Parveen and Green, 2002; Yoon et al., 2018), the lack of methods to Existing longitudinal DPM methods model biomarkers inherently handle incomplete data in RNNs is evident (Che as a function of disease progression using continuous curve et al., 2018). Other approaches update the architecture to fitting. In the AD progression modeling literature, a variety learn or encode the missing data patterns (Che et al., 2018; of regression-based methods have been proposed to fit lo- Lipton et al., 2016). These methods are typically biased gistic or polynomial functions to the longitudinal dynamic towards specific cohort or demographic circumstances cor- of each biomarker (Jedynak et al., 2012; Fjell et al., 2013; related with the learned missing data patterns and introduce Oxtoby et al., 2014; Donohue et al., 2014; Yau et al., 2015; additional parameters in the network which increases the Guerrero et al., 2016). However, parametric assumptions complexity of the network. on the biomarker trajectories not only limit the flexibility In this paper, we propose a generalized method for train- of such methods but also lead to the necessity of aligning ing LSTM networks that can handle missing values in both subjects’ trajectories. In addition, the existing approaches input and target. This is achieved by applying the batch mostly rely on independent biomarker modeling, and none gradient descent algorithm in combination with the loss of them consider the temporal dependencies among mea- function and its gradients normalized by the number of surements. missing values in input and target. Our goal is different Recurrent neural networks (RNNs) are non-parametric than the approaches that encode the missing values’ pat- sequence based learning methods that, by design, do not terns (Che et al., 2018; Lipton et al., 2016); we want to require alignment of subject trajectories. They offer con- train RNNs robust to missing values to more faithfully tinuous, joint modeling of longitudinal data while taking capture the true underlying signal and to make the learned temporal dependencies among measurements into account model generalizable across cohorts. The proposed LSTM (Pearlmutter, 1989). Long short-term memory (LSTM) algorithm is applied to AD progression modeling in the networks, the most widely used type of RNNs, developed ADNI cohort (Petersen et al., 2010) based on volumetric to effectively capture long-term temporal dependencies by magnetic resonance imaging (MRI) biomarkers, and the dealing with the exploding and vanishing gradient prob- estimated biomarker values are used to predict the clinical lem during backpropagation through time (Hochreiter and status of subjects. MRI is known to be the best noninva- Schmidhuber, 1997; Gers et al., 1999; Gers and Schmid- sive way to examine changes in the brain in vivo during huber, 2001). They employ a memory cell with nonlinear the course of AD (Biagioni and Galvin, 2011; Wu et al., reset units – so called constant error carousels (CECs) – 2011), and volumetric analysis is a widely used ROI-based and learn to store history for either long or short time peri- method to estimate brain atrophy. ods. Since their introduction, a variety of LSTM networks The main contribution is three-fold and can be summa- have been developed for different time-series applications rized as follows: (Greff et al., 2017). The vanilla LSTM that utilizes three reset gates with full gate recurrence is the most commonly First, a generalized formulation of backpropagation used LSTM architecture. It applies the backpropagation through time for LSTM networks is proposed to han- through time algorithm using full gradients to train the net- dle incomplete data, and it is shown that such built-in work and can include biases and cell-to-gates (peephole) handling of missing values provides a better modeling connections. and prediction performance compared to using data However, since longitudinal cohorts often contain miss- imputation with standard LSTM networks. ing biomarker values due to, for instance, dropped out patients, unsuccessful measurements, or different assess-  Second, temporal dependencies among measurements ment patterns used for different subject groups – as seen in in the ADNI data are modeled using the proposed the Alzheimer’s Disease Neuroimaging Initiative (ADNI) LSTM network via sequence-to-sequence learning. 2 Output Hidden Input Figure 1: Illustration of how the normalization factors are related to the input and output of an unfolded RNN. Assume an RNN with three consecutive time pointsft 1; t; t + 1g, three input nodes, four hidden nodes, and two output nodes. Missing data for an instance observation j is illustrated as black nodes. We wish to weight the loss function and its gradients according to the number of available points in the input and output nodes. In this specific example, subject j has only one measurement available for its n-th input node and the same many for its m-th output node. Hence, the loss function and its gradients are weighted by 1/3. Moreover, since there is a total of five measurements available in the input layer, the loss function is weighted by 5/9. The later weighting factor is to ensure that the loss function takes the number of available points in the input layer into account. To the best of our knowledge, this is the first time uses an L2-norm loss function with residuals weighted such multi-dimensional sequence learning methods according to the number of available time points per target are applied to neurodegenerative DPM. biomarker node ( ) and according to the total number of available input values for all visits of all biomarkers Third, an end-to-end approach, without need for tra- j ( ). In addition, it normalizes input weight gradients of jectory alignment, is proposed for modeling the lon- the loss function according to the number of available time gitudinal dynamics of imaging biomarkers and for j points per input biomarker node ( ). Figure 1 provides clinical status prediction. This is a practical way of an illustration of how the normalization factors are related implementing a robust DPM for both research and to the input and output of an unfolded RNN. Note that the clinical applications. use of batch gradient descend ensures the availability of at least one data point per biomarker that can proportionally A preliminary version of this work appeared in proceed- contribute in the weight update rule. ings of the International Conference on Medical Imaging with Deep Learning (Mehdipour Ghazi et al., 2018). The 2.1. The basic LSTM architecture present study contains a more detailed presentation and ad- ditional experiments to investigate statistical significance, Figure 2 shows a typical schematic of a vanilla LSTM robustness as a function of amount of missing data, and architecture. As can be seen, the topology includes a mem- situations with varying time steps. ory cell, an input modulation gate, and three nonlinear reset gates, namely input gate, forget gate, and output gate, each of which accepting current and recurrent inputs. The 2. Proposed LSTM algorithm memory cell learns to maintain its state over time while The main goal of this study is to minimize the influence the multiplicative gates learn to open and close access to of missing values on the learned LSTM network parame- the constant error/information flow, to prevent exploding ters. This is achieved by using the batch gradient descend or vanishing gradients. The input gate protects the memory method in combination with the backpropagation through contents from perturbation by irrelevant inputs, and the time algorithm modified to take into account missing val- output gate protects other units from perturbation by cur- ues in the input and target vectors. More specifically, the rently irrelevant memory contents. The forget gate deals algorithm sets input missing values to zero, backpropagates with continual or very long input sequences, and finally, zero errors corresponding to the target missing points, and peephole connections allow the gates to access the CEC of 3 set of peephole connections from the cell to the gates, LSTM Unit M1 fb ;b ;b ;b g 2 R represents corresponding biases f i o c of neurons, and denotes element-wise multiplication. Finally,  ,  , and  are nonlinear activation functions g c h assigned for the gates, input modulation, and hidden cell output, respectively. Logistic sigmoid functions are forget gate applied to the gates with range [0; 1] while hyperbolic output gate Σ tangent functions are applied to modulate both cell input and hidden output with range [1; 1]. Hence, the input modulation measurements need to be in the same range [1; 1]. hidden 2.3. Robust backpropagation through time activation M1 Let L 2 R be the loss function defined based on input gate the actual target s and network output y. Here, we con- sider one layer of LSTM units for sequence learning which means that the network output is the hidden output. The Figure 2: An illustration of a vanilla LSTM unit with peephole connec- tions in red. The solid and dashed lines show weighted and unweighted main idea is to calculate the partial derivatives of the nor- connections, respectively. malized loss function () with respect to the weights using the chain rule. 1 1 the same cell state. t t 2 L(m) = (y (m) s (m)) ; j j j j 2JT x m j;t 2.2. Feedforward in LSTM networks h i 1 1 t t t t N1 y (m) = (y (m) s (m)) ; Assume x 2 R is the j-th observation of an N - j j j j j JT x m dimensional input vector at current time t. If M is the jx j jy (m)j j j j j number of output units, feedforward calculations of the where = and = are normalization x m TN T LSTM network under study can be summarized as factors to handle missing values of the j-th observation with batch size J and sequence length T . Also, jx j and t t t1 t1 f = W x + U h + V c + b ; f f f f j j j j jy (m)j denote the total number of available input values t t ~ and the number of available target time points in the m- f =  (f ) ; j j th node, respectively. The backpropagation calculations t t t1 t1 i = W x + U h + V c + b ; i i i i j j j j through time using full gradients can be obtained as t t i =  (i ) ; j j t T t+1 T t+1 T t+1 T t+1 h = U f + U i + U z + U o j f j i j c j o j t t t1 z = W x + U h + b ; c c c j j j t + y ; t t z~ =  (z ) ; c t t t j j o~ = h c~ ; j j j t t t1 t t ~ ~ c = f c + i z~ ; t t 0 t j j j j o = o~  (o ) ; j j g j t t c~ =  (c ) ; h t t t j j c~ = h o~ ; j j j t t t1 t o = W x + U h + V c + b ; o o o o t t+1 t+1 t j j j c = V f + V i + V o f i o j j j j t t o~ =  (o ) ; g t 0 t t+1 t+1 j j + c~  (c ) + c f ; j h j j j t t t h = o~ c~ ; j j j t t t z~ = c i ; j j j t t t t t t M1 t t 0 t where ff ;i ;z ;c ;o ;h g 2 R and z = z~  (z ) ; j j j j j j j j c j t t t t t M1 ~ ~ t t t ff ;i ;z~ ;c~ ;o~ g 2 R are j-th observation of ~ j j j j j i = c z~ ; j j j forget gate, input gate, modulation gate, cell state, output t t 0 t i = i  (i ) ; j j g j gate, and hidden output at time t before and after activation, t t t1 MN ~ f = c c ; respectively. Moreover, fW ; W ; W ; W g 2 R f i o c j j MM and fU ; U ; U ; U g 2 R are sets of connecting t t 0 t f i o c ~ f = f  (f ) ; j j g j weights from current and recurrent inputs to the gates t T t T t T t T t M1 x = W f + W i + W z + W o ; j f j i j c j o j and cell, respectively, fV ;V ;V g 2 R is the f i o 4 Finally, if  2 ff; i; z; og and  2 ff; ig, the gradi- ADNI has been to test whether serial magnetic resonance ents of the loss function with respect to the weights are imaging, positron emission tomography, other biological calculated as markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cog- f0!Tg f0!Tg nitive impairment and early Alzheimer’s disease. To be W (n) =  x (n) ; j j more specific, we use The Alzheimer’s Disease Prediction j=1 Of Longitudinal Evolution (TADPOLE) challenge dataset f1!Tg f0!T1g (Marinescu et al., 2018) which is composed of data from U =  h ; j j the three ADNI phases ADNI 1, ADNI GO, and ADNI j=1 2. This includes roughly 1,500 biomarkers acquired from J T1 XX t+1 t 1,737 subjects (957 males and 780 females) during 12,741 V =  c ; j j visits at 22 distinct time points between 2003 and 2017. j=1 t=0 Table 1 summarizes statistics of the demographics in the J T XX t t TADPOLE dataset. Note that the subjects include missing V = o c ; j j values and clinical status during their visits. j=1 t=0 J T In this work, we have merged existing groups labeled XX as cognitively normal (CN), significant memory concern b =  ; j=1 t=0 (SMC), and normal (NL) under CN, mild cognitive impair- ment (MCI), early MCI (EMCI), and late MCI (LMCI) jx (n)j where = is the normalization factor handling under MCI, and Alzheimer’s disease (AD) and dementia missing input values and jx (n)j is the number of avail- under AD. Moreover, groups with labels converting from able time points in the input’s n-th node. Here, we use a one status to another, e.g. MCI-to-AD, belong to the next fixed sequence length of T to proportionally consider sub- status (AD in this example). jects based on their available visits. However, the robust MRI biomarkers are used for AD progression modeling. backpropagation algorithm can easily be generalized for a This includes T1-weighted brain MRI volumes of ventri- dynamic sequence length. cles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinal cortex. We normalize the MRI mea- 2.4. Momentum batch gradient descent surements by the corresponding intracranial volume (ICV). As an efficient iterative algorithm, momentum batch Next, we filter within-class outliers of each biomarker – gradient descent is applied to find the local minimum of across all subjects and their visits – by assuming them as the loss function calculated over a batch while speeding up missing values and normalize the measurements by scal- the convergence. The update rule using L2 regularization ing them linearly to [1; 1]. Out of 22 visits, we initially can be written as select 11 regular visits with a fixed interval of one year including baseline. Finally, subjects with less than three new old old # = # (! + ! ) ; distinct visits for any biomarker are removed to obtain new old new 742 subjects. This is to ensure that at least two visits are ! = ! + # ; available per biomarker for performing sequence learning where # is the weight update initialized to zero, ! is the through the feedforward step and an additional visit for to-be-updated weight array, ! is the gradient of the loss backpropagation. function with respect to !, and , , and  are the learning For evaluation purpose, we partition the entire dataset rate, weight decay or regularization factor, and momentum to three non-overlapping subsets for training, validation, weight, respectively. and testing. To achieve this, we randomly select 10% of the within-class subjects for validation and the same for testing. More specifically, we randomly pick subjects 3. Experiments based on their baseline labels while ensuring that subjects 3.1. Data with few and large number of visits are included in each Data used in the preparation of this article is obtained subset. This process results in 592, 76, and 74 subjects for from the ADNI database. The ADNI was launched in training, validations, and testing, respectively. Details on 2003 as a public-private partnership, led by principal in- the amount of available visits in the obtained evaluation vestigator Michael W. Weiner, MD. The primary goal of subsets are shown in Table 2. As can be deduced from the 5 Table 1: Demographics of the TADPOLE dataset. Number of visits Age, year (meanSD) Education, year (meanSD) male female male female male female CN 1,356 1,389 76.676.44 75.856.28 17.062.51 15.742.71 MCI 2,454 1,604 75.597.47 73.878.09 16.222.85 15.452.76 AD 1,208 900 77.227.11 75.457.92 15.853.03 14.352.73 All (labeled & unlabeled) 12,741 76.007.38 15.912.86 table, 63% of the obtained data is missing. 3.3. Experimental setup The following methods are evaluated in our conducted 3.2. Evaluation metrics and statistical tests experiments: Mean absolute error (MAE) and multi-class area under the receiver operating characteristic (ROC) curve (AUC) LSTM-Robust: an LSTM network trained based on are used to assess the performance of modeling and classi- the proposed robust backpropagation through time fication, respectively. MAE measures accuracy of contin- algorithm by setting input missing values to zero and uous prediction per biomarker by computing the absolute backpropagating zero errors corresponding to the tar- difference between actual and estimated values as follows get missing points while training. t t MAE = jy s j ;  LSTM-Mean: an LSTM network trained using the j j j;t standard backpropagation through time algorithm with missing values imputed based on mean impu- t t where s and y are the ground-truth and estimated values j j tation method prior to training (Che et al., 2018). of the specific biomarker for the j-th subject at the t-th visit, respectively, and I is the number of available points LSTM-Forward: an LSTM network trained using in the target array s. the standard backpropagation through time algorithm Multi-class AUC (Hand and Till, 2001) is a measure to with missing values imputed based on forward impu- examine the diagnostic performance in a multi-class test tation method prior to training (Lipton et al., 2016). set using ROC analysis. It is calculated using the posterior Regression-Based: a parametric, regression-based probabilities as follows method (Jedynak et al., 2012) that automatically han- n 1 n c c X X 1 1 dles missing values. The parameters of the algorithm AUC = (n (n 1)) n n are initially estimated using linear regression in 15 c c i k i=1 k=i+1 h i iterations and are optimized using sigmoidal func- n (n + 1) n (n + 1) i i k k SR + SR ; tions in 35 additional iterations where all parameters i k 2 2 converge. where n is the number of distinct classes, n denotes the c i number of available points belonging to the i-th class, and All the methods are developed in MATLAB R2017b SR is the sum of the ranks of posteriors p(c js ) after and run on a 2.80 GHz CPU with 16 GB RAM. We initial- i i i sorting all concatenated posteriors fp(c js ); p(c js )g in ize the LSTM networks’ weights by generating uniformly i i i an ascending order, where s and s are vectors of scores distributed random values in range [0:05; 0:05] and set i k belonging to the true classes c and c , respectively. the weights’ updates and weights’ gradients to zero. The i k The modeling performance is statistically assessed for batch size is set to the number of available training subjects, different methods using the paired, two-sided Wilcoxon and the first ten visits are used to estimate the second to eleventh visits per subject for evaluation purpose. It should signed rank test (Wilcoxon, 1945) applied to the obtained absolute errors. Also, classification performance is an- be noted that when data imputation is applied, the robust alyzed using McNemar’s test (McNemar, 1947) applied backpropagation formulas simply generalize to the ones to the hard classification results (clinical status) obtained for the standard LSTM network. from a linear discriminant analysis (LDA) classifier with We utilize the validation set to tune all the networks’ predicted MRI measurements as input. optimization parameters, each time by adjusting one of the 6 Table 2: Number of visits in the evaluation subsets across all subjects. Note that the complete dataset should have contained 742 11 = 8; 162 visits per biomarker where the maximum number of visits per subject is 11. The number of visits per subject per diagnostic group is left blank as subjects can convert from one group to another in the course of AD. Number of visits across subjects Number of visits per subject (meanSD [min, max]) train / validation / test train / validation / test CN 1,192 / 136 / 149 MCI 1,389 / 198 / 180 AD 606 / 84 / 92 All (labeled & unlabeled) 3,270 / 428 / 434 5.522.32 [3, 11] / 5.632.39 [3, 11] / 5.862.51 [3, 11] Ventricles 2,481 / 328 / 318 4.191.47 [3, 10] / 4.321.46 [3, 8] / 4.301.58 [3, 9] Hippocampus 2,381 / 311 / 312 4.021.31 [3, 10] / 4.091.29 [3, 8] / 4.221.49 [3, 7] Whole brain 2,513 / 328 / 322 4.241.49 [3, 10] / 4.321.46 [3, 8] / 4.351.57 [3, 9] Entorhinal cortex 2,351 / 310 / 309 3.971.29 [3, 10] / 4.081.34 [3, 8] / 4.181.46 [3, 7] Fusiform 2,351 / 310 / 309 3.971.29 [3, 10] / 4.081.34 [3, 8] / 4.181.46 [3, 7] Middle temporal gyrus 2,351 / 309 / 309 3.971.29 [3, 10] / 4.071.35 [3, 8] / 4.181.46 [3, 7] parameters while keeping the rest at fixed values to achieve there is no difference between the proposed method and the lowest average MAE. Peephole connections are used in LSTM-Forward. the networks since they tend to improve the performance 4.2. Predicting clinical status (Greff et al., 2017). Based on these strategies, the optimal parameters are obtained as = 0:1,  = 0:9, and = To assess the ability of the estimated measurements in 0:0001 with 1,000 epochs. The corresponding MAEs for predicting the clinical status, we train an LDA classifier the validation set are also calculated as 0.00296, 0.00025, using the estimated training measurements and apply it 0.01494, 0.00024, 0.00076, and 0.00097, for ventricles, to the estimated test data to compute the posterior prob- hippocampus, whole brain, entorhinal cortex, fusiform, abilities. The obtained scores are then used to calculate and middle temporal gyrus, respectively. It takes about diagnostic AUCs. The diagnostic prediction results for 340 seconds to train the network and 0.025 seconds to the test set are shown in Table 4. As can be seen, LSTM- estimate all the validation measurements. It is worthwhile Robust outperforms all other methods in predicting clinical mentioning that all the estimated measurements are linearly status of subjects per visit with a multi-class AUC of 0.76, scaled from [1; 1] to the original range of biomarkers which reveals the effect of modeling on classification per- using the original minimum and maximum values while formance. One could of course use other classifiers or calculating MAEs. train the LSTM network directly for classification based on sequence-to-label learning to potentially improve the diagnostic AUCs. However, the focus of this work is on 4. Results and discussion DPM based on sequence-to-sequence learning. In addition, sequence-to-label learning would only be able to utilize After successfully training the LSTM networks and the the part of the training data which has available clinical regression-based method for DPM, they are all evaluated status. using the test set. The multi-class AUC of 0.76 obtained using predicted measurements from the proposed approach is within the 4.1. Biomarker modeling top-five AUCs of the state-of-the-art, cross-sectional MRI- Table 3 compares the test MRI biomarker modeling per- based classification results of the recent challenge on formance (MAE) using aforementioned methods. Even Computer-Aided Diagnosis of Dementia (CADDementia) though the performance is reported per biomarker, the (Bron et al., 2015) that ranged from 0.75 to 0.79. It should, models are jointly fitted to all biomarkers. As it can be however, be noted that there are important differences be- deduced from Table 3, LSTM-Robust significantly outper- tween this study and the CADDementia challenge. Firstly, forms the other methods in all MRI biomarkers except for this work has the advantage of training and testing data whole brain where the regression-based approach performs from the same cohort whereas CADDementia algorithms significantly better and for middel temporal gyrus where were applied to classify data from independent cohorts. MRI biomarkers Clinical labels Table 3: Test MRI biomarker modeling performance (MAE) for yearly predictions. The proposed method is compared with the alternatives using a paired, two-sided Wilcoxon signed rank test, and this is reported in superscript as LSTM-Robust vs. LSTM-Mean/LSTM-Robust vs. LSTM-Forward/LSTM-Robust vs. Regression-Based. y : not significantly different, ? : p < 0:05, ?? : p < 0:01, ? ? ? : p < 0:001. LSTM-Robust LSTM-Mean LSTM-Forward Regression-Based (Che et al., 2018) (Lipton et al., 2016) (Jedynak et al., 2012) ???=???=??? Ventricles 0:00307 0:00620 0:00472 0:00807 ???=??=??? Hippocampus 0:00023 0:00051 0:00034 0:00051 ???=??=??? Whole brain 0:01330 0:02375 0:01639 0:00551 ???=?=??? Entorhinal cortex 0:00021 0:00030 0:00025 0:00035 ???=???=??? Fusiform 0:00068 0:00130 0:00100 0:00090 ???=y=? Middle temporal gyrus 0:00087 0:00126 0:00118 0:00111 Table 4: Test diagnostic performance (AUC) of the estimated MRI biomarker values using an LDA classifier. The proposed method is compared with the alternatives using McNemar’s test, and this is reported in superscript as LSTM-Robust vs. LSTM-Mean/LSTM-Robust vs. LSTM- Forward/LSTM-Robust vs. Regression-Based. y : not significantly different, ? : p < 0:05, ?? : p < 0:01, ? ? ? : p < 0:001. LSTM-Robust LSTM-Mean LSTM-Forward Regression-Based (Che et al., 2018) (Lipton et al., 2016) (Jedynak et al., 2012) y=y=y CN vs. MCI 0:5914 0.5838 0.5800 0.5468 ???=???=??? CN vs. AD 0:9029 0.8404 0.8150 0.7826 y=y=y MCI vs. AD 0:7844 0.6936 0.6890 0.7330 y=?=? CN vs. MCI vs. AD 0:7596 0.7059 0.6947 0.6875 Secondly, the top performing CADDementia algorithms to the higher rates of missing data could be due to the fact incorporated different types of MRI biomarkers besides that it replaces the missing values placed at the beginning volumetry. Thirdly, this work predicts the input features to of a sequence with the whole training data median. the classifier based on historical longitudinal data. 4.4. Irregular time intervals 4.3. Robustness as a function of amount of missing data As final experiment, we assess generalizability of the To evaluate the modeling robustness of the proposed proposed method for predicting measurements of irregular method compared to the alternatives for different amounts visits. In general, standard LSTM networks are designed of missing data, we construct subsamples of the training to handle evenly spaced sequences. We used the same dataset by randomly removing up to 50% of the actual approach in our baseline experiments for AD progression data per biomarker and train the methods on the smaller modeling application by disregarding visiting months 3, 6 datasets. Figure 3 illustrates the modeling performance of and 18, and confined the experiments to yearly follow-up the different methods on various amounts of missing mea- in the ADNI data. Now, we employ the available mea- surements, from 0% to 50%. It is important to note that the surements of the 6-th and 18-th visiting months from the training data already includes a large number of missing TADPOLE dataset and predict biomarker values of half- values at missing rate of 0% – i.e. 63% of actual data as yearly follow-ups by assuming unavailable visits as miss- seen on Table 2. For better comparison, we take the aver- ing data. In this experiment, 78% of the actual data is age of MAEs normalized by the range of corresponding missing. We apply the same methods to the extended data. biomarkers to obtain a single curve per method. As can be Table 5 details the test modeling performance of the MRI seen, the result of the proposed method is superior to those biomarkers for half-yearly predictions using the different of the benchmarks up until missing around 74% of the DPM methods. As can be seen, our proposed DPM method data. For higher rates of missing data, basic LSTM with outperforms all other methods in all categories. More inter- forward imputation outperforms all other methods. One estingly, considering the corresponding results from Table reason for why LSTM with forward imputation is robust 3 for yearly predictions, one can deduce that the modeling 8 Table 5: Test MRI biomarker modeling performance (MAE) for half-yearly predictions. LSTM-Robust LSTM-Mean LSTM-Forward Regression-Based (Che et al., 2018) (Lipton et al., 2016) (Jedynak et al., 2012) Ventricles 0:00272 0:00973 0:01030 0:00659 Hippocampus 0:00023 0:00068 0:00065 0:00043 Whole brain 0:01181 0:03332 0:02552 0:00601 Entorhinal cortex 0:00021 0:00037 0:00032 0:00038 Fusiform 0:00061 0:00164 0:00196 0:00091 Middle temporal gyrus 0:00085 0:00220 0:00263 0:00097 0.08 applied to AD progression modeling using longitudinal measurements of MRI biomarkers. To the best of our 0.075 knowledge, this is the first time RNNs have been stud- 0.07 ied and applied to DPM within neurodegenerative disease. Moreover, since RNNs are non-parametric learning meth- 0.065 ods, the proposed approach can be applied to different time- 0.06 series data and characteristics than the monotonic behavior that one typically encounters in MRI-based neurodegener- 0.055 ative disease progression modeling. The proposed training 0.05 method demonstrated better performance than using im- putation prior to standard LSTM network training and 0.045 LSTM-Robust outperformed an established parametric, regression-based Regression-Based 0.04 LSTM-Mean DPM method in terms of both biomarker prediction and LSTM-Forward subsequent diagnostic classification. This method is also 0.035 63 65 67 69 71 73 75 77 79 81 applicable for other types of RNNs such as gated recurrent Amount of Missing Data (%) units (GRUs) (Cho et al., 2014). This study highlights the potential of RNNs for modeling the progression of AD us- Figure 3: Modeling performance of MRI biomarkers for various ing longitudinal measurements, provided that proper care amounts of missing values. is taken to handle missing values and time intervals. performance of the proposed method improves by utilizing Disclosures the irregular visits. However, the additional time points in the LSTM increases the required time for training and vali- M. Nielsen is shareholder in Biomediq A/S and Cerebriu dation to 1,090 seconds and 0.061 seconds, respectively. A/S. A. Pai is shareholder in Cerebriu A/S. The remaining As an alternative, one could utilize modified LSTM authors report no disclosures. architectures where the networks learn a number of param- eters to encode visiting patterns among longitudinal patient records (Baytas et al., 2017; Neil et al., 2016). However, Acknowledgments using such methods not only increase the complexity of the network but also risk learning any time spacing patterns in This project has received funding from the Euro- the data. pean Union’s Horizon 2020 research and innovation pro- gramme under the Marie Skodowska-Curie grant agree- ment No 721820. This work uses the TADPOLE data sets 5. Conclusions (https://tadpole.grand-challenge.org) constructed by the In this paper, a training algorithm was proposed for EuroPOND consortium (http://europond.eu) funded by the LSTM networks aiming to improve robustness against European Union’s Horizon 2020 research and innovation missing data, and the robustly trained LSTM network was programme under grant agreement No 666992. Average Test NMAE Across MRI Biomarkers Data collection and sharing for this project was Donohue, M.C., Jacqmin-Gadda, H., Le Goff, M., Thomas, R.G., Ra- man, R., Gamst, A.C., Beckett, L.A., Jack, C.R., Weiner, M.W., funded by the Alzheimer’s Disease Neuroimaging Ini- Dartigues, J.F., Aisen, P.S., 2014. Estimating long-term multivariate tiative (ADNI) (National Institutes of Health Grant U01 progression from short-term data. Alzheimer’s & Dement.: the J. of AG024904) and DOD ADNI (Department of Defense the Alzheimer’s Assoc. 10, S400–S410. award number W81XWH-12-2-0012). ADNI is funded Fjell, A.M., Westlye, L.T., Grydeland, H., Amlien, I., Espeseth, T., Reinvang, I., Raz, N., Holland, D., Dale, A.M., Walhovd, K.B., by the National Institute on Aging, the National Insti- 2013. Critical ages in the life course of the adult brain: nonlinear tute of Biomedical Imaging and Bioengineering, and subcortical aging. Neurobiol. of Aging 34, 2239–2247. through generous contributions from the following: Ab- Gers, F.A., Schmidhuber, J., 2001. LSTM recurrent networks learn bVie, Alzheimer’s Association; Alzheimer’s Drug Dis- simple context-free and context-sensitive languages. IEEE Trans. on Neural Netw. 12, 1333–1340. covery Foundation; Araclon Biotech; BioClinica, Inc.; Gers, F.A., Schmidhuber, J., Cummins, F., 1999. Learning to forget: Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Continual prediction with LSTM, in: Proceedings of the 9th Interna- Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly tional Conference on Artificial Neural Networks (ICANN 99), pp. and Company; EuroImmun; F. Hoffmann-La Roche Ltd. 850–855. Greff, K., Srivastava, R.K., Koutn´ ık, J., Steunebrink, B.R., Schmid- and its affiliated company Genentech, Inc.; Fujirebio; GE huber, J., 2017. LSTM: A search space odyssey. IEEE Trans. on Healthcare; IXICO Ltd.; Janssen Alzheimer Immunother- Neural Netw. and Learn. Syst. 28, 2222–2232. apy Research & Development, LLC.; Johnson & Johnson Guerrero, R., Schmidt-Richberg, A., Ledig, C., Tong, T., Wolz, Pharmaceutical Research & Development LLC.; Lumos- R., Rueckert, D., 2016. Instantiated mixed effects modeling of ity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnos- Alzheimer’s disease markers. NeuroImage 142, 113–125. Hand, D.J., Till, R.J., 2001. A simple generalisation of the area under tics, LLC.; NeuroRx Research; Neurotrack Technologies; the ROC curve for multiple class classification problems. Mach. Novartis Pharmaceuticals Corporation; Pfizer Inc.; Pira- Learn. 45, 171–186. mal Imaging; Servier; Takeda Pharmaceutical Company; Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural and Transition Therapeutics. The Canadian Institutes of Comput. 9, 1735–1780. Jedynak, B.M., Lang, A., Liu, B., Katz, E., Zhang, Y., Wyman, B.T., Health Research is providing funds to support ADNI clini- Raunig, D., Jedynak, C.P., Caffo, B., Prince, J.L., 2012. A compu- cal sites in Canada. Private sector contributions are facili- tational neurodegenerative disease progression score: method and tated by the Foundation for the National Institutes of Health results with the Alzheimer’s Disease Neuroimaging Initiative cohort. (www.fnih.org). The grantee organization is the Northern NeuroImage 63, 1478–1486. Lipton, Z.C., Kale, D.C., Wetzel, R., 2016. Modeling missing data in California Institute for Research and Education, and the clinical time series with RNNs, in: Proceedings of Machine Learning study is coordinated by the Alzheimer’s Therapeutic Re- for Healthcare. search Institute at the University of Southern California. Marinescu, R.V., Oxtoby, N.P., Young, A.L., Bron, E.E., Toga, A.W., ADNI data are disseminated by the Laboratory for Neuro Weiner, M.W., Barkhof, F., Fox, N.C., Klein, S., Alexander, D.C., 2018. TADPOLE challenge: Prediction of longitudinal evolution in Imaging at the University of Southern California. Alzheimer’s disease. CoRR abs/1805.03909. McKhann, G., Drachman, D., Folstein, M., Katzman, R., Price, D., Stadlan, E.M., 1984. Clinical diagnosis of Alzheimer’s disease. References Neurol. 34, 939–939. McNemar, Q., 1947. Note on the sampling error of the difference Baytas, I.M., Xiao, C., Zhang, X., Wang, F., Jain, A.K., Zhou, J., 2017. between correlated proportions or percentages. Psychom. 12, 153– Patient subtyping via time-aware LSTM networks, in: Proceedings 157. of the 23rd ACM SIGKDD International Conference on Knowledge Mehdipour Ghazi, M., Nielsen, M., Pai, A., Cardoso, M.J., Modat, M., Discovery and Data Mining, pp. 65–74. Ourselin, S., Sørensen, L., 2018. Robust training of recurrent neural Biagioni, M.C., Galvin, J.E., 2011. Using biomarkers to improve networks to handle missing data for disease progression modeling. detection of Alzheimer’s disease. Neurodegener. Dis. Manag. 1, CoRR abs/1808.05500. 127–139. Neil, D., Pfeiffer, M., Liu, S.C., 2016. Phased LSTM: Accelerating Bron, E.E., Smits, M., Van Der Flier, W.M., Vrenken, H., Barkhof, F., recurrent network training for long or event-based sequences, in: Scheltens, P., Papma, J.M., Steketee, R.M., Orellana, C.M., Meij- Advances in Neural Information Processing Systems, pp. 3882– boom, R., et al., 2015. Standardized evaluation of algorithms for 3890. computer-aided diagnosis of dementia based on structural MRI: the Oxtoby, N.P., Alexander, D.C., 2017. Imaging plus X: multimodal CADDementia challenge. NeuroImage 111, 562–579. models of neurodegenerative disease. Curr. Opin. in Neurol. 30, Che, Z., Purushotham, S., Cho, K., Sontag, D., Liu, Y., 2018. Recurrent 371. neural networks for multivariate time series with missing values. Sci. Oxtoby, N.P., Young, A.L., Fox, N.C., Daga, P., Cash, D.M., Ourselin, Rep. 8, 6085. S., Schott, J.M., Alexander, D.C., 2014. Learning imaging biomarker Cho, K., Van Merrienboer ¨ , B., Gulcehre, C., Bahdanau, D., Bougares, trajectories from noisy Alzheimer’s disease data using a bayesian F., Schwenk, H., Bengio, Y., 2014. Learning phrase representa- multilevel model, in: Bayesian and Graphical Models for Biomedical tions using RNN encoder-decoder for statistical machine translation. Imaging, pp. 85–94. CoRR abs/1406.1078. Parveen, S., Green, P., 2002. Speech recognition with missing data 10 using recurrent neural nets, in: Advances in Neural Information Processing Systems, pp. 1189–1195. Pearlmutter, B.A., 1989. Learning state space trajectories in recurrent neural networks. Neural Comput. 1, 263–269. Petersen, R.C., Aisen, P., Beckett, L., Donohue, M., Gamst, A., Harvey, D., Jack, C., Jagust, W., Shaw, L., Toga, A., Trojanowski, J., Weiner, M., 2010. Alzheimer’s Disease Neuroimaging Initiative (ADNI): clinical characterization. Neurol. 74, 201–209. Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biom. Bull. 1, 80–83. Wu, L., Rosa-Neto, P., Gauthier, S., 2011. Use of biomarkers in clinical trials of Alzheimer disease. Mol. Diagn. & Ther. 15, 313–325. Yau, W.Y.W., Tudorascu, D.L., McDade, E.M., Ikonomovic, S., James, J.A., Minhas, D., Mowrey, W., Sheu, L.K., Snitz, B.E., Weissfeld, L., et al., 2015. Longitudinal assessment of neuroimaging and clinical markers in autosomal dominant Alzheimer’s disease: a prospective cohort study. The Lancet Neurol. 14, 804–813. Yoon, J., Zame, W.R., van der Schaar, M., 2018. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans. on Biomed. Eng. . http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Training recurrent neural networks robust to incomplete data: application to Alzheimer's disease progression modeling

Loading next page...
 
/lp/arxiv-cornell-university/training-recurrent-neural-networks-robust-to-incomplete-data-aH3j3OY072

References (33)

ISSN
1361-8415
eISSN
ARCH-3347
DOI
10.1016/j.media.2019.01.004
Publisher site
See Article on Publisher Site

Abstract

Disease progression modeling (DPM) using longitudinal data is a challenging machine learning task. Existing DPM algorithms neglect temporal dependencies among measurements, make parametric assumptions about biomarker trajecto- ries, do not model multiple biomarkers jointly, and need an alignment of subjects’ trajectories. In this paper, recurrent neural networks (RNNs) are utilized to address these issues. However, in many cases, longitudinal cohorts contain incomplete data, which hinders the application of standard RNNs and requires a pre-processing step such as imputation of the missing values. Instead, we propose a generalized training rule for the most widely used RNN architecture, long short-term memory (LSTM) networks, that can handle both missing predictor and target values. The proposed LSTM algorithm is applied to model the progression of Alzheimer’s disease (AD) using six volumetric magnetic resonance imaging (MRI) biomarkers, i.e., volumes of ventricles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinal cortex, and it is compared to standard LSTM networks with data imputation and a parametric, regression-based DPM method. The results show that the proposed algorithm achieves a significantly lower mean absolute error (MAE) than the alternatives with p < 0:05 using Wilcoxon signed rank test in predicting values of almost all of the MRI biomarkers. Moreover, a linear discriminant analysis (LDA) classifier applied to the predicted biomarker values produces a significantly larger area under the receiver operating characteristic curve (AUC) of 0.90 vs. at most 0.84 with p < 0:001 using McNemar’s test for clinical diagnosis of AD. Inspection of MAE curves as a function of the amount of missing data reveals that the proposed LSTM algorithm achieves the best performance up until more than 74% missing values. Finally, it is illustrated how the method can successfully be applied to data with varying time intervals. This paper shows that built-in handling of missing values in training an LSTM network benefits the application of RNNs in neurodegenerative disease progression modeling in longitudinal cohorts. Keywords: Alzheimer’s disease, disease progression modeling, linear discriminant analysis, long short-term memory, magnetic resonance imaging, recurrent neural networks. Corresponding Author. 1. Introduction Data used in preparation of this article were obtained from Alzheimer’s disease (AD) is a chronic neurodegenera- the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI tive disorder that begins with memory loss and develops contributed to the design and implementation of ADNI and/or pro- vided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found This document is the accepted version of the manuscript pub- at http://adni.loni.usc.edu/wp-content/uploads/how_to_ lished in Medical Image Analysis in Volume 53, Pages 39-46, apply/ADNI_Acknowledgement_List.pdf with DOI: https://doi.org/10.1016/j.media.2019.01.004. Email address: mehdipour@biomediq.com (Mostafa Mehdipour c 2019. This manuscript version is made available under the CC-BY- Ghazi) NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/. arXiv:1903.07173v1 [cs.CV] 17 Mar 2019 over time, causing issues in conversation, orientation, and (Petersen et al., 2010), standard RNNs inclunding LSTMs control of bodily functions (McKhann et al., 1984). Early cannot be directly applied. Pre-processing methods such diagnosis of the disease is challenging and is usually made as data imputation and interpolation are the most common once cognitive impairment has already compromised daily approaches to handling missing data in RNNs. These two- living. Hence, developing robust, data-driven methods for step procedures decouple missing data handling and net- disease progression modeling (DPM) utilizing longitudi- work training, resulting in a sub-optimal performance that nal data is necessary to yield a complete perspective on is heavily influenced by the choice of data pre-processing the disease for better diagnosis, monitoring, and prognosis method (Lipton et al., 2016). Although RNNs themselves (Oxtoby and Alexander, 2017). have been used for estimating missing data (Parveen and Green, 2002; Yoon et al., 2018), the lack of methods to Existing longitudinal DPM methods model biomarkers inherently handle incomplete data in RNNs is evident (Che as a function of disease progression using continuous curve et al., 2018). Other approaches update the architecture to fitting. In the AD progression modeling literature, a variety learn or encode the missing data patterns (Che et al., 2018; of regression-based methods have been proposed to fit lo- Lipton et al., 2016). These methods are typically biased gistic or polynomial functions to the longitudinal dynamic towards specific cohort or demographic circumstances cor- of each biomarker (Jedynak et al., 2012; Fjell et al., 2013; related with the learned missing data patterns and introduce Oxtoby et al., 2014; Donohue et al., 2014; Yau et al., 2015; additional parameters in the network which increases the Guerrero et al., 2016). However, parametric assumptions complexity of the network. on the biomarker trajectories not only limit the flexibility In this paper, we propose a generalized method for train- of such methods but also lead to the necessity of aligning ing LSTM networks that can handle missing values in both subjects’ trajectories. In addition, the existing approaches input and target. This is achieved by applying the batch mostly rely on independent biomarker modeling, and none gradient descent algorithm in combination with the loss of them consider the temporal dependencies among mea- function and its gradients normalized by the number of surements. missing values in input and target. Our goal is different Recurrent neural networks (RNNs) are non-parametric than the approaches that encode the missing values’ pat- sequence based learning methods that, by design, do not terns (Che et al., 2018; Lipton et al., 2016); we want to require alignment of subject trajectories. They offer con- train RNNs robust to missing values to more faithfully tinuous, joint modeling of longitudinal data while taking capture the true underlying signal and to make the learned temporal dependencies among measurements into account model generalizable across cohorts. The proposed LSTM (Pearlmutter, 1989). Long short-term memory (LSTM) algorithm is applied to AD progression modeling in the networks, the most widely used type of RNNs, developed ADNI cohort (Petersen et al., 2010) based on volumetric to effectively capture long-term temporal dependencies by magnetic resonance imaging (MRI) biomarkers, and the dealing with the exploding and vanishing gradient prob- estimated biomarker values are used to predict the clinical lem during backpropagation through time (Hochreiter and status of subjects. MRI is known to be the best noninva- Schmidhuber, 1997; Gers et al., 1999; Gers and Schmid- sive way to examine changes in the brain in vivo during huber, 2001). They employ a memory cell with nonlinear the course of AD (Biagioni and Galvin, 2011; Wu et al., reset units – so called constant error carousels (CECs) – 2011), and volumetric analysis is a widely used ROI-based and learn to store history for either long or short time peri- method to estimate brain atrophy. ods. Since their introduction, a variety of LSTM networks The main contribution is three-fold and can be summa- have been developed for different time-series applications rized as follows: (Greff et al., 2017). The vanilla LSTM that utilizes three reset gates with full gate recurrence is the most commonly First, a generalized formulation of backpropagation used LSTM architecture. It applies the backpropagation through time for LSTM networks is proposed to han- through time algorithm using full gradients to train the net- dle incomplete data, and it is shown that such built-in work and can include biases and cell-to-gates (peephole) handling of missing values provides a better modeling connections. and prediction performance compared to using data However, since longitudinal cohorts often contain miss- imputation with standard LSTM networks. ing biomarker values due to, for instance, dropped out patients, unsuccessful measurements, or different assess-  Second, temporal dependencies among measurements ment patterns used for different subject groups – as seen in in the ADNI data are modeled using the proposed the Alzheimer’s Disease Neuroimaging Initiative (ADNI) LSTM network via sequence-to-sequence learning. 2 Output Hidden Input Figure 1: Illustration of how the normalization factors are related to the input and output of an unfolded RNN. Assume an RNN with three consecutive time pointsft 1; t; t + 1g, three input nodes, four hidden nodes, and two output nodes. Missing data for an instance observation j is illustrated as black nodes. We wish to weight the loss function and its gradients according to the number of available points in the input and output nodes. In this specific example, subject j has only one measurement available for its n-th input node and the same many for its m-th output node. Hence, the loss function and its gradients are weighted by 1/3. Moreover, since there is a total of five measurements available in the input layer, the loss function is weighted by 5/9. The later weighting factor is to ensure that the loss function takes the number of available points in the input layer into account. To the best of our knowledge, this is the first time uses an L2-norm loss function with residuals weighted such multi-dimensional sequence learning methods according to the number of available time points per target are applied to neurodegenerative DPM. biomarker node ( ) and according to the total number of available input values for all visits of all biomarkers Third, an end-to-end approach, without need for tra- j ( ). In addition, it normalizes input weight gradients of jectory alignment, is proposed for modeling the lon- the loss function according to the number of available time gitudinal dynamics of imaging biomarkers and for j points per input biomarker node ( ). Figure 1 provides clinical status prediction. This is a practical way of an illustration of how the normalization factors are related implementing a robust DPM for both research and to the input and output of an unfolded RNN. Note that the clinical applications. use of batch gradient descend ensures the availability of at least one data point per biomarker that can proportionally A preliminary version of this work appeared in proceed- contribute in the weight update rule. ings of the International Conference on Medical Imaging with Deep Learning (Mehdipour Ghazi et al., 2018). The 2.1. The basic LSTM architecture present study contains a more detailed presentation and ad- ditional experiments to investigate statistical significance, Figure 2 shows a typical schematic of a vanilla LSTM robustness as a function of amount of missing data, and architecture. As can be seen, the topology includes a mem- situations with varying time steps. ory cell, an input modulation gate, and three nonlinear reset gates, namely input gate, forget gate, and output gate, each of which accepting current and recurrent inputs. The 2. Proposed LSTM algorithm memory cell learns to maintain its state over time while The main goal of this study is to minimize the influence the multiplicative gates learn to open and close access to of missing values on the learned LSTM network parame- the constant error/information flow, to prevent exploding ters. This is achieved by using the batch gradient descend or vanishing gradients. The input gate protects the memory method in combination with the backpropagation through contents from perturbation by irrelevant inputs, and the time algorithm modified to take into account missing val- output gate protects other units from perturbation by cur- ues in the input and target vectors. More specifically, the rently irrelevant memory contents. The forget gate deals algorithm sets input missing values to zero, backpropagates with continual or very long input sequences, and finally, zero errors corresponding to the target missing points, and peephole connections allow the gates to access the CEC of 3 set of peephole connections from the cell to the gates, LSTM Unit M1 fb ;b ;b ;b g 2 R represents corresponding biases f i o c of neurons, and denotes element-wise multiplication. Finally,  ,  , and  are nonlinear activation functions g c h assigned for the gates, input modulation, and hidden cell output, respectively. Logistic sigmoid functions are forget gate applied to the gates with range [0; 1] while hyperbolic output gate Σ tangent functions are applied to modulate both cell input and hidden output with range [1; 1]. Hence, the input modulation measurements need to be in the same range [1; 1]. hidden 2.3. Robust backpropagation through time activation M1 Let L 2 R be the loss function defined based on input gate the actual target s and network output y. Here, we con- sider one layer of LSTM units for sequence learning which means that the network output is the hidden output. The Figure 2: An illustration of a vanilla LSTM unit with peephole connec- tions in red. The solid and dashed lines show weighted and unweighted main idea is to calculate the partial derivatives of the nor- connections, respectively. malized loss function () with respect to the weights using the chain rule. 1 1 the same cell state. t t 2 L(m) = (y (m) s (m)) ; j j j j 2JT x m j;t 2.2. Feedforward in LSTM networks h i 1 1 t t t t N1 y (m) = (y (m) s (m)) ; Assume x 2 R is the j-th observation of an N - j j j j j JT x m dimensional input vector at current time t. If M is the jx j jy (m)j j j j j number of output units, feedforward calculations of the where = and = are normalization x m TN T LSTM network under study can be summarized as factors to handle missing values of the j-th observation with batch size J and sequence length T . Also, jx j and t t t1 t1 f = W x + U h + V c + b ; f f f f j j j j jy (m)j denote the total number of available input values t t ~ and the number of available target time points in the m- f =  (f ) ; j j th node, respectively. The backpropagation calculations t t t1 t1 i = W x + U h + V c + b ; i i i i j j j j through time using full gradients can be obtained as t t i =  (i ) ; j j t T t+1 T t+1 T t+1 T t+1 h = U f + U i + U z + U o j f j i j c j o j t t t1 z = W x + U h + b ; c c c j j j t + y ; t t z~ =  (z ) ; c t t t j j o~ = h c~ ; j j j t t t1 t t ~ ~ c = f c + i z~ ; t t 0 t j j j j o = o~  (o ) ; j j g j t t c~ =  (c ) ; h t t t j j c~ = h o~ ; j j j t t t1 t o = W x + U h + V c + b ; o o o o t t+1 t+1 t j j j c = V f + V i + V o f i o j j j j t t o~ =  (o ) ; g t 0 t t+1 t+1 j j + c~  (c ) + c f ; j h j j j t t t h = o~ c~ ; j j j t t t z~ = c i ; j j j t t t t t t M1 t t 0 t where ff ;i ;z ;c ;o ;h g 2 R and z = z~  (z ) ; j j j j j j j j c j t t t t t M1 ~ ~ t t t ff ;i ;z~ ;c~ ;o~ g 2 R are j-th observation of ~ j j j j j i = c z~ ; j j j forget gate, input gate, modulation gate, cell state, output t t 0 t i = i  (i ) ; j j g j gate, and hidden output at time t before and after activation, t t t1 MN ~ f = c c ; respectively. Moreover, fW ; W ; W ; W g 2 R f i o c j j MM and fU ; U ; U ; U g 2 R are sets of connecting t t 0 t f i o c ~ f = f  (f ) ; j j g j weights from current and recurrent inputs to the gates t T t T t T t T t M1 x = W f + W i + W z + W o ; j f j i j c j o j and cell, respectively, fV ;V ;V g 2 R is the f i o 4 Finally, if  2 ff; i; z; og and  2 ff; ig, the gradi- ADNI has been to test whether serial magnetic resonance ents of the loss function with respect to the weights are imaging, positron emission tomography, other biological calculated as markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cog- f0!Tg f0!Tg nitive impairment and early Alzheimer’s disease. To be W (n) =  x (n) ; j j more specific, we use The Alzheimer’s Disease Prediction j=1 Of Longitudinal Evolution (TADPOLE) challenge dataset f1!Tg f0!T1g (Marinescu et al., 2018) which is composed of data from U =  h ; j j the three ADNI phases ADNI 1, ADNI GO, and ADNI j=1 2. This includes roughly 1,500 biomarkers acquired from J T1 XX t+1 t 1,737 subjects (957 males and 780 females) during 12,741 V =  c ; j j visits at 22 distinct time points between 2003 and 2017. j=1 t=0 Table 1 summarizes statistics of the demographics in the J T XX t t TADPOLE dataset. Note that the subjects include missing V = o c ; j j values and clinical status during their visits. j=1 t=0 J T In this work, we have merged existing groups labeled XX as cognitively normal (CN), significant memory concern b =  ; j=1 t=0 (SMC), and normal (NL) under CN, mild cognitive impair- ment (MCI), early MCI (EMCI), and late MCI (LMCI) jx (n)j where = is the normalization factor handling under MCI, and Alzheimer’s disease (AD) and dementia missing input values and jx (n)j is the number of avail- under AD. Moreover, groups with labels converting from able time points in the input’s n-th node. Here, we use a one status to another, e.g. MCI-to-AD, belong to the next fixed sequence length of T to proportionally consider sub- status (AD in this example). jects based on their available visits. However, the robust MRI biomarkers are used for AD progression modeling. backpropagation algorithm can easily be generalized for a This includes T1-weighted brain MRI volumes of ventri- dynamic sequence length. cles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinal cortex. We normalize the MRI mea- 2.4. Momentum batch gradient descent surements by the corresponding intracranial volume (ICV). As an efficient iterative algorithm, momentum batch Next, we filter within-class outliers of each biomarker – gradient descent is applied to find the local minimum of across all subjects and their visits – by assuming them as the loss function calculated over a batch while speeding up missing values and normalize the measurements by scal- the convergence. The update rule using L2 regularization ing them linearly to [1; 1]. Out of 22 visits, we initially can be written as select 11 regular visits with a fixed interval of one year including baseline. Finally, subjects with less than three new old old # = # (! + ! ) ; distinct visits for any biomarker are removed to obtain new old new 742 subjects. This is to ensure that at least two visits are ! = ! + # ; available per biomarker for performing sequence learning where # is the weight update initialized to zero, ! is the through the feedforward step and an additional visit for to-be-updated weight array, ! is the gradient of the loss backpropagation. function with respect to !, and , , and  are the learning For evaluation purpose, we partition the entire dataset rate, weight decay or regularization factor, and momentum to three non-overlapping subsets for training, validation, weight, respectively. and testing. To achieve this, we randomly select 10% of the within-class subjects for validation and the same for testing. More specifically, we randomly pick subjects 3. Experiments based on their baseline labels while ensuring that subjects 3.1. Data with few and large number of visits are included in each Data used in the preparation of this article is obtained subset. This process results in 592, 76, and 74 subjects for from the ADNI database. The ADNI was launched in training, validations, and testing, respectively. Details on 2003 as a public-private partnership, led by principal in- the amount of available visits in the obtained evaluation vestigator Michael W. Weiner, MD. The primary goal of subsets are shown in Table 2. As can be deduced from the 5 Table 1: Demographics of the TADPOLE dataset. Number of visits Age, year (meanSD) Education, year (meanSD) male female male female male female CN 1,356 1,389 76.676.44 75.856.28 17.062.51 15.742.71 MCI 2,454 1,604 75.597.47 73.878.09 16.222.85 15.452.76 AD 1,208 900 77.227.11 75.457.92 15.853.03 14.352.73 All (labeled & unlabeled) 12,741 76.007.38 15.912.86 table, 63% of the obtained data is missing. 3.3. Experimental setup The following methods are evaluated in our conducted 3.2. Evaluation metrics and statistical tests experiments: Mean absolute error (MAE) and multi-class area under the receiver operating characteristic (ROC) curve (AUC) LSTM-Robust: an LSTM network trained based on are used to assess the performance of modeling and classi- the proposed robust backpropagation through time fication, respectively. MAE measures accuracy of contin- algorithm by setting input missing values to zero and uous prediction per biomarker by computing the absolute backpropagating zero errors corresponding to the tar- difference between actual and estimated values as follows get missing points while training. t t MAE = jy s j ;  LSTM-Mean: an LSTM network trained using the j j j;t standard backpropagation through time algorithm with missing values imputed based on mean impu- t t where s and y are the ground-truth and estimated values j j tation method prior to training (Che et al., 2018). of the specific biomarker for the j-th subject at the t-th visit, respectively, and I is the number of available points LSTM-Forward: an LSTM network trained using in the target array s. the standard backpropagation through time algorithm Multi-class AUC (Hand and Till, 2001) is a measure to with missing values imputed based on forward impu- examine the diagnostic performance in a multi-class test tation method prior to training (Lipton et al., 2016). set using ROC analysis. It is calculated using the posterior Regression-Based: a parametric, regression-based probabilities as follows method (Jedynak et al., 2012) that automatically han- n 1 n c c X X 1 1 dles missing values. The parameters of the algorithm AUC = (n (n 1)) n n are initially estimated using linear regression in 15 c c i k i=1 k=i+1 h i iterations and are optimized using sigmoidal func- n (n + 1) n (n + 1) i i k k SR + SR ; tions in 35 additional iterations where all parameters i k 2 2 converge. where n is the number of distinct classes, n denotes the c i number of available points belonging to the i-th class, and All the methods are developed in MATLAB R2017b SR is the sum of the ranks of posteriors p(c js ) after and run on a 2.80 GHz CPU with 16 GB RAM. We initial- i i i sorting all concatenated posteriors fp(c js ); p(c js )g in ize the LSTM networks’ weights by generating uniformly i i i an ascending order, where s and s are vectors of scores distributed random values in range [0:05; 0:05] and set i k belonging to the true classes c and c , respectively. the weights’ updates and weights’ gradients to zero. The i k The modeling performance is statistically assessed for batch size is set to the number of available training subjects, different methods using the paired, two-sided Wilcoxon and the first ten visits are used to estimate the second to eleventh visits per subject for evaluation purpose. It should signed rank test (Wilcoxon, 1945) applied to the obtained absolute errors. Also, classification performance is an- be noted that when data imputation is applied, the robust alyzed using McNemar’s test (McNemar, 1947) applied backpropagation formulas simply generalize to the ones to the hard classification results (clinical status) obtained for the standard LSTM network. from a linear discriminant analysis (LDA) classifier with We utilize the validation set to tune all the networks’ predicted MRI measurements as input. optimization parameters, each time by adjusting one of the 6 Table 2: Number of visits in the evaluation subsets across all subjects. Note that the complete dataset should have contained 742 11 = 8; 162 visits per biomarker where the maximum number of visits per subject is 11. The number of visits per subject per diagnostic group is left blank as subjects can convert from one group to another in the course of AD. Number of visits across subjects Number of visits per subject (meanSD [min, max]) train / validation / test train / validation / test CN 1,192 / 136 / 149 MCI 1,389 / 198 / 180 AD 606 / 84 / 92 All (labeled & unlabeled) 3,270 / 428 / 434 5.522.32 [3, 11] / 5.632.39 [3, 11] / 5.862.51 [3, 11] Ventricles 2,481 / 328 / 318 4.191.47 [3, 10] / 4.321.46 [3, 8] / 4.301.58 [3, 9] Hippocampus 2,381 / 311 / 312 4.021.31 [3, 10] / 4.091.29 [3, 8] / 4.221.49 [3, 7] Whole brain 2,513 / 328 / 322 4.241.49 [3, 10] / 4.321.46 [3, 8] / 4.351.57 [3, 9] Entorhinal cortex 2,351 / 310 / 309 3.971.29 [3, 10] / 4.081.34 [3, 8] / 4.181.46 [3, 7] Fusiform 2,351 / 310 / 309 3.971.29 [3, 10] / 4.081.34 [3, 8] / 4.181.46 [3, 7] Middle temporal gyrus 2,351 / 309 / 309 3.971.29 [3, 10] / 4.071.35 [3, 8] / 4.181.46 [3, 7] parameters while keeping the rest at fixed values to achieve there is no difference between the proposed method and the lowest average MAE. Peephole connections are used in LSTM-Forward. the networks since they tend to improve the performance 4.2. Predicting clinical status (Greff et al., 2017). Based on these strategies, the optimal parameters are obtained as = 0:1,  = 0:9, and = To assess the ability of the estimated measurements in 0:0001 with 1,000 epochs. The corresponding MAEs for predicting the clinical status, we train an LDA classifier the validation set are also calculated as 0.00296, 0.00025, using the estimated training measurements and apply it 0.01494, 0.00024, 0.00076, and 0.00097, for ventricles, to the estimated test data to compute the posterior prob- hippocampus, whole brain, entorhinal cortex, fusiform, abilities. The obtained scores are then used to calculate and middle temporal gyrus, respectively. It takes about diagnostic AUCs. The diagnostic prediction results for 340 seconds to train the network and 0.025 seconds to the test set are shown in Table 4. As can be seen, LSTM- estimate all the validation measurements. It is worthwhile Robust outperforms all other methods in predicting clinical mentioning that all the estimated measurements are linearly status of subjects per visit with a multi-class AUC of 0.76, scaled from [1; 1] to the original range of biomarkers which reveals the effect of modeling on classification per- using the original minimum and maximum values while formance. One could of course use other classifiers or calculating MAEs. train the LSTM network directly for classification based on sequence-to-label learning to potentially improve the diagnostic AUCs. However, the focus of this work is on 4. Results and discussion DPM based on sequence-to-sequence learning. In addition, sequence-to-label learning would only be able to utilize After successfully training the LSTM networks and the the part of the training data which has available clinical regression-based method for DPM, they are all evaluated status. using the test set. The multi-class AUC of 0.76 obtained using predicted measurements from the proposed approach is within the 4.1. Biomarker modeling top-five AUCs of the state-of-the-art, cross-sectional MRI- Table 3 compares the test MRI biomarker modeling per- based classification results of the recent challenge on formance (MAE) using aforementioned methods. Even Computer-Aided Diagnosis of Dementia (CADDementia) though the performance is reported per biomarker, the (Bron et al., 2015) that ranged from 0.75 to 0.79. It should, models are jointly fitted to all biomarkers. As it can be however, be noted that there are important differences be- deduced from Table 3, LSTM-Robust significantly outper- tween this study and the CADDementia challenge. Firstly, forms the other methods in all MRI biomarkers except for this work has the advantage of training and testing data whole brain where the regression-based approach performs from the same cohort whereas CADDementia algorithms significantly better and for middel temporal gyrus where were applied to classify data from independent cohorts. MRI biomarkers Clinical labels Table 3: Test MRI biomarker modeling performance (MAE) for yearly predictions. The proposed method is compared with the alternatives using a paired, two-sided Wilcoxon signed rank test, and this is reported in superscript as LSTM-Robust vs. LSTM-Mean/LSTM-Robust vs. LSTM-Forward/LSTM-Robust vs. Regression-Based. y : not significantly different, ? : p < 0:05, ?? : p < 0:01, ? ? ? : p < 0:001. LSTM-Robust LSTM-Mean LSTM-Forward Regression-Based (Che et al., 2018) (Lipton et al., 2016) (Jedynak et al., 2012) ???=???=??? Ventricles 0:00307 0:00620 0:00472 0:00807 ???=??=??? Hippocampus 0:00023 0:00051 0:00034 0:00051 ???=??=??? Whole brain 0:01330 0:02375 0:01639 0:00551 ???=?=??? Entorhinal cortex 0:00021 0:00030 0:00025 0:00035 ???=???=??? Fusiform 0:00068 0:00130 0:00100 0:00090 ???=y=? Middle temporal gyrus 0:00087 0:00126 0:00118 0:00111 Table 4: Test diagnostic performance (AUC) of the estimated MRI biomarker values using an LDA classifier. The proposed method is compared with the alternatives using McNemar’s test, and this is reported in superscript as LSTM-Robust vs. LSTM-Mean/LSTM-Robust vs. LSTM- Forward/LSTM-Robust vs. Regression-Based. y : not significantly different, ? : p < 0:05, ?? : p < 0:01, ? ? ? : p < 0:001. LSTM-Robust LSTM-Mean LSTM-Forward Regression-Based (Che et al., 2018) (Lipton et al., 2016) (Jedynak et al., 2012) y=y=y CN vs. MCI 0:5914 0.5838 0.5800 0.5468 ???=???=??? CN vs. AD 0:9029 0.8404 0.8150 0.7826 y=y=y MCI vs. AD 0:7844 0.6936 0.6890 0.7330 y=?=? CN vs. MCI vs. AD 0:7596 0.7059 0.6947 0.6875 Secondly, the top performing CADDementia algorithms to the higher rates of missing data could be due to the fact incorporated different types of MRI biomarkers besides that it replaces the missing values placed at the beginning volumetry. Thirdly, this work predicts the input features to of a sequence with the whole training data median. the classifier based on historical longitudinal data. 4.4. Irregular time intervals 4.3. Robustness as a function of amount of missing data As final experiment, we assess generalizability of the To evaluate the modeling robustness of the proposed proposed method for predicting measurements of irregular method compared to the alternatives for different amounts visits. In general, standard LSTM networks are designed of missing data, we construct subsamples of the training to handle evenly spaced sequences. We used the same dataset by randomly removing up to 50% of the actual approach in our baseline experiments for AD progression data per biomarker and train the methods on the smaller modeling application by disregarding visiting months 3, 6 datasets. Figure 3 illustrates the modeling performance of and 18, and confined the experiments to yearly follow-up the different methods on various amounts of missing mea- in the ADNI data. Now, we employ the available mea- surements, from 0% to 50%. It is important to note that the surements of the 6-th and 18-th visiting months from the training data already includes a large number of missing TADPOLE dataset and predict biomarker values of half- values at missing rate of 0% – i.e. 63% of actual data as yearly follow-ups by assuming unavailable visits as miss- seen on Table 2. For better comparison, we take the aver- ing data. In this experiment, 78% of the actual data is age of MAEs normalized by the range of corresponding missing. We apply the same methods to the extended data. biomarkers to obtain a single curve per method. As can be Table 5 details the test modeling performance of the MRI seen, the result of the proposed method is superior to those biomarkers for half-yearly predictions using the different of the benchmarks up until missing around 74% of the DPM methods. As can be seen, our proposed DPM method data. For higher rates of missing data, basic LSTM with outperforms all other methods in all categories. More inter- forward imputation outperforms all other methods. One estingly, considering the corresponding results from Table reason for why LSTM with forward imputation is robust 3 for yearly predictions, one can deduce that the modeling 8 Table 5: Test MRI biomarker modeling performance (MAE) for half-yearly predictions. LSTM-Robust LSTM-Mean LSTM-Forward Regression-Based (Che et al., 2018) (Lipton et al., 2016) (Jedynak et al., 2012) Ventricles 0:00272 0:00973 0:01030 0:00659 Hippocampus 0:00023 0:00068 0:00065 0:00043 Whole brain 0:01181 0:03332 0:02552 0:00601 Entorhinal cortex 0:00021 0:00037 0:00032 0:00038 Fusiform 0:00061 0:00164 0:00196 0:00091 Middle temporal gyrus 0:00085 0:00220 0:00263 0:00097 0.08 applied to AD progression modeling using longitudinal measurements of MRI biomarkers. To the best of our 0.075 knowledge, this is the first time RNNs have been stud- 0.07 ied and applied to DPM within neurodegenerative disease. Moreover, since RNNs are non-parametric learning meth- 0.065 ods, the proposed approach can be applied to different time- 0.06 series data and characteristics than the monotonic behavior that one typically encounters in MRI-based neurodegener- 0.055 ative disease progression modeling. The proposed training 0.05 method demonstrated better performance than using im- putation prior to standard LSTM network training and 0.045 LSTM-Robust outperformed an established parametric, regression-based Regression-Based 0.04 LSTM-Mean DPM method in terms of both biomarker prediction and LSTM-Forward subsequent diagnostic classification. This method is also 0.035 63 65 67 69 71 73 75 77 79 81 applicable for other types of RNNs such as gated recurrent Amount of Missing Data (%) units (GRUs) (Cho et al., 2014). This study highlights the potential of RNNs for modeling the progression of AD us- Figure 3: Modeling performance of MRI biomarkers for various ing longitudinal measurements, provided that proper care amounts of missing values. is taken to handle missing values and time intervals. performance of the proposed method improves by utilizing Disclosures the irregular visits. However, the additional time points in the LSTM increases the required time for training and vali- M. Nielsen is shareholder in Biomediq A/S and Cerebriu dation to 1,090 seconds and 0.061 seconds, respectively. A/S. A. Pai is shareholder in Cerebriu A/S. The remaining As an alternative, one could utilize modified LSTM authors report no disclosures. architectures where the networks learn a number of param- eters to encode visiting patterns among longitudinal patient records (Baytas et al., 2017; Neil et al., 2016). However, Acknowledgments using such methods not only increase the complexity of the network but also risk learning any time spacing patterns in This project has received funding from the Euro- the data. pean Union’s Horizon 2020 research and innovation pro- gramme under the Marie Skodowska-Curie grant agree- ment No 721820. This work uses the TADPOLE data sets 5. Conclusions (https://tadpole.grand-challenge.org) constructed by the In this paper, a training algorithm was proposed for EuroPOND consortium (http://europond.eu) funded by the LSTM networks aiming to improve robustness against European Union’s Horizon 2020 research and innovation missing data, and the robustly trained LSTM network was programme under grant agreement No 666992. Average Test NMAE Across MRI Biomarkers Data collection and sharing for this project was Donohue, M.C., Jacqmin-Gadda, H., Le Goff, M., Thomas, R.G., Ra- man, R., Gamst, A.C., Beckett, L.A., Jack, C.R., Weiner, M.W., funded by the Alzheimer’s Disease Neuroimaging Ini- Dartigues, J.F., Aisen, P.S., 2014. Estimating long-term multivariate tiative (ADNI) (National Institutes of Health Grant U01 progression from short-term data. Alzheimer’s & Dement.: the J. of AG024904) and DOD ADNI (Department of Defense the Alzheimer’s Assoc. 10, S400–S410. award number W81XWH-12-2-0012). ADNI is funded Fjell, A.M., Westlye, L.T., Grydeland, H., Amlien, I., Espeseth, T., Reinvang, I., Raz, N., Holland, D., Dale, A.M., Walhovd, K.B., by the National Institute on Aging, the National Insti- 2013. Critical ages in the life course of the adult brain: nonlinear tute of Biomedical Imaging and Bioengineering, and subcortical aging. Neurobiol. of Aging 34, 2239–2247. through generous contributions from the following: Ab- Gers, F.A., Schmidhuber, J., 2001. LSTM recurrent networks learn bVie, Alzheimer’s Association; Alzheimer’s Drug Dis- simple context-free and context-sensitive languages. IEEE Trans. on Neural Netw. 12, 1333–1340. covery Foundation; Araclon Biotech; BioClinica, Inc.; Gers, F.A., Schmidhuber, J., Cummins, F., 1999. Learning to forget: Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Continual prediction with LSTM, in: Proceedings of the 9th Interna- Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly tional Conference on Artificial Neural Networks (ICANN 99), pp. and Company; EuroImmun; F. Hoffmann-La Roche Ltd. 850–855. Greff, K., Srivastava, R.K., Koutn´ ık, J., Steunebrink, B.R., Schmid- and its affiliated company Genentech, Inc.; Fujirebio; GE huber, J., 2017. LSTM: A search space odyssey. IEEE Trans. on Healthcare; IXICO Ltd.; Janssen Alzheimer Immunother- Neural Netw. and Learn. Syst. 28, 2222–2232. apy Research & Development, LLC.; Johnson & Johnson Guerrero, R., Schmidt-Richberg, A., Ledig, C., Tong, T., Wolz, Pharmaceutical Research & Development LLC.; Lumos- R., Rueckert, D., 2016. Instantiated mixed effects modeling of ity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnos- Alzheimer’s disease markers. NeuroImage 142, 113–125. Hand, D.J., Till, R.J., 2001. A simple generalisation of the area under tics, LLC.; NeuroRx Research; Neurotrack Technologies; the ROC curve for multiple class classification problems. Mach. Novartis Pharmaceuticals Corporation; Pfizer Inc.; Pira- Learn. 45, 171–186. mal Imaging; Servier; Takeda Pharmaceutical Company; Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural and Transition Therapeutics. The Canadian Institutes of Comput. 9, 1735–1780. Jedynak, B.M., Lang, A., Liu, B., Katz, E., Zhang, Y., Wyman, B.T., Health Research is providing funds to support ADNI clini- Raunig, D., Jedynak, C.P., Caffo, B., Prince, J.L., 2012. A compu- cal sites in Canada. Private sector contributions are facili- tational neurodegenerative disease progression score: method and tated by the Foundation for the National Institutes of Health results with the Alzheimer’s Disease Neuroimaging Initiative cohort. (www.fnih.org). The grantee organization is the Northern NeuroImage 63, 1478–1486. Lipton, Z.C., Kale, D.C., Wetzel, R., 2016. Modeling missing data in California Institute for Research and Education, and the clinical time series with RNNs, in: Proceedings of Machine Learning study is coordinated by the Alzheimer’s Therapeutic Re- for Healthcare. search Institute at the University of Southern California. Marinescu, R.V., Oxtoby, N.P., Young, A.L., Bron, E.E., Toga, A.W., ADNI data are disseminated by the Laboratory for Neuro Weiner, M.W., Barkhof, F., Fox, N.C., Klein, S., Alexander, D.C., 2018. TADPOLE challenge: Prediction of longitudinal evolution in Imaging at the University of Southern California. Alzheimer’s disease. CoRR abs/1805.03909. McKhann, G., Drachman, D., Folstein, M., Katzman, R., Price, D., Stadlan, E.M., 1984. Clinical diagnosis of Alzheimer’s disease. References Neurol. 34, 939–939. McNemar, Q., 1947. Note on the sampling error of the difference Baytas, I.M., Xiao, C., Zhang, X., Wang, F., Jain, A.K., Zhou, J., 2017. between correlated proportions or percentages. Psychom. 12, 153– Patient subtyping via time-aware LSTM networks, in: Proceedings 157. of the 23rd ACM SIGKDD International Conference on Knowledge Mehdipour Ghazi, M., Nielsen, M., Pai, A., Cardoso, M.J., Modat, M., Discovery and Data Mining, pp. 65–74. Ourselin, S., Sørensen, L., 2018. Robust training of recurrent neural Biagioni, M.C., Galvin, J.E., 2011. Using biomarkers to improve networks to handle missing data for disease progression modeling. detection of Alzheimer’s disease. Neurodegener. Dis. Manag. 1, CoRR abs/1808.05500. 127–139. Neil, D., Pfeiffer, M., Liu, S.C., 2016. Phased LSTM: Accelerating Bron, E.E., Smits, M., Van Der Flier, W.M., Vrenken, H., Barkhof, F., recurrent network training for long or event-based sequences, in: Scheltens, P., Papma, J.M., Steketee, R.M., Orellana, C.M., Meij- Advances in Neural Information Processing Systems, pp. 3882– boom, R., et al., 2015. Standardized evaluation of algorithms for 3890. computer-aided diagnosis of dementia based on structural MRI: the Oxtoby, N.P., Alexander, D.C., 2017. Imaging plus X: multimodal CADDementia challenge. NeuroImage 111, 562–579. models of neurodegenerative disease. Curr. Opin. in Neurol. 30, Che, Z., Purushotham, S., Cho, K., Sontag, D., Liu, Y., 2018. Recurrent 371. neural networks for multivariate time series with missing values. Sci. Oxtoby, N.P., Young, A.L., Fox, N.C., Daga, P., Cash, D.M., Ourselin, Rep. 8, 6085. S., Schott, J.M., Alexander, D.C., 2014. Learning imaging biomarker Cho, K., Van Merrienboer ¨ , B., Gulcehre, C., Bahdanau, D., Bougares, trajectories from noisy Alzheimer’s disease data using a bayesian F., Schwenk, H., Bengio, Y., 2014. Learning phrase representa- multilevel model, in: Bayesian and Graphical Models for Biomedical tions using RNN encoder-decoder for statistical machine translation. Imaging, pp. 85–94. CoRR abs/1406.1078. Parveen, S., Green, P., 2002. Speech recognition with missing data 10 using recurrent neural nets, in: Advances in Neural Information Processing Systems, pp. 1189–1195. Pearlmutter, B.A., 1989. Learning state space trajectories in recurrent neural networks. Neural Comput. 1, 263–269. Petersen, R.C., Aisen, P., Beckett, L., Donohue, M., Gamst, A., Harvey, D., Jack, C., Jagust, W., Shaw, L., Toga, A., Trojanowski, J., Weiner, M., 2010. Alzheimer’s Disease Neuroimaging Initiative (ADNI): clinical characterization. Neurol. 74, 201–209. Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biom. Bull. 1, 80–83. Wu, L., Rosa-Neto, P., Gauthier, S., 2011. Use of biomarkers in clinical trials of Alzheimer disease. Mol. Diagn. & Ther. 15, 313–325. Yau, W.Y.W., Tudorascu, D.L., McDade, E.M., Ikonomovic, S., James, J.A., Minhas, D., Mowrey, W., Sheu, L.K., Snitz, B.E., Weissfeld, L., et al., 2015. Longitudinal assessment of neuroimaging and clinical markers in autosomal dominant Alzheimer’s disease: a prospective cohort study. The Lancet Neurol. 14, 804–813. Yoon, J., Zame, W.R., van der Schaar, M., 2018. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans. on Biomed. Eng. .

Journal

StatisticsarXiv (Cornell University)

Published: Mar 17, 2019

There are no references for this article.