A Bayesian binomial regression model with latent Gaussian processes for modelling DNA methylation
A Bayesian binomial regression model with latent Gaussian processes for modelling DNA methylation
Hubin, Aliaksandr;Storvik, Geir O;Grini, Paul E;Butenko, Melinka A
2020-04-28 00:00:00
Epigenetic observations are represented by the total number of reads from a given pool of cells and the number of methylated reads, making it reasonable to model this data by a binomial distribution. There are numerous factors that can in
uence the probability of success in a particular region. Moreover, there is a strong spatial (alongside the genome) dependence of these probabilities. We incorporate dependence on the covariates and the spatial dependence of the methylation probability for observations from a pool of cells by means of a binomial regression model with a latent Gaussian eld and a logit link function. We apply a Bayesian approach including prior speci cations on model con gurations. We run a mode jumping Markov chain Monte Carlo algorithm (MJMCMC) across dierent choices of covariates in order to obtain the joint posterior distribution of parameters and models. This also allows nding the best set of covariates to model methylation probability within the genomic region of interest and individual marginal inclusion probabilities of the covariates. Keywords : Bayesian binomial regression, latent Gaussian eld, mode jumping Markov chain Monte Carlo, integrated nested Laplace approximations, Bayesian variable selection, Bayesian model averaging, epigenetic data, genetic patterns. 1. Introduction Epigenetic modi cations contribute to the generation of phenotypic plasticity, but the under- standing of its contribution to phenotypic alterations and how the genome in
uences epige- netic variants requires further investigation (Schmitz, Schultz, Urich, Nery, Pelizzola, Libiger, Alix, McCosh, Chen, Schork et al. 2013). Epigenetic changes are crucial for the development and dierentiation of various cell types in an organism, as well as for normal cellular processes. Epigenetic modi cations modulate gene expression and modi cations found in the promoter or regulatory elements play a prominent role in activating or suppressing transcript levels. This creates interesting research possibilities, which are often challenging from the statistical point of view. For example, Li, Cassese, Guindani, and Vannucci (2019) suggested a Bayesian negative binomial regression model to study the in
uence of methylation (used as covariates) arXiv:2004.13689v1 [stat.AP] 28 Apr 2020 2 A Bayesian binomial regression model with latent GP for modelling DNA methylation on RNA-Seq gene expression counts (used as observations). Also, Tang, Zhou, Wang, Huang, and Jin (2017) developed a Gaussian Bayesian regression model to link the dierential gene expression (measured as log fold change) to various exogenous variables including tumour suppressor genes categories, mean methylation values and genomic segment distributions. In turn, Ma, Liu, Zhang, Huang, and Tang (2017) suggested a multiple network for epige- netic studies and implemented the Cox proportional hazard model to analyze the association of methylation pro le of each epigenetic module with the patient survival. Recently, high- throughput epigenetics experiments have enabled researchers to measure genome-wide epi- genetic pro les. This allows performing Epigenome-wide association studies (EWAS), which also hold promise for the detection of new regulatory mechanisms that may be susceptible to modi cation by environmental and lifestyle factors (Michels, Binder, Dedeurwaerder, Epstein, Greally, Gut, Houseman, Izzi, Kelsey, Meissner et al. 2013). A major task today is the development of models and statistical methods for linking epigenetic patterns to genomic and/or environmental variables and interpreting them. Unlike the papers mentioned above (Ma et al. 2017; Tang et al. 2017; Li et al. 2019), we use methylation data as responses and link them to genomic and phenotypic variables (used as covariates). Moreover, by means of performing careful statistical modelling, our model takes into account that epigenetic data are spatially correlated (along locations in the genome) with high noise levels. Due to the availability of the data, our focus will be on the model plant Arabidopsis thaliana. For instance, Becker, Hagmann, Mu ller, Koenig, Stegle, Borgwardt, and Weigel (2011) previously analysed Arabidopsis data consisting of epigenetic observations on a set of 10 lines, which were separately propagated in a common environment for 30 generations. These were compared with two independent lines propagated for only three generations. Their analysis aimed at global summaries of structures but was based on individual and (site- wise) hypothesis testing methods combined with false discovery rate control methodology. In this paper, however, we limit ourselves to nding a pattern of signals appearing along the single genome that signi cantly in
uences the methylation probability of the corresponding organism. This is done by means of applying a binomial regression model with latent Gaussian variables, which take into account both spatial dependence and variability that can not be explained by the exogenous variables alone. The chosen latent Gaussian variable is a sum of a random walk RW (1) component and an independent IG component. Model selection and parameter estimation within models are performed simultaneously in a Bayesian framework, applying the mode jumping Markov chain Monte Carlo (MJMCMC) algorithm developed by Hubin and Storvik (2018) to perform the computations involved. MJMCMC outputs posterior model probabilities allowing to nd the best combination of explanatory genomic variables and compute marginal inclusion probabilities for the importance of individual variables. Our approach also allows to generate a model-averaged classi cation of the methylation status at dierent locations and make imputations for those locations that do not have enough observations, whilst the currently used approach is to simply ignore these locations. 2. Mathematical model We model the number of methylated reads Y 2 f1; :::; n g per position in the genome (nu- t t cleotide base position) to be binomially distributed with the number of trials equal to the number of reads, n 2 N, for position t 2 ft ; :::; t g (where T is the total number of genomic t 1 T positions in the addressed genomic region) and corresponding probability of success p 2 R . t [0;1] The probability p is modeled via the logit link to the covariates X = fX ; :::; X g. These t t t1 td covariates might be a position within a gene (e.g promoter or coding region), an indicator of the underlying genetic structure, or other types of features (our choice of the covariates is given in Section 3). A latent Gaussian RW (1) process f g 2 R and a latent independent Gaussian (IG) process f g are included into the model in order to take into account spa- tial dependence of methylation probabilities along the genome and the variance which is not explained by the covariates. Other explored latent Gaussian variables were also tested prior Austrian Journal of Statistics 3 to variable selection on a full model before the selection of this structure was done (see the detail in the Appendix A of the paper). This gives the following model formulation: y n y Pr(Y = yjn ; p ) = p (1 p ) ; (1) t t t t logit(p ) = +
X + + ; (2) t 0 j j tj t t j=1 iid = + ; N (0; ); (3) t t 1 t t iid N (0; ); (4) where 2 R; j 2 f0; :::; dg are regression coecients of the model showing whether and in which way the corresponding covariate in
uences the probability of methylation on average, 2 f0; 1g; i 2 f1; :::; dg are latent indicators, de ning if covariate j is included into the model (
= 1) or not (
= 0), f g are the error terms of RW (1) process f g, which are normally j j t t distributed with zero mean and precision . Finally, is the precision term of the IG process f g. We then put the following priors for the parameters of the model: Bernoulli(q); j
I(
= 1)N (0; ); ; and Gamma(1; 5 10 ): (5) i i i i Here, q = 0:5 is the prior Bernoulli probability of including a covariate into the model, and I() is the indicator function. We perform analysis for the model de ned by Equations (1)-(5) by means of the MJMCMC algorithm (Hubin and Storvik 2018). The algorithm is capable of eciently moving in the de ned model space by means of both accurately exploring the modes of the probability mass and switching between these modes using large jumps combined with local optimization and randomization. 3. Bayesian inference Let
= (
; :::
), which uniquely de nes a speci c model. Assuming the constant term is 1 p 0 always included, there are L = 2 dierent models to consider. De ne = f ; ; ; ; g 2 , which describes the parameters of the models. Also let Y = (y ; :::; y ) and X = 1 n (X ; :::;X ). We want to make inference jointly on models and their parameters p(
;jY ;X ). 1 n We also want to nd a set of the best models with respect to posterior marginal model probabilities p(
jY ;X ). Finally, we want to obtain the marginal inclusion probabilities p(
= 1jY ;X ); j 2 f1; :::; dg for individual covariates. By Bayes formula, p(
;jY ;X ) = p(j
;Y ;X )p(
jY ;X ). In Section 3.1 we describe how to compute p(j
;Y ;X ) and p(Y jX;
). For the time being, we assume that marginal likelihoods p(Y jX;
) are available for a given
. Then by Bayes formula: p(Y jX;
)p(
) p(
jY ;X ) = P : (6) 0 0 p(Y jX;
)p(
) In order to calculate p(
jY ;X ) we have to iterate through the whole model space , which becomes computationally infeasible for large d. The ordinary MCMC estimate is based on a (i) number of MCMC samples
; i = 1; :::; W : 1 (i) pe(
jY ;X ) = W I(
=
) ! p(
jY ;X ): (7) W!1 i=1 An alternative, named the renormalized model (RM) estimates by Clyde, Ghosh, and Littman (2011), is p(Y jX;
)p(
) pb(
jY ;X ) = I(
2 V) ! p(
jY ;X ); (8) 0 0 p(Y jX;
)p(
) V! 2V 4 A Bayesian binomial regression model with latent GP for modelling DNA methylation where now V is the set of visited models during the MCMC (or any other model space exploration algorithm) run. Although both (8) and (7) are asymptotically consistent, (8) will often be a preferable estimator since convergence of the MCMC based approximation (7) is much slower, see Clyde et al. (2011); Hubin and Storvik (2018). We aim at approximating p(
jY ;X ) by means of searching for some subspace V of making the approximation (8) as precise as possible. Models with high values of p(Y jX;
)p(
) and regions of relatively high posterior mass are important to be included into V. Missing them in V can introduce signi cant biases in our estimates. Note that these aspects are just as important for the standard MCMC estimate (7). The dierence is that while the number of times a speci c model is visited is important when using (7), for (8) it is enough that the model is visited at least once. In this context, the denominator of (8), which we would like to be as high as possible, becomes an extremely relevant measure for the quality of the search in terms of being able to capture whether the algorithm visits all of the modes, whilst the size of V should be low in order to save computational time. The marginal inclusion probability p(
= 1jY ;X ) is de ned as: 0 0 p(
= 1jY ;X ) = I(
= 1)p(
jY ;X ): (9) It can be approximated either by the MCMC estimator: (i) d pe(
= 1jY ;X ) = W I(
= 1) ! p(
= 1jY ;X ); (10) j j W!1 i=1 or using the renormalized approach: 0 0 pb(
= 1jY ;X ) = I(
= 1)p(
jY ;X ) ! p(
= 1jY ;X ); (11) j j V! 2V giving a measure of importance for the covariates of the model. 3.1. Integrated nested Laplace approximations Within hierarchical models with latent Gaussian structures, integrated nested Laplace ap- proximations (INLA) for ecient inference on the posterior distribution (Rue, Martino, and Chopin 2009) can be used. Following the INLA terminology, we de ne z to be the set of latent Gaussian variables and the regression parameters while contains the remaining parameters (a low-dimensional vector). The INLA approach is based on two steps. First the marginal posterior of the hyperparameters is approximated by p(z;;Y ;X;
) p(z;;Y ;X;
) 3=2 p(jY ;X;
) / = +O(T ) : (12) p(zj;Y ;X;
) p ~ (zj;Y ;X;
) z=z () Here, p ~ (zj;Y ;X;
) is the Gaussian approximation of p(zj;Y ;X;
), and z () is the mode of the distribution p(zj;Y ;X;
). The posterior mode of the hyperparameters is found by maximizing the corresponding Laplace approximation using some gradient descent method (like for example the Newton-Raphson routine). Then an area with a relatively high posterior density of the hyperparameters is explored with either some grid-based procedure or variational Bayes. The second step involves the approximation of the latent variables for every set of the explored hyperparameters. Here, the computational complexity of the approximation depends on the likelihood type for the data Y jX . If it is Gaussian, the posterior of the latent variables is Gaussian, and the approximation is exact and fully tractable. In the case the likelihood is skewed or heavy tails are present, a Gaussian approximation of the latent variables tends to become inaccurate and another Laplace approximation should be used: p(z;;Y ;X;
) p ~ (z j;Y ;X;
) / : (13) LA i p ~ (z jz ;;Y ;X;
) z =z (z ;) i i GG i i i Austrian Journal of Statistics 5 Here, p ~ is the Gaussian approximation to p(z jz ;;Y ;X;
) and z (z ;) is the corre- GG i i i 3=2 sponding posterior mode. The error rate of (13) is O(T ). The full Laplace approximation of the latent elds de ned in equation (13) is rather time-consuming, hence more crude lower- order Laplace approximations are often used instead (typically increasing the error rate to O(T ), Tierney and Kadane 1986). Once the posterior distribution of the latent variables given the hyperparameters is approximated, the uncertainty in the hyperparameters can be marginalized out using the law of total probability (Rue et al. 2009): p ~(z jY ;X;
) = p ~ (z j ;Y ;X;
)p ~( jY ;X;
) ; (14) i LA i k k k where is the area weight corresponding to the grid exploration of the posterior distribution of the hyperparameters. Computing the marginal likelihood The marginal likelihood is de ned as follows: For data fY ;Xg and model
, which includes some unknown parameters , the marginal likelihood is given by p(Y jX;
) = p(Y jX;;
)p(j
)d (15) where p(j
) is the prior for under model
while p(Y jX;
;) is the likelihood function conditional on . Again, consider = (;z). INLA approximates marginal likelihoods by p(Y ;z;jX;
) pe(Y jX;
) = dz; (16) ~ (jY ;X;z;
) Z G = (zj
) where (zj
) is some chosen value of , typically the posterior mode, while ~ (jY ;X;z;
) is a Gaussian approximation to (jY ;X;z;
). The integration of z over the support Z can be performed by an empirical Bayes (EB) approximation or using numerical integration based on a central composite design (CCD) or a grid (see Rue et al. 2009, for details). A toy example on computing the marginal likelihood. Consider an example from Neal (2008), in which we assume the following model
: 1 1 Yjz;
N (z; ); zj
N (0; ): (17) 1 0 Then obviously the marginal likelihood is available analytically as 1 1 Yj
N (0; + ); (18) 0 1 and we have a benchmark to compare approximations to. The harmonic mean estima- tor (Raftery, Newton, Satagopan, and Krivitsky 2006) is given by p ~(Yj
) = i=1 p(Yjz ;
) where z p(zjY;
). This estimator is consistent, however, often requires unreasonably many iterations to converge. We performed the experiments with = 1 and being either 0.001, 1 0 0.1 or 10. The harmonic mean is obtained based on W = 10 simulations. 5 runs of the harmonic mean procedure are performed for each scenario. For INLA we used the default tuning parameters from the package (Rue et al. 2009). As one can see from Table 1, INLA gives extremely precise results even for a huge variance of the latent variable, whilst the harmonic mean can often become extremely crude even for 10 iterations. More examples showing the 6 A Bayesian binomial regression model with latent GP for modelling DNA methylation T Exact INLA H.mean 0 1 0.001 1 2 -7.8267 -7.8267 -2.4442 -2.4302 -2.5365 -2.4154 -2.4365 0.1 1 2 -3.2463 -3.2463 -2.3130 -2.3248 -2.5177 -2.4193 -2.3960 10 1 2 -2.9041 -2.9041 -2.9041 -2.9041 -2.9042 -2.9041 -2.9042 Table 1: Comparison of INLA, harmonic mean and exact marginal likelihood accuracy of INLA are summarized in Hubin and Storvik (2016) and Friel and Wyse (2012), which perform a comparison of a number of approaches to computing the marginal likelihood, including Laplace approximations, harmonic mean approximations, Chib's method, Chib and Jelizkov's method and INLA. The studies show that INLA is a fast method that enjoys giving precise approximations to the marginal likelihood even if the number of samples T is limited. 3.2. Mode jumping Markov chain Monte Carlo The main problem with the standard Metropolis-Hastings algorithms is the trade-o between possibilities of large jumps (by which we understand proposals with a large neighbourhood) and high acceptance probabilities. Large jumps will typically result in proposals with low probabilities. In a continuous setting, Tjelmeland and Hegstad (1999) solved this by intro- ducing local optimization after large jumps, which results in proposals with higher acceptance probabilities. Hubin and Storvik (2018) adopted this approach to the discrete model selection setting and suggested the following algorithm: Algorithm 1 Mode jumping MCMC 1: Generate a large jump according to a proposal distribution q ( j
). 0 0 2: Perform a local optimization, de ned through q ( j ). k k 0 3: Perform small randomization to generate the proposal
q (
j ). 4: Generate backwards auxiliary variables q ( j
), q ( j ). 0 l 0 k o k 0 5: Put with probability r (
;
; ; ); mh k 0 k otherwise, where (
)q (
j ) r k r (
;
; ; ) = min 1; : (19) mh k (
)q (
j ) Here, a large jump corresponds to changing a large number of
's while the local optimization will be some iterative procedure based on, at each iteration, changing a small number of components until a local mode is reached. For this algorithm, three proposals need to be speci ed; q (j) specifying the rst large jump, q (j) specifying the local optimizer, and l o q (j) specifying the last randomization. All of them are described in detail in Hubin and Storvik (2018). The convergence of the MJMCMC procedures is shown in Theorem 1 in Hubin and Storvik (2018). 4. Data description The addressed data set consists of 1502 observations from chromosome one of the Arabidop- sis plant belonging to ve prede ned groups of genes. This data set was divided into 950 observations (with more than 2 reads, see Figure 1) for inference and 552 observations (with less than 3 reads) for model-based identi cation of methylation probabilities for the positions with the lack of data. Apart from the observations represented by the methylated versus the total number of reads Austrian Journal of Statistics 7 Figure 1: Epigenetic observations, where blue dots are total number of reads, red dots - number of methylated reads, the green line corresponds to 2 total reads distinguishing the inference and the identi cation data, light blue line gives nave probabilities as rates, brown line - probabilities as the posterior mean of the probability of success parameter from the posterior mode model. we have data on various exogenous variables (covariates). Among these covariates, we address a factor with 3 levels corresponding to whether the location belongs to a CGH, CHH or CHG genetic region, where H is either A, C or T and thus generating two covariates X CGH and X . The second group of factors indicates whether the distance to the previous CHH cytosine nucleobase (C) in DNA is 1, 2, 3, 4, 5, from 6 to 20 or greater than 20 inducing six binary covariates X ; X ; X ; X ; X , and X . We also include such 1D DT 1 DT 2 DT 3 DT 4 DT 5 DT 6:20 distance as a continuous covariate X . The third addressed group of factors corresponds to DIST whether the location belongs to a gene from a particular group of genes of biological interest. These groups are indicated as M , M and M , yielding two additional covariates X ; X . a g M M d a g Additionally, we have a covariate X indicating if the corresponding nucleobase is in the CODE coding region of a gene and a covariate X indicating if the nucleobase is on a " + " or a STRD "