Clustering Matrix Variate Longitudinal Count Data
Clustering Matrix Variate Longitudinal Count Data
Subedi, Sanjeena
2023-05-05 00:00:00
Article Sanjeena Subedi School of Mathematics and Statistics, Carleton University, Ottawa, ON K1S 5B6, Canada; sanjeena.dang@carleton.ca Abstract: Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Y for n = 1, . . . , N can be written as an r t matrix. When dealing with such data, the number of parameters in the model can be greatly reduced by considering the matrix variate structure. The components of the covariance matrix then also provide a meaningful interpretation. In this work, a mixture of matrix variate Poisson-log normal distributions is introduced for clustering longitudinal read counts from RNA-seq studies. To account for the longitudinal nature of the data, a modified Cholesky-decomposition is utilized for a component of the covariance structure. Furthermore, a parsimonious family of models is developed by imposing constraints on elements of these decompositions. The models are applied to both real and simulated data, and it is demonstrated that the proposed approach can recover the underlying cluster structure. Keywords: cluster analysis; RNA-seq data; matrix variate discrete data; longitudinal data; mixture models 1. Introduction Biological studies are a major source of longitudinal data. While modelling such longitudinal datasets, it is important to take into account the correlations among the mea- surements at different time points. Gene expression time course studies present important clustering and classification problems. Understanding how different genes are modulated over a period of time during an event of interest can provide key insight in understanding Citation: Subedi, S. Clustering their involvement in various biological pathways [1–4]. Cluster analysis allows us to group Matrix Variate Longitudinal Count genes into clusters with similar patterns, or ‘expression profiles’, over time. Data. Analytics 2023, 2, 426–437. Model-based clustering approaches have been shown to be an effective approach for https://doi.org/10.3390/ clustering a wide variety of biological datasets, such as microarray datasets [5–7], RNA-seq analytics2020024 data [8–10], microbiome data [11,12], flow cytometry data [13,14], etc. Several studies have Academic Editors: Jesus S. utilized cluster analysis to gain novel insights into various biological phenomenon, such as Aguilar-Ruiz and Ernesto Mauro identification of novel tumour subtypes [15,16], understanding diseases progression [17,18], Suarez Lopez understanding crops’ response to abiotic stresses [19,20], and others. Model-based clustering utilizes finite mixture models [21] for cluster analysis, which Received: 24 November 2022 assumes that a population is a mixture of G subpopulations or components and each Revised: 28 January 2023 Accepted: 3 April 2023 component can be modelled using a probability distribution. A random vector Y is said to Published: 5 May 2023 arise from a parametric finite mixture distribution; we can write its density in the form f (y j Q) = p f (y j q ) (1) g g g=1 Copyright: © 2023 by the author. Licensee MDPI, Basel, Switzerland. where p 2 [0, 1] such that p = 1 is the mixing proportion of the gth component, g g g=1 This article is an open access article f(y j q ) is the density of the gth component, and Q = (p , . . . , p , q , . . . , q ) are the distributed under the terms and g 1 G 1 G model parameters. The choice of an appropriate probability density/mass function depends conditions of the Creative Commons on the data type. Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ Various approaches have been developed for the clustering of time course gene ex- 4.0/). pression data (e.g., [6,7,22–25]). However, most statistical approaches for clustering gene Analytics 2023, 2, 426–437. https://doi.org/10.3390/analytics2020024 https://www.mdpi.com/journal/analytics Analytics 2023, 2 427 expression data are tailored for microarray studies. While some of this could be attributed to RNA-seq technology being more recent compared to microarrays, the computational cost that comes with fitting multivariate discrete models needed for RNA-seq data has also led to challenges. The transcriptomics data arising from RNA-seq studies are discrete, high-dimensional, and over-dispersed count data. Efficiently analyzing these data remains a challenge. Because of the restrictive mean-variance relationship that comes with the Pois- son distribution, the negative binomial distribution emerged as the univariate distribution of choice [26,27]. However, the multivariate negative binomial distribution [28] is seldom used in practice due to the computational cost that comes with fitting such a model [29]. Recently, [10] proposed mixtures of multivariate Poisson lognormal (MPLN) mod- els for clustering over-dispersed, multivariate count data. In the MPLN model, the counts of the nth gene (with n = 1, 2, . . . , N) are modelled using a hierarchical Poisson X +log c nj j log-normal distribution such that Y Poisson(e ) and X N (m,S), where nj n p X = (X , X , . . . , X ), N (m,S) denotes a p-dimensional Gaussian distribution with n n p p n1 n2 mean m and covariance S, and c is a known constant representing the normalized library size of the jth sample. This hierarchical structure allows for over-dispersion similar to the negative binomial distribution, but it also allows for correlation between the variables. An efficient framework for parameter estimation for mixtures of MPLN distributions was developed by [30] that utilizes a variational Gaussian approximation. In RNA-seq studies, it is common to obtain expression levels of n genes at r condi- tions over t occasions in one study. A natural framework for modelling such data is a matrix variate approach such that each observation Y is a r t matrix. Alternately, a multivariate framework can also be utilized, where each observation Y can be written as a vector vec(Y ) of dimensionality rt = r t. In [31], the authors developed mix- tures of matrix variate Poisson lognormal distributions (MVPLN). In the MVPLN model, X +log C n,i j i j Y Poisson(e ) and X N (M,F,W), where X is an r t matrix, and n,i j n rt n N (M,F,W) denotes a matrix normal distribution with location matrix M and scale ma- rt trices F and W, and C is a r t matrix where C denotes the normalized library size of the i j ith condition from the jth time point. Mathematically, the MVPLN model is equivalent to X +log C 0 0 n,i j i j Y Poisson(e ) and vec(X ) N (vec(M ),S = F W), where N is an rt rt n,i j n rt-dimensional multivariate normal distribution and is a Kronecker product. By adopt- ing a matrix variate form, the large rt rt covariance matrix of the latent variable X can be written as the Kronecker product of two smaller r r and t t scale matrices F and W, i.e., S = F W . This can greatly reduce the number of parameters in S. rtrt rr tt In Section 2, we extend the mixtures of the matrix variate Poisson log-normal model for clustering matrix variate longitudinal RNA-seq data by incorporating a modified Cholesky decomposition of the scale matrix W that captures the covariances of the t occasions of the latent variable X. Furthermore, by imposing constraints on the components of scale matrices to be equal or different across the group, a family of eight models is obtained. Parameter estimation is performed using a variational variant of the expectation- maximization algorithm. In Section 3, the proposed models are applied to simulated and real datasets, and Section 4 concludes the paper. 2. Methods A matrix variate Poisson log-normal distribution was proposed by [31] for modelling RNA-seq data. This arises from a hierarchical mixture of independent Poisson distributions with a matrix variate Gaussian distribution. Suppose Y , . . . , Y are N observations from a 1 N matrix variate Poisson log-normal distribution, where the nth observation Y is an r t dimensional matrix representing r conditions and t time points. A matrix variate Poisson log-normal distribution for modelling RNA-seq data can be written as X +log C n,i j i j Y j X Poisson(e ), and X N (M,F,W), (2) n,i j n,i j n rt Analytics 2023, 2 428 where M is an r t matrix of means, C is an r t matrix of fixed, known constants to account for the differences in library sizes across each sample, and F and W are r r and t t scale matrices, respectively. Suppose the least-squares predictor of unobserved latent variable X of the nth observation from the ith condition at the jth time point can be n,i j written as j 1 X = M + ( r )(X M ) + d # , (3) n,i j i j å jk n,ik ik j j k=1 where X , X ,. . . , X are the unobserved latent variables from the preceding j 1 n,i1 n,i2 n,i(j 1) time points, # N (0, 1), and r and d are the autoregressive parameters and innovation j jk j variances, respectively [32]. Thus, when the responses are recorded over a period of time (i.e., t time points), the parameter W that relates to the covariance of the t time points can be parameterized to account for the relationship between measurements at different time points. Now, W can be decomposed using the modified Cholesky decomposition [7,23,32] 0 1 0 1 such that TWT = D. This can alternately be written as W = T D T, where D is a unique diagonal matrix with innovation variances d , . . . , d , and T is a unique lower 1 t triangular matrix with 1 as the diagonal elements and the autoregressive parameters (r jk with j > k) as the off-diagonal elements: 0 1 1 0 . . . 0 B C r 1 . . . 0 B C T = . @ A . . . . . . . . . r r . . . 1 j,j 1 2,j 1 In the context of a G-component mixture of matrix variate Poisson log-normal distri- butions [31], Equation (1) can be written as f (Y j Q) = p f (Y j M ,F ,W ), g g g g g=1 where p > 0 is the mixing proportion of the gth component such that p = 1, g g g=1 and f (Y j M ,F ,W ) is the marginal distribution function of the gth component with g g g parameters M , F , and W , and Q denotes all the model parameters. g g g 2.1. Longitudinal Data and Family of Models Due to the longitudinal nature of t time points, one can utilize the modified Cholesky 1 0 1 decomposition for W such that W = T D T . The number of parameters in W g g g g g g (i.e., t(t + 1)/2) increases quadratically with respect to time points, and this is further compounded in mixture models as G different Ws need to be estimated. Thus, similar to [7], constraints can be imposed on T and D to be equal or different across groups, and an g g isotropic constraint can also be imposed on D = d I, where d is a scalar. Various combi- g g g nations of these constraints result in a family of eight models (see Table 1). Table 1. The family of eight models obtained by imposing various constraints on the components of W . Model T D Total Parameters in W , . . . ,W . g g 1 G Group Group Diagonal “VVA” Variable Variable Anisotropic Gt(t 1)/2 + Gt “EVA” Equal Variable Anisotropic t(t 1)/2 + Gt “VEA” Variable Equal Anisotropic Gt(t 1)/2 + t “EEA” Equal Equal Anisotropic t(t 1)/2 + t “VVI” Variable Variable Isotropic Gt(t 1)/2 + G “EVI” Equal Variable Isotropic t(t 1)/2 + G “VEI” Variable Equal Isotropic Gt(t 1)/2 + 1 “EEI” Equal Equal Isotropic t(t 1)/2 + 1 Analytics 2023, 2 429 Parameter estimation is outlined in Section 2.2. In cluster analysis, the best constraint for a specific dataset is also unknown. Selection of the best fitting model among the eight models in the family for a given dataset is performed using a model selection criteria, which is discussed in detail in Section 2.3. 2.2. Parameter Estimation Parameter estimation for mixture models is typically conducted using the traditional expectation-maximization (EM; [33]) algorithm. In the case of an MVPLN model, this requires computing the posterior expectations of E(X j Y) and E(XX j Y). However, the posterior distribution of a latent variable (i.e., X j Y) does not have a known form; thus, a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm is typically employed for empirically estimating E(X j Y) and E(XX j Y). Such an approach can be computationally intensive [10,31]. Subedi and Browne [30] developed an efficient parameter estimation algorithm for a matrix variate Poisson log-normal distribution using variational approximations [34]. This utilized a computationally convenient approximating density to approximate a more complex but ‘true’ posterior density through minimization of the Kullback– Leibler (KL) divergence between the true and the approximating densities. Suppose we have an ap- proximating density q(X); then, the marginal log of the probability mass function can q(X) be written as log f (Y) = F(q, Y) + D (qk f ), where D (qk f ) = q(X) log dX Y K L K L f (XjY) is the KL divergence between f (X j Y), and the approximating distribution q(X), and F(q, Y) = [log f (Y, X) log q(X)]q(X)dX is our evidence lower bound (ELBO). A Gaussian density is utilized as the approximating density for variational Gaussian ap- proximations. Similar to [31], assuming q(X ) = N (x ,D , k ), the ELBO for each ng rt ng ng ng observation y from the gth component can be written as 2 3 r t 4 5 F(q , Y ) = exp (x ) + (D ) (W ) + log C ng n å å i j ng ii ng jj i j ng i=1 j=1 " # h i rt p r 0 0 0 0 0 1 + vec x + log vec(C ) vec(Y ) log(vec(Y ) !) logjF j logjT D T j g g ng n å n k g g 2 2 k=1 0 0 1 0 1 0 0 vec(x ) vec(M ) F (T D T ) vec(x ) vec(M ) ng g g g g ng g n o n o t r rt 1 0 1 + tr F D tr T D T k + logjD j + logjk j + . ng g ng ng ng g g g 2 2 2 Thus, the complete-data log-likelihood for the mixtures of MVPLN distributions can be written as G N G N l (Q) = z log p + z log f (Y j M ,F , T , D ) c ng g ng n g g g g å å å å g=1 n=1 g=1 n=1 G N G N = z log p + z F(Y , q ) + D (q k f ) , ng g ng n ng K L ng ng å å å å g=1 n=1 g=1 n=1 q(X ) ng where D (q k f ) = q(X ) log dX is the KL divergence between f (X j K L ng ng ng ng n f (X jY ,Z =1) n n ng Y , Z = 1) and approximating distribution q(X ). As the variational parameters that n ng ng maximize the ELBO also minimize the KL divergence, the estimates of the model pa- rameters are obtained by maximizing the variational approximation of the complete-data log-likelihood using the ELBO, i.e., G N G N l (Q) z log p + z F(Y , q ). c å å ng g å å ng n ng g=1 n=1 g=1 n=1 Similar to [31], an iterative EM-type algorithm is utilized for parameter estimation. At the (k + 1)th step, Analytics 2023, 2 430 1. Conditional on the variational parameters x , D , and k and on M , F , T , and ng ng g g g ng D , the E(Z ) is given by g ng p f (Y j M ,F , T , D ) g n g g g g E(Z j Y ) = . ng n å p f (Y j M ,F , T , D ) h h h h h h=1 As the marginal distribution of Y is difficult to compute, we use an approximation of E(Z ) using the ELBO such that ng p exp F q , Y def g ng n (k+1) Z = . (4) ng p exp F q , Y å [ ( )] h nh n h=1 (k+1) 2. Given Z , variational parameters x , D , and k are updated conditional on ng ng ng ng (k) (k) (k) (k) M , F , T , and D . g g g g (a) A fixed-point method is used for updating D : ng (k+1) (k) (k) D = t I diag(k ) exp x + log C ng rr ng ng ) # 0 n o (k) (k) 1(k) (k) 1(k) (k) (k) + diag(D )diag(k ) + F tr T D T k , ng ng g g g g ng a a 0 1 r where the vector function exp[a] = (e , . . . , e ) is a vector of the exponential of each element of the r-dimensional vector a, diag(k) = (k . . . , k ) puts the tt diagonal elements of the t t matrix k into a t-dimensional vector, and is the Hadamard product. (b) A fixed-point method is used for updating k : ng (k+1) (k+1) (k) k = r I diag(D ) exp x + log C p p ng ng ng n o 1 (k) (k+1) 1(k) 1(k) (k+1) + diag(k )diag(D ) + W tr F D , ng ng g g ng a a 0 1 t where the vector function exp a = (e , . . . , e ) is a vector of the exponential [ ] of each element of the t-dimensional vector a, diag(D) = (D . . . ,D ) puts the rr diagonal elements of the r r matrix D into an r-dimensional vector, and is the Hadamard product. (c) Newton’s method is used to update x : ng 0 0 (k+1) (k) 1(k+1) 0 0 vec(x ) = vec(x ) Y fvec(Y ) exp log vec(C ) ng ng ng (k) 1(k+1) 1(k+1) 0(k) 0(k) +vec(x ) + diag Y Y vec(x ) vec(M ) vec(Y ) , ng ng ng ng g (k+1) (k+1) (k+1) where Y = D k . ng ng ng (k+1) (k+1) (k+1) (k+1) 3. Given Z and the variational parameters x , D , and k , the updates ng ng ng ng of model parameters p , M , and F are obtained as g g g Analytics 2023, 2 431 (k+1) ng (k+1) n=1 p = , (k+1) (k+1) Z x ng ng (k+1) n=1 M = , (k+1) ng n=1 (k+1) (k+1) (k+1) 1(t) (t+1) (t+1) 0 0 å Z (x M )T D T (x M ) ng ng g g g ng g n=1 g (t+1) F = (k+1) t Z ng n=1 n o (k+1) (k+1) 1(t) (k+1) N 0 å Z D tr T D T k ng ng g g ng n=1 g + . (k+1) t Z ng n=1 Estimates of the model parameters T and D can be obtained by maximizing the g g approximation of the log-likelihood with respect to the parameters T and D . If we g g (k+1) define S as: (k+1) (k+1) (k+1) 1(k+1) (k+1) (k+1) N 0 Z (x M ) F (x M ) (k+1) ng ng g g ng g n=1 S = (k+1) r Z ng n=1 n o (k+1) (k+1) 1(k+1) (k+1) Z k tr F D ng ng g ng n=1 + , (5) (k+1) r Z ng n=1 the estimates of T and D are analogous to [7]. The elements of T can be estimated as: g g g 0 1 0 1 0 1 (g) (g) (g) (g) (g) r ˆ S S . . . S S j1 11 21 j 1,1 j1 B C B C B C (g) (g) (g) (g) (g) B C B C B C r ˆ S S . . . S S B j2 C B 12 22 j 1,2 C B j2 C = , B C B C B C . . . . . B C B C B C . . . . . . . . . . @ A @ A @ A (g) (g) (g) (g) (g) r ˆ S S . . . S S j,j 1 j,j 1 2,j 1 j 1,j 1 j,j 1 where j = 2, . . . , t, and the updates for D can be obtained as ˆ ˆ ˆ D = T S T . g g g Similarly, with the above defined S in Equation (5), the updates of T and D are g g g analogous to [7] for all eight models with various constraints, and thus the R package longclust [35] is utilized for the covariance decomposition. Convergence of the iterative EM-type approach was determined using the Aitken’s acceleration-based [36] criteria which computes the asymptotic estimate of the log-likelihood at each iteration and assumes that the algorithm converges when the successive difference in the asymptotic estimate of the log-likelihood is less than # [37] . Here, we used # = 0.05. 2.3. Model Selection and Performance Assessment In clustering, the true number of components are typically unknown. Additionally, the best constraint for the covariance structure here is also unknown. It is common practice to fit the model for a range of G for all possible models and select the best fitting model a posteriori using a model selection criteria. The Bayesian information criterion (BIC; [38]) remains the most widely used criterion. In our case, similar to [30], we use an approximation of BIC defined as: BIC = 2 log L p log N, Analytics 2023, 2 432 where p is the number of parameters, N is the sample size, and log L is the approximation of the maximized log-likelihood using ELBO. This approximation is computationally efficient as the marginal of the probability mass function does not have a closed form to compute the maximized log-likelihood using the marginal probability mass function. When the true number of clusters is known, assessment of the clustering performance can be conducted using an adjusted Rand index (ARI; [39]). The ARI provides a measure of the clustering agreement between the true and predicted labels and adjusts for agreement by chance. The ARI for perfect agreement is 1, and the expected value of ARI under random classification is 0. 3. Results 3.1. Simulation Studies Simulation studies were conducted to demonstrate clustering performance of the proposed family of models and show parameter recovery. 3.1.1. Scenario 1 In the first scenario, 25 datasets, each of size n = 1000 were generated from a two- component fully unconstrained model “VVA”. Here, each observation in the dataset is a 3 4 (i.e., r = 3 and p = 4) matrix. The parameters used to generate the dataset are summarized in Table 2. Table 2. True parameters along with the estimated means and standard deviations of the model parameters for Scenario 1 from all 25 datasets. J True Value Means Standard Deviations 2 3 2 3 2 3 6.20 6.20 6.20 6.20 6.19 6.21 6.20 6.19 0.05 0.05 0.07 0.07 4 5 4 5 4 5 6.20 6.20 6.20 6.20 6.21 6.20 6.19 6.20 0.06 0.07 0.05 0.08 6.20 6.20 6.20 6.20 6.19 6.21 6.20 6.21 0.05 0.07 0.06 0.05 2 3 2 3 2 3 1.50 1.50 1.50 1.50 1.82 1.68 1.63 1.63 0.13 0.08 0.08 0.07 4 5 4 5 4 5 1.50 1.50 1.50 1.50 1.62 1.78 1.66 1.64 0.08 0.13 0.07 0.08 1.50 1.50 1.50 1.50 1.65 1.62 1.78 1.67 0.09 0.05 0.12 0.09 2 3 2 3 2 3 1.00 0.00 0.00 1.00 0.01 0.01 0.00 0.03 0.02 4 5 4 5 4 5 0.00 1.46 0.00 0.01 1.46 0.00 0.03 0.05 0.03 0.00 0.00 1.44 0.01 0.00 1.46 0.02 0.03 0.05 2 3 2 3 2 3 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.03 0.04 4 5 4 5 4 5 0.00 0.82 0.00 0.00 0.82 0.01 0.03 0.06 0.03 0.00 0.00 0.90 0.00 0.01 0.89 0.04 0.03 0.06 2 3 2 3 2 3 1.00 1.00 0.00 6 7 6 7 6 7 0.75 1.00 0.75 1.00 0.03 0.00 6 7 6 7 6 7 4 5 4 5 4 5 0.25 0.75 1.00 0.25 0.75 1.00 0.02 0.03 0.00 0.05 0.25 0.75 1.00 0.05 0.25 0.76 1.00 0.02 0.02 0.02 0.00 2 3 2 3 2 3 1.00 1.00 0.00 6 7 6 7 6 7 0.20 1.00 0.24 1.00 0.02 0.00 6 7 6 7 6 7 4 5 4 5 4 5 0.05 0.30 1.00 0.08 0.39 1.00 0.02 0.05 0.00 0.01 0.02 0.35 1.00 0.03 0.08 0.44 1.00 0.02 0.04 0.06 0.00 D 0.53 0.74 1.15 1.82 0.53 0.75 1.15 1.85 0.02 0.04 0.06 0.08 D 0.10 0.45 0.47 0.33 0.14 0.62 0.67 0.48 0.01 0.06 0.06 0.03 All eight models with G = 1 to G = 5 were fitted, and BIC was used for model selection. In all 25 datasets, the approach selected a two-component “VVA” model with an Analytics 2023, 2 433 ARI of 1.00 0.00. The mean of the estimated parameters along with their standard errors are also summarized in Table 2. The estimated parameters are close to the true parameters. 3.1.2. Scenario 2 In the second scenario, 25 datasets, each of size n = 1000, were generated from a two- component model with the same set of parameters as Scenario 1 but with a constraint on D such that D = . . . = D = dI (i.e., “VEI” model). Again, each observation in the dataset is a 1 g 3 4 (i.e., r = 3 and p = 4) matrix. All eight models with G = 1 to G = 5 were fitted, and the BIC was used for model selection. In all 25 datasets, the approach selected a two-component model with an average ARI of 1.00 0.00. In 11 out of the 25 datasets, a “VEI” model was selected, and in 14 out of the 25 datasets, a “VVI” model was selected. Note that a “VVI” model also assumes an isotropic constraint on D such that D = d I, but in a “VVI” model, g g g the d varies across groups. The mean of the estimated parameters along with their standard errors from the datasets where a two-component “VEI” model was selected are summarized in Table 3. The estimated parameters are close to the true parameters. Table 3. True parameters along with the estimated means and standard deviations of the model parameters for Scenario 2 from all 25 datasets. J True Value Means Standard Deviations 2 3 2 3 2 3 6.20 6.20 6.20 6.20 6.20 6.22 6.23 6.17 0.05 0.07 0.11 0.09 4 5 4 5 4 5 6.20 6.20 6.20 6.20 6.19 6.23 6.15 6.18 0.05 0.07 0.11 0.14 6.20 6.20 6.20 6.20 6.22 6.20 6.21 6.23 0.04 0.08 0.07 0.11 2 3 2 3 2 3 1.50 1.50 1.50 1.50 1.60 1.63 1.62 1.59 0.08 0.07 0.08 0.07 4 5 4 5 4 5 1.50 1.50 1.50 1.50 1.62 1.59 1.60 1.61 0.09 0.07 0.06 0.08 1.50 1.50 1.50 1.50 1.61 1.61 1.60 1.62 0.09 0.05 0.04 0.10 2 3 2 3 2 3 1.00 0.00 0.00 1.00 0.01 0.00 0.00 0.02 0.02 4 5 4 5 4 5 0.00 1.46 0.00 0.01 1.47 0.01 0.02 0.06 0.03 0.00 0.00 1.44 0.00 0.01 1.49 0.02 0.03 0.07 2 3 2 3 2 3 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.02 0.05 4 5 4 5 4 5 0.00 0.82 0.00 0.00 0.77 0.00 0.02 0.03 0.02 0.00 0.00 0.90 0.00 0.00 0.84 0.05 0.02 0.03 2 3 2 3 2 3 1.00 1.00 0.00 6 7 6 7 6 7 0.75 1.00 0.76 1.00 0.02 0.00 6 7 6 7 6 7 4 5 4 5 4 5 0.25 0.75 1.00 0.25 0.74 1.00 0.03 0.03 0.00 0.05 0.25 0.75 1.00 0.05 0.25 0.76 1.00 0.04 0.03 0.03 0.00 2 3 2 3 2 3 1.00 1.00 0.00 6 7 6 7 6 7 0.20 1.00 0.24 1.00 0.04 0.00 6 7 6 7 6 7 4 5 4 5 4 5 0.05 0.30 1.00 0.07 0.37 1.00 0.05 0.04 0.00 0.01 0.02 0.35 1.00 0.04 0.07 0.41 1.00 0.03 0.05 0.05 0.00 D = D 0.45 0.45 0.45 0.45 0.50 0.50 0.50 0.50 0.02 0.02 0.02 0.02 1 2 3.1.3. Comparison with Other Approaches The performances of the proposed models were compared to other mixtures of dis- crete distributions. Since other approaches for matrix variate discrete data were not avail- able, the matrix variate data was first converted to multivariate data before comparison. Two model-based clustering techniques for RNA-seq data: HTSCluster [9,40,41], which is a mixture of Poisson distributions, and MBCluster.Seq [8,42], which is a mixture of negative binomial distributions, were used. The comparison of the performance of the proposed approach with the two competitive approaches for the simulated datasets from both scenarios is summarized in Table 4. Both HTSCluster and MBCluster.Seq failed to recover the underlying cluster structure in both scenarios. This could be partly because both approaches are mixtures of independent univariate distributions, and in the presence of covariance, their performance suffers. This is in line with findings of [30]. Through simulation studies, ref. [30] previously showed that when the dataset is generated from Analytics 2023, 2 434 mixtures of independent Poisson distribution, HTSCluster can recover the underlying cluster structure. However, in the presence of over-dispersion (i.e., when the data are generated from a model such as multivariate Poisson log-normal distribution or negative binomial distribution), the performance of HTSCluster suffers. Table 4. Comparison of the performance of the proposed approach (longitudinal MVPLN) with HTSCluster and MBCluster.Seq on simulated datasets from both Scenario 1 and Scenario 2. Simulation Scenario 1 G Selected Average ARI Time in Minutes Approach (# of Times) (SD) Average (SD) Long. MVPLN 2 (25) 1.000 (0.000) 76.590 (20.970) HTSCluster 5 (25) 0.002 (0.005) 0.057 (0.010) MBCluster.Seq 4 (1), 5 (24) 0.000 (0.002) 0.237 (0.004) Simulation Scenario 2 G Selected Average ARI Time in Minutes Approach (# of Times) (SD) Average (SD) Long. MVPLN 2 (25) 1.00 (0.000) 78.928 (7.886) HTSCluster 5 (25) 0.011 (0.012) 0.047 (0.007) MBCluster.Seq 5 (25) 0.000 (0.004) 0.237(0.003) 3.2. Transcriptomics Data Analysis The proposed approach was used to cluster a transcriptomics dataset fission from the R package fission available through bioconductor. The dataset was originally proposed by [43]. The study consists of a time course RNA-Seq experiment of fission yeast in response to oxidative stress (1M sorbitol treatment) at 0, 15, 30, 60, 120, and 180 mins from two types of yeast: wild type and mutant type (aft21D strain).Thus, the measurements for each observation can be written using a matrix notation such that the two types of yeast are treated as rows (i.e., r = 2), and the time points are treated as columns (i.e., t = 6). We treat developmental stages as longitudinal. For cluster analysis, we focused on the subset of the differentially expressed genes provided in the Supplementary Material by [43]. Genes were considered differentially expressed if their mean expression level differed in at least one time point relative to unstressed reference, and multiple testing correction was performed to ensure overall FDR was kept below 5%. A total of 3169 genes (out of 5957) were differentially expressed in wild type yeast, and 3044 genes were differentially expressed in the aft21D strain. For our analysis, we included the gene if it was differentially expressed in both wild type and aft21D strain; thus a total of 2476 genes were included. All eight models were fitted to the dataset for G = 1 to G = 20, and the best fitting model was selected by BIC. A G = 17 component “EVA” model with a constrained T (i.e., T = T = . . . = T ) and unconstrained anisotropic D was selected. The constrained 1 2 17 T suggests that the correlation structure (i.e., autoregressive relationship) among the developmental stages is the same for all groups. However, the unconstrained anisotropic D suggests that the variances at the developmental stages are different, and it varies from cluster to cluster. Visualization of the log-transformed expression values of the genes in each group along with its mean-expression trends is shown in Figure 1. As seen in Figure 1, the clusters have distinctive mean-expression trends. Analytics 2023, 2 435 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 7.5 9 8 7.0 8 7.0 7 6 6.5 7 6.5 6 6.0 6.0 5.5 5 5.5 5.0 4 0 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100150 0 50 100 150 0 50 100150 Time points Time points Time points Time points Time points Time points Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12 9.0 8.5 10.5 8.0 9 10 10.0 8.5 7.5 6 9.5 8.0 7.0 9.0 6.5 3 7.5 7 8.5 6.0 0 50 100150 0 50 100 150 0 50 100150 0 50 100 150 0 50 100 150 0 50 100150 Time points Time points Time points Time points Time points Time points Cluster 13 Cluster 14 Cluster 15 Cluster 16 Cluster 17 7.5 8.0 6 7.0 9.2 7.5 Type 6.5 7.0 6.0 8.8 Mutant 5.5 2 6.5 8.4 Wild Type 5.0 6.0 4.5 0 50 100 150 0 50 100150 0 50 100 150 0 50 100150 0 50 100150 Time points Time points Time points Time points Time points Figure 1. Visualization of the log-transformed expression values along with its mean-expression values for the two yeast types (solid black line for the mutant and dashed black line for wild type) for all 17 clusters of the transcriptomics dataset. 4. Conclusions A novel family of matrix variate Poisson log-normal mixture models was developed for clustering longitudinal transcriptomics data. This approach utilized a modified Cholesky decomposition of a component of the covariance matrices of the latent variable, and constraints were imposed on various components of this decomposition which resulted in a family of eight models. Performance of the proposed approach was illustrated using both simulated and real datasets where the proposed approach showed good clustering performance. One of the limitations with the proposed approach is that it assumes that measure- ments are taken at the same fixed intervals for all observations, which can be restrictive. Some future work will focus on extending these models to allow for varying interval lengths between observations. Furthermore, time is continuous; hence, discretizing it can result in a loss of information. Some work will also focus on developing a modelling framework that models time as a continuous variable. Author Contributions: S.S. developed the method, conducted the analysis, and wrote the manuscript. The author has read and agreed to the published version of the manuscript. Funding: This work was supported by Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) and Canada Research Chairs program. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The dataset used in the manuscript is publicly available through R package fission. Acknowledgments: This research was enabled in part by support provided by Research Computing Services (https://carleton.ca/rcs) at Carleton University. Conflicts of Interest: The author declares no conflict of interest. Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Log−expression Analytics 2023, 2 436 Abbreviations The following abbreviations are used in this manuscript: ARI adjusted Rand index BIC Bayesian information criterion ELBO evidence lower bound EM expectation-maximization KL Kullback-Leibler MCMC-EM Markov chain Monte Carlo expectation-maximization MPLN multivariate Poisson lognormal MVPLN matrix variate Poisson lognormal References 1. Spellman, P.T.; Sherlock, G.; Zhang, M.Q.; Iyer, V.R.; Anders, K.; Eisen, M.B.; Brown, P.O.; Botstein, D.; Futcher, B. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998, 9, 3273–3297. [CrossRef] 2. Lee, C.W.; Stabile, E.; Kinnaird, T.; Shou, M.; Devaney, J.M.; Epstein, S.E.; Burnett, M.S. Temporal patterns of gene expression after acute hindlimb ischemia in mice: Insights into the genomic program for collateral vessel development. J. Am. Coll. Cardiol. 2004, 43, 474–482. [CrossRef] 3. Louis, E.; Raue, U.; Yang, Y.; Jemiolo, B.; Trappe, S. Time course of proteolytic, cytokine, and myostatin gene expression after acute exercise in human skeletal muscle. J. Appl. Physiol. 2007, 103, 1744–1751. [CrossRef] [PubMed] 4. Li, Y.; Li, M.; Tan, L.; Huang, S.; Zhao, L.; Tang, T.; Liu, J.; Zhao, Z. Analysis of time-course gene expression profiles of a periodontal ligament tissue model under compression. Arch. Oral Biol. 2013, 58, 511–522. [CrossRef] [PubMed] 5. McLachlan, G.J.; Bean, R.W.; Peel, D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 2002, 18, 413–422. [CrossRef] 6. Inoue, L.Y.; Neira, M.; Nelson, C.; Gleave, M.; Etzioni, R. Cluster-based network model for time-course gene expression data. Biostatistics 2007, 8, 507–525. [CrossRef] [PubMed] 7. McNicholas, P.D.; Murphy, T.B. Model-based clustering of longitudinal data. Can. J. Stat. 2010, 38, 153–168. [CrossRef] 8. Si, Y.; Liu, P.; Li, P.; Brutnell, T.P. Model-based clustering for RNA-seq data. Bioinformatics 2014, 30, 197–205. [CrossRef] 9. Rau, A.; Maugis-Rabusseau, C.; Martin-Magniette, M.L.; Celeux, G. Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 2015, 31, 1420–1427. [CrossRef] 10. Silva, A.; Rothstein, S.J.; McNicholas, P.D.; Subedi, S. A multivariate Poisson-log normal mixture model for clustering transcrip- tome sequencing data. BMC Bioinform. 2019, 20, 394. [CrossRef] 11. Holmes, I.; Harris, K.; Quince, C. Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLOS ONE 2012, 7, e30126. [CrossRef] [PubMed] 12. Subedi, S.; Neish, D.; Bak, S.; Feng, Z. Cluster analysis of microbiome data by using mixtures of Dirichlet–multinomial regression models. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2020, 69, 1163–1187. [CrossRef] 13. Lo, K.; Brinkman, R.R.; Gottardo, R. Automated gating of flow cytometry data via robust model-based clustering. Cytom. Part A J. Int. Soc. Anal. Cytol. 2008, 73, 321–332. [CrossRef] [PubMed] 14. Chan, C.; Feng, F.; Ottinger, J.; Foster, D.; West, M.; Kepler, T.B. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytom. Part A J. Int. Soc. Anal. Cytol. 2008, 73, 693–701. [CrossRef] [PubMed] 15. Shen, R.; Mo, Q.; Schultz, N.; Seshan, V.E.; Olshen, A.B.; Huse, J.; Ladanyi, M.; Sander, C. Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE 2012, 7, e35236. [CrossRef] 16. Higgins, J.P.; Shinghal, R.; Gill, H.; Reese, J.H.; Terris, M.; Cohen, R.J.; Fero, M.; Pollack, J.R.; Van de Rijn, M.; Brooks, J.D. Gene expression patterns in renal cell carcinoma assessed by complementary DNA microarray. Am. J. Pathol. 2003, 162, 925–932. [CrossRef] 17. Ma, X.J.; Salunga, R.; Tuggle, J.T.; Gaudet, J.; Enright, E.; McQuary, P.; Payette, T.; Pistone, M.; Stecker, K.; Zhang, B.M.; et al. Gene expression profiles of human breast cancer progression. Proc. Natl. Acad. Sci. USA 2003, 100, 5974–5979. [CrossRef] 18. Haqq, C.; Nosrati, M.; Sudilovsky, D.; Crothers, J.; Khodabakhsh, D.; Pulliam, B.L.; Federman, S.; Miller III, J.R.; Allen, R.E.; Singer, M.I.; et al. The gene expression signatures of melanoma progression. Proc. Natl. Acad. Sci. USA 2005, 102, 6092–6097. [CrossRef] 19. Humbert, S.; Subedi, S.; Cohn, J.; Zeng, B.; Bi, Y.M.; Chen, X.; Zhu, T.; McNicholas, P.D.; Rothstein, S.J. Genome-wide expression profiling of maize in response to individual and combined water and nitrogen stresses. BMC Genom. 2013, 14, 3. [CrossRef] 20. Misyura, M.; Guevara, D.; Subedi, S.; Hudson, D.; McNicholas, P.D.; Colasanti, J.; Rothstein, S.J. Nitrogen limitation and high density responses in rice suggest a role for ethylene under high density stress. BMC Genom. 2014, 15, 681. [CrossRef] 21. Wolfe, J.H. Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 1970, 5, 329–350. [CrossRef] 22. Luan, Y.; Li, H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics 2003, 19, 474–482. [CrossRef] Analytics 2023, 2 437 23. McNicholas, P.D.; Subedi, S. Clustering gene expression time course data using mixtures of multivariate t-distributions. J. Stat. Plan. Inference 2012, 142, 1114–1127. [CrossRef] 24. Coffey, N.; Hinde, J.; Holian, E. Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data. Comput. Stat. Data Anal. 2014, 71, 14–29. [CrossRef] 25. Koestler, D.C.; Marsit, C.J.; Christensen, B.C.; Kelsey, K.T.; Houseman, E.A. A recursively partitioned mixture model for clustering time-course gene expression data. Transl. Cancer Res. 2014, 3, 217. [PubMed] 26. Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [CrossRef] 27. Dong, K.; Zhao, H.; Tong, T.; Wan, X. NBLDA: Negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinform. 2016, 17, 369. [CrossRef] 28. Doss, D. Definition and characterization of multivariate negative binomial distribution. J. Multivar. Anal. 1979, 9, 460–464. [CrossRef] 29. Brijs, T.; Karlis, D.; Swinnen, G.; Vanhoof, K.; Wets, G.; Manchanda, P. A multivariate Poisson mixture model for marketing applications. Stat. Neerl. 2004, 58, 322–348. [CrossRef] 30. Subedi, S.; Browne, R.P. A family of parsimonious mixtures of multivariate Poisson-lognormal distributions for clustering multivariate count data. Stat 2020, 9, e310. [CrossRef] 31. Silva, A.; Rothstein, S.J.; McNicholas, P.D.; Subedi, S. Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data. arXiv 2018, arXiv:1807.08380. 32. Pourahmadi, M. Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation. Biometrika 1999, 86, 677–690. [CrossRef] 33. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. 34. Wainwright, M.J.; Jordan, M.I. Graphical models, exponential families, and variational inference. In Foundations and Trends in Machine Learning; The Essence of Knowledge: Delft, The Netherlands, 2008; Volume 1, pp. 1–305. 35. McNicholas, P.D.; Jampani, K.R.; Subedi, S. In Longclust: Model-Based Clustering and Classification for Longitudinal Data; R Package Version 1.2.3; R Package: Vienna, Austria, 2019. 36. Aitken, A.C. A series formula for the roots of algebraic and transcendental equations. Proc. R. Soc. Edinb. 1926, 45, 14–22. [CrossRef] 37. Böhning, D.; Dietz, E.; Schaub, R.; Schlattmann, P.; Lindsay, B. The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann. Inst. Stat. Math. 1994, 46, 373–388. [CrossRef] 38. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [CrossRef] 39. Hubert, L.; Arabie, P. Comparing Partitions. J. Classif. 1985, 2, 193–218. [CrossRef] 40. Rau, A.; Celeux, G.; Martin-Magniette, M.; Maugis-Rabusseau, C. Clustering High-Throughput Sequencing Data with Poisson Mixture Models; Technical Report RR-7786; INRIA: Saclay, France, 2011. 41. Rau, A.; Celeux, G.; Martin-Magniette, M.L.; Maugis-Rabusseau, C. HTSCluster: Clustering High-Throughput Transcriptome Sequencing (HTS) Data; R Package Version 2.0.4; R Package: Vienna, Austria, 2016. 42. Si, Y. MBCluster.Seq: Model-Based Clustering for RNA-Seq Data; R Package Version 1.0; R Package: Vienna, Austria, 2012. 43. Leong, H.S.; Dawson, K.; Wirth, C.; Li, Y.; Connolly, Y.; Smith, D.L.; Wilkinson, C.R.; Miller, C.J. A global non-coding RNA system modulates fission yeast protein levels in response to stress. Nat. Commun. 2014, 5, 3947. [CrossRef] Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
Analytics
Multidisciplinary Digital Publishing Institute
http://www.deepdyve.com/lp/multidisciplinary-digital-publishing-institute/clustering-matrix-variate-longitudinal-count-data-o9E8lpVj8q