Access the full text.

Sign up today, get DeepDyve free for 14 days.

Statistics
, Volume 2019 (1902) – Feb 27, 2019

/lp/arxiv-cornell-university/bayesian-effect-selection-in-structured-additive-distributional-5xzA3c0X0e

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

- ISSN
- 1936-0975
- eISSN
- ARCH-3347
- DOI
- 10.1214/20-BA1214
- Publisher site
- See Article on Publisher Site

We propose a novel spike and slab prior speciﬁcation with scaled beta prime marginals for the importance parameters of regression coeﬃcients to allow for general eﬀect selection within the class of structured addi- tive distributional regression. This enables us to model eﬀects on all distributional parameters for arbitrary parametric distributions, and to consider various eﬀect types such as non-linear or spatial eﬀects as well as hierarchical regression structures. Our spike and slab prior relies on a parameter expansion that separates blocks of regression coeﬃcients into overall scalar importance parameters and vectors of standardised coeﬃ- cients. Hence, we can work with a scalar quantity for eﬀect selection instead of a possibly high-dimensional eﬀect vector, which yields improved shrinkage and sampling performance compared to the classical normal- inverse-gamma prior. We investigate the propriety of the posterior, show that the prior yields desirable shrinkage properties, propose a way of eliciting prior parameters and provide eﬃcient Markov Chain Monte Carlo sampling. Using both simulated and three large-scale data sets, we show that our approach is applica- ble for data with a potentially large number of covariates, multilevel predictors accounting for hierarchically nested data and non-standard response distributions, such as bivariate normal or zero-inﬂated Poisson. Keywords: penalised splines; prior elicitation; redundant parameterisation; scaled beta prime distribution; shrinkage properties. Correspondence should be directed to Prof. Dr. Nadja Klein at Humboldt University of Berlin, Span- dauer Str. 1, 10178 Berlin. Email: nadja.klein@hu-berlin.de. The work of Manuel Carlan was supported by the German Research Foundation (DFG) via the research training group 1644 “Scaling Problems in Statistics”. Thomas Kneib received ﬁnancial support from the German Research Foundation (DFG) within the research project KN 922/9-1. Nadja Klein gratefully acknowledges funding by the Alexander von Humboldt Foundation. arXiv:1902.10446v1 [stat.ME] 27 Feb 2019 1 Introduction The ﬂexibility of modern regression methodology is both a blessing and a curse for applied researchers and statisticians alike since, on the one hand, added ﬂexibility enables potentially more realistic models approximating the true data generating process but, on the other hand, poses additional challenges in the model building and model checking process. In this paper, we consider structured additive distributional regression models (Rigby and Stasinopoulos, 2005; Klein, Kneib, Lang and Sohn, 2015) that combine additive predictors consisting of various types of regression eﬀects, e.g. non-linear eﬀects of continuous covariates, spatial eﬀects or random eﬀects (Kammann and Wand, 2003; Ruppert et al., 2003; Wood, 2017) with the possibility to model all parameters of the response distribution (e.g. location, scale or shape parameters) in terms of covariates in a distributional regression approach. As a consequence, an analyst is faced with the challenge of not only choosing an appropriate response distribution, (a task that we will not consider in this paper since both graphical tools for model checking as well as selection criteria are well developed, see for example Klein, Kneib, Lang and Sohn, 2015) but also with determining the most appropriate subset of covariates along with their exact modelling alternative for multiple regression predictors. As an example, in one of our empirical illustrations on childhood undernutrition in Nigeria with more than 20,000 observations, we analyse a bivariate response variable (y , y ) consisting of two 1 2 scores for chronic and acute undernutrition. A previous study (Klein, Kneib, Klasen and Lang, 2015) suggests a bivariate normal model in which not only the marginal expectations but also the marginal scale parameters and the correlation parameter depend on covariates. This leads to a distributional regression model with ﬁve parameters μ , μ , σ , σ , ρ. In a full model, all of 1 2 1 2 these parameters could be related to a predictor η of the form ik η =x β + f (cage) + f (mage) + f (mbmi) + f (region), k = 1, . . . , 5, i = 1, . . ., n, ik k 1,k 2,k 3,k spat,k where i = 1, . . . , n denotes the observation index, k refers to the ﬁve distributional parameters, x contains 13 binary covariates (and an intercept term) with regression coeﬃcients β , f (·), i j,k j = 1, 2, 3, are non-linear smooth functions of age of child (cage), mother’s age (mage) and mother’s body mass index (mbmi), and f are spatial eﬀects based on regional information in spat,k the data. While eﬀect selection (deciding which of the diﬀerent eﬀects should be included in the model) via a full search in the model space would already be challenging in a mean regression framework with only one single predictor, full eﬀect selection in a distributional regression setting with multiple predictors is typically computationally prohibitive. This is even more the case when one is interested in deciding whether the eﬀect of a continuous covariate shall be included in a linear or non-linear form or whether it could be excluded completely from the model. In this paper, we address these challenges and develop a novel spike and slab prior structure that enables Bayesian eﬀect selection within structured additive distributional regression models. While there has been extensive interest in spike and slab priors for Bayesian variable selec- tion (i.e. the selection of eﬀects in models with purely linear predictors) or function selec- tion (selection of non-linear eﬀects of continuous covariates) in previous years (see for exam- ple Clyde and George, 2004; O’Hara and Sillanpa¨¨a, 2009, for reviews), most research has been 2 restricted to additive mean regression with Gaussian errors, distributions from the exponential family or survival models but also in the context of group variable selection (Zhang et al., 2014; Xu and Ghosh, 2015). Furthermore, most approaches restrict the predictor speciﬁcation to in- clude either only linear eﬀects or only non-linear eﬀects of continuous covariates but do not enable the consideration of more complex eﬀect types such as spatial eﬀects or the decomposition of non-linear eﬀects in linear and non-linear components. Classical Bayesian variable selection approaches for linear models based on spike and slab priors include for example Mitchell and Beauchamp (1988), or George and McCulloch (1997). Smith and Kohn (1996) utilise these approaches for function selection in nonparametric regres- sion with Gaussian responses by assigning the variable selection priors to individual basis func- tions. Approaches that move beyond the framework of Gaussian models but pertain the purely linear predictor structure comprise the approaches of Rossell and Rubio (2017) who propose a Bayesian variable selection approach that allows for skewness and thicker tails compared to the Gaussian distribution, Wang et al. (2017) who consider variable selection after transform- ing the response, and Chung and Dunson (2009); Kundu and Dunson (2014) who propose non- parametric models where in the former proposal the mean and shape learn the eﬀect of covariates, while the latter assumes symmetric residuals. In all these approaches however, the spike and slab prior is directly imposed on the scalar regression coeﬃcients. In contrast, Ishwaran and Rao (2005) consider a hierarchical speciﬁcation where the spike and slab structure is not imposed directly on the regression coeﬃcients but, on a higher level of the hierarchy, on their prior vari- ances. This approach also allows to consider situations where selection should take place on blocks of regression coeﬃcients representing for example the coeﬃcients of a basis expansion in nonparametric regression. This leads to function selection approaches for additive models, also considered in Yau et al. (2003); Cottet et al. (2008); Reich et al. (2009), who combine a spike with point mass at zero with a slab that has support only on the positive real numbers. In contrast, Zhu et al. (2010) specify both spike and slab as normal distributions (with very dif- ferent variance components) and Panagiotelis and Smith (2008) assign a multivariate prior with spike at the origin and normal slab directly to the whole vector of basis coeﬃcients. In either case, one typically observes poor mixing unless sampling from marginalized full conditionals which are only available in closed form for Gaussian models (Yau et al., 2003; Reich et al., 2009; Panagiotelis and Smith, 2008) or models that have a latent Gaussian representation such as the probit model (Zhu et al., 2010). Cottet et al. (2008) address function selection in double expo- nential regression models, where both the mean and the dispersion parameter are linked to an additive predictor which comprises linear and non-linear eﬀects. The model space is restricted, since functional eﬀects may enter the model only if the corresponding linear eﬀect is included in the model. Our proposal is inspired by the approach of Scheipl et al. (2012) that introduces eﬀect selection in generalized additive models for simple exponential family regression and with only one mean- related additive predictor. As Scheipl et al. (2012), we rely on a redundant parameter expansion of the vector of the basis coeﬃcients as originally proposed in Gelman et al. (2008), and which allows us to expand the vector of basis coeﬃcients in an importance parameter shared by all basis 3 coeﬃcients on the one hand and standardised basis coeﬃcients on the other hand. Eﬀect selection is then performed by assigning a spike and slab prior to the squared importance parameter. More precisely, our paper makes the following important contributions: We integrate eﬀect selection based on spike and slab priors in the structured additive distri- butional regression framework such that selection of general eﬀect types is no longer restricted to mean regression models with responses from simple exponential families. The parameter vectors representing the additive eﬀect components in a structured additive predictor are typically assigned partially improper multivariate normal priors. Instead of ex- plicitly reparameterising the vector of basis coeﬃcients to enable the speciﬁcation of proper priors as in Scheipl et al. (2012), we implicitly remove the partial impropriety by adding a corresponding constraint to the prior distribution. As a consequence, we can retain sparse matrix structures for speeding up computations and show empirically that this has beneﬁcial impact on the mixing behaviour of the MCMC simulations. In particular, when the vector of regression coeﬃcients is large, we do not observe the strong dependence on the dimensionality of the basis coeﬃcient vector identiﬁed in Scheipl et al. (2012). This enables us to also include eﬀects of considerable dimension such as spatial eﬀects to truly exploit the beneﬁts of eﬀect selection over function selection and even allows us to further extend the model to hierarchical speciﬁcations of the predictors (Lang et al., 2014). Formulating the spike and slab prior for the squared importance parameter in the redun- dant parameterisation yields scaled beta prime marginals which have favourable shrinkage properties (P´erez et al., 2017). We study these properties in detail and provide corresponding theoretical results for our prior structure including conditions for the propriety of the posterior. We develop rules for eliciting the hyperparameters of the spike and slab prior based on simple scaling criteria that are easily accessible to applied researchers. Based on the elicited parame- ters, we ﬁnd that our new prior structure has similarly favourable shrinkage properties as the approach by Scheipl et al. (2012), while it avoids to arbitrarily ﬁx the hyperparameters. The rest of this paper is structured as follows: Section 2 summarises the speciﬁcation of our novel spike and slab prior for eﬀect selection in distributional regression. Properties of the prior, including prior elicitation, shrinkage properties and propriety of the posterior are discussed in Section 3. Section 4 contains details on posterior estimation via Markov chain Monte Carlo simu- lations and points to software and implementation. Sections 5.1 and 5.2 evaluate the performance of our approach in simulations and three diverse applications. In Section 6 we conclude. 2 Bayesian Eﬀect Selection in Distributional Regression 2.1 Observation Model 2.1.1 Distributional Regression Our approach to Bayesian eﬀect selection based on spike and slab priors is developed for the general class of (multivariate) Bayesian structured additive distributional regression (Klein, Kneib, Lang and Sohn, 2015). Let (y ,ν ), i = 1, . . . , n denote n independent obser- 4 vations on the (not necessarily scalar) response variable y and covariates ν. We then assume that the conditional distribution of y given ν is speciﬁed in terms of a K-parametric distribution with density p(y |ϑ , . . . , ϑ ), (M1) i1 iK where ϑ = (ϑ , . . . , ϑ ) is a collection of K scalar distributional parameters ϑ , k = 1, . . . , K, i i1 iK ik which depend on ν . Compared to mean regression models where p(·) is usually assumed to belong to the exponential family and where K−1 parameters are treated as ﬁxed or nuisance parameters, in distributional regression each of the distributional parameters is linked to a structured additive −1 predictor η via a suitable one-to-one transformation h , i.e. h (η ) = ϑ and η = h (ϑ ). ik k k ik ik ik ik 2.1.2 Structured Additive Predictors The predictors themselves are speciﬁed as L J k k X X in sel in sel η = η + η = f (ν ) + f (ν ), (M2) ik i i ik ik l,k j,k l=1 j=1 sel where the eﬀects f (ν ) represent various types of ﬂexible functions depending on (diﬀerent j,k in subsets of) the covariate vector ν that are to be selected via spike and slab priors, while η ik in represents a second additive predictor consisting of all eﬀects f (ν ) that are not under selec- l,k tion. The separation into two subsets of eﬀects allows us to include speciﬁc covariate eﬀects mandatorily in the model (e.g. based on prior knowledge or since these represent confounding eﬀects that have to be included in the model in any case). In the following, we will only discuss in the speciﬁcation of priors for the eﬀects under selection in detail since the eﬀects η can be ik handled exactly as in distributional regression models without eﬀect selection, but we will use the diﬀerentiation later in Section 3.4 for deriving suﬃcient conditions for the propriety of the posterior. Dropping the parameter index k, the function index j and the superscript sel in the rest of this section for notational simplicity, we assume that each eﬀect f(ν ) can be approximated by a linear combination of basis functions such that f(ν ) = τ β B (ν ), (M3) i d d i d=1 ˜ ˜ ˜ where B (ν ), d = 1, . . . , D are the basis functions, β = (β , . . ., β ) is the vector of (standard- d i 1 D ised) basis coeﬃcients and τ is an importance parameter. Due to the linear basis representation, the vector of function evaluations f = (f(ν ), . . . , f(ν )) can be written as f = τBβ where B is 1 n the (n×D) design matrix arising from the evaluation of the basis functions B (ν ), d = 1, . . ., D d i at the observed covariate values ν , . . . ,ν . 1 n Note that the parameterisation in (M3) is equivalent to the standard speciﬁcation in structured additive regression f(ν ) = β B (ν ), (M3 ) i d d i d=1 but redundant as only the product β = τβ is identiﬁed. However, the importance parameter τ allows us to remove eﬀects from the predictor for τ = 0 while eﬀects are considered to be of high 5 importance if τ is large in absolute terms. We will place a spike and slab prior on the squared importance parameter τ to achieve eﬀect selection. 2.2 The Normal Beta Prime Spike and Slab Prior 2.2.1 Constraint Prior for Regression Coeﬃcients Since for many speciﬁc types of eﬀects the vector of basis coeﬃcients β is of relatively high dimension, it is often useful to enforce speciﬁc properties such as smoothness or shrinkage. In a Bayesian formulation, this can be facilitated by assuming (partially improper) multivariate Gaussian priors 2 ′ ∗ p(β|τ ) ∝ exp − β Kβ 1 [Aβ = 0] , (M4 ) 2τ where K denotes the prior precision matrix implementing the desired properties, τ is a prior variance parameter and the indicator function 1[Aβ = 0] is included to enforce linear constraints on the regression coeﬃcients via the constraint matrix A. The latter is typically used to remove identiﬁability problems from the additive predictor (e.g. by centering the additive components of the predictor) but can also be used to remove the partial impropriety from the prior that comes from a potential rank deﬁciency of the precision matrix K with rk(K) = κ ≤ D. We specify a prior of exactly the same structure on the vector of scaled basis coeﬃcients β, h i 1 ′ ˜ ˜ ˜ ˜ p(β) ∝ exp − β Kβ 1 Aβ = 0 (M4) and assume that the constraint matrix A is chosen such that all rank-deﬁciencies in K are eﬀectively removed from the prior distribution. This can, for example, be achieved by setting A = span (ker(K)) , (M5) where ker(K) denotes the null space of K and span (ker(K)) is a representation of the corre- sponding basis. This speciﬁcation eﬀectively restricts the parameter vector β to a lower dimen- sional space of dimension rk(K) and allows us to establish a decomposition of the eﬀect f(ν) into a penalized and an unpenalized part, i.e. f (ν) + f (ν) where f (ν) represents unpen pen unpen parts of the function corresponding to the null space of K which are therefore not aﬀected by the “penalisation” induced by K while f (ν) represents the part of the total eﬀect that is pen associated with the proper, informative prior part. Importantly, we can now put separate spike and slab priors on both parts of f. For instance, in case of penalized splines with second order random walk prior, the space of unpenalized functions contains the linear functions, while the penalized part contains nonlinear deviations from the former. Such a parameterization hence en- ables the decision whether a continuous covariate should be included purely nonlinearly, whether it is suﬃcient to assume a pure linear eﬀect or whether the sum of a linear and a non-linear eﬀect is needed. The resulting models are therefore both potentially more parsimonious and easier to interpret. ∗ ∗ The speciﬁcations (M3), (M4) and (M3 ), (M4 ) seem to be equivalent to each other correspond- ing to rescaling the regression coeﬃcients and the prior distribution as β = τβ. However, this is only true if the prior distribution (M4) is indeed proper. To see this, assume that K is rank 6 deﬁcient and a constant eﬀect is not penalised by the prior precision matrix. In this case, the traditional formulation of structured additive regression models (M3 ) implies a constant eﬀect if τ approaches zero while the rescaled version (M3) implies an eﬀect equal to zero since the complete function is multiplied by τ. Note, that both (M4 ) and (M4) rely on the same precision matrix K and hence the constraint matrix A can be constructed independently of the parametrisation. The traditional way is an explicit mixed model decomposition (Fahrmeir et al., 2004; Wood, 2011) which is used by Scheipl et al. (2012) to perform eﬀect selection for mean regression models. As the mixed model representation yields a penalised component which is β ∼ N(0,I), this is eﬀectively equivalent to considering our constraint prior by choosing the constraint matrix according to (M5) and by rescaling the individual entries in β with the eigenvalues of K (see Rue and Held, 2005, Sec. 3.2 for details). However, the explicit mixed model representation used by Scheipl et al. (2012) destroys the sparsity properties of the design matrices (such as band structures for B-splines) and causes full design matrices which in turn increases computation times. In order to keep the sparsity of the design matrices of functional eﬀects (and hence to minimize computation time) we instead implicitly remove the improper part of p(β|τ ) by sampling β directly from the constrained posterior using (M4). 2.2.2 Normal Beta Prime Spike and Slab Prior on Squared Importance Parameter To achieve function selection in our model, we place a spike and slab prior speciﬁcation on the squared importance parameter τ . This hierarchical prior relies on a mixture of one prior concentrated close to zero such that it can eﬀectively be thought of as representing zero (the spike component) and a more dispersed, mostly noninformative prior (the slab) and is speciﬁed via the hierarchy 1 1 2 2 τ |δ, ψ ∼ Ga , 2 2r(δ)ψ δ|ω ∼ Bi(1, ω) ψ ∼ IG(a, b) (M6) ω ∼ Beta(a , b ) 0 0 r δ = 0 r(δ) = 1 δ = 1 2 2 2 2 The scale parameter ψ determines the prior expectation of τ , which is ψ for δ = 1 and rψ for δ = 0 with r ≪ 1 being a ﬁxed small value and hence the indicator δ determines whether a speciﬁc eﬀect β = τβ is included in the model (δ = 1) or excluded from the model (δ = 0). The parameter ω is the prior probability for an eﬀect being included in the model and the remaining parameters a, b, a , b and r are hyperparameters of the spike and slab prior. We will discuss 0 0 prior elicitation for these parameters in detail in Section 3.2. 2 2 Marginalising over ψ , both the spike and the slab component p(τ |δ) are scaled beta prime distributions with shape parameters 1/2 and a and scale parameter 2r(δ)b (P´erez et al., 2017). Therefore we call the hierarchical prior on β = τβ speciﬁed by (M4) – (M6) the Normal Beta 7 Prime Spike and Slab (NBPSS) prior, see Section 3 for a detailed discussion of the properties of the NBPSS prior. Equations (M1) to (M6) deﬁne our complete model speciﬁcation for eﬀect selection in structured additive distributional regression. 2.3 Special Cases We brieﬂy discuss some of the components of structured additive predictors used later in our empirical evaluations. These include linear eﬀects with either ﬂat, improper priors if these are not under selection or conditionally i.i.d. Gaussian priors for linear eﬀects under selection. The columns of the design matrix B are then equal to the diﬀerent covariates. non-linear eﬀects based on Bayesian P-splines (Lang and Brezger, 2004), where random walk priors are used for the regression coeﬃcients corresponding to D diﬀerent B-spline basis func- tions. The i-th row of B then contains the basis functions B (x ), . . . , B (x ) evaluated at x . 1 i D i i If not stated otherwise, we will use second order random walk priors and cubic B-splines with 20 inner knots resulting in D = 22. spatial eﬀects for a discrete set of geographical regions modelled via Gaussian Markov random ﬁelds (GMRFs) with precision matrix given by an adjacency matrix encoding the neighbour- hood relation between the regions (Rue and Held, 2005) and a design matrix with entries (i, s) equal to one if observation i is located in region s and zero otherwise. We consider the simplest form of GMRFs and deﬁne two regions as neighbours if they share common borders. multilevel structured additive regression models as proposed by Lang et al. (2014) that allow for hierarchical prior speciﬁcations for regression eﬀects where each parameter vector may again be assigned an additive predictor, i.e. the vector β is decomposed as β = η + ε and the predictor η can itself be of structured additive form. 3 Properties of the NBPSS prior In the following, we discuss properties of the NBPSS prior hierarchy, including elicitation of hyperparameters, shrinkage properties and propriety of the posterior. For prior elicitation and shrinkage properties, the marginal distribution of β = τβ plays a crucial role. We will therefore start with deriving this marginal distribution. 3.1 Marginal Distribution The marginal prior for the squared importance parameter τ is given by the mixture 2 2 2 p(τ ) = p(τ |δ = 1)P(δ = 1|a , b ) + p(τ |δ = 0)P(δ = 0|a , b ) (1) 0 0 0 0 of two scaled beta prime distributions BP(1/2, a, 2b) and BP(1/2, a, 2rb) with mixture weight of the slab given by P(δ = 1|a , b ) = a /(a + b ). A modiﬁed version of the NBPSS prior can 0 0 0 0 0 alternatively be derived by assuming a mixture of two scaled t distributions for the importance pa- rameter τ = ± τ . Specifying this prior hierarchically, the ﬁrst equation in (M6) is replaced by 2 2 τ|δ, ψ ∼ N (0, r(δ)ψ ) and as a consequence posterior sampling would no longer be possible with 8 Gibbs steps as the corresponding conditional posterior would depend on the likelihood function. Marginalising over ψ , δ and ω, the prior p(τ) is a mixture of two scaled t-distributions with 2a degrees of freedom, location parameter 0, scale parameters b/a and rb/a and mixture weights a /(a +b ) and b /(a +b ), respectively. Thus, the prior on the (signed) importance parameter 0 0 0 0 0 0 τ is closely linked to the NMIG prior used in Ishwaran and Rao (2005) when considering scalar regression coeﬃcients β that are conditionally normal given the inverse gamma distributed vari- ance parameter τ (but with one level of hierarchy less) on the one hand, and, on the other hand to the peNMIG speciﬁcation of Scheipl et al. (2012). The implied marginal distribution for β = τβ can now be derived as p(β) = p(τ)p (β/τ) dτ, (2) |τ| −∞ where p is given in equation (M4). However no analytical solution exists for this integral such that it has to be approximated numerically. 3.2 Prior Elicitation In the following, we discuss prior elicitation for the NBPSS prior hyperparameters a, b, a , b and 0 0 r. More precisely, we argue that suitable default values can be suggested for a, a , and b based 0 0 on theoretical arguments while providing intuitive and user-friendly criteria for the elicitation of b and r. In the literature, default values have often been suggested from simulation-based evidence (e.g. in Scheipl et al., 2012) but we prefer to determine b and r in a more transparent way. Theoretical properties of the scaled beta prime distribution have been discussed in P´erez et al. (2017). From this, it follows that for both spike and slab moments of order less than a exist and the variance decreases with a. Furthermore, for small values of a, the spike and the slab component will overlap such that moves from δ = 0 to δ = 1 are possible. However to guarantee the existence of moments, a should not be too small either. Fixing a = 5 yielded overall a convincing mixing performance and we therefore use this value also in our real data examples. For the prior inclusion parameter ω a sensible default is to use a = b = 1 which corresponds 0 0 to a ﬂat prior on the unit interval. Of course, one can also choose ﬁx values for ω in case strong prior knowledge on the prior inclusion probability of the size of the expected model is available. As the marginal prior inclusion probability is given by P(δ = 1|a , b ) = a /(a + b ), a and b 0 0 0 0 0 0 0 can be chosen to reﬂect prior assumptions on the inclusion probability of eﬀects. For the elicitation of b and r, we propose an approach inspired by the principled approaches of Simpson et al. (2017) and Klein and Kneib (2016). More precisely, we consider marginal probability statements on the supremum norm sup |f(ν)| over a certain set of covariate values ν∈D D conditional on the status of the inclusion/exclusion parameter δ. Given δ = 1 (inclusion of the eﬀect), the marginal distribution of f(ν) does no longer depend on r, such that the parameter b can be determined from P sup |f(ν)| ≤ c δ = 1 = α, (3) ν∈D This is the probability that the supremum norm of an eﬀect is smaller than a pre-speciﬁed level 9 c for all design points ν ∈ D, such that α and c should be small. Basically we formulate the prior such that it is unlikely that the supremum norm stays below a pre-speciﬁed level if it is indeed an informative eﬀect that should be included. Both the level c and the prior probability α have to be speciﬁed by the analyst according to her/his prior beliefs. To derive r, we proceed similarly but consider the probability P sup |f(ν)| ≤ c δ = 0 = 1 − α (4) ν∈D now conditioning on non-inclusion. Since in this case we would rather be interested in making the probability of not exceeding the threshold c large, the probability is reversed to 1 − α. Note that the absolute value of the eﬀects can be taken without loss of generality due to the centring constraint of each function to ensure identiﬁability. The basic idea of these two equations is that such prior statements can be much more easily elicited in applications, in particular in distributional regression where the application of response functions such as the exponential function or the logit transform induce default ranges of plausible eﬀect sizes. Of course, the levels c as well the probability levels α can be chosen to be distinct for the inclusion/exclusion criteria in (3) and (4) but we suppress this possibility notationally both for simplicity and since in most cases it seems plausible to choose the same parameter settings anyway. To access the probabilities in (3) and (4), we have to derive the marginal distribution of sup |f(ν)| which is not analytically accessible. For a single covariate value ν, the function evaluation is ′ ′ ˜ ˜ given by f(ν) = τ(B (ν), . . . , B (ν))β = τb β = b β and the marginal density is 1 D ν ν ′ ′ 2 2 2 p(b β | δ) = p(b β|τ )p(τ | δ)dτ ν ν ′ 2 2 ′ − − 2 where b β|τ ∼ N(0, τ b K b ) (with K denoting the generalized inverse of K) and p(τ ) is ν ν given in Equation (1). Note that using the generalized inverse eﬀectively removes the portion of f(ν) that corresponds to the null space of K such that we take the constraint in (M4) into account. The integrals above are scalar integrals for each covariate ν which can be solved numerically. However, obtaining the supremum over a large set D, numerical integration easily becomes computationally intractable. We hence determine the distribution of the supremum based on simulations from the hierarchical NBPSS prior. In the Online Appendix B, we show how to determine r and b independently of each other. For ′ ′ ′ given design matrix B = (b , . . .,b ) , precision matrix K, probability level α and threshold c, ν ν 1 n these can be computed for general functional eﬀects using the R package sdPrior (Klein, 2018). 3.3 Shrinkage Properties Regularisation and shrinkage properties of certain prior settings in regression speciﬁcations can be studied by considering the marginal distribution of the regression coeﬃcients and/or functional eﬀects. According to Section 3.1 the marginal densities have to be determined by numerical integration. 10 3.3.1 Constraint Regions We compare the prior speciﬁed in (M4)–(M6) with a standard NMIG prior applied directly to the coeﬃcients in β and the parameter expanded prior (peNMIG) of Scheipl et al. (2012). Figure 1 shows the univariate marginal log-densities where the most distinct diﬀerence is between the standard NMIG prior compared to peNMIG and NBPSS priors. While the standard NMIG prior resembles the shape of a normal distribution with a ﬁnite asymptote at zero, both parameter expanded priors feature a spike in zero. As we will show in the next section, this spike is indeed inﬁnite such that advantageous selection behaviour is to be expected for the NBPSS prior. Figure 2 supplements the univariate considerations by bivariate marginal log-densities. We diﬀerentiate between two situations: First, we consider two parameters that depend on the same value τ , i.e. parameters belonging to the same function f(ν), while in the second case we consider parameters depending on diﬀerent importance parameters. This distinction is important since the standard NMIG prior always assumes independent components with separate hyperparameters. As a consequence, the peNMIG and NBPSS priors deviate from the standard situation in two ways: First by the parameter expansion itself and second by making the parameters depend on the same hyperparameter. To disentangle the eﬀect of these two deviations, we rely on the separate presentations. We make the following important observations: The NBPSS and peNMIG priors share the same qualitative behaviour while deviating consid- erably from the standard NMIG prior regardless of whether the case of shared or distinct τ is considered. The univariate marginal densities qualitatively resemble the ones of the original spike and slab prior of Mitchell and Beauchamp (1988) with tails that are heavy enough to induce a re-descending score function which ensures robustness of the Bayesian estimators (see also the next subsection). For the case of distinct parameters, we observe contours similar to the convex shape of L priors with q < 1 for the peNMIG and NBPSS priors which implies weak shrinkage of large eﬀects while small coeﬃcients are strongly shrunken to zero. For the case of shared τ , the shapes of the contours imply simultaneous shrinkage of both parameters instead of the strong shrinkage towards the coordinate axes observed for distinct importance parameters. This is exactly the desired type of shrinkage for parameters belonging to one eﬀect f(ν) to completely remove the eﬀect from the model speciﬁcation. As already noted in Section 2.2, the speciﬁcation of the prior in Scheipl et al. (2012) diﬀers from ours insofar as they consider the mixed model decomposition of eﬀects. Additionally, Scheipl et al. (2012) use a bimodal prior for the standardized regression eﬀects with modes at +1 and −1. This eﬀectively bounds the coeﬃcients away from zero and thus encour- ages sampling from one mode of the posterior, while we instead explore the full posterior. Consequently, the conditional posterior of β of NBPSS is a standard normal distribution p (x) = N(x; 0, 1), while the one of peNMIG is a mixture of two normals with modes, NBPSS p (x) = 0.5 N(x; 1, 1) + 0.5 N(x;−1, 1). Taking the ratio yields peNMIG p (x) peNMIG −1 > 1 ⇔ |x| > cosh (exp(0.5)) ≈ 1.08, p (x) NBPSS 11 which explains the slightly heavier tails of peNMIG in Figures 1 and 2. We also study the implied constraint regions for the marginal prior of function evaluations f(ν) = ′ ′ ′ − b β, which can be derived in complete analogy by utilising that b β ∼ N(0,b K b ) with a ν ν ν generalised inverse K . In contrast, the marginal prior for function evaluations for the parameter expanded prior of Scheipl et al. (2012) is not numerically accessible since it involves a complex mixture of 2 components (where D is the dimension of β) due to the bimodal prior for the elements of β. Figure 3 depicts marginal densities for the eﬀect f(ν) evaluated at one (left panel) and two (right hand panel) randomly chosen covariate values of a sequence of n = 100 equidistant values in [−π, π]. The resulting design matrix B is based on cubic Bayesian P-splines with D = dim(β) = 22. Hence, the bivariate plot corresponds to the situation of one shared importance parameter since we are interested in shrinkage of the eﬀect evaluations for the same eﬀect at diﬀerent covariate values. Qualitatively, the behaviour from the marginal densities of the regression coeﬃcient is translated to the function evaluations, i.e. we observe a peak in zero and simultaneous shrinkage. 3.3.2 Tail Behaviour and Behaviour in the Origin Visually, the marginal prior for β features a distinct peak as shown in the previous section. We now investigate more closely, whether this spike is ﬁnite or inﬁnite by considering the behaviour of p (β)| . Using Equation (2) we obtain β=0 Z Z 1 ∞ 1 1 p(β)| = 2p (0) p (τ) dτ + p (τ) dτ ˜ τ τ β=0 β | {z } τ τ 0 1 | {z } ≥p (1) ≥0 ≥ 2p (1)p (0) dτ = 2p (1)p (0) [log(τ)] = ∞, ˜ ˜ τ τ β β 0 and therefore the marginal prior for β indeed has an inﬁnite spike in zero. Note that we have shown that the multivariate parameter expanded prior has a spike in zero, while Scheipl et al. (2012) have only shown the result for the univariate marginal prior. An inﬁnite spike in zero is considered to induce particularly beneﬁcial shrinkage properties since we obtain heavy penalisa- tion of small eﬀects. The tail behaviour of the marginal prior for β can be studied by looking at the score function of p(β) which consists of the elements ∂ β 1 p (β) = − p (τ)p (β/τ) dτ. β τ ˜ ∂β τ |τ| Figure 4 visualizes the resulting score function and compares it to the score function of the NMIG and peNMIG priors. From the graphical representation we ﬁnd that all three prior structures have heavy tails such that the score functions are re-descending (i.e. they approach zero as their argument tends to inﬁnity) which induces Bayesian robustness of the resulting estimates. The score functions of the peNMIG and NBPSS priors resemble the shape of L priors with q close to zero, while the shape of the score function for the NMIG prior shows a more complex non-monotonously shape around zero. 12 3.4 Propriety of the Posterior Distribution While in Section 2 we do not explicitly change the design matrices to remove the nullspace of the precision matrices K (both eﬀects with NBPSS prior and the ones not under selection), j,k we do derive an explicit mixed model representation of the predictors η in (M2) in this section as this greatly simpliﬁes the derivation of suﬃcient conditions for the propriety of the posterior. As the exact conditions are also dependent on the prior structures employed, we need to be more in precise here about η and will therefore introduce a slightly diﬀerent notation compared to that in Section 2. 3.4.1 Mixed Model Representation in in sel Assume we have L eﬀects in η and J eﬀects under selection and let furthermore η = η +η k k k k k k be the complete predictors for k = 1, . . ., K as deﬁned in Section 2.1.2. in We then assume a mixed model type representation (Fahrmeir et al., 2004) for η L L k k X X in in in in in in in in in in in in in in ˜ ˜ η = Z (U β +V β ) = U β + V β = U β +V β , k l,k l,k unpen,l,k l,k pen,l,k l,k unpen,l,k l,k pen,l,k k unpen,k k pen,k l=1 l=1 in in in in in in in where U = (U , . . . ,U ), V = (V , . . . ,V ), and β = k 1,k L ,k k 1,k L ,k unpen,k k k in ′ in ′ ′ in in ′ in ′ ′ ((β ) , . . . , (β ) ) , β = ((β ) , . . . , (β ) ) . The columns of unpen,1,k unpen,L ,k pen,k pen,1,k pen,L ,k k k in in in in ˜ ˜ U are a basis of ker(K ), V forms a basis of the images of K , such that l,k l,k l,k l,k in in in in 2 in 2 in in dim(β ) = rk(K ) = κ and β |(τ ) ∼ N(0, (τ ) I), while β has di- pen,l,k l,k l,k pen,l,k l,k l,k unpen,l,k in in 2 in mension D − κ and a ﬂat prior. As a consequence, we obtain L variance parameters (τ ) l,k l,k l,k in in for the L penalized vectors of coeﬃcients β in η . sel sel For eﬀects in η we proceed similarly but with proper NBPSS priors on both parts of K , j,k sel sel rk(K ) = κ representing a basis of the nullspace and the image each. Hence, by construction j,k j,k all eﬀects under selection (after centring) can be assumed to have proper prior distributions. For non-linear eﬀects of continuous covariates with random walk priors of order > 2 for instance, this is achieved by separating the polynomial parts up to order-1 and to include separate NBPSS prior on these, see Section 2 for details. We hence assume that the sub-predictors under selection are of the form sel sel sel sel sel η = V β = V β , k j,k j,k k k j=1 sel sel sel sel sel ′ sel ′ ′ where V = (V , . . . ,V ), and β = ((β ) , . . . , (β ) ) . This yields J impor- k 1,k J ,k k 1,k J ,k k k 2 sel 2 tance parameters (τ ) with hyperparameters ψ , δ , ω in addition to the J regression j,k j,k k j,k j,k coeﬃcients with NBPSS priors after re-parameterisation. We furthermore introduce κ = P P L J k in k sel κ + κ . l=1 l,k j=1 j,k Finally, the complete predictor can be written as in in in in sel sel η = U β + V β + V β = U β + V β , (5) k k k k unpen,k k pen,k k k unpen,k pen,k in in sel ′ ′ ′ in in sel where we denote β ≡ β , β = ((β ) , (β ) ) , U ≡ U , V = (V ,V ). k k unpen,k pen,k unpen,k pen,k k k k k Let us in the sequel assume that the matrices U have full column rank r , k = 1, . . . , K and k k 13 deﬁne for X = (U ,V ) and t = rk(X ) − rk(U ) ≤ dim(β ), k k k k k k pen,k rk(X ) = r + t . (6) k k k Remark 1. In order to obtain a full column rank matrix of unpenalised eﬀects in the mixed model representation (5), all superﬂuous columns have to be deleted. In particular, duplicated constant columns representing the levels of the functions are deleted which is a simple way to include the centring restrictions and is equivalent to the centring of functions that we in- clude in our MCMC algorithm. Furthermore, using the one-to-one relationship between original parameterisation and the reparameterised model the restrictions for one presentation can be de- duced from the other one. Hence, suﬃcient rank conditions can be formulated directly for the reparameterised model (5) and do not have to be traced back to the original parameterisation, see Klein and Kneib (2016, Remark 1) for a detailed derivation of this result. 3.4.2 Conditional Independence Assumptions To derive the posterior distribution of model (M1) to (M6), we make the usual conditional independence assumptions (see the Online Appendix A.1, conditions (a.1)–(a.3b)) by labelling in 2 in for k = 1, . . . K the coeﬃcients β with variances (τ ) , l = 1, . . . , L for eﬀects not under l,k l,k sel 2 sel 2 selection; and β , (τ ) , ψ , δ , ω , ,J = 1, . . . , J , for the eﬀects with NBPSS prior. In j,k j,k k j,k j,k j,k general, they mean that priors for diﬀerent eﬀects are assumed to be independent, while within an eﬀect they are dependent by construction. In general, prior independence assumptions should be a reasonable working assumption which also does not rule out posterior dependence. Note that we always assume proper NBPSS priors and in particular a > 0, b > 0 in the priors for j,k j,k ψ . This is justiﬁed by our considerations on prior elicitation as discussed in Section 3.2 of the j,k main paper. In the following we assume that conditions (a.1)–(a.3b) of the Online Appendix A.1 hold. 3.4.3 Gaussian Mean Regression Assume in this section a Gaussian mean regression model for y = (y , . . . , y ) with predictor η 1 n from (5) in mixed model representation, i.e. y = η + ε, ε ∼ N(0, τ I ), (7) where we assume 1 b p(τ ) ∝ exp − 2 a +1 2 (τ ) τ ε ε for the error variance. Note that k = 1 in this subsection and that J , L , κ are replaced by k k k J, L, κ. Applying the mixed model representation (5) allows us writing (7) as y = Uβ + V β + ε, unpen pen and with the corresponding rank assumptions from above. 14 b. Conditions for Gaussian Mean Regression in in in (b.1) a < b = 0 or b > 0, l = 1, . . . , L. l l l in in (b.2) κ + 2a > 0, l = 1, . . ., L. l l in in (b.3) κ + 2a > κ − t, l = 1, . . ., L. l l sel sel (b.4) κ + 2a − 1 > κ − t, j = 1, . . . , J. j j in (b.5) n + 2a + 2 a > r + J. l=1 in (b.6) n + 2a + 2 min(0, a ) > r + J. l=1 (b.7) SSE + 2b > 0. in in Condition (b.1) excludes Jeﬀrey’s prior (corresponding to a = b = 0) for eﬀects not under l l 2 in selection but allows for ﬂat priors on variances and standard deviations (τ ) . Conditions (b.2) in sel to (b.4) relate the ranks κ and κ of the prior precision matrices of each of the eﬀects to the rank κ of all prior precision matrices. For eﬀects not under selection, the conditions can in be ensured by increasing a . Condition (b.5) restricts the number of all eﬀects to be smaller or equal to the number of observations but can be relaxed by increasing the hyperparameters in values a and a . Condition (b.7) is always fulﬁlled for b > 0. In case of an improper prior for ε ε τ , SSE > 0 has to be assured, while b > 0 becomes necessary when the number of unknown parameters is greater than n. Theorem 1. Consider the Gaussian mean regression model (7) with mixed model representa- tion (5) and rank conditions from (6). 1. κ = t: Then, conditions (b.1),(b.3),(b.5) and (b.7) are necessary for the propriety of the joint posterior while conditions (b.1),(b.3),(b.4),(b.6) and (b.7) are suﬃcient for the propriety of the joint posterior. 2. κ < t: Then, conditions (b.1),(b.2),(b.5) and (b.7) are necessary for the propriety of the joint posterior while conditions (b.1),(b.3),(b.4),(b.6) and (b.7) are suﬃcient for the propriety of the joint posterior. The proof of Theorem 1 is given in the Online Appendix A.3. Remark 2. For eﬀects not under selection, additional conditions on the ranks κ and the number in of eﬀects compared to the shape parameters (a ) of the priors are required, as the latter can in be improper and hence (a ) < 0 becomes possible. Consequently, one has to consider the cases t = κ or L = 1 as well as t < κ and L > 1 separately. This is not necessary for eﬀects with NBPSS prior. 3.4.4 Distributional Regression In order to achieve suﬃcient conditions for the propriety of the posterior in distributional re- gression, we deﬁne a normalized submodel with Gaussian errors to be able to apply results of 15 Theorem 1. More precisely, we ﬁrst separate the random eﬀect with largest dimension in each predictor of (5), such that we obtain in in η = U β + V b + V b , k k ε,k ε,k k unpen,k where V b corresponds to the eﬀect with proper prior and with the largest dimension, ε,k ε,k dim(b ) = rk(K ) = κ , and V b contains all remaining eﬀects with proper prior, both ε,k ε,k ε,k k k the ones with NBPSS prior and the ones not under selection with usual inverse gamma priors. Note that b is based on J = L +J −1 eﬀects in the notation in (5), with κ denoting the sum k k k k of ranks of the J precision matrices of predictor k, and where, w.l.o.g. we assume that the eﬀects in the predictors are ordered such that the (J +L )-th eﬀect corresponds to the random eﬀect in k k the mixed model representation with largest dimension. Similarly, the design matrix (V ,V ) k ε,k in sel corresponds to the design matrix (V ,V ). Note also, that b can originate from an eﬀect ε,k k k not under selection or one with NBPSS prior and we distinguish the two cases in Theorems 2 and 3. Assume that the set of observations can (after re-ordering) be partitioned such that for n ≥ 1 Z Z (c.1) . . . p(y |η , . . ., η )dη . . . dη < ∞ for i = 1, . . ., n . i i1 iK i1 iK (c.2) p(y |η , . . . , η ) ≤ M for i = n + 1, . . ., n, i i1 iK −1 where η = h (ϑ ), i = 1, . . . , n, k = 1, . . ., K. This implies that for at least one observa- ik ik tion the density is integrable (with respect to the predictors) and that all remaining densities are bounded. For discrete distributions, all densities are automatically bounded by 1 so that only Condition (c.1) can be an issue in practice. Condition (c.1) is usually fulﬁlled if certain restrictions apply on speciﬁc parameters that exclude extreme values on the boundary of the parameter space, see Klein, Kneib and Lang (2015) for a more detailed discussion on count data and binary distributions. For continuous distributions, the densities are sometimes not bounded (e.g. for the gamma distribution). Note that this is not a problem when all observations fulﬁl Condition (c.1) since n = n is allowed. Similar as for the discrete distributions, integrability of the densities can be assured by the assumption that none of the distributional parameters is on the boundary of the parameter space (an assumption that would also have to be made to apply standard maximum likelihood asymptotics). Let n˜ = min{κ , . . . , κ } and assume that we can choose n˜ observations including at least ε ε,1 ε,K ε one observation fulﬁlling (c.1) to deﬁne the submodel η = U β + V b + V b (8) k,s k,s k ε,k,s ε,k k,s unpen,k 2 ′ with these observations, such that V b ∼ N(0, τ V V ). Then the following rank ε,k,s ε,k ε,k,s ε,k,s ε,k conditions have to be fulﬁlled: (c.3) The design matrix U has full rank r . k,s k (c.4) rk(U ,V ) = rk(U ,V ) = r + t . k k k,s k,s k k (c.5) rk(V ) = n ˜ i.e. V is of full rank for k = 1, . . . , K. ε,k,s ε ε,k,s To ensure (c.3), superﬂuous columns arising from the reparameterisation have to be deleted. In particular, duplicated constant columns representing the levels of the functions are deleted, 16 see Klein and Kneib (2016, Remark 2 (iii) for details). Condition (c.4) indicates that the rank of the design matrices in the submodel is the same as in the complete model whereas (c.5) deﬁnes a similar restriction for the design matrix of the largest random eﬀect arising from the mixed model representation. Finally, the normalised submodel ˜ ˜ η˜ = U β + V b + ε , ε ∼ N(0, τ I ) (9) k,s k,s k k,s k,s n ˜ k,s unpen,k ε,k ε ′ −1/2 is obtained by multiplying (8) with M = (V V ) such that η˜ = M η , U = k ε,k,s ε,k,s k k,s k,s k,s M U , V = M V , and ε represents an i.i.d. random eﬀect. k k,s k,s k k,s k,s The corresponding residual sum of squares for the normalised submodel is ˜ ˜ ˜ ˜ SSE := η˜ − U β − V b η˜ − U β − V b . (10) k,s k,s k,s k k,s k,s k k,s unpen,k k,s unpen,k To derive suﬃcient conditions for the propriety of the posterior we have to distinguish two cases: the largest random eﬀect ε corresponds to an eﬀect with a) NBPSS prior and b) not under k,s selection and with the usual inverse gamma priors for the variance τ . ε,k in in in (c.6a) a < b = 0 or b > 0, l = 1, . . ., L . l,k l,k l,k in in in (c.6b) a < b = 0 or b > 0, l = 1, . . ., L − 1. l,k l,k l,k in in (c.7a) κ + 2a > κ − t , l = 1, . . ., L . k k k l,k l,k in in (c.7b) κ + 2a > κ − t , l = 1, . . ., L − 1. k k k l,k l,k sel (c.8a) κ + 2a − 1 > κ − t , j = 1, . . . , J − 1. j,k k k k j,k sel (c.8b) κ + 2a − 1 > κ − t , j = 1, . . . , J . j,k k k k j,k in (c.9a) n ˜ + 2a + 2 min(0, a ) > r + (J − 1). ε ε,k k k l,k l=1 L −1 in (c.9b) n ˜ + 2a + 2 min(0, a ) > r + J . ε ε,k k k l,k l=1 (c.10a) SSE > 0. (c.10b) SSE + 2b > 0. k ε Above, Conditions (c.·a) each correspond to the case that the largest random eﬀect has a variance with inverse gamma prior, while Conditions (c.·b) each are active when the variance of the largest random eﬀect has an NBPSS prior. Conditions (c.6a),(c.6b) require that if for eﬀects in in not under selection b is set to zero, the parameter a has to be negative. This includes l,k l,k in situations corresponding to ﬂat priors for the random eﬀects variance (a = −1) or standard l,k in in deviation (a = −0.5) but excludes Jeﬀreys prior (a =0). Conditions (c.7a), (c.7b) and (c.8a), l,k l,k (c.8b) relate the rank of the random eﬀects part of one individual eﬀect to the sum of all rank deﬁciencies in the corresponding predictor, are similar for eﬀects not under selection and the ones with NBPSS prior and require that the dimensionality is not too small. The condition can be in ensured by increasing the shape parameters a and a , respectively. Conditions (c.9a), (c.9b) j,k l,k restrict the number of eﬀects not under selection and with ﬂat prior to be at most equal to the dimension of the largest random eﬀects part in the model but can again be relaxed by increasing in the shape parameters a . Finally, Conditions (c.10a), (c.10b) require that there is variation in l,k the residual sum of squares in the normalized submodel (implying that not all eﬀects are zero) 17 in situations where the largest random eﬀect has an NBPSS prior and either variation in the residual sum of squares or b > 0 when the largest random eﬀect has the usual inverse gamma prior on the variances. The latter requirement can always be ensured in practice but excludes ﬂat priors for the random eﬀects variances or standard deviations. Theorem 2. Consider the distributional regression model with densities (M1) and predic- tors (M2). Let ε be an i.i.d. random eﬀect with variance τ ∼ IG(a , b ). Then, Con- k,s ε,k ε,k ε,k ditions (c.1), (c.2) on the densities, (c.3) to (c.5) on the ranks as well as (c.6b), (c.7b), (c.8b), (c.9b), (c.10b) on the hyperpriors are suﬃcient conditions for a proper posterior. The proof of Theorem 2 follows from the proof of Klein, Kneib and Lang (2015) using Theorem 1 above as we assume that all NBPSS priors are proper. Theorem 3. Consider the distributional regression model with densities (M1) and predic- tors (M2). Let ε P be an i.i.d. random eﬀect with NBPSS prior with parameters τ ∼ k,s ε,k 2 2 Ga(1/2, 1/(2r(δ )ψ )), ψ ∼ IG(a , b ), δ ∼ Be(ω ), ω ∼ Beta(a , b ). Then, ε,k ε,k ε,k ε,k ε,k 0,ε,k 0,ε,k ε,k ε,k Conditions (c.1), (c.2) on the densities, (c.3) to (c.5) on the ranks as well as (c.6a), (c.7a), (c.8a), (c.9a), (c.10a) on the hyperpriors are suﬃcient conditions for a proper posterior. The proof of Theorem 3 is given in the Online Appendix A.4. 4 Posterior Estimation Update of the Basis Coeﬃcients. Due to the modular structure of Markov chain Monte Carlo (MCMC) simulation algorithms, no changes in the MCMC scheme developed by Klein, Kneib, Lang and Sohn (2015) are required for updating the basis coeﬃcients β when sup- plementing them with a NBPSS prior instead of the standard inverse gamma prior. We therefore apply iteratively weighted least squares based approximations to the log full conditional and −1 generate proposals from the multivariate normal distribution N(μ,P ) with expectation and precision matrix given by −1 ′ ′ μ = P B W (y˜ − η ) P = B WB + K (11) where η = η − Bβ is the predictor without the eﬀect currently updated and the working observations y˜ and weights W are determined based on ﬁrst and second derivatives of the log- likelihood with respect to the predictor. Update of the Smoothing Variance for Eﬀects not Subject to Selection. For eﬀects not subject to selection, we consider an inverse gamma prior τ ∼ IG(a, b) for the smoothing variances such that the update of τ can be done via a simple Gibbs sampling step drawing from rk(K) ′ 2 ′ ′ ′ ′ 1 τ |· ∼ IG(a , b ), with updated parameters a = + a, b = β Kβ + b. 2 2 Update of the Squared Importance Parameter for Eﬀects Subject to Selection. The 2 2 full conditional p(τ |β, δ, ψ ) is a generalised inverse Gaussian distribution GIG(p, q, c), with p = −0.5 rk(K) + 0.5, q = 1/(r(δ)ψ ), c = β Kβ and can be generated eﬃciently in a Gibbs- step. This has the advantage that τ can be generated independently of the likelihood in an 18 eﬃcient Gibbs step. This is no longer possible when the prior is formulated for the importance parameter τ as in (Scheipl et al., 2012) where a Metropolis-Hastings update is required, see the Online Appendix C. Updates for the Hyperparameters of the NBPSS prior. For the hyperparameters of the NBPSS prior, we obtain Gibbs sampling steps via the following full conditionals: Inclusion indicator δ: 1 1 p(δ = 1|·) = = , 1−ω ϕ(τ;0;rψ )(1−ω) 1 + L 1 + 2 ω ϕ(τ;0;ψ )ω 2 2 where ϕ(·; μ, σ ) denotes the density of the normal distribution with mean μ and variance σ and 2 2 ϕ(τ; 0, rψ ) 1 − (1/r−1) 2ψ L = = e . ϕ(τ; 0, ψ ) r Hyper-variance ψ : ψ |· ∼ IG a + 0.5, b + 2r(δ) Inclusion probability ω: ω|· ∼ Beta(a + δ, b + 1 − δ) 0 0 Note that it is also possible to use the same ω for multiple eﬀects simultaneously. If ω relates to a total of L eﬀects, the full conditional is then given by L L X X ω|· ∼ Beta a + δ , b + L − δ 0 l 0 l l=1 l=1 Implementation. Spike and slab based eﬀect selection in distributional regression has been implemented in a developer version of BayesX (Belitz et al., 2015) which is available from the authors on request. The software makes use of methods for eﬃcient storing of large data sets and sparse matrix algorithms for sampling from multivariate Gaussian distributions (George and Liu, 1981; Rue, 2001) and also allows us to access existing procedures for example for computing si- multaneous conﬁdence bands for nonparametric eﬀects as developed in Krivobokova et al. (2010). Hyperparameter elicitation is integrated in the R-package sdPrior (Klein, 2018). 5 Empirical Evaluations 5.1 Simulations To evaluate the performance of the NBPSS prior for eﬀect selection in distributional regression, we conducted extensive simulations under various settings. We distinguish diﬀerent scenarios for the predictor complexity, models including and excluding spatial eﬀects, four selected response distributions, varying sample sizes, correlated and uncorrelated covariates and a set of user- deﬁned parameters for hyperprior elicitation. Speciﬁcally, we consider Gaussian responses with eﬀects only on the expectation, a Gaussian location-scale model, Poisson regression and zero-inﬂated Poisson models. 19 • we specify four test functions – f (x) = x – f (x) = −x + πsin(πx) 1 3 (2x−2) – f (x) = x + – f (x) = 0.5x + 15φ(2(x− 0.2)) − φ(x + 0.4). 2 4 5.5 we distinguish two scenarios in terms of the predictor complexity: – low sparsity in which out of 16 included covariates 12 have non-zero inﬂuence. The true lin- ear predictor is η = f (x )+f (x )+f (x )+f (x )+1.5 (f (x ) + f (x ) + f (x ) + f (x ))+ 1 1 2 2 3 3 4 4 1 5 2 6 3 7 4 8 2(f (x )+f (x )+f (x )+f (x ) and we simulate the two cases with additional and with- 1 9 2 10 3 11 4 12 out additional spatial eﬀect f (s), labeled as ‘spatial/non-spatial’. These settings are used spat for η in the homoscedastic Gaussian and the Gaussian location-scale model, as well as for η in the Poisson and the zero-inﬂated Poisson model. – high sparsity in which out of eight included covariates four have non-zero inﬂuence. The true linear predictor is η = f (x ) + f (x ) + f (x ) + f (x ) and we again simulate the two 1 1 2 2 3 3 4 4 cases with additional and without additional spatial eﬀect f (s). These settings are used spat for η 2 in the Gaussian location-scale model and for η in the zero-inﬂated Poisson model. σ π we generate covariates either – as i.i.d. realizations from U[−2, 2] or – from an AR(1) process with correlation ρ = 0.7 and standarize x in order to facilitate prior elicitation. we simulate 150 replications for each combination of the settings. we use six combinations of α and c for the elicitation of the prior hyperparameters b and r arising from the pairwise combination of – α = 0.05, 0.1, 0.2, – c = 0.1, 0.2. we consider the sample sizes n = 200; 1, 000 for Gaussian, n = 500; 2, 000 for Poisson, n = 1, 000; 2, 000 for Gaussian location-scale and zero-inﬂated Poisson responses. The sample sizes have been chosen to reﬂect a challenging (small sample size) and a relatively informative (large sample size) setting, taking the diﬀerent complexity of the model structures into account. As a competitor for the single parameter distributions Gaussian and Poisson, we consider the peNMIG prior of Scheipl et al. (2012) implemented in the R-package spikeSlabGAM (Scheipl, 2016). We refrain from comparison with further variable selection priors mentioned in the intro- duction as these usually lack applicability beyond the framework of generalized linear models. Hyperparameter elicitation for the NBPSS prior was performed with the package sdPrior (Klein, 2018) and estimation was done with the current developer version of BayesX (Belitz et al., 2015). For both the NBPSS and the peNMIG prior, non-linear eﬀects are based on 20 cubic B-spline basis functions constructed from an equidistant set of knots combined with second-order random walk prior unless stated otherwise. In the following, we restrict ourselves to the main conclusions, a detailed description about simulation settings and evaluation including complete graphical evidence is provided in the Online Appendix D. As a general outcome, the NBPSS prior results in very good performance for the selection of relevant eﬀects even in challenging distributional regression settings with eﬀect selection on multiple distributional parameters, where no competing Bayesian variable selection 20 approach is available so far. Evidence for that is given in Figures 5 and 6 showing posterior inclusion probabilities and the ratio between predictive NBPSS log-scores and oracle log-scores (i.e. log-scores arising from a model with given, true predictor speciﬁcation), respectively, in the zero-inﬂated Poisson model. The log-scores have been computed from independently generated test data sets with 5,000 observations. In the simple exponential family framework with only one single regression predictor, the NBPSS prior turns out to be a strong competitor to the peNMIG prior (see Figure 7 for overall accuracy results of the Poisson model). Selection of large coeﬃcient blocks such as spatial eﬀects works well for all types of response distributions, while these are particularly problematic with peNMIG due to severe mixing problems. On the other hand, the explicit reparameterisation of non-linear eﬀects used with the peNMIG prior (as compared to the constrained sampling approach that NBPSS is based on) seems to have some advantages in separating the linear and non-linear part of non-linear eﬀects in cases where the true eﬀect is close to linear and at the same time covariates are strongly correlated. Coinciding with previous evidence on Bayesian eﬀect selection, we ﬁnd a strong impact of hy- perprior parameter choice on the resulting eﬀect selection performance. Our interpretable yet ﬂexible way of eliciting hyperprior parameters equips data analysts with an intuitive approach for choosing these hyperparameters. More precisely, changing the probability α and the threshold c can help to balance between the true positive and false negative rates of eﬀect selection. Choos- ing α and c smaller, results in more conservative, i.e. sparser models. Based on our simulations, we suggest α = c = 0.1 as default values in our applications. In summary, our simulations demonstrate that the NPBSS prior provides a promising approach for Bayesian eﬀect selection that extends existing methods to a framework that is applicable in any distributional regression model comprising both multiple hierarchical predictor speciﬁcations and high-dimensional coeﬃcient vectors. In addition, our eﬀect decomposition allows to select the linear part and its non-linear deviation for an eﬀect of a continuous covariate separately. 5.2 Applications In this section, we demonstrate the eﬃcacy of a simultaneous selection approach via the NBPSS prior speciﬁcation and its applicability for non-Gaussian, discrete or multivariate data. Core information about the diﬀerent data sets Patents, Nigeria and House prices including the type of response distribution, number of observations and eﬀects can be found in Table 1. Estimates shown in the subsequent subsections are all the model-averaged estimates obtained from the MCMC iterates with the NBPSS prior and the covariates have been standardized for prior elicitation reasons. 5.2.1 Number of Patent Citations The Patents data set contains the number of citations of patents granted by the European Patent Oﬃce (EPO). An inventor who applies for a patent has to cite all related, already existing patents his patent is based on. Klein, Kneib and Lang (2015) use this data set to illustrate their developed methodology on Bayesian zero-inﬂated and overdispersed count data and conducted 21 variable selection in a stepwise forward approach based on the deviance information criterion (DIC). In the following, we focus on zero-inﬂated Poisson (ZIP) models for analysing the number of patent citations. The ZIP model has two distributional parameters, λ, the rate of the count process, and π the probability of observing an excess of zeros. Including all available variables in one of the predictors η , k = 1, 2 reads as η = β + x β + f (year) + f (ncountry) + f (nclaims), k 0,k 1,k 2,k 3,k where x contains the continuous variables year (year when patent was granted), ncountry (num- ber of designated states for patent), nclaims (number of patent claims), as well as the binary indicators ustwin (twin patent in the US), opp (oppositions against the patent), biopharm (patent from the biotech/pharma sector), patus (patent holder from the US) and patgsgr (patent holder from Germany, Switzerland or Great Britain), see Table E.1 in the Online Appendix for sum- mary statistics of the variables. Possible non-linear eﬀects of the three continuous variables are captured by the functions f to f . The predictor speciﬁcations of the model identiﬁed in 1 3 Klein, Kneib and Lang (2015) via stepwise DIC-selection are η = β + β opp + β biopharm + β patus + β patgsgr + f (year) + f (ncountry) λ 0,λ 1,λ 2,λ 3,λ 4,λ 1,λ 2,λ + f (nclaims) 3,λ η = β + β opp + β biopharm + β patus + β patgsgr + f (year) + f (ncountry). π 0,π 1,π 2,π 3,λ 4,λ 1,π 2,π This model is denoted as ZIP DIC in the following. We compare this model to the model ZIP NPBSS with predictors selected by the NBPSS prior where r and b were determined from α ∈ {0.05, 0.1}, c = 0.1. Table 2 reports predictive log-scores (obtained from ten-fold cross validation) as well as values for the DIC and the widely applicable information criterion (WAIC). From the table, we can conclude, that the ZIP NPBSS model is clearly favoured in terms of the chosen criteria. For the NBPSS model, we report posterior probabilities P(δ|y) in Table 3. Based on the decision to include an eﬀect if P(δ|y) ≥ 0.5 holds, the NBPSS prior coincides with the stepwise approach of ZIP DIC for the eﬀects of the continuous covariates but yields a sparser prediction speciﬁcation for the eﬀects of binary covariates. 5.2.2 Bivariate Analysis of Undernutrition The Nigeria data have been extracted from Demographic and Health Surveys (DHS, https://dhsprogram.com/) containing nationally representative information about the pop- ulation’s health and nutrition status in numerous developing and transition countries. Here we use data from Nigeria collected in 2013. Overall there are 23,042 observations after removing out- liers and inconsistent observations from the data. We use stunting and wasting as the bivariate response vector, where stunting refers to stunted growth measured as insuﬃcient height of the child with respect to its age, while wasting refers to insuﬃcient weight for height. Hence stunt- ing is an indicator for chronic undernutrition while wasting reﬂects acute undernutrition. We assume that the two indicators are jointly normally distributed with marginal means, marginal scales and correlation parameter depending on covariates. Speciﬁcally, the model equations for all predictors of the distributions are speciﬁed as η =β + x β + f (cage) + f (mage) + f (mbmi) + f (region), k 0,k k 1,k 2,k 3,k spat,k 22 where x contains 13 binary covariates characterising the household the child is living in as well as the child itself, see Table C.3 of the Online Appendix for a full description of variables. The three non-linear eﬀects f to f of cage (age of the child in months), mage (age of the mother in 1 3 years), mbmi (body mass index of the mother) are decomposed into their linear and non-linear part as described in Section 2.2. For the scale parameters, we used an exponential response function and for ρ the response function g(x) = x/ (1 + x ). The DIC/WAIC of the full model and model with NBPSS prior are 159,101/159,190 and 159,101/159,173, respectively and hence slightly better for the NBPSS prior model. Figures 8 and 9 show the posterior means together with their 95% posterior credible intervals of linear and non-linear eﬀects for the full model (blue) and the model with NBPSS prior (red). For the the function estimates f = f + f , Figure 9 shows the corresponding non- j,k j,k,lin j,k,nonlin linear part f separate from the linear part f in Figure 8, while the sum of the j,k,nonlin j,k,lin two components can be found in the Online Appendix F. We see that both models yield very similar point estimates, however the NBPSS prior results in slightly smoother estimates and more narrow credible intervals and hence more precise predictions – as desired with an eﬀective variable selection approach. Spatial eﬀects of the ﬁve distribution parameters with the NBPSS prior are visualized in Figure 10. While we omit the ones of the full model, tendencies are similar as for the remaining eﬀects. Inclusion probabilities are reported in Table 4. We ﬁnd that the regional eﬀect is relevant in all distribution parameters, i.e. not only the marginal means but also the scales and the correlation between stunting and wasting. Interestingly, chronic undernutrition measured by stunting seems to be mostly driven by variables describing the life situation of the children. In contrast, besides the region of residence, the mother’s nutritional status measured by mbmi has a relevant eﬀect only for acute undernutrition (wasting). 5.2.3 Hedonic House Prices We apply our methodology to the house prices dataset of n = 98, 354 single family homes in Germany. The data were provided by F+B Research & Consulting for Habitation, Real Estate and Environment Ltd, a business consultancy in Hamburg, Germany. We consider the price per square metre in Euro as the response variable and explain the variation in prices in terms of four continuous covariates representing year of construction (yoc), expert rating (rating), plot area (areapl), living area (arealiv) and spatial location (dist). We use district-speciﬁc averages yoc and rating as further covariates. We assume a Gaussian hierarchical location-scale dist dist model, where both expectation μ and log-variance log(σ ) are related to the following hierarchical predictor. Level 1 (houses): (1) (1) (1) (1) (1) (1) η = f (yoc) + f (rating) + f (areapl) + f (arealiv) + f (dist) k 1,k 2,k 3,k 4,k 5,k Level 2 (districts): (2) (2) (2) (2) η = f (yoc ) + f (rating ) + f (dist), dist dist dist,k 1,k 2,k 3,k (2) where f (dist) follow Gaussian Markov random ﬁelds for k = 1, 2 and, as before, we decompose 3,k 23 the eﬀects of the continuous covariates in both levels into their linear and non-linear part such that we end up with 26 eﬀects in total. The NBPSS prior is put on all eﬀects and inclusion probabilities are given in Table 5, while Figures 11 to 13 show the estimated linear and non- (l) linear parts of each function f , l = 1, 2 with the NBPSS prior compared to the ones of the full j,k (l) (l) (l) model. The recomposed function estimates f = f + f and the estimated spatial j,k j,k,lin j,k,nonlin eﬀects can be found in the Online Appendix G. In summary, we ﬁnd that the NBPSS prior demonstrates its eﬀect selection and shrinkage abilities also in hierarchical settings. While on level 1 the full model and the model with NBPSS prior mostly coincide, we see considerable regularisation of some non-linear eﬀects for level 2. The NBPSS prior is clearly able to select the spatial eﬀect and non-linear part of rating in both distribution parameters, while the linear dist part and the eﬀect of yoc would be excluded according to the inclusion probabilities. dist 6 Summary and Discussion In this paper, we have developed a novel prior structure for Bayesian eﬀect selection in struc- tured additive distributional regression models thus extending existing approaches in terms of both ﬂexibility of available response distributions and predictor ﬂexibility. We derived shrinkage properties of the NBPSS prior and show its favourable properties. In simulations we demonstrate empirically that the NBPSS prior is applicable even to the selection of high dimensional coeﬃ- cient blocks in more than one distribution parameter. The method promises wide applicability which we illustrate along three diﬀerent examples including zero-inﬂated count data, a bivariate Gaussian model and a hierarchical location-scale speciﬁcation for hedonic housing priors. Instead of arbitrarily ﬁxing hyperparameters of the inverse gamma priors we provide an intuitive and interpretable way for hyperprior elicitation which is easily accessible by applied users. This is an important feature since results react sensitively with respect to the actual choices of hyper- parameters. Yet, the NBPSS prior controls the ﬂexibility of each eﬀect separately since priors are assumed to be independent and does not allow to control the overall complexity of the predictor. However, the NBPSS prior could be extended to achieve also global shrinkage properties, e.g. by specifying the scale parameter in the prior on τ as a product of a global and a local param- eter (Polson and Scott, 2010). As in distributional regression the propriety of the posterior is not trivial, however, care has to be taken with respect to the speciﬁc prior choices (Ghosh et al., 2018). Alternatively, if interest is rather in smoothing and shrinkage than in explicit eﬀect se- lection shrinkage priors like the double gamma prior Bitto and Fru¨hwirth-Schnatter (2018) or penalised complexity priors Simpson et al. (2017) might be used. Also, it is conceptually straightforward to include Bayesian quantile or expectile regression models into the NBPSS prior framework and we aim to do so in a future work. References Belitz, C., Brezger, A., Klein, N., Kneib, T., Lang, S. and Umlauf, N. (2015). BayesX - Software for Bayesian inference in structured additive regression models. Version 3.0.2. Available from http://www.bayesx.org. Bitto, A. and Fru¨hwirth-Schnatter, S. (2018). Achieving shrinkage in a time-varying parameter model framework, arXiv: 1611.01310v2. 24 Chung, Y. and Dunson, D. B. (2009). Nonparametric Bayes conditional distribution modeling with variable selection, Journal of the American Statistical Association 104(488): 1646–1660. Clyde, M. and George, E. I. (2004). Model uncertainty, Statistical Science 19(1): 81–94. Cottet, R., Kohn, R. J. and Nott, D. J. (2008). Variable selection and model averaging in semiparametric overdispersed generalized linear models, Journal of the American Statistical Association 103: 661–671. Fahrmeir, L., Kneib, T. and Lang, S. (2004). Penalized structured additive regression for space-time data: A Bayesian perspective, Statistica Sinica 14: 731–761. Gelman, A., Van Dyk, D., Huang, Z. and Boscardin, W. J. (2008). Using redundant parameterizations to ﬁt hierarchical models, Journal of Computational and Graphical Statistics 17: 95–122. George, A. and Liu, J. W. (1981). Computer Solution of Large Sparse Positive Deﬁnite Systems, Prentice-Hall, Englewood Cliﬀs. George, E. and McCulloch, R. (1997). Approaches to Bayesian variable selection, Statistica Sinica 7: 339–374. Ghosh, J., Li, Y. and Mitra, R. (2018). On the use of Cauchy prior distributions for Bayesian logistic regression, Bayesian Analysis 13(3): 359–383. Ishwaran, H. and Rao, S. (2005). Spike and slab variable selection: frequentist and Bayesian strategies, The Annals of Statistics 33: 730–773. Kammann, E. E. and Wand, M. P. (2003). Geoadditive models, Journal of the Royal Statistical Society. Series C (Applied Statistics) 52: 1–18. Klein, N. (2018). sdPrior: Scale-Dependent Hyperpriors in Structured Additive Distributional Regression. R package version 0.6. Klein, N. and Kneib, T. (2016). Scale-dependent priors for variance parameters in structured additive distribu- tional regression, Bayesian Analysis 11: 1107–1106. doi:10.1214/15-BA983. Klein, N., Kneib, T., Klasen, S. and Lang, S. (2015). Bayesian structured additive distributional regression for multivariate responses, Journal of the Royal Statistical Society. Series C (Applied Statistics) 64: 569–591. Klein, N., Kneib, T. and Lang, S. (2015). Bayesian generalized additive models for location, scale and shape for zero-inﬂated and overdispersed count data, Journal of the American Statistical Association 110: 405–419. Klein, N., Kneib, T., Lang, S. and Sohn, A. (2015). Bayesian structured additive distributional regression with an application to regional income inequality in Germany, The Annals of Applied Statistics 9: 1024–1052. Krivobokova, T., Kneib, T. and Claeskens, G. (2010). Simultaneous conﬁdence bands for penalized spline esti- mators, Journal of the American Statistical Association 105: 852–863. Kundu, S. and Dunson, D. B. (2014). Bayes variable selection in semiparametric linear models, Journal of the American Statistical Association 109(505): 437–447. Lang, S. and Brezger, A. (2004). Bayesian P-splines, Journal of Computational and Graphical Statistics 13: 183– Lang, S., Umlauf, N., Wechselberger, P., Harttgen, K. and Kneib, T. (2014). Multilevel structured additive regression, Statistics and Computing 24: 223–238. Mitchell, T. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression, Journal of the American Statistical Association 83: 1023–1032. O’Hara, R. and Sillanp¨aa¨, M. (2009). A review of Bayesian variable selection methods: What, How, and Which, Bayesian Analysis 4: 85–118. Panagiotelis, A. and Smith, M. S. (2008). Bayesian identiﬁcation, selection and estimation of functions in high- dimensional additive models, Journal of Econometrics 143: 291–316. P´erez, M.-E., Pericchi, L. R. and Ram´ez, I. C. (2017). The scaled beta2 distribution as a robust prior for scales, Bayesian Analysis 12(3): 615–637. Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: Sparse Bayesian regularization and prediction, in J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (eds), Bayesian Statistics 9, Oxford. 25 Reich, B. J., Storlie, C. B. and Bondell, H. (2009). Variable selection in bayesian smoothing spline anova models: Application to deterministic computer codes, Technometrics 51: 110–120. Rigby, R. A. and Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape (with discussion), Journal of the Royal Statistical Society. Series C (Applied Statistics) 54: 507–554. Rossell, D. and Rubio, F. J. (2017). Tractable Bayesian variable selection: beyond normality, To appear in Journal of the American Statistical Association . Rue, H. (2001). Fast sampling of Gaussian Markov random ﬁelds with applications, Journal of the Royal Statistical Society. Series B (Statistical Methodology 63: 325–338. Rue, H. and Held, L. (2005). Gaussian Markov Random Fields, Chapman & Hall/CRC, New York/Boca Raton. Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression, Cambridge University Press. Scheipl, F. (2016). spikeSlabGAM: Bayesian Variable Selection and Model Choice for Generalized Additive Mixed Models. R package version 1.1.11. Scheipl, F., Fahrmeir, L. and Kneib, T. (2012). Spike-and-slab priors for function selection in structured additive regression models, Journal of the American Statistical Association 107: 1518–1532. Simpson, D., Rue, H. Martins, T. G., Riebler, A. and Sørbye, S. H. (2017). Penalising model component com- plexity: A principled, practical approach to constructing priors, Statistical Science 32(1): 1–28. Smith, M. S. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection, Journal of Econometrics 75: 317–343. Wang, L., Yuanyuan Tang, Y., Debajyoti, S., Pati, D. and Stuart Lipsitz, S. (2017). Bayesian variable selection for skewed heteroscedastic response, Technical report. arXiv:1602.09100v2. Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semipara- metric generalized linear models, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 73: 3–36. Wood, S. N. (2017). Generalized Additive Models : An Introduction with R, 2nd edn, Chapman & Hall/CRC, New York/Boca Raton. Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso, Bayesian Analysis 10(4): 909–936. Yau, P., Kohn, R. and Wood, S. (2003). Bayesian variable selection and model averaging in high-dimensional multinomial nonparametric regression, Journal of Computational and Graphical Statistics 12: 23–54. Zhang, L., Baladandayuthapani, V., Mallick, B. K., Manyam, G. C., Thompson, P. A., Bondy, M. L. and Do, K.-A. (2014). Bayesian hierarchical structured variable selection methods with application to molecular inversion probe studies in breast cancer, Journal of the Royal Statistical Society: Series C (Applied Statistics) 63(4): 595–620. Zhu, H., Vannunci, M. and Cox, D. D. (2010). A Bayesian hierarchical model for classiﬁcation with selection of functional predictors, Biometrics 66: 463–473. 26 Data set sample size no. of eﬀects distribution computing time Patents 4,805 22 zero-inﬂated Poisson 0.25 min Nigeria 23,042 108 bivariate normal 5.92 min House prices 98,354 26 Gaussian location-scale 3.75 min Table 1: Summaries for the data sets Patents, Nigeria, and House prices. Columns 2 to 4 show the number of observations, number of potential eﬀects in the full model and the distribution for the response. The last column reports the computing time required for estimating 1,000 subsequent MCMC sweeps with the NBPSS prior. Model Quadratic score Log score Spherical score DIC WAIC ZIP DIC -3,465.6 -8,866.8 2,500.4 17,136.3 17,214.4 ZIP NPBSS(α = 0.1) -3460.1 -8817.6 2511.9 17,124 17,206 ZIP NBPSS(α = 0.05) -3467.2 -8803.8 2507.1 17,118.2 17,205.1 Table 2: Patent citations: Summarised scores in the models under consideration. Values for the predictive scores were obtained from ten-fold cross validation while DIC/WAIC are based on estimates obtained with the complete data set. The best model according to each of the criteria is highlighted in bold. Covariate Scale NBPSS ZIP DIC λ π λ π year continuous 0.129 1 ∅ ∅ lin year continuous 0.965 0.999 X X nonlin ncountry continuous 0.321 0.861 ∅ ∅ lin ncountry continuous 1 0.936 X X nonlin nclaims continuous 0.954 0.286 ∅ ∅ lin nclaims continuous 0.996 0.144 X – nonlin ustwin binary 0.061 0.168 – – opp binary 0.399 0.401 X X biopharm binary 0.293 0.381 X X patus binary 0.176 0.789 X X patgsgr binary 0.153 0.571 X X Table 3: Patent citations: Eﬀect selection. The second column indicates the scale of the variable (continu- ous/binary). The third and fourth column show posterior inclusion probabilities P(δ|y) of λ and π for α = 0.1 and c = 0.1 with the NBPSS prior. Checkmarks (X‘’) in the last two columns indicate that an eﬀect was selected in the stepwise approach of Klein, Kneib and Lang (2015), while ‘–’ denotes the non-selected eﬀects. Since Klein, Kneib and Lang (2015) did not decompose nonlinear eﬀects into linear eﬀects and the nonlinear deviation from this linear eﬀect, ‘∅’ is used for the corresponding linear parts in ZIP DIC. 27 Covariate NBPSS μ μ σ σ ρ wasting stunting wasting stunting bicycle binary 0.006 0.091 0.005 0 0.01 car binary 0.012 0.498 0.008 0.004 0.002 cbirthborder7 binary 0.005 0.108 0.005 0.005 0.004 cbirthborder6 binary 0.011 0.171 0.006 0.003 0.002 cbirthborder5 binary 0.018 0.426 0.005 0.002 0.008 cbirthborder4 binary 0.013 0.418 0.004 0.005 0.004 cbirthborder3 binary 0.009 0.569 0.004 0.005 0.003 cbirthborder2 binary 0.024 0.846 0.003 0.003 0.002 cbirthborder1 binary 0.007 0.858 0.004 0.007 0.005 csex binary 0.011 0.529 0.003 0.006 0.001 ctwin binary 0.135 0.952 0.008 0.007 0.002 electricity binary 0.006 0.194 0.004 0.002 0.004 motorcycle binary 0.013 0.08 0.005 0.002 0.002 mresidence binary 0.027 0.099 0.006 0.003 0.002 munemployed binary 0.003 0.069 0.005 0.002 0.002 radio binary 0.005 0.103 0.004 0.004 0.002 refrigerator binary 0.001 0.458 0.003 0.002 0.013 television binary 0.004 0.261 0.008 0.006 0.007 cage continuous 0.007 1 0.051 0.012 0.067 lin edupartner binary 0.013 0.921 0.004 0.011 0.002 lin mage continuous 0.008 0.9 0.006 0.007 0.005 lin mbmi continuous 0.951 0.937 0.019 0.007 0.004 lin cage continuous 1 1 0.131 0.209 0.393 nonlin edupartner continuous 0.073 0.204 0.213 0.088 0.069 nonlin mage continuous 0.078 0.301 0.323 0.055 0.086 nonlin mbmi continuous 0.304 0.12 0.09 0.06 0.095 nonlin region spatial 1 1 1 0.999 0.999 Table 4: Nigeria: Posterior inclusion probabilities P(δ|y) are shown in columns 3 to 7 for μ , μ , wasting stunting σ , σ and ρ with α = 0.1 and c = 0.1 for the NBPSS prior. The second column gives the scale of the wasting stunting variable (continuous/binary/spatial). Eﬀects selected according to a cut oﬀ of 0.5 are highlighted in bold. Covariate Level 1 yoc yoc rating rating areapl areapl arealiv arealiv lin nonlin lin nonlin lin nonlin lin nonlin μ 1.00 1.00 1.00 1.00 1.00 0.67 1.00 0.94 σ 1.00 1.00 1.00 1.00 1.00 0.38 1.00 1.00 Level 2 yoc yoc rating rating dist lin nonlin lin nonlin η 0.29 0.16 0.93 1.00 1.00 dist,µ η 2 0.18 0.19 0.41 0.63 1.00 dist,σ Table 5: House prices: Posterior inclusion probabilities P(δ|y) of μ and σ for α = 0.1 and c = 0.1 (ﬁrst row) and of η and η (second row) with the NBPSS prior. dist,µ dist,σ 28 −3 −3.5 −7 −7 −5 −2.5 −6 −3.5 −4.5 −6 −6 −4 −4.5 −5 −3 −5 −4 −5.5 −4 −3.5 −3 −4 −5 −4.5 −4 −1 −3 −1 −2 −2 −5 NMIG peNMIG NBPSS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 1: Univariate marginal log-densities for a standard NMIG prior (solid line), the peNMIG prior of Scheipl et al. (2012, dashed line) and the NBPSS prior (dotted line). Hyperparameters are set to a = b = 1, 0 0 a = 5, b = 50, r = 0.005. NBPSS peNMIG NMIG 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 NBPSS peNMIG 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 β β 1 1 Figure 2: Contour lines of bivariate marginal log-densities for a standard NMIG prior (middle panel), the peNMIG (right column) and the NBPSS prior (left column). The ﬁrst row panels show results for parameters with distinct hyperparameters and the second row panels show results for parameters sharing the same τ. For the standard NMIG, the hyperparameters are by construction assumed to be distinct and no changes in the row are possible. β β 2 2 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 log(p(β)) −4 −3 −2 −1 0 1 2 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 −6 −3.5 −4 −4.5 −5 −5.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 f(ν) f(ν ) Figure 3: Univariate (left) and bivariate (right) marginal log-densities of f(ν). The hyperparameters have been ﬁxed at a = 5, b = 50, r = 0.005 and a = b = 1. 0 0 NMIG peNMIG NBPSS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 4: Score function of the marginal prior p(β) for the standard NMIG (solid line), the parameter expanded prior by Scheipl et al. (2012, dashed line) and the parameter expanded prior proposed in this paper (dotted line). log(p(f(ν)) −3 −2 −1 0 1 δ/δβ p(β) −10 −8 −6 −4 −2 0 f(ν ) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 π 1.00 0.75 c = 0.1 α = 0.05 0.50 0.25 0.00 1.00 0.75 0.50 c = 0.2 α = 0.05 0.25 0.00 1.00 0.75 c = 0.1 α = 0.1 0.50 0.25 1.00 0.75 0.50 c = 0.2 α = 0.1 0.25 0.00 1.00 0.75 c = 0.1 α = 0.2 0.50 0.25 1.00 0.75 c = 0.2 α = 0.2 0.50 0.25 Figure 5: Posterior inclusion probabilities of eﬀects in the zero-inﬂated Poisson model with n = 2, 000 observa- tions, uncorrelated covariates and no true spatial eﬀect in the predictor (i.e. the data generating model does not comprise a spatial eﬀect but we estimate a model including a spatial eﬀect) . Blue boxplots correspond to eﬀects that are included in the true model while the red boxes correspond to the noise variables that do not have an eﬀect in the data generating mechanism. lin(x1) lin(x2) lin(x3) lin(x4) lin(x5) lin(x6) lin(x7) lin(x8) lin(x9) lin(x10) lin(x11) lin(x12) lin(x13) lin(x14) lin(x15) lin(x16) sm(x1) sm(x2) sm(x3) sm(x4) sm(x5) sm(x6) sm(x7) sm(x8) sm(x9) sm(x10) sm(x11) sm(x12) sm(x13) sm(x14) sm(x15) sm(x16) mrf(region) lin(x1) lin(x2) lin(x3) lin(x4) lin(x5) lin(x6) lin(x7) lin(x8) sm(x1) sm(x2) sm(x3) sm(x4) sm(x5) sm(x6) sm(x7) sm(x8) mrf(region) n = 1000 n = 2000 1.15 1.10 non - spatial ρ = 0 1.05 1.00 1.0 ρ = 0 spatial 0.8 0.6 1.4 1.2 non - spatial ρ = 0.7 1.0 0.8 1.2 1.1 1.0 spatial ρ = 0.7 0.9 0.8 Figure 6: Violin plots of relative mean log-scores (i.e. mean log scores obtained with the NBPSS prior divided by mean log scores of the oracle model) in the zero-inﬂated Poisson model. The log-scores are averaged over 5,000 new test data observations for each simulation replicate. The columns represent the diﬀerent sample sizes n = 1, 000; 2, 000, rows 1 and 3 belong to the non-spatial scenarios (no spatial eﬀect in the data generating model) and rows 2 and 4 to the spatial ones (the data generating model comprises a spatial eﬀect). Covariates are uncorrelated in rows 1 and 2 and correlated in rows 3 and 4. The diﬀerent boxplots within a column/row correspond to diﬀerent combinations of α, c denoted as (α, c) in the labels. (0.05,0.1) (0.1,0.1) (0.2,0.1) (0.05,0.2) (0.1,0.2) (0.2,0.2) (0.05,0.1) (0.1,0.1) (0.2,0.1) (0.05,0.2) (0.1,0.2) (0.2,0.2) Relative mean logarithmic scores n = 500 n = 2000 1.0 0.9 non - spatial ρ = 0 0.8 0.7 0.6 1.0 0.9 ρ = 0 spatial 0.8 0.7 1.0 0.9 0.8 non - spatial ρ = 0.7 0.7 0.6 1.0 0.9 spatial ρ = 0.7 0.8 0.7 0.6 Figure 7: Overall accuracy (measured by the sum of true positives and true negatives divided by the total number of eﬀects) for the Poisson model. The columns represent the diﬀerent sample sizes n = 500; 2, 000, rows 1 and 3 belong to the non-spatial scenarios (no spatial eﬀect in the data generating model) and rows 2 and 4 to the spatial ones (the data generating model comprises a spatial eﬀect). Covariates are uncorrelated in rows 1 and 2 and correlated in rows 3 and 4. Last, the boxplot on the right of each subplot shows the peNMIG prior results, the remaining ones correspond to diﬀerent choices of α and c of the NBPSS prior, denoted as (α, c) in the labels. (0.05,0.1) (0.1,0.1) (0.2,0.1) (0.05,0.2) (0.1,0.2) (0.2,0.2) penNMIG (0.05,0.1) (0.1,0.1) (0.2,0.1) (0.05,0.2) (0.1,0.2) (0.2,0.2) penNMIG Accuracy ρ μ wasting stunting bicycle car cbirthorder7 cbirthorder6 cbirthorder5 cbirthorder4 cbirthorder3 cbirthorder2 cbirthorder1 csex ctwin electricity motorcycle mresidence munemployed radio refrigerator television cage edupartner mage mbmi bicycle car cbirthorder7 cbirthorder6 cbirthorder5 cbirthorder4 cbirthorder3 cbirthorder2 Model cbirthorder1 csex ctwin NBSS prior electricity motorcycle Full model mresidence munemployed radio refrigerator television cage edupartner mage mbmi bicycle car cbirthorder7 cbirthorder6 cbirthorder5 cbirthorder4 cbirthorder3 cbirthorder2 cbirthorder1 csex ctwin electricity motorcycle mresidence munemployed radio refrigerator television cage edupartner mage mbmi −0.6 −0.3 0.0 0.3 0.6−0.6 −0.3 0.0 0.3 0.6 Posterior Mean Figure 8: Nigeria: Posterior means and 95% credible intervals for the linear eﬀects of all model parameters (left column for stunting, right column for wasting, top row for ρ, middle row for σ, bottom row for μ). Since ρ acts on both responses, the eﬀects are only shown in the ﬁrst column. Red corresponds to results for the NBPSS prior and blue to the full model. Variable ρ σ wasting σ stunting μ wasting μ stunting cage mage edupartner mbmi 0.2 0.1 0.0 −0.1 −0.2 0.2 0.1 0.0 −0.1 −0.2 0.15 0.10 0.05 0.00 −0.05 −0.10 0.6 0.3 0.0 −0.3 −0.6 0.5 0.0 −0.5 −1.0 0 20 40 60 0 5 10 15 20 20 30 40 50 20 30 40 Figure 9: Nigeria: Posterior means and pointwise 95% credible intervals for the non-linear eﬀects f of j,k,nonlin cage, mage, mbmi and edupartner (column-wise) for all model parameters (ρ, σ , σ , μ , μ , stunting wasting stunting wasting row-wise). Red corresponds to results for the NBPSS prior and blue to the full model. 35 level1 level2 Figure 10: Nigeria: Posterior means for the spatial eﬀects of all model parameters μ , μ , σ , wasting stunting wasting σ estimated with the NBPSS prior. stunting yoc rating areapl arealiv −0.5 0.0 0.5 −0.5 0.0 0.5 Posterior Mean yoc Model NBPSS Full Model rating −0.2 −0.1 0.0 0.1 −0.2 −0.1 0.0 0.1 Posterior Mean Figure 11: House prices: Estimated posterior mean linear eﬀects with 95% credible intervals of the model parameters μ, σ (level 1, ﬁrst row), η and η 2 (level 2, second row) estimated with the NBPSS prior. dist,µ dist,σ 36 37 level1 : μ level1 : μ level1 : μ level1 : μ yoc rating areapl arealiv 0.6 0.10 1.0 0.1 0.3 0.5 0.05 0.0 0.0 0.0 0.00 −0.1 −0.3 −0.5 −0.05 −0.2 −0.6 1925 1950 1975 2000 2.5 5.0 7.5 500 1000 1500 2000 100 150 200 250 2 2 2 2 level1 : σ level1 : σ level1 : σ level1 : σ yoc rating areapl arealiv 0.2 0.50 1.0 0.1 0.25 0.4 0.5 0.00 0.0 0.0 0.0 −0.25 −0.1 −0.5 −0.4 −0.50 1925 1950 1975 2000 2.5 5.0 7.5 500 1000 1500 2000 100 150 200 250 Figure 12: House prices: Estimated posterior mean non-linear eﬀects with 95% credible intervals of the model parameters μ, σ (level 1) estimated with the NBPSS prior (red) and the full model (blue). level2 : μ level2 : μ yoc rating 0.8 0.1 0.4 0.0 0.0 −0.4 −0.1 1970 1980 1990 2000 2010 4 6 8 10 Model 2 2 NBPSS level2 : σ level2 : σ Full Model yoc rating 0.3 0.2 0.2 0.1 0.0 0.0 −0.1 −0.2 −0.2 −0.3 −0.4 1970 1980 1990 2000 2010 4 6 8 10 Figure 13: House prices: Estimated posterior mean non-linear eﬀects with 95% credible intervals of the model parameters η and η 2 (level 2) estimated with the NBPSS prior (red) and the full model (blue). dist,µ dist,σ

Statistics – arXiv (Cornell University)

**Published: ** Feb 27, 2019

Loading...

You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!

Read and print from thousands of top scholarly journals.

System error. Please try again!

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.

Copy and paste the desired citation format or use the link below to download a file formatted for EndNote

Access the full text.

Sign up today, get DeepDyve free for 14 days.

All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.