Identifying influential observations in Bayesian models by using Markov chain Monte Carlo

In statistical modelling, it is often important to know how much parameter estimates are influenced by particular observations. An attractive approach is to re-estimate the parameters with each observation deleted in turn, but this is computationally demanding when fitting models by using Markov chain Monte Carlo (MCMC), as obtaining complete sample estimates is often in itself a very time-consuming task. Here we propose two efficient ways to approximate the case-deleted estimates by using output from MCMC estimation. Our first proposal, which directly approximates the usual influence statistics in maximum likelihood analyses of generalised linear models (GLMs), is easy to implement and avoids any further evaluation of the likelihood. Hence, unlike the existing alternatives, it does not become more computationally intensive as the model complexity increases. Our second proposal, which utilises model perturbations, also has this advantage and does not require the form of the GLM to be specified. We show how our two proposed methods are related and evaluate them against the existing method of importance sampling and case deletion in a logistic regression analysis with missing covariates. We also provide practical advice for those implementing our procedures, so that they may be used in many situations where MCMC is used to fit statistical models. Copyright © 2011 John Wiley & Sons, Ltd.


Introduction
Measures of leverage, influence and residuals are well-established tools in data analysis. Influence refers to how sensitive inferences are to the presence of particular observations and has proved its usefulness in maximum likelihood and least squares analyses.
It is therefore of interest to explore efficient algorithms for measuring influence in Bayesian analyses. Our interest in this was motivated by examples such as the sudden unexpected death in infancy (SUDI) analysis described in detail in Section 2. Here we wish to fit a standard model and examine the usual diagnostics such as influence and residuals, but some observations have missing covariates. A natural way to incorporate these observations into the analysis is to posit a model for the missing data and use maximum likelihood, but this is computationally intensive in all but the simplest of situations. In particular, this type of approach becomes increasingly difficult as the dimension of the integrals with respect to the missing data and the parameter space over which the maximisation is performed become large.
In order to overcome these difficulties, Markov chain Monte Carlo (MCMC) [1] can be used. MCMC is a powerful but computationally intensive method for fitting Bayesian models. It has the nice feature of yielding a random sample from the posterior distribution of the parameters of interest, from which means and credible intervals can be readily calculated. Although relatively simple models present few difficulties, the iteration times for complex and hence more realistic models may be long. If the mixing is poor, many iterations can be required to obtain estimates to within an appropriate degree of Monte Carlo error. Numerical estimation of influence by directly removing observations is therefore extremely tedious at best. 33 are cases. This was analysed using logistic regression with key matching variables as covariates. WinBUGS [8], via the R package BRugs, was used to perform all MCMCs.
We found four variables to be especially good predictors of death in this sample by using a logistic model: Income Deprivation Affecting Children Index (DEP; a low score indicates high deprivation), mother's age in years (MAGE), birthweight in grams (BWT) and the number of previous terminations and miscarriages (TERMIN). All covariates were centred, and continuous variables were standardised, so we fit the model logit.p.death// Dˇ1 Cˇ2.DEP 9500/=9000 Cˇ3.MAGE 29/=6C (1) 4 .BWT 3000/=700 Cˇ5.TERMIN 0:9/: From a standard complete case maximum likelihood analysis, we obtain the following (standard errors in parentheses):  We place uniform priors on all regression parameters and a Gamma (0.001,0.001) prior on the precision of the variance in the linear model for MAGE. We used a burn in of 10 3 iterations and a further 10 5 iterations to make inferences. This MCMC took around 10 min on a Windows Terminals Server and provided Ǒ 1 D 2:62 .0:54/, Ǒ 2 D 1:10 .0:42/, Ǒ 3 D 0:54 .0:28/, Ǒ 4 D 1:09 .0:33/ and Ǒ 5 D 1:91 .0:60/, where the estimates are given by the simulated sample means. This analysis adds further weight to the conclusion that the four variables considered are good predictors of death. We also obtained the correlation matrix of these estimates and therefore Var. Ǒ /. We use the notation Ǒ for both Bayesian and maximum likelihood estimates because we assume that they are in good numerical agreement. In particular, we use the hat notation to denote the posterior mean in all Bayesian analyses that follow, where we obtain these quantities as the sample mean of the simulated values from the MCMC. This analysis has successfully incorporated all data but makes it difficult to assess issues such as influence. With the additional assumptions made about the missing data, coupled with the fact that there is a large amount of variation in the observed covariates, there is the natural concern that a few unusual observations might be driving the inferences. We now turn our attention to methods for deriving influences using MCMC output and begin by considering the simpler case where there is just a single GLM and no missing data.

Case-deleted estimates for generalised linear models
We condition inference on X , an n by p matrix of explanatory variables. The observations, denoted by y, are an n by 1 vector of conditionally independent responses. We assume for now that the data are complete but return to the possibility of missing data in Sections 6 and 7. Denote i D EOEY i jX i , where i depends on the ith row of X , X T i , through the linear predictor Á i D g. i / D X T iˇ, where g. / denotes the link function andˇis the p by 1 vector of regression parameters. We assume that Y i is a member of the exponential family with variance v i and dispersion parameter . Fitting the model in a Bayesian framework, and assuming that the likelihood dominates the prior, implies that the posterior means, medians and modes will all be close to the maximum likelihood point estimates, obtained using iterated weighted least squares [10,11], with agreement as the sample size tends to infinity. Williams [7] provides an approximation to the ith case-deleted maximum likelihood estimate, Ǒ .i / , obtained by using the complete sample estimate Ǒ as an initial value and one step of the weighted least squares algorithm: where w i D v 1 i .@ i =@Á i / 2 , W D diag.w i /, r i denotes the residual .y i i / and h i is the ith diagonal element of the 'hat' matrix H D W 1=2 X.X T W X/ 1 X T W 1=2 ; all terms on the right-hand side are evaluated at the complete sample estimates. Let Â i denote the canonical parameter for the regression. The density of y i is exp.
We suggest omitting the term involving h i in (2). The sum of the positive h i equals the number of regression parameters p, and we assume that the sample size n is much greater than p. Hence, although the relative magnitudes of h i play an important role in determining the observations' leverages, they are much less important when obtaining influence statistics using (2); .1 h i / 1 for all i in large samples with no grossly outlying values. With this simplification and substitution of This further approximation is useful as we routinely obtain Ǒ and Var. Ǒ / when fitting the model to the full dataset. In particular, for the SUDI data, we have already obtained Ǒ and Var. Ǒ / as described and reported in Section 2. Any procedure for obtaining the rest of the right-hand side of (3) may be used to provide approximate case-deleted estimates. In particular, we can obtain influence statistics, Ǒ Ǒ .i / , directly from (3). If the canonical link is used, @Â i =@Á i D 1 and (3) simplifies further.

Our first proposal
Assuming a particular GLM, and that the impact of the prior specification is negligible, we propose additionally defining and storing the simulated values of when running the MCMC. The derivatives @Â i =@Á i can be evaluated in terms of Á i and are unity if a canonical link has been adopted. Hence, the ı i are easily computed in terms of the covariates,ˇand . Replacing ı i with its estimate (the mean of the simulated values) from the MCMC in (3), and Ǒ and Var. Ǒ / with theirs, immediately yields an approximate influence for the ith observation. All of the unobserved quantities that we define in this article are evaluated at every iterate of the MCMC, including derivatives, and so all of these vary between MCMC draws, but only the estimates O ı i , Ǒ and Var. Ǒ / are used to compute influence statistics in (3) when using our first proposal. An assessment of the MC error for O ı i is equally important, but just as straightforward, as for Ǒ when obtaining influence statistics in this way.

Obtaining influence statistics using perturbations
Our first proposal avoids any further evaluation of the likelihood and so can be expected to be more computationally efficient than importance sampling but requires the form of the GLM to be known. In this section, we propose a second method based on perturbing the regression parameters. Although we use the theory of GLMs to justify the procedure, ultimately we use just the variance/covariance matrix of the complete sample point estimates and the posterior means of the perturbations.
We will define an n by p random matrix of 'perturbations' and replace the j th regression coefficient in the model, for the ith observation, byˇi j Dˇj C ij . The entries ij are to be given independent (of each other,ˇand ) normal priors with mean 0. Otherwise, the regression is to be fitted using MCMC in the usual way. This essentially introduces a random effect component (over observations) on the regression parameters. By making the prior precisions of the random perturbations extremely large, we obtain approximately the same estimates Ǒ and Var. Ǒ / as in the usual perturbation-free regression model. We use the quantities ij to denote the prior standard deviations of ij .
We next show in the case of a GLM that the posterior distributions of the ij relate to influential outcomes in the model. The intuition is that, if an observation is directly influential for a particular parameter, then the corresponding posterior ij distribution will move further from zero than its less influential counterparts, reflecting the 'pull' or influence that this observation exerts on the fitted model.

The mathematical consequences of introducing the perturbations
Let T i denote the ith row of the perturbation matrix. With the assumption that each of the n observations, conditional on all ij ,ˇand , are independent (equivalent to assuming that they are exchangeable in the original model), Note that we include the prior P .ˇ; / for completeness, but we assume that its role is negligible in practice. We show in the Appendix that, provided the prior perturbation variances (denoted by 2 ij ) are made sufficiently small, the posterior mean of ij is given by X ij denotes the j th entry of X T i , and all quantities are on the right-hand side of (5) are evaluated at their posterior modes.

An approximation
We assume that the magnitude of the posterior mode of s i is small in relation to that of r i . We can check this, provided that the form of the GLM is known, as shown for our example in Section 7. As 2 ij ! 0, s i ! 0, and we can ignore s i . Hence, where ı ij is the j th entry in ı i and is evaluated at full sample estimates in exactly the same way as in Williams' approximation for case-deleted estimates. For models where the variance v i is unbounded, extra care must be taken to ensure that s i is small in relation to the residuals because s i is linear in v i . For example, for datasets with large Poisson counts, very small 2 ij may be required to ensure that we can ignore s i in this way.
We can therefore obtain the term O ı i f@Â i =@Á i g.X i r i = / using MCMC by adding the perturbations and monitoring the simulated i or equivalently by defining the random variables ı ij D ij = 2 ij and monitoring the ı i . With the combination of (6) and (3), the case-deleted estimates are from which we can obtain the influences, Ǒ Ǒ .i / .

Implementation and Monte Carlo error
A practical issue is that, although it is possible to implement this method in a single stage as presented above, this is computationally demanding and frequently results in 'trap errors' in WinBUGS. We suggest a two-stage procedure that avoids these difficulties in practice. In the first stage, the original model is fitted without perturbations in the usual way in order to obtain Ǒ , O and Var. Ǒ /.
In the second stage, to simplify the MCMC algorithms, we suggest constrainingˇand to their estimates from the first stage and then introducing the perturbations. Withˇand fixed in this way, the posterior for the vectors becomes (4) with P .ˇ; / D 1 and P .y i j i ;ˇ; / replaced by P .y i j i ; Ǒ ; O /, and a similar O ı i is obtained. This simplification is made purely for computational convenience because suitable O ı i are evaluated regardless of whether a one-stage or a two-stage procedure is adopted. Now thatˇand are to be held fixed in this second stage, the only observation which is used to update the prior of i , and therefore ı i , is y i . Hence, we can add the i to a single observation (but to all regression coefficients) at a time, providing y i as the only observation, and run n additional analyses. In practice, we found it convenient and computationally efficient to add the perturbations in this manner, so that 2 k;j D 0 for all k ¤ i when obtaining the ith influence statistic. It is then a very simple task to update the vector i en bloc using a Gibbs' sampler, and, because there are no other random variables, we simulate directly from the posterior i jy i and the simulated i , and hence ı i are independent. The resulting MCMC mixes and converges extremely rapidly.
Monte Carlo error is inherent in the influences obtained using equation (6), which is a little more difficult to assess because of the indirect manner in which influence statistics are obtained when using the perturbations. The error in O ı i is approximately normally distributed, centred at the origin with Var. O ı i / .mdiag. 2 i // 1 , where 2 i is the vector of 2 ij associated with observation i, and m is the number of iterations. This is because ı i has a prior variance of .diag. 2 i // 1 , and in the two-stage approach, the simulated ı i are independent as explained above; the posterior variance of ı i is therefore very similar to its prior. Hence, the Monte Carlo error of the influence of the ith observation, obtained as Var. can easily be obtained. Small values of 2 ij , which improve the accuracy of the approximations, also increase Monte Carlo error of the influence statistics, and some kind of tradeoff is needed in practice. If the perturbations are made quite large in this Monte Carlo error/approximation tradeoff, then they do not affect the primary analysis because this is performed in the first 'perturbation free' stage. Large entries of Var. Ǒ / also have an unfortunate implication for the Monte Carlo error, so we suggest centring data before fitting models to reduce the variance of intercept terms.
Although an advantage of this approach is that we merely require the model to be some (unspecified) GLM, the form of the GLM does have implications for how small the perturbations' variances have to be, which in turn has implications for m. In practice, it may be desirable to use a variety of small ij and large m in order to determine if reliable influence statistics have been obtained. Alternatively, and if known, the properties of the GLM in question can be examined and suitable criterion chosen. Despite this, the difficulties associated with choosing appropriate values of ij and m, so that both the approximations are accurate and the MC error is small, are a disadvantage of this method. Our first proposal may therefore be considered preferable in situations where the form of the GLM is known.

A simulation study
In order to assess the accuracy of the two proposed methods, we performed a simulation study. Here we simulated 100 datasets, each with n D 100 observations, by using the linear model Y N.ˇ0 Cˇ1x 1 Cˇ2x 2 ; 1/, whereˇ0 D 0,ˇ1 D 1 andˇ3 D 3. Burn ins of 10 3 iterations were used. A further 10 5 iterations were used when implementing our first proposal, and m D 10 6 iterations were used when obtaining influences using the perturbations to ensure that these had stabilised.
By using normally distributed data, exact frequentist influence statistics [5] can be obtained which provide a 'gold standard' to compare our methods with. Because we used a linear model with a canonical link, our simulation study does not assess the accuracy of the approximations described in the Appendix; rather, it assesses how well our proposals work when these approximations are valid. Flat priors over very wide ranges were used for theˇparameters, and a Gamma (0.001,0.001) prior was used for the precision of the variance.
Perturbations with ij D min.0:05=jX ij j; 1/ were used. These values have been found to ensure as large perturbations as possible are used, without compromising the accuracy of any approximation, when a canonical link is used in either a linear (with variance of around 1 or greater) or logistic regression involving a moderate number of parameters. For a GLM with a canonical link, the approximations described in the Appendix only require that values of X T i i are small and the posterior modes of the s i are small in relation to the residuals r i ; s i D X T i i for a standard linear regression. For example, for p D 6 as in our main example, using jX ij j ij D 0:05 ensures that X T our methods with. Both our proposed methods provided influence statistics that were in good agreement with these standard influence statistics. To give an indication of this, the coefficients of the least squares regression lines of the proposed influence statistics on the standard frequentist influence statistics are given for each parameter in Table I, where the correlations between the proposed and standard influence statistics are also shown. The correlations are given to five decimal places so that they are distinguishable. The gradients of the regression lines are all between 0.95 and 0.99 suggesting that, compared with the exact frequentist influences, the proposed methods have a slight tendency to understate the influence of observations in this simulation study, but the overall picture is that all three methods are in good agreement. This tendency can be explained by the omittance of the .1 h i / terms in (2), so this will lessen in larger samples.

Obtaining influences for models comprising a collection of connected generalised linear models and in situations where there are missing data
The above analysis assumes a single GLM (with complete data), but as explained in the introduction, Bayesian models are commonly defined as a collection of connected GLMs. For example, the model for the SUDI data is such a collection of three GLMs, where the logistic regression of death on all the covariates is of real interest. There are also some missing data, which motivated our methods as we explained in the introduction.
Our methods are, however, immediately applicable to any GLM component within the full model. When implementing our first proposal (Section 3.1), we simply define ı i D f@Â i =@Á i g.X i r i = / for the GLM for which influences are desired and combine these with the resulting regression estimates, Ǒ and Var. Ǒ /, for this same GLM. Neglecting any prior dependence between regression parameters that might have been specified, and if there is no missing data, the parameter estimates for each GLM are independent, and hence, the standard theory described in Section 3 ensures that influence statistics are obtained from our methods.
There is, however, the issue of missing covariates, which is ignored in the above argument. This is in fact of little concern provided the fraction of missing data is not too severe because our procedures average the influences over the posterior distributions of any missing data, and hence, suitable influences are obtained. This way of handling missing data may be considered an advantage of our methods. We may therefore handle missing data by entering them as 'NA' in the usual way when using WinBUGS, for example.
The perturbations are justified as they obtain ı i in a different way and hence are also appropriate. Here we constrain all parameters to their estimated values after fitting the model comprising a collection of connected GLMs in the first stage and then add the perturbations to the GLM in question and monitor the simulated ı i in the same manner as before.

Influence statistics in the sudden unexpected death in infancy analysis
We used the proposed methods, and the alternatives, to estimate the influences in the SUDI analysis. Our first proposal is suitable because the form of the model of interest (the logistic regression for the probability of death) is known, but our second proposal will also be used so that the results can be compared.
There are some missing covariates, but otherwise model (1) is a standard logistic regression. We therefore also monitored s i p i .death/.1 p i .death//X T i i when adding the corresponding perturbation to Table I. Some results from the simulation study: c 1 and m 1 denote the intercepts and gradients of the least squares regression lines of influence statistics from our first proposal (Section 3.1) on the frequentist influence statistics; 1 denotes the correlation between these influence statistics. c 2 , m 2 and 2 denote these same quantities for influences obtained using perturbations. ensure that the role of s i is negligible in (5). The average absolute posterior mean of O s i was just 0.0006, and its maximum was 0.0024. These are small values in relation to the residuals resulting from logistic regression, and we are reassured that approximation (6) is accurate. A burn in of 10 3 iterations was used. A further 10 5 iterations were used to obtain influences using our first proposal, and the influences from the perturbations-based method appeared to stabilise (small MC error) using m D 10 6 . Perturbations with jX ij j ij D 0:05 were initially used, but some numerical difficulties were encountered for the three data points where the mother's age was 29 years; this corresponds to a covariate of zero in the regression as written above and, hence, with the decision to use jX ij j ij D 0:05, an infinite ij . In fact, covariates were centred at exact means rather than the approximate ones given above, but the three very large ij arising from this strategy presented problems. Following the procedure used in the simulation study to avoid these problems, ij was therefore reduced to unity when used in conjunction with the mother's age covariate for these three observations.

Comparison of methods
A gold standard influence analysis was performed by removing each of the 137 observations in turn. Only 20 000 iterations were used to obtain estimates for each case-deleted sample to reduce the time taken (to around 5 h). The resulting case-deleted influences were standardised (divided by the standard error of the corresponding complete sample parameter estimate) and are shown as solid points in Figure 1 for the first 20 observations, where standardised influences forˇ2 andˇ3 are shown in the top two panels and those forˇ4 andˇ5 are shown in the bottom two panels. The 95% intervals describing Monte Carlo uncertainty, from normal approximations, are only slightly wider than the solid points which can thus be taken to be accurate. Hollow circles show the corresponding estimates obtained by importance sampling [3,4]; triangles show these from our first proposal (Section 3.1), and diamonds show influences obtained using perturbations (our second proposal). All three approximations perform least well for the more influential observations but are generally effective in obtaining influences. It is to be expected that large influences are obtained with the least precision, as these provide the biggest challenge to Williams' approximation (which two of the methods are based on) and the most volatile weights for importance sampling. Very similar results were obtained for the remaining 117 observations. Note in particular that all methods successfully detected the influential observation 12 forˇ2. It is not surprising that observation 12 is influential; this has the highest DEP score of the cases and one of the oldest MAGE values. Cook [5] suggests, in the context of linear regression, that if removing an observation results in an estimate at the edge of 50% of the whole sample confidence region, then this 'may be cause for concern'. Observation 12 can therefore be interpreted as being worryingly influential forˇ2 but not excessively so.
Because all the procedures were successful, it is perhaps the additional computational burden required to obtain influences that determines which is to be recommended in practice. Importance sampling doubled the overall simulation time (from around 10 to 20 min). Using our first proposal only required an extra 5 min. Even for this relatively simple example, our first proposal was much faster than importance sampling. This relative efficiency will increase as the model complexity, and hence the difficulties associated with evaluating the observations' contributions to the likelihood, also increases.
The perturbations-based method required around 30 s for each observation, so that around an hour of computing time was needed to obtain influence statistics. Nevertheless, this method has the advantage that computation time does not increase with model complexity in the way it does with importance sampling and may prove useful for models where the precise form of the GLM cannot be specified. Indicative WinBUGS code for performing the analysis is available from the first author on request.

Discussion
This paper has developed novel ways of approximating established influence statistics, as used in classical methods, so that they can be used in the context of Bayesian analyses using MCMC. The two new approaches involve storing additional quantities and then use standard MCMC output to recover Williams' influence statistic. The type of analysis performed provided the motivation for this work, which can be applied much more generally. We have not, however, attempted to answer the much more difficult question of whether or not removing particular observations changes the conclusions qualitatively. Furthermore, we have not asked the even more difficult question of what we should do if we discover that an observation is very influential.
Direct case deletion, although easily implemented, is not a feasible option unless models can be fitted extremely quickly; even for the relatively simple model considered here, with just over a hundred observations, this took several hours. Importance sampling is a generally viable but fairly computationally intensive method for the type of models we have considered but becomes computationally prohibitive as the model becomes more complex and the repeated computation of the likelihood becomes less feasible. Even for our relatively simple example, the computational advantages of our first proposal are apparent. If any of the procedures that avoid direct case deletion highlight extremely influential observations, and an accurate indication of their influence is desired, it is necessary to remove the offending observations and refit the model. This is because all the approximate influences are less accurate for more influential observations. All that is typically needed in practice, however, is the identification of influential observations, and an indication of the extent of this influence and our procedures can be used for these purposes.
We have assumed that the model is a collection of connected single-level GLMs. Our methods might also be considered when the part of the model of interest is either known to be such a GLM or, at the very least, thought to behave similarly to one when using the perturbations. Bayesian Hierarchical models, where unobserved random effects present further issues [12], do not fit directly into our framework, however. Despite this, Hodges' [13] proposals for the assessment of case influence for hierarchical models, in his Section 4.4, bear some similarities to ours. Extensions or variations of our methods might be especially useful in this context and may form the subject of future work.
In particular, because the perturbations have the intuition that influential observations will exert more 'pull', it would be especially interesting to see if these are also useful for these and other types of statistical model. A difficulty in using the perturbations in practice is that we must ensure that all the various approximations used are appropriate. For models where the variance is unbounded, extra care must be taken to ensure that s i is small in relation to the residuals because this is linear in v i . Furthermore, if a canonical link has not been used, we have a further criterion that must be satisfied in terms of the derivatives of the canonical parameter and the linear predictor.
In conclusion, we have proposed two new methods for estimating the influence of observations when fitting models by using MCMC. We have shown how these are related and given practical guidance. Both proposals have been shown, in an example, to give accurate measures of influence. Both proposals avoid any further computation of the likelihood and so can be used in some situations where the complexity of the model renders case deletion and importance sampling unfeasible. and Using independent normal priors for the ij , and noting that all prior distributions are independent, gives The distributional assumptions of the GLM, with the perturbations introduced, gives P .y i j i ;ˇ; / / exp..b.f .X T iˇC X T i i // C y i f .X T iˇC X T i i //= /: The product of (9) and (10) gives the posterior density of i to within a constant of proportionality. Taking the logarithm of the resulting density and differentiating with respect to ij gives where X ij denotes the j th entry of X T i ; f 0 . / D @Â i =@Á i and b 0 . / are the derivatives of functions f . / and b. /. In order to obtain the posterior distribution of ij , a series of approximations are required.

Approximation 1
We use the linear approximation f 0 .X T iˇC X T i i / f 0 .X T iˇ/ C X T i i f 00 .X T iˇ/ and note that this is exact if a canonical link is used and valid as 2 ij ! 0 for all j D 1; ; p. Hence, f 0 .X T iˇC X T i i / @Â i =@Á i C X T i i @ 2 Â i =@Á 2 i where the partial derivatives are evaluated at Á i D X T iˇ. We further assume @Â i =@Á i >> abs.X T i i @ 2 Â i =@Á 2 i /. Again as 2 ij ! 0, this criterion is satisfied, and this is exact if a canonical link is used. Assuming sufficiently small perturbations, we approximate f 0 .X T iˇC X T i i / @Â i =@Á i in (11