Estimation of the predictive power of the model in mixed-effects meta-regression: A simulation study



Several methods are available to estimate the total and residual amount of heterogeneity in meta-analysis, leading to different alternatives when estimating the predictive power in mixed-effects meta-regression models using the formula proposed by Raudenbush (1994, 2009). In this paper, a simulation study was conducted to compare the performance of seven estimators of these parameters under various realistic scenarios in psychology and related fields. Our results suggest that the number of studies (k) exerts the most important influence on the accuracy of the results, and that precise estimates of the heterogeneity variances and the model predictive power can only be expected with at least 20 and 40 studies, respectively. Increases in the average within-study sample size (math formula) also improved the results for all estimators. Some differences among the accuracy of the estimators were observed, especially under adverse (small k and math formula) conditions, while the results for the different methods tended to convergence for more optimal scenarios.

1. Introduction

Meta-analysis is a form of research synthesis that allows researchers to quantitatively integrate the results from a set of studies on the same topic (Borenstein, Hedges, Higgins & Rothstein, 2009; Cooper, Hedges & Valentine, 2009). Since the outcomes from the individual studies are often expressed in different measurement units, their results are typically converted into a common metric through a standardized effect size index (such as the standardized mean difference). The main objectives in a meta-analysis are to obtain an overall effect size estimate, to assess the heterogeneity among the individual effect size estimates, and to search for moderators that can account for (at least) part of that heterogeneity (Hedges & Olkin, 1985; Sánchez-Meca & Marín-Martínez, 2010).

The results or effect sizes of the individual studies in a meta-analysis usually exhibit some heterogeneity (e.g., Sidik & Jonkman, 2005b; Thompson & Higgins, 2002). This means that, although a set of studies analysing the same phenomenon (e.g., effectiveness of psychological treatments and interventions on a given disorder) are selected, their results are likely to differ to some extent. For that reason, moderator analyses typically constitute a crucial element of a meta-analysis (Lipsey, 2009). In a moderator analysis, the goal is to test the influence of one or more study characteristics (e.g., type and duration of the intervention, severity of the disorder in the sample patients) on the outcome variable (e.g., efficacy of the intervention, assessed through the comparison between a treatment and a control group). Such analyses can be conducted by fitting linear models to the data where the moderators constitute the predictor variables and the effect sizes are employed as the criterion variable (Borenstein et al., 2009). This leads to so-called meta-regression models (Thompson & Higgins, 2002). In a meta-regression model, both continuous and categorical moderators can be included.

When carrying out a meta-analysis, some statistical model must be assumed for the effect size distribution, and the model choice will have an influence on the validity and generalizability of the results from the meta-analysis. Two kinds of statistical models have been employed for the majority of meta-analytic reviews conducted so far, namely the fixed-effects and random-effects models (Hedges & Vevea, 1998; Schmidt, Oh & Hayes, 2009). Nowadays, most researchers agree that the model choice should be made based on the generalizability intended for the results (National Research Council, 1992). Only random-effects models, which include an additional variance component to model the between-studies heterogeneity, allow for generalization to studies different to the ones included in the meta-analysis, which is usually the goal when carrying out such a review. Thus, random-effects models are a suitable option for most meta-analyses (Hedges & Vevea, 1998; Raudenbush, 1994, 2009).

Under a random-effects model, it is assumed that the study outcomes (e.g., treatment efficacy) will fluctuate as a consequence of two sources of variation: the sampling of the participants for each study; and the differential characteristics of the studies (e.g., different conditions of the sample, treatment application, methodology, or context in each individual study). The magnitude of the latter can be analysed through the estimation of the heterogeneity (or between-studies) variance, τ2, which represents the excess variation among the effects over that expected from sampling error alone (Thompson & Sharp, 1999). In contrast to the sampling variances from each effect size, which quantify the random sampling error, τ2 denotes systematic differences due to the influence of characteristics from the individual studies. The identification of some of these characteristics (or moderators) is the main objective of the moderator analyses. Since the moderators are usually included as fixed effects in the model, the addition of a random effect (the effect sizes in the studies) to model the heterogeneity among the studies leads to mixed-effects meta-regression models.

There are several parameters of interest in a meta-regression model. One of these is the model predictive power, denoted by Ρ2 (Ρ denotes the capital Greek letter ‘rho’), which can be defined as the proportion of variance among the effect sizes that can be accounted for by the predictors included in the model. Note that only the variance due to differences among the studies, quantified by τ2, can be explained by the predictors usually included in a mixed-effects meta-regression model. An estimate of the Ρ2 parameter is usually denoted as an R2 value. The interpretation of R2 is identical in ordinary regression and in meta-regression models, in terms of a percentage or proportion of the variability in the outcomes associated with the predictor(s).

When regression models are fitted using ordinary least squares techniques, the R2 index is computed as the quotient between the sum of squares due to the regression and the total sum of squares, that is, math formula (e.g., Pedhazur & Schmelkin, 1991). However, this strategy is not suitable for meta-regression models because part of the total variability, more specifically the sampling error of an observed effect size given the population effect size in that study, cannot by definition be explained by the moderators included in the model (Aloe, Becker & Pigott, 2010; Konstantopoulos & Hedges, 2009; Rodriguez & Maeda, 2006).1 Thus, a different method is typically proposed for obtaining an R2 index in meta-regression models (Raudenbush, 1994), where the total variability is an estimate of the between-studies variance, τ2, and the variability explained by the predictors in the model is estimated as a part of τ2 (see equation (3)) . This method will be presented, explained, and illustrated in this paper.

In a meta-regression model, an adequate estimate of the magnitude of its predictive power via the R2 index is an essential complement of the statistical significance of the model. The R2 index informs us about the practical significance or the degree of influence of a set of moderators in the heterogeneity of the effect sizes in a meta-analysis (e.g., explaining around 20% or 30% of the heterogeneity). However, as far as we know, no studies have yet evaluated in a systematic manner the performance of the R2 index in the conditions of a meta-regression model. Therefore, the purpose of the present study was to assess the performance of the method proposed by Raudenbush to compute an R2 index in meta-analysis, by conducting a Monte Carlo simulation with different conditions usually found in the real meta-analyses.

The outline of the present paper is as follows. First, mixed-effects meta-regression models are briefly sketched. Second, various alternatives for computing an R2 index according to the proposal of Raudenbush (1994) for meta-analysis are considered. After presenting the methods, results from previous simulation studies that pursued part of the objectives of our study are summarized. The performance of the alternative methods here considered is then illustrated by applying them to an example. Next, a simulation study comparing the various estimators is presented and the results obtained are detailed. Finally, the results are discussed and some conclusions provided, where the degree of accuracy of the different methods for the computation of an R2 index as a measure of the explanatory power of a predictor is assessed as a function of the specific conditions in a meta-analysis (e.g., number of studies, sample size distribution of the studies, effect size distribution, and the true percentage of variance accounted for by the predictor).

2. Mixed-effects meta-regression models

In a meta-analysis with k studies, let y denote a k × 1 vector of independent effect sizes {yi} that represents the results of the studies, and X a k × (p + 1) design matrix of full column rank with p predictor variables, representing some differential characteristics in the studies. Since the predictors are included as fixed effects in the model, assuming a random-effects model for the effect sizes leads to a mixed-effects meta-regression model, which can be expressed by the formula (Raudenbush, 1994)

display math(1)

where β is a (p + 1) × 1 vector containing the regression coefficients math formula, u is a k × 1 vector of independent between-studies errors {ui} with distribution math formula, and e is a k × 1 vector of independent within-study errors {ei}, each with distribution math formula. While vi is the within-study variance (or sampling error) for the ith study, math formula represents the residual heterogeneity (or between-studies) variance, that is, the remaining variability in the true effect sizes not accounted for after adding one or more predictors to the model (Viechtbauer, 2007a).

Note that the mixed-effects model presented in equation (1) is actually an extension of the random-effects model and that the latter can be formulated if X is defined as a k × 1 vector of ones. In this case we would have a model without predictors, where β is a scalar containing the hypermean (mean of the population effects) and u is normally distributed with mean 0 and variance τ2, the latter denoting the total heterogeneity in the true effects. If, moreover, the error term u were suppressed from equation (1), then the model would become a fixed-effect model (which is equivalent to setting τ2 = 0 or assuming that the sampling error is the only source of variability).

The regression coefficients math formula can be estimated using the weighted least squares formula

display math(2)

where math formula is a k × k diagonal matrix with the inverse variances of the effect sizes as elements, that is, math formula for mixed-effects models. Note that an adequate estimate of both the within-study variance for each study, vi, and the residual between-studies variance, math formula, is needed for the estimation of the regression coefficients. For commonly used effect size metrics (e.g., standardized mean differences, correlation coefficients, odds ratios, risk ratios), approximately unbiased estimators are available for vi and the usual practice in meta-analysis is to substitute those estimates and treat them as known values (e.g., Aloe et al., 2010; Hedges & Pigott, 2004; Knapp, Biggerstaff & Hartung, 2006; Konstantopoulos & Hedges, 2009; Viechtbauer, 2007b; for a different approach, see Malzahn, Böhning & Holling, 2000). A more crucial issue is the choice of estimator for math formula, and at least seven different estimators have been described in the literature, as detailed in the next section.

3. Estimating the model predictive power in meta-analysis

A proposal to compute an R2 index in meta-analysis was presented by Raudenbush (1994, 2009). It is based on the re-estimation of the amount of heterogeneity (i.e., between-studies variance) after adding one or more predictors to the model, resulting in the residual heterogeneity or the heterogeneity that cannot be explained by the predictors. The rationale for this index is that the extent to which the moderators can account for the heterogeneity in the true effects will be reflected in the degree by which the residual heterogeneity, math formula, will be smaller than the total amount of heterogeneity, τ2, as a result of including explanatory variables in the model. In practice, the parameter values are replaced by their estimates, math formula and math formula, allowing for the computation of the R2 index as (Borenstein et al., 2009)

display math(3)

denoting the proportion of total heterogeneity accounted for by the moderator(s) included in the model.

Several alternatives have been proposed in the literature to estimate the total heterogeneity variance, τ2, in random-effects models (DerSimonian & Laird, 1986; Morris, 1983; Sánchez-Meca & Marín-Martínez, 2008; Sidik & Jonkman, 2005b, 2007; Viechtbauer, 2005). Most of these estimators have also been extended to mixed-effects models, allowing for estimating the residual heterogeneity variance, math formula (Raudenbush, 1994, 2009; Sidik & Jonkman, 2005a,b). It is important to remark here that, for both parameters, no estimator is expected to provide accurate results unless the number of studies is large enough (e.g., Borenstein et al., 2009; Schulze, 2004).

Seven different estimators of τ2 and math formula can be computed with the formulae gathered in Table 1. The metafor package programmed in R (Viechtbauer, 2010) directly computes these seven estimators from the values of the effect sizes and their corresponding within-study variances in the studies of the meta-analysis. The Hedges (HE), Hunter–Schmidt (HS), DerSimonian–Laird (DL), and Sidik–Jonkman (SJ) methods are non-iterative estimators, while the maximum likelihood (ML), restricted maximum likelihood (REML), and empirical Bayes (EB) methods require iterative computations. All estimators presented in Table 1 can be succinctly expressed after defining the matrix

display math(4)

where W is a diagonal weighting matrix whose elements, wi, can change from one estimator to another. For the iterative estimators, one starts with an initial estimate of math formula (e.g., as obtained with one of the non-iterative estimators) and then iterates through the equation

display math(5)

until convergence, where Δ is given in Table 1 for the ML, REML, and EB estimators. Although all the equations gathered in Table 1 include predictors, they also apply for the random-effects model without predictors by setting p = 0 and with X being a k × 1 vector of ones. In a model without predictors, the equations in Table 1 estimate the total heterogeneity variance, τ2, while the inclusion of predictors in the same equations leads to the estimation of the residual heterogeneity variance, math formula.

Table 1. Heterogeneity variance estimators


  1. For the HE estimator, math formula for the SJ estimator, math formula

Hedges (HE) math formula wi = 1Raudenbush (1994)
Hunter–Schmidt (HS) math formula wi = 1/viViechtbauer, López-López, Sánchez-Meca and Marín-Martínez (2012)
DerSimonian–Laird (DL) math formula wi = 1/viSidik and Jonkman (2005a)
Sidik–Jonkman (SJ) math formula math formula Sidik and Jonkman (2005b)
Maximum likelihood (ML) math formula math formula Raudenbush (2009)
Restricted maximum likelihood (REML) math formula math formula math formula Raudenbush (2009)
Empirical Bayes (EB)  math formula Sidik and Jonkman (2005a)

A value of zero for math formula suggests that all the heterogeneity among the effect sizes is accounted for by the predictors included in the model (Viechtbauer, 2007a). Also, due to random sampling error, the estimators in Table 1 (with the exception of the SJ estimator) can provide a negative estimate, which is a value outside of the parameter space for a variance component. The usual practice is to truncate negative values to zero. When an iterative estimator is employed, a simple strategy to avoid negative estimates is the use of step-halving (Jennrich & Sampson, 1976), which implies multiplying the adjustment value, Δ, by 1/2 (e.g., first by 1/2, then by 1/4, then by 1/8, and so on) until it becomes sufficiently small enough for the resulting estimate to stay non-negative.

Both (total and residual) heterogeneity variance estimates employed in equation (3) can be obtained using any of the methods presented in Table 1. As a consequence, there are at least seven different methods for computing the R2 index using this proposal. Aloe et al. (2010) recommended using the same method for both estimates. Indeed, it does not seem sensible to mix two estimates obtained using methods with different theoretical assumptions and, furthermore, only the estimates obtained with the same method are readily comparable.

It is important to note that, due to sampling error, the formula proposed by Raudenbush may require or lead to truncation in several situations. First, math formula can be larger than math formula for a given meta-analytic data set, especially with small samples (small number of studies, small sample sizes, or both), leading to a negative R2 value that is typically truncated to zero in practice (indicating that all of the heterogeneity among the effect sizes remains unaccounted for after including the moderator(s) in the model). Second, a negative value of math formula truncated to zero leads to division by zero in equation (3), in which case R2 is undefined. It is then common practice to set (or truncate) the value of R2 to 0 (indicating that none of the heterogeneity is accounted for by the moderators, given that there appeared to be none to begin with). Finally, with a positive value of math formula, a negative value of math formula truncated to zero will lead to an R2 value of 1 (indicating that all of the heterogeneity is accounted for).

Since an estimate of the heterogeneity variance is included in both the random- and mixed-effects model weights (cf. equation (2)), the accuracy of these estimates might affect the result of other statistical analyses, such as the computation of an overall effect size estimate and its confidence interval in a random-effects model or the estimation and testing of the model coefficients in a mixed-effects meta-regression model. However, getting accurate estimates of τ2 and math formula seems even more crucial for the assessment of the predictive power in meta-regression models since the R2 index in equation (3) requires estimates both of the total and residual amount of heterogeneity.

4. Previous simulation studies

Several simulation studies have already been conducted with the aim of comparing the accuracy of various estimators of the heterogeneity variance in meta-analysis. Some of these studies employed effect size indices for dichotomous measures (e.g., Malzahn et al., 2000; Sidik & Jonkman, 2005b, 2007), while others considered indices for continuous variables (e.g., Van den Noortgate & Onghena, 2003; Viechtbauer, 2005).

In general, a positive bias has been found in the SJ estimator for small to medium parameter values (Sidik & Jonkman, 2005b, 2007), while a negative bias was reported for the HS and ML estimators, as well as for the DL method when estimating large parameter values (Malzahn et al., 2000; Viechtbauer, 2005). The HE method was found to perform appropriately in terms of bias, although it was less efficient than the HS, DL, ML, and REML estimators (Viechtbauer, 2005). Finally, good performance was observed for both the REML and EB estimators when considering bias and efficiency criteria jointly (Sidik & Jonkman, 2007; Van den Noortgate & Onghena, 2003; Viechtbauer, 2005).

All of these simulation studies focused on random-effects models. Therefore, it is not certain to what extent these results would also carry over to mixed-effects meta-regression models. Moreover, these studies do not indicate whether one of the various estimators for τ2 and math formula would be preferable when computing the R2 index given by equation (3).

5. Objectives and hypotheses of this study

In the present study, all seven heterogeneity variance estimators presented (i.e., the HE, HS, DL, SJ, ML, REML, and EB estimators) were considered and applied to simulated meta-analyses where the standardized mean difference was the effect size index. This simulation compared the accuracy of the methods under different scenarios for the estimation of the total and residual heterogeneity variances as well as of the model predictive power, as defined by Raudenbush (1994).

A first objective was to check whether the patterns reported in previous studies for the heterogeneity variance estimators under random-effects models also apply for mixed-effects models with one predictor. The second objective was to assess the performance of Raudenbush's proposal for estimating the model predictive power in meta-analysis when computing R2 with the various estimators for τ2 and math formula described earlier.

Regarding our hypotheses, we expected to find results similar to those reported in previous simulation studies for the different estimators of the total heterogeneity variance under random-effects models. In particular, we expected the HS and ML estimators to show a negative bias and the DL method to provide negatively biased estimates for large parameter values. The SJ estimator was expected to show a large positive bias for small to medium parameter values, while the HE method was expected to provide essentially unbiased estimates, although less efficiently than the remaining methods under comparison. According to our hypotheses, the REML and EB estimators were expected to provide the best performance, as found in previous simulation studies. The same trends observed for the different estimators under random-effects models were also expected to be found when estimating the residual heterogeneity variance under mixed-effects meta-regression models with one moderator. Finally, it was expected that the REML and EB estimators would also provide the best performance for the estimation of the predictive power in mixed-effects meta-regression models, computed with equation (3). We also expected that an increase in the average sample size and (especially) the number of studies would lead to more precise results for all estimators.

6. An illustrative example

Else-Quest, Hyde and Linn (2010) published a meta-analysis integrating results from the Programme for International Student Assessment (PISA) in different countries in 2003. This report evaluated 15-year-old students' performance in several subjects. The authors focused on mathematics and, since they were interested in gender differences, effect sizes were defined as standardized mean differences between the marks achieved by boys and girls (with positive values indicating better performance for boys).

One of the coded characteristics for each country was the share of parliamentary seats held by women (given as a proportion), used as a moderator in this example. Twenty countries from different parts of the world were selected to illustrate the methods described earlier. Table 2 shows the effect size, yi, sampling variance, vi, and the moderator value, Parli, for each of the 20 countries.

Table 2. Data from the meta-analysis published by Else-Quest et al. (2010)
Country y i v i Parl i Country y i v i Parl i
Canada0.130.00020.24South Korea0.250.00080.06
Italy 0.190.00030.10Turkey0.140.00080.04

All seven variance estimators compared in this study were employed to estimate the total heterogeneity variance in a random-effects model, as well as the slope, the residual heterogeneity variance, and the proportion of variance accounted for by the moderator in a mixed-effects meta-regression model. Results are presented in Table 3.

Table 3. Estimates in random- and mixed-effects models using data from Else-Quest et al. (2010)
Method math formula math formula math formula R 2

As the slope estimates show, a negative relationship was found with all methods, indicating that a higher percentage of women in parliament was associated with decreasing advantages for boys in the mathematics test. Regarding the total heterogeneity variance, the lowest estimates were obtained using HS and DL methods (0.0052 and 0.0058, respectively), while the highest estimates were provided by HE, SJ, and EB methods (0.0077, 0.0076, 0.0075, respectively). Residual heterogeneity variance estimates also showed some variability, with values ranging between 0.0046 (HS estimator) and 0.0061, obtained with the HE and SJ estimators. These differences led to notable variation among the estimates of the model predictive power depending on the estimator used. The R2 values showed fluctuations from 6.9% of heterogeneity accounted for by the moderator (DL estimator) to the 25.4% obtained with the ML estimator.

7. Simulation study

A simulation study was programmed in R using the metafor (Viechtbauer, 2010) package. Meta-analyses of k studies were generated, obtaining the individual scores for each study from two normal populations (see Marín-Martínez & Sánchez-Meca, 2010) and using the standardized mean difference as the effect size index (Marín-Martínez & Sánchez-Meca, 2010; equation (2)).

For each meta-analysis, θ and x were defined as k × 1 vectors containing parameter effects and moderator values, respectively. The predictor x was generated from a standard normal distribution. On the other hand, the θ values were obtained from the expression θ = β0 + β1x + u, where β0 was set to 0.5, which can be regarded as an effect of medium size in some psychological areas (Cohen, 1988); the slope β1 was set as described below, and u is an error term with distribution math formula. Note that if the predictor is dropped from the model, the error term u will have distribution N(0, τ2).

The total heterogeneity variance, τ2, and the model predictive power, Ρ2, were manipulated in the simulations. The former was set to values representative of no, low, medium, or large amounts of heterogeneity in psychology and related fields (0, 0.08, 0.16, and 0.32, respectively), similar to the values employed in previous simulation studies (e.g., Knapp & Hartung, 2003; Marín-Martínez & Sánchez-Meca, 2010; Schulze, 2004). For Ρ2, we used values of 0%, 25%, 50%, or 75% of heterogeneity accounted for, with the aim of reflecting realistic conditions (Thompson & Higgins, 2002). After setting both parameter values, we then assigned a value to β1 by means of the expression math formula. Table 4 gathers the different values considered for these parameters, as well as the resulting values for math formula and the residual heterogeneity variance parameter, math formula, which we computed as math formula.2

Table 4. Parameter values considered in this simulation for τ2 and Ρ2 (and the resulting values for math formula and math formula)
math formula
math formula

Other factors manipulated in this simulation were the number of studies in each meta-analysis (k = 5, 10, 20, 40, and 80) and the average sample size of the k studies (math formula, 50, 100, 150, and 200). Note that, for the ith study, Ni = niE + niC, with niE = niC. Vectors of individual sample sizes were generated with an skewness of +1.546, as reported by Sánchez-Meca and Marín-Martínez (1998, p. 317) in a review of meta-analytic syntheses in psychology. A total of 13 × 5 × 5 = 325 conditions were examined. For each condition, 10,000 meta-analyses were simulated, and math formula, math formula, and R2 were computed with the seven alternatives above presented for each simulated data set.

The performance of the estimators for math formula, math formula and Ρ2 was compared using several criteria. Let math formula be an estimate of one of the parameters of interest obtained with any of the proposed methods in a particular condition. The bias for that estimate and condition was estimated as (Marín-Martínez & Sánchez-Meca, 2010)

display math(6)

where θ is the value of the parameter of interest (see Table 4). The percentage of bias, or relative bias, was then obtained as

display math(7)

Moreover, the MSE was estimated as

display math(8)

Finally, as described earlier, the computation of the R2 value may require truncation in various cases. When τ2 and math formula are both actually positive (in which case 0 < P2 < 1), a large rate of truncated R2 values would reflect undesirable performance of equation (3). Therefore, the proportion of R2 values truncated to 0 or 1 was also examined for the different estimators along the simulated scenarios.

8. Results

Due to limitations of space, only part of the results will be presented in this section. The full set of results is available from the corresponding author upon request.

8.1 Total heterogeneity variance

Because any negative estimates of τ2 were truncated to zero, all estimators showed the expected positive bias under the homogeneous scenario (τ2 = 0). On the other hand, for the conditions with τ2 > 0, Table 5 shows the percentage of bias for the total heterogeneity variance estimates provided by each method when setting the number of studies and the average within-study sample size to values that can often be found in meta-analytic reviews in several psychological fields (i.e., k = 20 and math formula).

Table 5. Percentage of bias for the total heterogeneity variance estimators with k = 20 and math formula

The HS and ML estimators provided the most negatively biased estimates, with a deviation of around 16% from the parameter value. The SJ estimator showed the most (positively) biased results, although its performance improved as τ2 increased. The DL and REML estimators performed similarly for small to medium amounts of heterogeneity, with a negative bias slightly over 5%, while the DL estimator yielded more biased results for large values of τ2. The HE estimator showed the best results in terms of bias, with a positive deviation smaller than 3% and better results as the parameter value increased. Finally, the EB estimator performed reasonably well in terms of bias, with a negative deviation from the parameter value around 2%. With smaller values of k, all estimators showed a larger bias. Conversely, the estimates obtained with 40 and 80 studies were more accurate than with k = 20 for the different methods. Finally, higher average sample sizes also led to more accurate results for all estimators.

When comparing the estimators in terms of their relative efficiency, the SJ and HE methods provided the largest MSE values, while the HS and ML estimators showed the most efficient performance. The remaining estimators (DL, REML, and EB) performed similarly as k increased. All methods yielded more accurate estimates with a larger k, with MSE values clearly decreasing with 20 or more studies, and an increase in the average sample size per study also led to better results.

8.2 Residual heterogeneity variance

Trends for the different methods when estimating the residual heterogeneity variance were very similar to those detailed for τ2. Regarding bias, the SJ estimator again showed the most biased results – the positive bias was now larger than for τ2 – unless the parameter value was large enough (math formula and math formula). Moreover, HS and ML methods provided again negatively biased estimates, with a deviation from the parameter value around 25% with 20 studies, larger than that observed for τ2. Finally, the HE, DL, REML, and EB estimators performed similarly as for τ2.

Figure 1 shows the MSE results for the estimators as a function of k and math formula. The HS and ML methods performed very similarly, so their results are presented jointly, as are those for the REML and EB estimators. As found in the results for τ2, the number of studies showed the largest influence on the efficiency of all estimators of math formula and the MSE values especially decreased when going from 5 to 10 and from 10 to 20 studies. The average sample size also showed some influence on the efficiency of the estimates, with smaller MSE values obtained as math formula increased. The SJ and HE estimators showed the largest MSE values, while the HS and ML methods provided the most efficient estimates. All estimators except the SJ method performed similarly with k = 80.

Figure 1.

MSE for the residual heterogeneity variance estimators.

8.3 Model predictive power

The R2 values obtained with all estimators were quite variable, but the estimates tended to fall closer to the parameter value as k, math formula, τ2, and P2 increased. As an illustration, Table 6 presents the correlations between the estimates obtained with the different methods under two opposite scenarios. Figures below the main diagonal are correlations under adverse conditions (math formula, and P2 = 0.25), while those above the main diagonal are correlations obtained under an optimal scenario (math formula, and P2 = 0.50).

Table 6. Correlations between the R2 values obtained with the different methods, for adverse conditions (lower triangle) and the optimal scenario (upper triangle)
HE .9727.9731.9934.9958.9960.9991
HS.7070 .9999.9692.9869.9865.9803
DL.9368.7201 .9701.9871.9868.9807
SJ.8227.5720.8196 .9907.9915.9935
ML.7627.8395.7677.5943 .9999.9988
REML.9322.6796.9516.8221.7591 .9989

Under adverse conditions, the highest correlations were found between the DL, REML, and EB estimators, with values over .95, while most of the remaining combinations yielded values below .90 and even below .60 (e.g., the correlation between the HS and SJ estimators). Conversely, all estimators performed very similarly under the optimal scenario, with all correlations falling above .96. Table 6 shows, therefore, that the differences between estimators are especially important under the most adverse conditions, while performance for all methods tends to convergence for the optimal scenarios.

Among the different factors manipulated in this simulation, the accuracy of the P2 estimates was mostly influenced by k. This finding is illustrated in Figure 2 using the EB estimator, which provided slightly more accurate results than the other methods, and considering scenarios with P2 = 0.25.

Figure 2.

R2 values using the EB estimator with P2 = 0.25.

The boxplots in Figure 2 reveal substantial variability in the P2 estimates, especially for small values of k (e.g., less than 20 studies), represented on the X-axis of each chart. The picture is worrying for a typical meta-regression, as it reveals that no value between 0 and 1 (including a truncated estimate) is unlikely unless k is large enough (40 or more studies), especially with small to medium sample sizes (math formula) for the individual studies. Results with 5 studies, which are not shown in this figure, were very unstable, showing even more variability than with k = 10. Moreover, an increase in the average sample size per study led to more precise estimates (as can be seen when looking at Figure 1(b)), while increasing the heterogeneity variance parameter, represented with different bar shades, led to a smaller rate of truncation of the R2 values to zero and one.

Several descriptives were computed for the R2 values obtained with the different estimators, considering conditions with k = 40 and setting the other factors to realistic values for a meta-regression with one covariate (math formula, and P2 = 0.25). Table 7 gathers the mean, the median, the 2.5 and 97.5 percentiles, and the rates of values truncated to zero and one for each estimator.

Table 7. R2 values with k = 40, math formula, τ2 = 0.16, and P2 = 0.25


  1. P2.5 and P97.5 are 2.5 and 97.5 percentiles, respectively. Prmath formula is the rate of values truncated to i, for i = 0, 1.

P 2.5 0.015700.016600
P 97.5 .6512.6974.6458.3734.7379.6781.6547
Pr (R2 = 0).0585.0003.0689.0570.0062.0630.0565
Pr(R2 = 1).0021.0029.00170.0011.0010.0015

Regarding the comparison of the different estimators in terms of bias, the HE, DL, REML, and EB estimators performed appropriately, with their mean estimates deviating less than 0.01 from the parameter value (P2 = 0.25). In contrast, the HS and ML estimators showed a positive bias, while the mean estimate for the SJ estimator showed a large negative bias.3

In addition to the bias that was found for the HS, ML, and SJ estimators, the remaining methods showed some problems as well. When examining the percentiles presented in Table 7, it can be seen that there was a wide variation among the individual estimates, and that 95% of the central values ranged from 0 to 0.65. Moreover, a non-negligible proportion of the estimates (over 5%) were truncated to zero, especially for the DL and REML estimators. While the rates of truncation to zero were clearly lower for the HS and ML estimators, the bias showed by these two methods advises against their use. Finally, despite the parameter value of P2 = 0.25, the HE, DL, REML, and EB methods still provided some estimates that were truncated to one. On the other hand, since the SJ estimator always yields a positive value, R2 can never reach 1 when using this estimator and hence never required truncation at the upper end of the scale, although in turn it provided the largest bias.

Table 8 presents the MSE results with k = 40 and P2 = 0.25 for the different estimators. Only conditions with some heterogeneity among the parameter effects (τ2 > 0) were considered here.

Table 8. MSE values for the P2 estimators with k = 40 and P2 = 0.25
math formula τ2 = 0.08.0932.1227.0997.0395.1495.1187.1043
τ2 = 0.16.0678.0807.0667.0306.0984.0776.0662
τ2 = 0.32.0377.0417.0390.0234.0482.0415.0363
math formula τ2 = 0.08.0641.0758.0625.0292.0872.0696.0634
τ2 = 0.16.0322.0345.0323.0218.0373.0335.0317
τ2 = 0.32.0218.0225.0231.0174.0231.0228.0220
math formula τ2 = 0.08.0285.0300.0285.0206.0312.0292.0285
τ2 = 0.16.0202.0197.0202.0165.0202.0204.0203
τ2 = 0.32.0172.0161.0171.0149.0168.0173.0172
math formula τ2 = 0.08.0230.0228.0229.0182.0234.0232.0231
τ2 = 0.16.0179.0168.0176.0153.0175.0180.0179
τ2 = 0.32.0164.0148.0159.0146.0159.0165.0164
math formula τ2 = 0.08.0199.0193.0197.0165.0197.0200.0199
τ2 = 0.16.0175.0164.0171.0151.0171.0176.0175
τ2 = 0.32.0159.0142.0153.0145.0153.0159.0159

All methods performed more efficiently as math formula and τ2 increased. When comparing the different methods, the ML and HS estimators provided the largest MSE values unless the average sample size per study was 150 or 200 participants, while the SJ estimator was the most efficient method, especially under the most adverse conditions. Regarding the influence of k, weak performance was reported before for the method proposed by Raudenbush (1994) with a small number of studies (see Figure 2 and Table 7). With k = 20, trends were already similar to those shown in Table 8, although the MSE values were twice as large as for k = 40. With k = 80, MSEs were on average smaller than 0.04 under all of the conditions examined here, although trends for the different estimators remained the same.

9. Discussion

In this study, the performance of seven methods for the estimation of the total and residual heterogeneity variances, as well as the model predictive power, was assessed under a variety of realistic scenarios in applied research. The estimators here compared showed different performance, especially under adverse and intermediate conditions, while all methods provided similar and accurate estimates of the parameters of interest for the most favourable conditions (e.g., large number of studies and large number of participants per study).

Regarding the results for the total heterogeneity variance, the patterns found in this simulation are comparable to the ones reported by Viechtbauer (2005). The DL, REML, and EB estimators performed reasonably well in terms of bias and efficiency, although the DL method yielded negatively biased estimates for large parameter values, as was found in previous simulations (Malzahn et al., 2000; Sidik & Jonkman, 2005b, 2007; Viechtbauer, 2005). The HE estimator showed essentially unbiased results (the slight positive bias observed in Table 5 can be regarded as a consequence of truncating the negative estimates to zero) but large MSE values, while the HS and ML methods performed very efficiently but with a negative bias. Finally, the SJ method showed a large positive bias for small parameter values, as has been previously described (Sidik & Jonkman, 2005b), and the largest MSE values. The performance of the various estimators remained very similar after the inclusion of a moderator.

Regarding the estimation of the predictive power in meta-regression models with one predictor, no estimator performed accurately with less than 40 studies. Again, the HS, ML, and SJ estimators yielded the most biased estimates. The remaining estimators performed more precisely, although their estimates still showed wide variation even with a moderate to large k, including truncated values to zero and one, as shown in Table 7. Given the large MSE of the SJ estimator for τ2 and math formula, the SJ estimator showed surprisingly efficient performance for estimating P2, while the HS and ML methods now provided the largest MSE values.

Out of the different factors manipulated in this simulation, our results suggest that the number of studies exerts an important influence on the accuracy of the results, and that precise estimates of the heterogeneity variances and the model predictive power can only be expected with at least 20 and 40 studies, respectively. An increase in the average sample size also improved the results for all estimators. The critical influence of k on the accuracy of the heterogeneity variance estimators has already been discussed by several authors both in the context of random-effects models (e.g., Borenstein et al., 2009; Schulze, 2004) and mixed-effects models (Thompson & Higgins, 2002). The fact that results were more accurate as k and math formula increased is in agreement with large-sample theory, which underlies the statistical models and methods in meta-analysis (Hedges, 2009). Moreover, as shown in Figure 2 and Table 8, the P2 estimators performed more efficiently as the total heterogeneity variance increased. An explanation of this fact is that, when estimating τ2, a small parameter value will lead more often to negative estimates requiring truncation, and this will also lead to truncated R2 values.

10. Conclusion

When a meta-analysis is carried out, some variability is usually found among the effect sizes from the individual studies. The part of that variability due to systematic differences among studies can be quantified by estimating the heterogeneity (or between-studies) variance, τ2. Moreover, if the results from the studies are not homogeneous, the meta-analyst may be interested in the identification of one or more study characteristics that can explain part of the variability among the results. This goal can be addressed through meta-regression analyses, which are typically conducted under a mixed-effects model. Two parameters of interest in a mixed-effects meta-regression model are the residual heterogeneity variance after including one or more moderators, math formula, and the predictive power of the moderator(s) included in the model, P2.

In the present simulation study, we found a different performance for the seven estimation methods available both for τ2 and math formula in the set of simulated conditions. For a small number of studies (k < 20) no estimator performed accurately. When the number of studies was moderate (20–40 studies), the REML and EB methods yielded the most accurate results when considering bias and efficiency criteria jointly. Finally, with 80 studies, all methods converged and showed similar (and accurate) results. Increasing the average sample size per study also led to more accurate results. These results are of interest not only for the accurate estimation of the heterogeneity variances, but also for the computation of an R2 index in meta-analysis, which can be obtained by comparing the math formula and math formula values (Raudenbush, 1994).

The results obtained in this simulation study suggest that about 40 studies are required to get accurate estimates of P2 in mixed-effects meta-regression models, so that a cautious interpretation of R2 values should be advised for meta-regression models fitted with a smaller number of studies (Thompson, 1994). Among the different estimators here compared, the REML, DL, and EB methods showed the most accurate results across the different scenarios and criteria considered. Although the present study focused on standardized mean differences, it is likely that our findings can be generalized to meta-analyses with other effect size measures that are (at least approximately) normally distributed. However, conclusions from this simulation are restricted to the scenarios considered here, so that further simulation studies are needed in order to account for conditions different from the ones included in the present study.


  1. 1

    An exception to this is when meta-analysing the raw data from a set of individual studies, in which case within-study variability can be accounted for. For more details on so-called individual participant data meta-analyses, see, for example, Cooper and Patall (2009).

  2. 2

    From θi = β1Xi + ui, the total amount of heterogeneity in the true effect sizes, τ2, can easily be computed with math formula, as Xi and ui are independent and normally distributed with mean zero and variances 1 and math formula, respectively. This leads to the expression math formula.

  3. 3

    Since the negative bias for the HS and ML estimators and the positive bias for the SJ estimator tended to be larger for math formula than for τ2, the bias for these three methods was reversed when estimating P2.


This research was supported by a grant from the Fundación Séneca, Region of Murcia, Spain, and by the Ministerio de Economía y Competitividad and FEDER funds from the Spanish Government, Project No. PSI2012-31399.