## Introduction to Meta Analysis (II)

You can read my first post about the book here. Some parts of the book are fairly technical, so I decided in the post below to skip some chapters in my coverage, simply because I could see no good way to cover the stuff on a wordpress blog (which as already mentioned many times is not ideal for math coverage) without spending a lot more time on that stuff than I wanted to. If you’re a new reader and/or you don’t know what a meta-analysis is, I highly recommend you read my first post about the book before moving on to the coverage below (and/or you can watch this brief video on the topic).

Below I have added some more quotes and observations from the book.

…

“In primary studies we use regression, or multiple regression, to assess the relationship between one or more covariates (moderators) and a dependent variable. Essentially the same approach can be used with meta-analysis, except that the covariates are at the level of the study rather than the level of the subject, and the dependent variable is the effect size in the studies rather than subject scores. We use the term *meta-regression* to refer to these procedures when they are used in a meta-analysis.

The differences that we need to address as we move from primary studies to meta-analysis for regression are similar to those we needed to address as we moved from primary studies to meta-analysis for subgroup analyses. These include the need to assign a weight to each study and the need to select the appropriate model (fixed versus random effects). Also, as was true for subgroup analyses, the *R ^{2} *index, which is used to quantify the proportion of variance explained by the covariates, must be modified for use in meta-analysis.

With these modifications, however, the full arsenal of procedures that fall under the heading of multiple regression becomes available to the meta-analyst. […] As is true in primary studies, where we need an appropriately large ratio of

*subjects*to covariates in order for the analysis be to meaningful, in meta-analysis we need an appropriately large ratio of

*studies*to covariates. Therefore, the use of meta-regression, especially with multiple covariates, is not a recommended option when the number of studies is small.”

“Power depends on the size of the effect and the precision with which we measure the effect. For subgroup analysis this means that power will increase as the difference between (or among) subgroup means increases, and/or the standard error within subgroups decreases. For meta-regression this means that power will increase as the magnitude of the relationship between the covariate and effect size increases, and/or the precision of the estimate increases. In both cases, a key factor driving the precision of the estimate will be the total number of individual subjects across all studies and (for random effects) the total number of studies. […] While there is a general perception that power for testing the main effect is consistently high in meta-analysis, this perception is not correct […] and certainly does not extend to tests of subgroup differences or to meta-regression. […] Statistical power for detecting a difference among subgroups, or for detecting the relationship between a covariate and effect size, is often low [and] failure to obtain a statistically significant difference among subgroups should never be interpreted as evidence that the effect is the same across subgroups. Similarly, failure to obtain a statistically significant effect for a covariate should never be interpreted as evidence that there is no relationship between the covariate and the effect size.”

“When we have effect sizes for more than one outcome (or time-point) within a study, based on the same participants, the information for the different effects is not independent and we need to take account of this in the analysis. […] When we are working with different outcomes at a single point in time, the plausible range of correlations [between outcomes] will depend on the similarity of the outcomes. When we are working with the same outcome at multiple time-points, the plausible range of correlations will depend on such factors as the time elapsed between assessments and the stability of the relative scores over this time period. […] Researchers who do not know the correlation between outcomes sometimes fall back on either of two ‘default’ positions. Some will include both [outcome variables] in the analysis and treat them as independent. Others would use the average of the [variances of the two outcomes]. It is instructive, therefore, to consider the practical impact of these choices. […] In effect, […] researchers who adopt either of these positions as a way of bypassing the need to specify a correlation, are actually adopting a correlation, albeit implicitly. And, the correlation that they adopt falls at either extreme of the possible range (either zero or 1.0). The first approach is almost certain to underestimate the variance and overestimate the precision. The second approach is almost certain to overestimate the variance and underestimate the precision.” [*A good example of a more general point in the context of statistical/mathematical modelling: Sometimes it’s really hard not to make assumptions, and trying to get around such problems by ‘ignoring them’ may sometimes lead to the implicit adoption of assumptions which are highly questionable as well.*]

“Vote counting is the name used to describe the idea of seeing how many studies yielded a significant result, and how many did not. […] narrative reviewers often resort to [vote counting] […] In some cases this process has been formalized, such that one actually counts the number of significant and non-significant p-values and picks the winner. In some variants, the reviewer would look for a clear majority rather than a simple majority. […] One might think that summarizing *p*-values through a vote-counting procedure would yield more accurate decision than any one of the single significance tests being summarized. This is not generally the case, however. In fact, Hedges and Olkin (1980) showed that the power of vote-counting considered as a statistical decision procedure can not only be lower than that of the studies on which it is based, the power of vote counting can tend toward zero as the number of studies increases. […] the idea of vote counting is fundamentally flawed and the variants on this process are equally flawed (and perhaps even more dangerous, since the basic flaw is less obvious when hidden behind a more complicated algorithm or is one step removed from the *p*-value). […] The logic of vote counting says that a significant finding is evidence that an effect exists, while a non-significant finding is evidence that an effect is absent. While the first statement is true, the second is not. While a nonsignificant finding *could* be due to the fact that the true effect is nil, it can also be due simply to low statistical power. Put simply, the *p*-value reported for any study is a function of the observed effect size and the sample size. Even if the observed effect is substantial, the *p*-value will not be significant unless the sample size is adequate. In other words, as most of us learned in our first statistics course, *the absence of a statistically significant effect is not evidence that an effect is absent*.”

“While the term vote counting is associated with narrative reviews it can also be applied to the single study, where a significant p-value is taken as evidence that an effect exists, and a nonsignificant p-value is taken as evidence that an effect does not exist. Numerous surveys in a wide variety of substantive fields have repeatedly documented the ubiquitous nature of this mistake. […] When we are working with a single study and we have a nonsignificant result we don’t have any way of knowing whether or not the effect is real. The nonsignificant *p*-value could reflect either the fact that the true effect is nil *or* the fact that our study had low power. While we caution against accepting the former (that the true effect is nil) we cannot rule it out. By contrast, when we use meta-analysis to synthesize the data from a series of studies we can often identify the true effect. And in many cases (for example if the true effect is substantial and is consistent across studies) we can assert that the nonsignificant *p*-value in the separate studies was due to low power rather than the absence of an effect. […] vote

counting is never a valid approach.”

“The fact that a meta-analysis will often [but not always] have high power is important because […] primary studies often suffer from low power. While researchers are encouraged to design studies with power of at least 80%, this goal is often elusive. Many studies in medicine, psychology, education and an array of other fields have power substantially lower than 80% to detect large effects, and substantially lower than 50% to detect smaller effects that are still important enough to be of theoretical or practical importance. By contrast, a meta-analysis based on multiple studies will have a higher total sample size than any of the separate studies and the increase in power can be substantial. The problem of low power in the primary studies is especially acute when looking for adverse events. The problem here is that studies to test new drugs are *powered* to find a treatment effect for the drug, and do not have adequate power to detect side effects (which have a much lower event rate, and therefore lower power).”

“Assuming a nontrivial effect size, power is primarily a function of the precision […] When we are working with a fixed-effect analysis, precision for the summary effect is always higher than it is for any of the included studies. Under the fixed-effect analysis precision is largely determined by the total sample size […], and it follows the total sample size will be higher across studies than within studies. […] in a random-effects meta-analysis, power depends on within-study error and between-studies variation [*…if you don’t recall the difference between fixed-effects models and random effects models, see the previous post*]. If the effect sizes are reasonably consistent from study to study, and/or if the analysis includes a substantial number of studies, then the second of these will tend to be small, and power will be driven by the cumulative sample size. In this case the meta-analysis will tend to have higher power than any of the included studies. […] However, if the effect size varies substantially from study to study, and the analysis includes only a few studies, then this second aspect will limit the potential power of the meta-analysis. In this case, power could be limited to some low value even if the analysis includes tens of thousands of persons. […] The Cochrane Database of Systematic Reviews is a database of systematic reviews, primarily of randomized trials, for medical interventions in all areas of healthcare, and currently includes over 3000 reviews. In this database, the median number of trials included in a review is six. When a review includes only six studies, power to detect even a moderately large effect, let alone a small one, can be well under 80%. While the median number of studies in a review differs by the field of research, in almost any field we do find some reviews based on a small number of studies, and so we cannot simply assume that power is high. […] Even when power to test the main effect is high, many meta-analyses are not concerned with the main effect at all, but are performed solely to assess the impact of covariates (or moderator variables). […] The question to be addressed is not whether the treatment works, but whether one variant of the treatment is more effective than another variant. The test of a moderator variable in a meta-analysis is akin to the test of an interaction in a primary study, and both suffer from the same factors that tend to decrease power. First, the effect size is actually the difference between the two effect sizes and so is almost invariably smaller than the main effect size. Second, the sample size within groups is (by definition) smaller than the total sample size. Therefore, power for testing the moderator will often be very low (Hedges and Pigott, 2004).”

“It is important to understand that the fixed-effect model and random-effects model address different hypotheses, and that they use different estimates of the variance because they make different assumptions about the nature of the distribution of effects across studies […]. Researchers sometimes remark that power is lower under the random-effects model than for the fixed-effect model. While this statement may be true, it misses the larger point: it is not meaningful to compare power for fixed- and random-effects analyses since the two values of power are not addressing the same question. […] Many meta-analyses include a test of homogeneity, which asks whether or not the between-studies dispersion is more than would be expected by chance. The test of significance is […] based on *Q*, the sum of the squared deviations of each study’s effect size estimate (*Yi*) from the summary effect (*M*), with each deviation weighted by the inverse of that study’s variance. […] Power for this test depends on three factors. The larger the ratio of between-studies to within-studies variance, the larger the number of studies, and the more liberal the criterion for significance, the higher the power.”

“While a meta-analysis will yield a mathematically accurate synthesis of the studies included in the analysis, if these studies are a biased sample of all relevant studies, then the mean effect computed by the meta-analysis will reflect this bias. Several lines of evidence show that studies that report relatively high effect sizes are more likely to be published than studies that report lower effect sizes. Since published studies are more likely to find their way into a meta-analysis, any bias in the literature is likely to be reflected in the meta-analysis as well. This issue is generally known as publication bias. The problem of publication bias is not unique to systematic reviews. It affects the researcher who writes a narrative review and even the clinician who is searching a database for primary papers. […] Other factors that can lead to an upward bias in effect size and are included under the umbrella of publication bias are the following. Language bias (English-language databases and journals are more likely to be searched, which leads to an oversampling of statistically significant studies) […]; availability bias (selective inclusion of studies that are easily accessible to the researcher); cost bias (selective inclusion of studies that are available free or at low cost); familiarity bias (selective inclusion of studies only from one’s own discipline); duplication bias (studies with statistically significant results are more likely to be published more than once […]) and citation bias (whereby studies with statistically significant results are more likely to be cited by others and therefore easier to identify […]). […] If persons performing a systematic review were able to locate studies that had been published in the grey literature (any literature produced in electronic or print format that is not controlled by commercial publishers, such as technical reports and similar sources), then the fact that the studies with higher effects are more likely to be published in the more mainstream publications would not be a problem for meta-analysis. In fact, though, this is not usually the case.

While a systematic review *should* include a thorough search for all relevant studies, the actual amount of grey/unpublished literature included, and the types, varies considerably across meta-analyses.”

“In sum, it is possible that the studies in a meta-analysis may overestimate the true effect size because they are based on a biased sample of the target population of studies. But how do we deal with this concern? The only true test for publication bias is to compare effects in the published studies formally with effects in the unpublished studies. This requires access to the unpublished studies, and if we had that we would no longer be concerned. Nevertheless, the best approach would be for the reviewer to perform a truly comprehensive search of the literature, in hopes of minimizing the bias. In fact, there is evidence that this approach is somewhat effective. Cochrane reviews tend to include more studies and to report a smaller effect size than similar reviews published in medical journals. Serious efforts to find unpublished, and difficult to find studies, typical of Cochrane reviews, may therefore reduce some of the effects of publication bias. Despite the increased resources that are needed to locate and retrieve data from sources such as dissertations, theses, conference papers, government and technical reports and the like, it is generally indefensible to conduct a synthesis that categorically excludes these types of research reports. Potential benefits and costs of grey literature searches must be balanced against each other.”

“Since we cannot be certain that we have avoided bias, researchers have developed methods intended to assess its potential impact on any given meta-analysis. These methods address the following questions:

*Is there evidence of any bias?

*Is it possible that the entire effect is an artifact of bias?

*How much of an impact might the bias have? […]

Methods developed to address publication bias require us to make many assumptions, including the assumption that the pattern of results is due to bias, and that this bias follows a certain model. […] In order to gauge the impact of publication bias we need a model that tells us which studies are likely to be missing. The model that is generally used […] makes the following assumptions: (a) Large studies are likely to be published regardless of statistical significance because these involve large commitments of time and resources. (b) Moderately sized studies are at risk for being lost, but with a moderate sample size even modest effects will be significant, and so only some studies are lost here. (c) Small studies are at greatest risk for being lost. Because of the small sample size, only the largest effects are likely to be significant, with the small and moderate effects likely to be unpublished.

The combined result of these three items is that we expect the bias to increase as the sample size goes down, and the methods described […] are all based on this model. […] [One problem is however that] when there is clear evidence of asymmetry, we cannot assume that this reflects publication bias. The effect size may be larger in small studies because we retrieved a biased sample of the smaller studies, but it is also possible that the effect size really is larger in smaller studies for entirely unrelated reasons. For example, the small studies may have been performed using patients who were quite ill, and therefore more likely to benefit from the drug (as is sometimes the case in early trials of a new compound). Or, the small studies may have been performed with better (or worse) quality control than the larger ones. Sterne et al. (2001) use the term *small-study effect* to describe a pattern where the effect is larger in small studies, and to highlight the fact that the mechanism for this effect is not known.”

“It is almost always important to include an assessment of publication bias in relation to a meta-analysis. It will either assure the reviewer that the results are robust, or alert them that the results are suspect.”

No comments yet.

## Leave a Reply