## Beyond Significance Testing (III)

There are many ways to misinterpret significance tests, and this book spends quite a bit of time and effort on these kinds of issues. I decided to include in this post quite a few quotes from chapter 4 of the book, which deals with these topics in some detail. I also included some notes on effect sizes.

…

“[*P*] < .05 means that the likelihood of the data or results even more extreme given random sampling under the null hypothesis is < .05, assuming that all distributional requirements of the test statistic are satisfied and there are no other sources of error variance. […] the **odds-against-chance** **fallacy **[…] [is] the false belief that *p *indicates the probability that a result happened by sampling error; thus, p < .05 says that there is less than a 5% likelihood that a particular finding is due to chance. There is a related misconception i call the **filter myth**, which says that *p* values sort results into two categories, those that are a result of “chance” (H_{0} not rejected) and others that are due to “real” effects (H_{0} rejected). These beliefs are wrong […] When *p* is calculated, it is already assumed that H_{0} is true, so the probability that sampling error is the only explanation is already taken to be 1.00. It is thus illogical to view *p* as measuring the likelihood of sampling error. […] There is no such thing as a statistical technique that determines the probability that various causal factors, including sampling error, acted on a particular result.“

“Most psychology students and professors may endorse the **local Type I error fallacy** [which is] the mistaken belief that p < .05 given α = .05 means that the likelihood that the decision just taken to reject H_{0} is a type I error is less than 5%. […] *p *values from statistical tests are conditional probabilities of data, so they do not apply to any specific decision to reject H_{0}. This is because any particular decision to do so is either right or wrong, so no probability is associated with it (other than 0 or 1.0). Only with sufficient replication could one determine whether a decision to reject H_{0} in a particular study was correct. […] the **valid research hypothesis fallacy **[…] refers to the false belief that the probability that H_{1} is true is > .95, given p < .05. The complement of *p *is a probability, but 1 – p is just the probability of getting a result even less extreme under H_{0} than the one actually found. This fallacy is endorsed by most psychology students and professors”.

“[S]everal different false conclusions may be reached after deciding to reject or fail to reject H_{0}. […] the **magnitude fallacy **is the false belief that low *p *values indicate large effects. […] *p* values are confounded measures of effect size and sample size […]. Thus, effects of trivial magnitude need only a large enough sample to be statistically significant. […] the **zero fallacy **[…] is the mistaken belief that the failure to reject a nil hypothesis means that the population effect size is zero. Maybe it is, but you cannot tell based on a result in one sample, especially if power is low. […] The **equivalence fallacy **occurs when the failure to reject H_{0}: µ1 = µ2 is interpreted as saying that the populations are equivalent. This is wrong because even if µ1 = µ2, distributions can differ in other ways, such as variability or distribution shape.”

“[T]he **reification fallacy** is the faulty belief that failure to replicate a result is the failure to make the same decision about H_{0} across studies […]. In this view, a result is not considered replicated if H_{0} is rejected in the first study but not in the second study. This sophism ignores sample size, effect size, and power across different studies. […] The **sanctification fallacy** refers to dichotomous thinking about continuous p values. […] Differences between results that are “significant” versus “not significant” by close margins, such as p = .03 versus p = .07 when α = .05, are themselves often not statistically significant. That is, relatively large changes in *p* can correspond to small, nonsignificant changes in the underlying variable (Gelman & Stern, 2006). […] Classical parametric statistical tests are not robust against outliers or violations of distributional assumptions, especially in small, unrepresentative samples. But many researchers believe just the opposite, which is the **robustness fallacy**. […] most researchers do not provide evidence about whether distributional or other assumptions are met”.

“Many [of the above] fallacies involve wishful thinking about things that researchers really want to know. These include the probability that H_{0} or H_{1} is true, the likelihood of replication, and the chance that a particular decision to reject H_{0} is wrong. Alas, statistical tests tell us only the conditional probability of the data. […] But there is [however] a method that can tell us what we want to know. It is not a statistical technique; rather, it is good, old-fashioned replication, which is also the best way to deal with the problem of sampling error. […] Statistical significance provides even in the best case nothing more than low-level support for the existence of an effect, relation, or difference. That best case occurs when researchers estimate a priori power, specify the correct construct definitions and operationalizations, work with random or at least representative samples, analyze highly reliable scores in distributions that respect test assumptions, control other major sources of imprecision besides sampling error, and test plausible null hypotheses. In this idyllic scenario, *p* values from statistical tests may be reasonably accurate and potentially meaningful, if they are not misinterpreted. […] The capability of significance tests to address the dichotomous question of whether effects, relations, or differences are greater than expected levels of sampling error may be useful in some new research areas. Due to the many limitations of statistical tests, this period of usefulness should be brief. Given evidence that an effect exists, the next steps should involve estimation of its magnitude and evaluation of its substantive significance, both of which are beyond what significance testing can tell us. […] It should be a hallmark of a maturing research area that significance testing is not the primary inference method.”

“[An] **effect size** [is] a quantitative reﬂection of the magnitude of some phenomenon used for the sake of addressing a specific research question. In this sense, an effect size is a statistic (in samples) or parameter (in populations) with a purpose, that of quantifying a phenomenon of interest. more specific definitions may depend on study design. […] **cause size** refers to the independent variable and specifically to the amount of change in it that produces a given effect on the dependent variable. A related idea is that of **causal efficacy**, or the ratio of effect size to the size of its cause. The greater the causal efficacy, the more that a given change on an independent variable results in proportionally bigger changes on the dependent variable. The idea of cause size is most relevant when the factor is experimental and its levels are quantitative. […] An **effect size measure** […] is a named expression that maps data, statistics, or parameters onto a quantity that represents the magnitude of the phenomenon of interest. This expression connects dimensions or generalized units that are abstractions of variables of interest with a specific operationalization of those units.”

“A good effect size measure has the [following properties:] […] 1. Its scale (metric) should be appropriate for the research question. […] 2. It should be independent of sample size. […] 3. As a point estimate, an effect size should have good statistical properties; that is, it should be unbiased, consistent […], and efficient […]. 4. The effect size [should be] reported with a confidence interval. […] Not all effect size measures […] have all the properties just listed. But it is possible to report multiple effect sizes that address the same question in order to improve the communication of the results.”

“Examples of outcomes with meaningful metrics include salaries in dollars and post-treatment survival time in years. means or contrasts for variables with meaningful units are **unstandardized effect sizes** that can be directly interpreted. […] In medical research, physical measurements with meaningful metrics are often available. […] But in psychological research there are typically no “natural” units for abstract, nonphysical constructs such as intelligence, scholastic achievement, or self-concept. […] Therefore, metrics in psychological research are often arbitrary instead of meaningful. An example is the total score for a set of true-false items. Because responses can be coded with any two different numbers, the total is arbitrary. Standard scores such as percentiles and normal deviates are arbitrary, too […] **Standardized effect sizes** can be computed for results expressed in arbitrary metrics. Such effect sizes can also be directly compared across studies where outcomes have different scales. this is because standardized effect sizes are based on units that have a common meaning regardless of the original metric.”

“1. It is better to report unstandardized effect sizes for outcomes with meaningful metrics. This is because the original scale is lost when results are standardized. 2. Unstandardized effect sizes are best for comparing results across different samples measured on the same outcomes. […] 3. Standardized effect sizes are better for comparing conceptually similar results based on different units of measure. […] 4. Standardized effect sizes are affected by the corresponding unstandardized effect sizes plus characteristics of the study, including its design […], whether factors are fixed or random, the extent of error variance, and sample base rates. This means that standardized effect sizes are less directly comparable over studies that differ in their designs or samples. […] 5. There is no such thing as **T-shirt effect sizes** (Lenth, 2006– 2009) that classify standardized effect sizes as “small,” “medium,” or “large” and apply over all research areas. This is because what is considered a large effect in one area may be seen as small or trivial in another. […] 6. There is usually no way to directly translate standardized effect sizes into implications for substantive significance. […] It is standardized effect sizes from sets of related studies that are analyzed in most meta analyses.”