# Econstudentlog

## Beyond Significance Testing (II)

I have added some more quotes and observations from the book below.

“The least squares estimators M and s2 are not robust against the effects of extreme scores. […] Conventional methods to construct confidence intervals rely on sample standard deviations to estimate standard errors. These methods also rely on critical values in central test distributions, such as t and z, that assume normality or homoscedasticity […] Such distributional assumptions are not always plausible. […] One option to deal with outliers is to apply transformations, which convert original scores with a mathematical operation to new ones that may be more normally distributed. The effect of applying a monotonic transformation is to compress one part of the distribution more than another, thereby changing its shape but not the rank order of the scores. […] It can be difficult to find a transformation that works in a particular data set. Some distributions can be so severely nonnormal that basically no transformation will work. […] An alternative that also deals with departures from distributional assumptions is robust estimation. Robust (resistant) estimators are generally less affected than least squares estimators by outliers or nonnormality.”

“An estimator’s quantitative robustness can be described by its finite-sample breakdown point (BP), or the smallest proportion of scores that when made arbitrarily very large or small renders the statistic meaningless. The lower the value of BP, the less robust the estimator. For both M and s2, BP = 0, the lowest possible value. This is because the value of either statistic can be distorted by a single outlier, and the ratio 1/N approaches zero as sample size increases. In contrast, BP = .50 for the median because its value is not distorted by arbitrarily extreme scores unless they make up at least half the sample. But the median is not an optimal estimator because its value is determined by a single score, the one at the 50th percentile. In this sense, all the other scores are discarded by the median. A compromise between the sample mean and the median is the trimmed mean. A trimmed mean Mtr is calculated by (a) ordering the scores from lowest to highest, (b) deleting the same proportion of the most extreme scores from each tail of the distribution, and then (c) calculating the average of the scores that remain. […] A common practice is to trim 20% of the scores from each tail of the distribution when calculating trimmed estimators. This proportion tends to maintain the robustness of trimmed means while minimizing their standard errors when sampling from symmetrical distributions […] For 20% trimmed means, BP = .20, which says they are robust against arbitrarily extreme scores unless such outliers make up at least 20% of the sample.”

The standard H0 is both a point hypothesis and a nil hypothesis. A point hypothesis specifies the numerical value of a parameter or the difference between two or more parameters, and a nil hypothesis states that this value is zero. The latter is usually a prediction that an effect, difference, or association is zero. […] Nil hypotheses as default explanations may be fine in new research areas when it is unknown whether effects exist at all. But they are less suitable in established areas when it is known that some effect is probably not zero. […] Nil hypotheses are tested much more often than non-nil hypotheses even when the former are implausible. […] If a nil hypothesis is implausible, estimated probabilities of data will be too low. This means that risk for Type I error is basically zero and a Type II error is the only possible kind when H0 is known in advance to be false.”

“Too many researchers treat the conventional levels of α, either .05 or .01, as golden rules. If other levels of α are specifed, they tend to be even lower […]. Sanctification of .05 as the highest “acceptable” level is problematic. […] Instead of blindly accepting either .05 or .01, one does better to […] [s]pecify a level of α that reﬂects the desired relative seriousness (DRS) of Type I error versus Type II error. […] researchers should not rely on a mechanical ritual (i.e., automatically specify .05 or .01) to control risk for Type I error that ignores the consequences of Type II error.”

“Although p and α are derived in the same theoretical sampling distribution, p does not estimate the conditional probability of a Type I error […]. This is because p is based on a range of results under H0, but α has nothing to do with actual results and is supposed to be specified before any data are collected. Confusion between p and α is widespread […] To differentiate the two, Gigerenzer (1993) referred to p as the exact level of significance. If p = .032 and α = .05, H0 is rejected at the .05 level, but .032 is not the long-run probability of Type I error, which is .05 for this example. The exact level of significance is the conditional probability of the data (or any result even more extreme) assuming H0 is true, given all other assumptions about sampling, distributions, and scores. […] Because p values are estimated assuming that H0 is true, they do not somehow measure the likelihood that H0 is correct. […] The false belief that p is the probability that H0 is true, or the inverse probability error […] is widespread.”

“Probabilities from significance tests say little about effect size. This is because essentially any test statistic (TS) can be expressed as the product TS = ES × f(N) […] where ES is an effect size and f(N) is a function of sample size. This equation explains how it is possible that (a) trivial effects can be statistically significant in large samples or (b) large effects may not be statistically significant in small samples. So p is a confounded measure of effect size and sample size.”

“Power is the probability of getting statistical significance over many random replications when H1 is true. it varies directly with sample size and the magnitude of the population effect size. […] This combination leads to the greatest power: a large population effect size, a large sample, a higher level of α […], a within-subjects design, a parametric test rather than a nonparametric test (e.g., t instead of Mann–Whitney), and very reliable scores. […] Power .80 is generally desirable, but an even higher standard may be need if consequences of Type II error are severe. […] Reviews from the 1970s and 1980s indicated that the typical power of behavioral science research is only about .50 […] and there is little evidence that power is any higher in more recent studies […] Ellis (2010) estimated that < 10% of studies have samples sufficiently large to detect smaller population effect sizes. Increasing sample size would address low power, but the number of additional cases necessary to reach even nominal power when studying smaller effects may be so great as to be practically impossible […] Too few researchers, generally < 20% (Osborne, 2008), bother to report prospective power despite admonitions to do so […] The concept of power does not stand without significance testing. as statistical tests play a smaller role in the analysis, the relevance of power also declines. If significance tests are not used, power is irrelevant. Cumming (2012) described an alternative called precision for research planning, where the researcher specifies a target margin of error for estimating the parameter of interest. […] The advantage over power analysis is that researchers must consider both effect size and precision in study planning.”

“Classical nonparametric tests are alternatives to the parametric t and F tests for means (e.g., the Mann–Whitney test is the nonparametric analogue to the t test). Nonparametric tests generally work by converting the original scores to ranks. They also make fewer assumptions about the distributions of those ranks than do parametric tests applied to the original scores. Nonparametric tests date to the 1950s–1960s, and they share some limitations. One is that they are not generally robust against heteroscedasticity, and another is that their application is typically limited to single-factor designs […] Modern robust tests are an alternative. They are generally more ﬂexible than nonparametric tests and can be applied in designs with multiple factors. […] At the end of the day, robust statistical tests are subject to many of the same limitations as other statistical tests. For example, they assume random sampling albeit from population distributions that may be nonnormal or heteroscedastic; they also assume that sampling error is the only source of error variance. Alternative tests, such as the Welch–James and Yuen–Welch versions of a robust t test, do not always yield the same p value for the same data, and it is not always clear which alternative is best (Wilcox, 2003).”