## Beyond Significance Testing (IV)

Below I have added some quotes from chapters 5, 6, and 7 of the book.

…

“There are two broad classes of standardized effect sizes for analysis at the group or variable level, the ** d family**, also known as

**group difference indexes**, and the

**, or**

*r*family**relationship indexes**[…] Both families are

**metric- (unit-) free effect sizes**that can compare results across studies or variables measured in different original metrics. Effect sizes in the d family are standardized mean differences that describe mean contrasts in standard deviation units, which can exceed 1.0 in absolute value. Standardized mean differences are signed effect sizes, where the sign of the statistic indicates the direction of the corresponding contrast. Effect sizes in the r family are scaled in correlation units that generally range from –1.0 to +1.0, where the sign indicates the direction of the relation […] Measures of association are

**unsigned effect sizes**and thus do not indicate directionality.”

“The correlation *r*_{pb} is for designs with two unrelated samples. […] *r*_{pb} […] is affected by base rate, or the proportion of cases in one group versus the other, *p *and *q*. It tends to be highest in balanced designs. As the design becomes more unbalanced holding all else constant, *r*_{pb} approaches zero. […] *r*_{pb} is not directly comparable across studies with dissimilar relative group sizes […]. The correlation *r*_{pb} is also affected by the total variability (i.e., *S _{T}*). If this variation is not constant over samples, values of

*r*

_{pb}may not be directly comparable.”

“Too many researchers neglect to report reliability coefficients for scores analyzed. This is regrettable because effect sizes cannot be properly interpreted without knowing whether the scores are precise. The general effect of measurement error in comparative studies is to attenuate absolute standardized effect sizes and reduce the power of statistical tests. Measurement error also contributes to variation in observed results over studies. Of special concern is when both score reliabilities and sample sizes vary from study to study. If so, effects of sampling error are confounded with those due to measurement error. […] There are ways to correct some effect sizes for measurement error (e.g., Baguley, 2009), but corrected effect sizes are rarely reported. It is more surprising that measurement error is ignored in most meta-analyses, too. F. L. Schmidt (2010) found that corrected effect sizes were analyzed in only about 10% of the 199 meta-analytic articles published in *Psychological Bulletin *from 1978 to 2006. This implies that (a) estimates of mean effect sizes may be too low and (b) the wrong statistical model may be selected when attempting to explain between-studies variation in results. If a fixed

effects model is mistakenly chosen over a random effects model, confidence intervals based on average effect sizes tend to be too narrow, which can make those results look more precise than they really are. Underestimating mean effect sizes while simultaneously overstating their precision is a potentially serious error.”

“[D]emonstration of an effect’s significance — whether theoretical, practical, or clinical — calls for more discipline-specific expertise than the estimation of its magnitude”.

“Some outcomes are categorical instead of continuous. The levels of a categorical outcome are mutually exclusive, and each case is classified into just one level. […] The **risk difference **(RD) is defined as *p*_{C} – *p*_{T}, and it estimates the parameter π_{C} – π_{T}. [*Those ‘n-resembling letters’ is* *how wordpress displays pi; this is one of an almost infinite number of reasons why I detest blogging equations on this blog and usually do not do this – US*] […] The **risk ratio **(RR) is the ratio of the risk rates […] which rate appears in the numerator versus the denominator is arbitrary, so one should always explain how RR is computed. […] The odds ratio (OR) is the ratio of the within-groups odds for the undesirable event. […] A convenient property of OR is that it can be converted to a kind of standardized mean difference known as **logit *** d* (Chinn, 2000). […] Reporting logit

*d*may be of interest when the hypothetical variable that underlies the observed dichotomy is continuous.”

“The risk difference RD is easy to interpret but has a drawback: Its range depends on the values of the population proportions π_{C} and π_{T}. That is, the range of RD is greater when both π_{C} and π_{T} are closer to .50 than when they are closer to either 0 or 1.00. The implication is that RD values may not be comparable across different studies when the corresponding parameters π_{C} and π_{T} are quite different. The risk ratio RR is also easy to interpret. It has the shortcoming that only the finite interval from 0 to < 1.0 indicates lower risk in the group represented in the numerator, but the interval from > 1.00 to infinity is theoretically available for describing higher risk in the same group. The range of RR varies according to its denominator. This property limits the value of RR for comparing results across different studies. […] The odds ratio or shares the limitation that the finite interval from 0 to < 1.0 indicates lower risk in the group represented in the numerator, but the interval from > 1.0 to infinity describes higher risk for the same group. Analyzing natural log transformations of OR and then taking antilogs of the results deals with this problem, just as for RR. The odds ratio may be the least intuitive of the comparative risk effect sizes, but it probably has the best overall statistical properties. This is because OR can be estimated in prospective studies, in studies that randomly sample from exposed and unexposed populations, and in retrospective studies where groups are first formed based on the presence or absence of a disease before their exposure to a putative risk factor is determined […]. Other effect sizes may not be valid in retrospective studies (RR) or in studies without random sampling ([*Pearson correlations between dichotomous variables, US*]).”

“Sensitivity and specificity are determined by the threshold on a screening test. This means that different thresholds on the same test will generate different sets of sensitivity and specificity values in the same sample. But both sensitivity and specificity are independent of population base rate and sample size. […] Sensitivity and specificity affect predictive value, the proportion of test results that are correct […] In general, predictive values increase as sensitivity and specificity increase. […] Predictive value is also inﬂuenced by the base rate (BR), the proportion of all cases with the disorder […] In general, PPV [positive predictive value] decreases and NPV [*negative…*] increases as BR approaches zero. This means that screening tests tend to be more useful for ruling out rare disorders than correctly predicting their presence. It also means that most positive results may be false positives under low base rate conditions. This is why it is difficult for researchers or social policy makers to screen large populations for rare conditions without many false positives. […] The effect of BR on predictive values is striking but often overlooked, even by professionals […]. One misunderstanding involves confusing sensitivity and specificity, which are invariant to BR, with PPV and NPV, which are not. This means that diagnosticians fail to adjust their estimates of test accuracy for changes in base rates, which exemplifies the base rate fallacy. […] In general, test results have greater impact on changing the pretest odds when the base rate is moderate, neither extremely low (close to 0) nor extremely high (close to 1.0). But if the target disorder is either very rare or very common, only a result from a highly accurate screening test will change things much.”

“The technique of ANCOVA [*ANalysis of COVAriance, US*] has two more assumptions than ANOVA does. One is **homogeneity of regression**, which requires equal within-populations unstandardized regression coefficients for predicting outcome from the covariate. In nonexperimental designs where groups differ systematically on the covariate […] the homogeneity of regression assumption is rather likely to be violated. The second assumption is that the covariate is measured without error […] Violation of either assumption may lead to inaccurate results. For example, an unreliable covariate in experimental designs causes loss of statistical power and in nonexperimental designs may also cause inaccurate adjustment of the means […]. In nonexperimental designs where groups differ systematically, these two extra assumptions are especially likely to be violated. An alternative to ANCOVA is **propensity score analysis **(PSA). It involves the use of logistic regression to estimate the probability for each case of belonging to different groups, such as treatment versus control, in designs without randomization, given the covariate(s). These probabilities are the propensities, and they can be used to match cases from nonequivalent groups.”