## The Mathematical Challenge of Large Networks

This is another one of the aforementioned lectures I watched a while ago, but had never got around to blogging:

…

If I had to watch this one again, I’d probably skip most of the second half; it contains highly technical coverage of topics in graph theory, and it was very difficult for me to follow (but I did watch it to the end, just out of curiosity).

The lecturer has put up a ~500 page publication on these and related topics, which is available here, so if you want to know more that’s an obvious place to go have a look. A few other relevant links to stuff mentioned/covered in the lecture:

Szemerédi regularity lemma.

Graphon.

Turán’s theorem.

Quantum graph.

## A few diabetes papers of interest

i. Association Between Blood Pressure and Adverse Renal Events in Type 1 Diabetes.

“The Joint National Committee and American Diabetes Association guidelines currently recommend a blood pressure (BP) target of <140/90 mmHg for all adults with diabetes, regardless of type (1–3). However, evidence used to support this recommendation is primarily based on data from trials of type 2 diabetes (4–6). The relationship between BP and adverse outcomes in type 1 and type 2 diabetes may differ, given that the type 1 diabetes population is typically much younger at disease onset, hypertension is less frequently present at diagnosis (3), and the basis for the pathophysiology and disease complications may differ between the two populations.

Prior prospective cohort studies (7,8) of patients with type 1 diabetes suggested that lower BP levels (<110–120/70–80 mmHg) at baseline entry were associated with a lower risk of adverse renal outcomes, including incident microalbuminuria. In one trial of antihypertensive treatment in type 1 diabetes (9), assignment to a lower mean arterial pressure (MAP) target of <92 mmHg (corresponding to ∼125/75 mmHg) led to a significant reduction in proteinuria compared with a MAP target of 100–107 mmHg (corresponding to ∼130–140/85–90 mmHg). Thus, it is possible that lower BP (<120/80 mmHg) reduces the risk of important renal outcomes, such as proteinuria, in patients with type 1 diabetes and may provide a synergistic benefit with intensive glycemic control on renal outcomes (10–12). However, fewer studies have examined the association between BP levels over time and the risk of more advanced renal outcomes, such as stage III chronic kidney disease (CKD) or end-stage renal disease (ESRD)”.

“The primary objective of this study was to determine whether there is an association between lower BP levels and the risk of more advanced diabetic nephropathy, defined as macroalbuminuria or stage III CKD, within a background of different glycemic control strategies […] We included 1,441 participants with type 1 diabetes between the ages of 13 and 39 years who had previously been randomized to receive intensive versus conventional glycemic control in the Diabetes Control and Complications Trial (DCCT). The exposures of interest were time-updated systolic BP (SBP) and diastolic BP (DBP) categories. Outcomes included macroalbuminuria (>300 mg/24 h) or stage III chronic kidney disease (CKD) […] During a median follow-up time of 24 years, there were 84 cases of stage III CKD and 169 cases of macroalbuminuria. In adjusted models, SBP in the 2 (95% CI 1.05–1.21), and a 1.04 times higher risk of ESRD (95% CI 0.77–1.41) in adjusted Cox models. Every 10 mmHg increase in DBP was associated with a 1.17 times higher risk of microalbuminuria (95% CI 1.03–1.32), a 1.15 times higher risk of eGFR decline to 2 (95% CI 1.04–1.29), and a 0.80 times higher risk of ESRD (95% CI 0.47–1.38) in adjusted models. […] Because these data are observational, they cannot prove causation. It remains possible that subtle kidney disease may lead to early elevations in BP, and we cannot rule out the potential for reverse causation in our findings. However, we note similar trends in our data even when imposing a 7-year lag between BP and CKD ascertainment.”

“**CONCLUSIONS** A lower BP (<120/70 mmHg) was associated with a substantially lower risk of adverse renal outcomes, regardless of the prior assigned glycemic control strategy. Interventional trials may be useful to help determine whether the currently recommended BP target of 140/90 mmHg may be too high for optimal renal protection in type 1 diabetes.”

It’s important to keep in mind when interpreting these results that endpoints like ESRD and stage III CKD are not the only relevant outcomes in this setting; even mild-stage kidney disease in diabetics significantly increase the risk of death from cardiovascular disease, and a substantial proportion of patients may die from cardiovascular disease before reaching a late-stage kidney disease endpoint (here’s a relevant link).

…

Identifying Causes for Excess Mortality in Patients With Diabetes: Closer but Not There Yet.

“A number of epidemiological studies have quantified the risk of death among patients with diabetes and assessed the causes of death (2–6), with highly varying results […] Overall, the studies to date have confirmed that diabetes is associated with an increased risk of all-cause mortality, but the magnitude of this excess risk is highly variable, with the relative risk ranging from 1.15 to 3.15. Nevertheless, all studies agree that mortality is mainly attributable to cardiovascular causes (2–6). On the other hand, studies of cancer-related death have generally been lacking despite the diabetes–cancer association and a number of plausible biological mechanisms identified to explain this link (8,9). In fact, studies assessing the specific causes of noncardiovascular death in diabetes have been sparse. […] In this issue of *Diabetes Care*, Baena-Díez et al. (10) report on an observational study of the association between diabetes and cause-specific death. This study involved 55,292 individuals from 12 Spanish population cohorts with no prior history of cardiovascular disease, aged 35 to 79 years, with a 10-year follow-up. […] This study found that individuals with diabetes compared with those without diabetes had a higher risk of cardiovascular death, cancer death, and noncardiovascular noncancer death with similar estimates obtained using the two statistical approaches. […] Baena-Díez et al. (10) showed that individuals with diabetes have an approximately threefold increased risk of cardiovascular mortality, which is much higher than what has been reported by recent studies (5,6). While this may be due to the lack of adjustment for important confounders in this study, there remains uncertainty regarding the magnitude of this increase.”

“[A]ll studies of excess mortality associated with diabetes, including the current one, have produced highly variable results. The reasons may be methodological. For instance, it may be that because of the wide range of age in these studies, comparing the rates of death between the patients with diabetes and those without diabetes using a measure based on the ratio of the rates may be misleading because the ratio can vary by age [*it almost certainly does vary by age,*

*US*]. Instead, a measure based on the difference in rates may be more appropriate (16). Another issue relates to the fact that the studies include patients with longstanding diabetes of variable duration, resulting in so-called prevalent cohorts that can result in muddled mortality estimates since these are necessarily based on a mix of patients at different stages of disease (17). Thus, a paradigm change may be in order for future observational studies of diabetes and mortality, in the way they are both designed and analyzed. With respect to cancer, such studies will also need to tease out the independent contribution of antidiabetes treatments on cancer incidence and mortality (18–20). It is thus clear that the quantification of the excess mortality associated with diabetes per se will need more accurate tools.”

…

iii. Risk of Cause-Specific Death in Individuals With Diabetes: A Competing Risks Analysis. This is the paper some of the results of which were discussed above. I’ll just include the highlights here:

“**RESULTS** We included 55,292 individuals (15.6% with diabetes and overall mortality of 9.1%). The adjusted hazard ratios showed that diabetes increased mortality risk: *1*) cardiovascular death, CSH = 2.03 (95% CI 1.63–2.52) and PSH = 1.99 (1.60–2.49) in men; and CSH = 2.28 (1.75–2.97) and PSH = 2.23 (1.70–2.91) in women; *2*) cancer death, CSH = 1.37 (1.13–1.67) and PSH = 1.35 (1.10–1.65) in men; and CSH = 1.68 (1.29–2.20) and PSH = 1.66 (1.25–2.19) in women; and *3*) noncardiovascular noncancer death, CSH = 1.53 (1.23–1.91) and PSH = 1.50 (1.20–1.89) in men; and CSH = 1.89 (1.43–2.48) and PSH = 1.84 (1.39–2.45) in women. In all instances, the cumulative mortality function was significantly higher in individuals with diabetes.

**CONCLUSIONS** Diabetes is associated with premature death from cardiovascular disease, cancer, and noncardiovascular noncancer causes.”

### “Summary

Diabetes is associated with premature death from cardiovascular diseases (coronary heart disease, stroke, and heart failure), several cancers (liver, colorectal, and lung), and other diseases (chronic obstructive pulmonary disease and liver and kidney disease). In addition, the cause-specific cumulative mortality for cardiovascular, cancer, and noncardiovascular noncancer causes was significantly higher in individuals with diabetes, compared with the general population. The dual analysis with CSH and PSH methods provides a comprehensive view of mortality dynamics in the population with diabetes. This approach identifies the individuals with diabetes as a vulnerable population for several causes of death aside from the traditionally reported cardiovascular death.”

…

iv. Disability-Free Life-Years Lost Among Adults Aged ≥50 Years With and Without Diabetes.

“**RESEARCH DESIGN AND METHODS** Adults (*n* = 20,008) aged 50 years and older were followed from 1998 to 2012 in the Health and Retirement Study, a prospective biannual survey of a nationally representative sample of adults. Diabetes and disability status (defined by mobility loss, difficulty with instrumental activities of daily living [IADL], and/or difficulty with activities of daily living [ADL]) were self-reported. We estimated incidence of disability, remission to nondisability, and mortality. We developed a discrete-time Markov simulation model with a 1-year transition cycle to predict and compare lifetime disability-related outcomes between people with and without diabetes. Data represent the U.S. population in 1998.

**RESULTS** From age 50 years, adults with diabetes died 4.6 years earlier, developed disability 6–7 years earlier, and spent about 1–2 more years in a disabled state than adults without diabetes. With increasing baseline age, diabetes was associated with significant (*P* < 0.05) reductions in the number of total and disability-free life-years, but the absolute difference in years between those with and without diabetes was less than at younger baseline age. Men with diabetes spent about twice as many of their remaining years disabled (20–24% of remaining life across the three disability definitions) as men without diabetes (12–16% of remaining life across the three disability definitions). Similar associations between diabetes status and disability-free and disabled years were observed among women.

**CONCLUSIONS** Diabetes is associated with a substantial reduction in nondisabled years, to a greater extent than the reduction of longevity. […] Using a large, nationally representative cohort of Americans aged 50 years and older, we found that diabetes is associated with a substantial deterioration of nondisabled years and that this is a greater number of years than the loss of longevity associated with diabetes. On average, a middle-aged adult with diabetes has an onset of disability 6–7 years earlier than one without diabetes, spends 1–2 more years with disability, and loses 7 years of disability-free life to the condition. Although other nationally representative studies have reported large reductions in complications (9) and mortality among the population with diabetes in recent decades (1), these studies, akin to our results, suggest that diabetes continues to have a substantial impact on morbidity and quality of remaining years of life.”

…

“People with type 1 diabetes have a documented shorter life expectancy than the general population without diabetes (1). Cardiovascular disease (CVD) is the main cause of the excess morbidity and mortality, and despite advances in management and therapy, individuals with type 1 diabetes have a markedly elevated risk of cardiovascular events and death compared with the general population (2).

Lipid-lowering treatment with hydroxymethylglutaryl-CoA reductase inhibitors (statins) prevents major cardiovascular events and death in a broad spectrum of patients (3,4). […] We hypothesized that primary prevention with lipid-lowering therapy (LLT) can reduce the incidence of cardiovascular morbidity and mortality in individuals with type 1 diabetes. The aim of the study was to examine this in a nationwide longitudinal cohort study of patients with no history of CVD. […] A total of 24,230 individuals included in 2006–2008 NDR with type 1 diabetes without a history of CVD were followed until 31 December 2012; 18,843 were untreated and 5,387 treated with LLT [Lipid-Lowering Therapy] (97% statins). The mean follow-up was 6.0 years. […] Hazard ratios (HRs) for treated versus untreated were as follows: cardiovascular death 0.60 (95% CI 0.50–0.72), all-cause death 0.56 (0.48–0.64), fatal/nonfatal stroke 0.56 (0.46–0.70), fatal/nonfatal acute myocardial infarction 0.78 (0.66–0.92), fatal/nonfatal coronary heart disease 0.85 (0.74–0.97), and fatal/nonfatal CVD 0.77 (0.69–0.87).

**CONCLUSIONS** This observational study shows that LLT is associated with 22–44% reduction in the risk of CVD and cardiovascular death among individuals with type 1 diabetes without history of CVD and underlines the importance of primary prevention with LLT to reduce cardiovascular risk in type 1 diabetes.”

…

“In many prognostic factor studies, multivariate analyses using the Cox proportional hazards model are applied to identify independent prognostic factors. However, the coefficient estimates derived from the Cox proportional hazards model may be biased as a result of violating assumptions of independence. […] RPA [Recursive Partitioning Analysis] classification is a useful tool that could prioritize the prognostic factors and divide the subjects into distinctive groups. RPA has an advantage over the proportional hazards model in identifying prognostic factors because it does not require risk factor independence and, as a nonparametric technique, makes no requirement on the underlying distributions of the variables considered. Hence, it relies on fewer modeling assumptions. Also, because the method is designed to divide subjects into groups based on the length of survival, it defines groupings for risk classification, whereas Cox regression models do not. Moreover, there is no need to explicitly include covariate interactions because of the recursive splitting structure of tree model construction.”

“This is the first study that characterizes the risk factors associated with the transition from one preclinical stage to the next following a recommended staging classification system (9). The tree-structured prediction model reveals that the risk parameters are not the same across each transition. […] Based on the RPA classification, the subjects at younger age and with higher GAD65Ab [*an important biomarker in the context of autoimmune forms of diabetes, US – here’s a relevant link*] titer are at higher risk for progression to multiple positive autoantibodies from a single autoantibody (seroconversion). Approximately 70% of subjects with a single autoantibody were positive for GAD65Ab, much higher than for insulin autoantibody (24%) and IA-2A [*here’s a relevant link – US*] (5%). Our study results are consistent with those of others (22–24) in that seroconversion is age related. Previous studies in infants and children at an early age have shown that progression from single to two or more autoantibodies occurs more commonly in children 25). The subjects ≤16 years of age had almost triple the 5-year risk compared with subjects >16 years of age at the same GAD65Ab titer level. Hence, not all individuals with a single islet autoantibody can be thought of as being at low risk for disease progression.”

“This is the first study that identifies the risk factors associated with the timing of transitions from one preclinical stage to the next in the development of T1D. Based on RPA risk parameters, we identify the characteristics of groups with similar 5-year risks for advancing to the next preclinical stage. It is clear that individuals with one or more autoantibodies or with dysglycemia are not homogeneous with regard to the risk of disease progression. Also, there are differences in risk factors at each stage that are associated with increased risk of progression. The potential benefit of identifying these groups allows for a more informed discussion of diabetes risk and the selective enrollment of individuals into clinical trials whose risk more appropriately matches the potential benefit of an experimental intervention. Since the risk levels in these groups are substantial, their definition makes possible the design of more efficient trials with target sample sizes that are feasible, opening up the field of prevention to additional at-risk cohorts. […] Our results support the evidence that autoantibody titers are strong predictors at each transition leading to T1D development. The risk of the development of multiple autoantibodies was significantly increased when the GAD65Ab titer level was elevated, and the risk of the development of dysglycemia was increased when the IA-2A titer level increased. These indicate that better risk prediction on the timing of transitions can be obtained by evaluating autoantibody titers. The results also suggest that an autoantibody titer should be carefully considered in planning prevention trials for T1D in addition to the number of positive autoantibodies and the type of autoantibody.”

## Biodemography of aging (IV)

My working assumption as I was reading part two of the book was that I would not be covering that part of the book in much detail here because it would simply be too much work to make such posts legible to the readership of this blog. However I then later, while writing this post, had the thought that given that almost nobody reads along here anyway (I’m not complaining, mind you – this is how I like it these days), the main beneficiary of my blog posts will always be myself, which lead to the related observation/notion that I should not be limiting my coverage of interesting stuff here simply because some hypothetical and probably nonexistent readership out there might not be able to follow the coverage. So when I started out writing this post I was working under the assumption that it would be my last post about the book, but I now feel sure that if I find the time I’ll add at least one more post about the book’s statistics coverage. On a related note I am explicitly making the observation here that this post was written for *my* benefit, not yours. You can read it if you like, or not, but it was not really written for you.

I have added bold a few places to emphasize key concepts and observations from the quoted paragraphs and in order to make the post easier for me to navigate later (all the italics below are on the other hand those of the authors of the book).

…

“** Biodemography** is a multidisciplinary branch of science that unites under its umbrella various analytic approaches aimed at integrating biological knowledge and methods and traditional demographic analyses to shed more light on variability in mortality and health across populations and between individuals.

**is a special subfield of biodemography that focuses on understanding the impact of processes related to aging on health and longevity.”**

*Biodemography of aging*“Mortality rates as a function of age are a cornerstone of many demographic analyses. The longitudinal **age** **trajectories of biomarkers** add a new dimension to the traditional demographic analyses: the mortality rate becomes a function of not only age but also of these biomarkers (with additional dependence on a set of sociodemographic variables). Such analyses should incorporate dynamic characteristics of trajectories of biomarkers to evaluate their impact on mortality or other outcomes of interest. Traditional analyses using baseline values of biomarkers (e.g., Cox proportional hazards or logistic regression models) do not take into account these dynamics. One approach to the evaluation of the impact of biomarkers on mortality rates is to use the Cox proportional hazards model with time-dependent covariates; this approach is used extensively in various applications and is available in all popular statistical packages. In such a model, the biomarker is considered a time-dependent covariate of the hazard rate and the corresponding regression parameter is estimated along with standard errors to make statistical inference on the direction and the significance of the effect of the biomarker on the outcome of interest (e.g., mortality). However, **the choice of the analytic approach should not be governed exclusively by its simplicity or convenience of application. It is essential to consider whether the method gives meaningful and interpretable results relevant to the research agenda. In the particular case of biodemographic analyses, the Cox proportional hazards model with time-dependent covariates is not the best choice.**”

“Longitudinal studies of aging present special methodological challenges due to inherent characteristics of the data that need to be addressed in order to avoid biased inference. The challenges are related to the fact that the populations under study (aging individuals) experience substantial **dropout rates** related to death or poor health and often have co-morbid conditions related to the disease of interest. The standard assumption made in longitudinal analyses (although usually not explicitly mentioned in publications) is that dropout (e.g., death) is not associated with the outcome of interest. While this can be safely assumed in many general longitudinal studies (where, e.g., the main causes of dropout might be the administrative end of the study or moving out of the study area, which are presumably not related to the studied outcomes), the very nature of the longitudinal outcomes (e.g., measurements of some physiological biomarkers) analyzed in a longitudinal study of aging assumes that they are (at least hypothetically) related to the process of aging. Because the process of aging leads to the development of diseases and, eventually, death, in longitudinal studies of aging an assumption of non-association of the reason for dropout and the outcome of interest is, at best, risky, and usually is wrong. As an illustration, we found that the average trajectories of different physiological indices of individuals dying at earlier ages markedly deviate from those of long-lived individuals, both in the entire Framingham original cohort […] and also among carriers of specific alleles […] In such a situation, **panel compositional changes due to attrition** affect the averaging procedure and modify the averages in the total sample. Furthermore, biomarkers are subject to **measurement error** and random biological variability. They are usually collected intermittently at examination times which may be sparse and typically biomarkers are not observed at event times. It is well known in the statistical literature that ignoring measurement errors and biological variation in such variables and using their observed “raw” values as time-dependent covariates in a Cox regression model may lead to biased estimates and incorrect inferences […] **Standard methods of survival analysis such as the Cox proportional hazards model** (Cox 1972) **with time-dependent covariates should be avoided in analyses of biomarkers measured with errors** because they can lead to biased estimates.”

“Statistical methods aimed at analyses of time-to-event data jointly with longitudinal measurements have become known in the mainstream biostatistical literature as “**joint models for longitudinal and time-to-event data**” (“survival” or “failure time” are often used interchangeably with “time-to-event”) or simply “**joint models**.” This is an active and fruitful area of biostatistics with an explosive growth in recent years. […] The standard joint model consists of two parts, the first representing the dynamics of longitudinal data (which is referred to as the “longitudinal sub-model”) and the second one modeling survival or, generally, time-to-event data (which is referred to as the “survival sub-model”). […] Numerous extensions of this basic model have appeared in the joint modeling literature in recent decades, providing great flexibility in applications to a wide range of practical problems. […] The standard parameterization of the joint model (11.2) assumes that the risk of the event at age t depends on the current “true” value of the longitudinal biomarker at this age. While this is a reasonable assumption in general, it may be argued that additional dynamic characteristics of the longitudinal trajectory can also play a role in the risk of death or onset of a disease. For example, if two individuals at the same age have exactly the same level of some biomarker at this age, but the trajectory for the first individual increases faster with age than that of the second one, then the first individual can have worse survival chances for subsequent years. […] Therefore, extensions of the basic parameterization of joint models allowing for dependence of the risk of an event on such dynamic characteristics of the longitudinal trajectory can provide additional opportunities for comprehensive analyses of relationships between the risks and longitudinal trajectories. Several authors have considered such extended models. […] **joint models are computationally intensive** and are sometimes prone to convergence problems [however such] models provide more efficient estimates of the effect of a covariate […] on the time-to-event outcome in the case in which there is […] an effect of the covariate on the longitudinal trajectory of a biomarker. This means that** analyses of longitudinal and time-to-event data in joint models may require smaller sample sizes to achieve comparable statistical power** **with analyses based on time-to-event data alone** (Chen et al. 2011).”

“To be useful as a tool for biodemographers and gerontologists who seek biological explanations for observed processes, models of longitudinal data should be based on realistic assumptions and reflect relevant knowledge accumulated in the field. An example is the shape of the risk functions. Epidemiological studies show that **the conditional hazards of health and survival events considered as functions of risk factors often have U- or J-shapes** […], so a model of aging-related changes should incorporate this information. In addition, risk variables, and, what is very important, their effects on the risks of corresponding health and survival events, experience aging-related changes and these can differ among individuals. […] An important class of models for joint analyses of longitudinal and time-to-event data incorporating a stochastic process for description of longitudinal measurements uses an epidemiologically-justified assumption of a quadratic hazard (i.e., U-shaped in general and J-shaped for variables that can take values only on one side of the U-curve) considered as a function of physiological variables. **Quadratic hazard models** have been developed and intensively applied in studies of human longitudinal data”.

“Various approaches to statistical model building and data analysis that incorporate unobserved heterogeneity are ubiquitous in different scientific disciplines. **Unobserved heterogeneity** in models of health and survival outcomes can arise because there may be relevant risk factors affecting an outcome of interest that are either unknown or not measured in the data. Frailty models introduce the concept of unobserved heterogeneity in survival analysis for time-to-event data. […] Individual age trajectories of biomarkers can differ due to various observed as well as unobserved (and unknown) factors and such individual differences propagate to differences in risks of related time-to-event outcomes such as the onset of a disease or death. […] The joint analysis of longitudinal and time-to-event data is the realm of a special area of biostatistics named “joint models for longitudinal and time-to-event data” or simply “joint models” […] Approaches that incorporate heterogeneity in populations through random variables with continuous distributions (as in the standard joint models and their extensions […]) assume that the risks of events and longitudinal trajectories follow similar patterns for all individuals in a population (e.g., that biomarkers change linearly with age for all individuals). Although such homogeneity in patterns can be justifiable for some applications, generally this is a rather strict assumption […] A population under study may consist of subpopulations with distinct patterns of longitudinal trajectories of biomarkers that can also have different effects on the time-to-event outcome in each subpopulation. When such subpopulations can be defined on the base of observed covariate(s), one can perform stratified analyses applying different models for each subpopulation. However, observed covariates may not capture the entire heterogeneity in the population in which case it may be useful to conceive of the population as consisting of *latent* subpopulations defined by unobserved characteristics. Special methodological approaches are necessary to accommodate such hidden heterogeneity. Within the joint modeling framework, a special class of models, **joint latent class models**, was developed to account for such heterogeneity […] The joint latent class model has three components. First, it is assumed that a population consists of a fixed number of (latent) subpopulations. The latent class indicator represents the latent class membership and the probability of belonging to the latent class is specified by a multinomial logistic regression function of observed covariates. It is assumed that individuals from different latent classes have different patterns of longitudinal trajectories of biomarkers and different risks of event. The key assumption of the model is conditional independence of the biomarker and the time-to-events given the latent classes. Then the class-specific models for the longitudinal and time-to-event outcomes constitute the second and third component of the model thus completing its specification. […] **the latent class stochastic process model** […] provides a useful tool for dealing with unobserved heterogeneity in joint analyses of longitudinal and time-to-event outcomes and taking into account hidden components of aging in their joint influence on health and longevity. This approach is also helpful for sensitivity analyses in applications of the original stochastic process model. We recommend starting the analyses with the original stochastic process model and estimating the model ignoring possible hidden heterogeneity in the population. Then the latent class stochastic process model can be applied to test hypotheses about the presence of hidden heterogeneity in the data in order to appropriately adjust the conclusions if a latent structure is revealed.”

“**The longitudinal genetic-demographic model** (or the genetic-demographic model for longitudinal data) […] combines three sources of information in the likelihood function: (1) follow-up data on survival (or, generally, on some time-to-event) for genotyped individuals; (2) (cross-sectional) information on ages at biospecimen collection for genotyped individuals; and (3) follow-up data on survival for non-genotyped individuals. […] Such joint analyses of genotyped and non-genotyped individuals can result in substantial improvements in statistical power and accuracy of estimates compared to analyses of the genotyped subsample alone if the proportion of non-genotyped participants is large. Situations in which genetic information cannot be collected for all participants of longitudinal studies are not uncommon. They can arise for several reasons: (1) the longitudinal study may have started some time before genotyping was added to the study design so that some initially participating individuals dropped out of the study (i.e., died or were lost to follow-up) by the time of genetic data collection; (2) budget constraints prohibit obtaining genetic information for the entire sample; (3) some participants refuse to provide samples for genetic analyses. Nevertheless, even when genotyped individuals constitute a majority of the sample or the entire sample, application of such an approach is still beneficial […] **The genetic stochastic process model** […] adds a new dimension to genetic biodemographic analyses, combining information on longitudinal measurements of biomarkers available for participants of a longitudinal study with follow-up data and genetic information. Such **joint analyses of different sources of information** collected in both genotyped and non-genotyped individuals allow for more efficient use of the research potential of longitudinal data which otherwise remains underused when only genotyped individuals or only subsets of available information (e.g., only follow-up data on genotyped individuals) are involved in analyses. Similar to the longitudinal genetic-demographic model […], **the benefits of combining data** on genotyped and non-genotyped individuals in the genetic SPM come from the presence of common parameters describing characteristics of the model for genotyped and non-genotyped subsamples of the data. This takes into account the knowledge that the non-genotyped subsample is a mixture of carriers and non-carriers of the same alleles or genotypes represented in the genotyped subsample and applies the ideas of heterogeneity analyses […] When the non-genotyped subsample is substantially larger than the genotyped subsample, these joint analyses can lead to a noticeable increase in the power of statistical estimates of genetic parameters compared to estimates based only on information from the genotyped subsample. **This approach is applicable not only to genetic data but to any discrete time-independent variable that is observed only for a subsample of individuals in a longitudinal study.**”

“Despite an existing tradition of interpreting differences in the shapes or parameters of the mortality rates (survival functions) resulting from the effects of exposure to different conditions or other interventions in terms of characteristics of individual aging, this practice has to be used with care. This is because such characteristics are difficult to interpret in terms of properties of external and internal processes affecting the chances of death. An important question then is: What kind of mortality model has to be developed to obtain parameters that are biologically interpretable? The purpose of this chapter is to describe an approach to mortality modeling that represents mortality rates in terms of parameters of physiological changes and declining health status accompanying the process of aging in humans. […] **A traditional (demographic) description of changes in individual health/survival status is performed using a continuous-time random Markov process** with a finite number of states, and age-dependent transition intensity functions (transitions rates). Transitions to the absorbing state are associated with death, and the corresponding transition intensity is a mortality rate. Although such a description characterizes connections between health and mortality, it does not allow for studying factors and mechanisms involved in the aging-related health decline. Numerous epidemiological studies provide compelling evidence that health transition rates are influenced by a number of factors. Some of them are fixed at the time of birth […]. Others experience stochastic changes over the life course […] **The presence of** such **randomly changing influential factors violates the Markov assumption, and makes the description of aging-related changes in health status more complicated.** […] The age dynamics of influential factors (e.g., physiological variables) in connection with mortality risks has been described using a stochastic process model of human mortality and aging […]. Recent extensions of this model have been used in analyses of longitudinal data on aging, health, and longevity, collected in the Framingham Heart Study […] This model and its extensions are described in terms of **a Markov stochastic process satisfying a diffusion-type stochastic differential equation.** The stochastic process is stopped at random times associated with individuals’ deaths. […] When an individual’s health status is taken into account, the coefficients of the stochastic differential equations become dependent on values of the **jumping process.** This dependence violates the Markov assumption and renders the conditional Gaussian property invalid. So the description of this (continuously changing) component of aging-related changes in the body also becomes more complicated. Since studying age trajectories of physiological states in connection with changes in health status and mortality would provide more realistic scenarios for analyses of available longitudinal data, it would be a good idea to find an appropriate mathematical description of the joint evolution of these interdependent processes in aging organisms. For this purpose,** we propose a comprehensive model of human aging, health, and mortality in which the Markov assumption is fulfilled by a two-component stochastic process consisting of jumping and continuously changing processes. The jumping component is used to describe relatively fast changes in health status occurring at random times, and the continuous component describes relatively slow stochastic age-related changes of individual physiological states. **[…] The use of stochastic differential equations for random continuously changing covariates has been studied intensively in the analysis of longitudinal data […] Such a description is convenient since it captures the feedback mechanism typical of biological systems reflecting regular aging-related changes and takes into account the presence of random noise affecting individual trajectories. It also captures the dynamic connections between aging-related changes in health and physiological states, which are important in many applications.”

## Biodemography of aging (III)

Latent class representation of the Grade of Membership model.

Singular value decomposition.

Affine space.

Lebesgue measure.

General linear position.

The links above are links to topics I looked up while reading the second half of the book. The first link is quite relevant to the book’s coverage as a comprehensive longitudinal Grade of Membership (-GoM) model is covered in chapter 17. Relatedly, chapter 18 covers linear latent structure (-LLS) models, and as observed in the book LLS is a generalization of GoM. As should be obvious from the nature of the links some of the stuff included in the second half of the text is highly technical, and I’ll readily admit I was not fully able to understand all the details included in the coverage of chapters 17 and 18 in particular. On account of the technical nature of the coverage in Part 2 I’m not sure I’ll cover the second half of the book in much detail, though I probably shall devote at least one more post to some of those topics, as they were quite interesting even if some of the details were difficult to follow.

I have almost finished the book at this point, and I have already decided to both give the book five stars and include it on my list of favorite books on goodreads; it’s really well written, and it provides consistently highly detailed coverage of very high quality. As I also noted in the first post about the book the authors have given readability aspects some thought, and I am sure most readers would learn quite a bit from this text even if they were to skip some of the more technical chapters. The main body of Part 2 of the book, the subtitle of which is ‘Statistical Modeling of Aging, Health, and Longevity’, is however probably in general not worth the effort of reading unless you have a solid background in statistics.

This post includes some observations and quotes from the last chapters of the book’s Part 1.

…

“The proportion of older adults in the U.S. population is growing. This raises important questions about the increasing prevalence of aging-related diseases, multimorbidity issues, and disability among the elderly population. […] In 2009, 46.3 million people were covered by Medicare: 38.7 million of them were aged 65 years and older, and 7.6 million were disabled […]. By 2031, when the baby-boomer generation will be completely enrolled, Medicare is expected to reach 77 million individuals […]. Because the Medicare program covers 95 % of the nation’s aged population […], the prediction of future Medicare costs based on these data can be an important source of health care planning.”

“Three essential components (which could be also referred as sub-models) need to be developed to construct a modern model of forecasting of population health and associated medical costs: (i) a model of medical cost projections conditional on each health state in the model, (ii) health state projections, and (iii) a description of the distribution of initial health states of a cohort to be projected […] In making medical cost projections, two major effects should be taken into account: the dynamics of the medical costs during the time periods comprising the date of onset of chronic diseases and the increase of medical costs during the last years of life. In this chapter, we investigate and model the first of these two effects. […] the approach developed in this chapter generalizes the approach known as “life tables with covariates” […], resulting in a new family of forecasting models with covariates such as comorbidity indexes or medical costs. In sum, this chapter develops a model of the relationships between individual cost trajectories following the onset of aging-related chronic diseases. […] The underlying methodological idea is to aggregate the health state information into a single (or several) covariate(s) that can be determinative in predicting the risk of a health event (e.g., disease incidence) and whose dynamics could be represented by the model assumptions. An advantage of such an approach is its substantial reduction of the degrees of freedom compared with existing forecasting models (e.g., the FEM model, Goldman and RAND Corporation 2004). […] We found that the time patterns of medical cost trajectories were similar for all diseases considered and can be described in terms of four components having the meanings of (i) the pre-diagnosis cost associated with initial comorbidity represented by medical expenditures, (ii) the cost peak associated with the onset of each disease, (iii) the decline/reduction in medical expenditures after the disease onset, and (iv) the difference between post- and pre-diagnosis cost levels associated with an acquired comorbidity. The description of the trajectories was formalized by a model which explicitly involves four parameters reflecting these four components.”

As I noted earlier in my coverage of the book, I don’t think the model above fully captures all relevant cost contributions of the diseases included, as the follow-up period was too short to capture all relevant costs to be included in the part iv model component. This is definitely a problem in the context of diabetes. But then again nothing in theory stops people from combining the model above with other models which are better at dealing with the excess costs associated with long-term complications of chronic diseases, and the model results were intriguing even if the model likely underperforms in a few specific disease contexts.

Moving on…

“Models of medical cost projections usually are based on regression models estimated with the majority of independent predictors describing demographic status of the individual, patient’s health state, and level of functional limitations, as well as their interactions […]. If the health states needs to be described by a number of simultaneously manifested diseases, then detailed stratification over the categorized variables or use of multivariate regression models allows for a better description of the health states. However, it can result in an abundance of model parameters to be estimated. One way to overcome these difficulties is to use an approach in which the model components are demographically-based aggregated characteristics that mimic the effects of specific states. The model developed in this chapter is an example of such an approach: the use of a comorbidity index rather than of a set of correlated categorical regressor variables to represent the health state allows for an essential reduction in the degrees of freedom of the problem.”

“Unlike mortality, the onset time of chronic disease is difficult to define with high precision due to the large variety of disease-specific criteria for onset/incident case identification […] there is always some arbitrariness in defining the date of chronic disease onset, and a unified definition of date of onset is necessary for population studies with a long-term follow-up.”

“Individual age trajectories of physiological indices are the product of a complicated interplay among genetic and non-genetic (environmental, behavioral, stochastic) factors that influence the human body during the course of aging. Accordingly, they may differ substantially among individuals in a cohort. Despite this fact, the average age trajectories for the same index follow remarkable regularities. […] some indices tend to change monotonically with age: the level of blood glucose (BG) increases almost monotonically; pulse pressure (PP) increases from age 40 until age 85, then levels off and shows a tendency to decline only at later ages. The age trajectories of other indices are non-monotonic: they tend to increase first and then decline. Body mass index (BMI) increases up to about age 70 and then declines, diastolic blood pressure (DBP) increases until age 55–60 and then declines, systolic blood pressure (SBP) increases until age 75 and then declines, serum cholesterol (SCH) increases until age 50 in males and age 70 in females and then declines, ventricular rate (VR) increases until age 55 in males and age 45 in females and then declines. With small variations, these general patterns are similar in males and females. The shapes of the age-trajectories of the physiological variables also appear to be similar for different genotypes. […] The effects of these physiological indices on mortality risk were studied in Yashin et al. (2006), who found that the effects are gender and age specific. They also found that the dynamic properties of the individual age trajectories of physiological indices may differ dramatically from one individual to the next.”

“An increase in the mortality rate with age is traditionally associated with the process of aging. This influence is mediated by aging-associated changes in thousands of biological and physiological variables, some of which have been measured in aging studies. The fact that the age trajectories of some of these variables differ among individuals with short and long life spans and healthy life spans indicates that dynamic properties of the indices affect life history traits. Our analyses of the FHS data clearly demonstrate that the values of physiological indices at age 40 are significant contributors both to life span and healthy life span […] suggesting that normalizing these variables around age 40 is important for preventing age-associated morbidity and mortality later in life. […] results [also] suggest that keeping physiological indices stable over the years of life could be as important as their normalizing around age 40.”

“The results […] indicate that, in the quest of identifying longevity genes, it may be important to look for candidate genes with pleiotropic effects on more than one dynamic characteristic of the age-trajectory of a physiological variable, such as genes that may influence both the initial value of a trait (intercept) and the rates of its changes over age (slopes). […] Our results indicate that the dynamic characteristics of age-related changes in physiological variables are important predictors of morbidity and mortality risks in aging individuals. […] We showed that the initial value (*intercept*), the rate of changes (*slope*), and the *variability* of a physiological index, in the age interval 40–60 years, significantly influenced both mortality risk and onset of unhealthy life at ages 60+ in our analyses of the Framingham Heart Study data. That is, these dynamic characteristics may serve as good predictors of late life morbidity and mortality risks. The results also suggest that physiological changes taking place in the organism in middle life may affect longevity through promoting or preventing diseases of old age. For non-monotonically changing indices, we found that having a later age at the peak value of the index […], a lower peak value […], a slower rate of decline in the index at older ages […], and less variability in the index over time, can be beneficial for longevity. Also, the dynamic characteristics of the physiological indices were, overall, associated with mortality risk more significantly than with onset of unhealthy life.”

“Decades of studies of candidate genes show that they are not linked to aging-related traits in a straightforward manner […]. Recent genome-wide association studies (GWAS) have reached fundamentally the same conclusion by showing that the traits in late life likely are controlled by a relatively large number of common genetic variants […]. Further, GWAS often show that the detected associations are of tiny effect […] the weak effect of genes on traits in late life can be not only because they confer small risks having small penetrance but because they confer large risks but in a complex fashion […] In this chapter, we consider several examples of complex modes of gene actions, including genetic tradeoffs, antagonistic genetic effects on the same traits at different ages, and variable genetic effects on lifespan. The analyses focus on the *APOE* common polymorphism. […] The analyses reported in this chapter suggest that the e4 allele can be protective against cancer with a more pronounced role in men. This protective effect is more characteristic of cancers at older ages and it holds in both the parental and offspring generations of the FHS participants. Unlike cancer, the effect of the e4 allele on risks of CVD is more pronounced in women. […] [The] results […] explicitly show that the same allele can change its role on risks of CVD in an antagonistic fashion from detrimental in women with onsets at younger ages to protective in women with onsets at older ages. […] e4 allele carriers have worse survival compared to non-e4 carriers in each cohort. […] Sex stratification shows sexual dimorphism in the effect of the e4 allele on survival […] with the e4 female carriers, particularly, being more exposed to worse survival. […] The results of these analyses provide two important insights into the role of genes in lifespan. First, they provide evidence on the key role of aging-related processes in genetic susceptibility to lifespan. For example, taking into account the specifics of aging-related processes gains 18 % in estimates of the RRs and five orders of magnitude in significance in the same sample of women […] without additional investments in increasing sample sizes and new genotyping. The second is that a detailed study of the role of aging-related processes in estimates of the effects of genes on lifespan (and healthspan) helps in detecting more homogeneous [high risk] sub-samples”.

“The aging of populations in developed countries requires effective strategies to extend healthspan. A promising solution could be to yield insights into the genetic predispositions for endophenotypes, diseases, well-being, and survival. It was thought that genome-wide association studies (GWAS) would be a major breakthrough in this endeavor. Various genetic association studies including GWAS assume that there should be a deterministic (unconditional) genetic component in such complex phenotypes. However, the idea of unconditional contributions of genes to these phenotypes faces serious difficulties which stem from the lack of direct evolutionary selection against or in favor of such phenotypes. In fact, evolutionary constraints imply that genes should be linked to age-related phenotypes in a complex manner through different mechanisms specific for given periods of life. Accordingly, the linkage between genes and these traits should be strongly modulated by age-related processes in a changing environment, i.e., by the individuals’ life course. The inherent sensitivity of genetic mechanisms of complex health traits to the life course will be a key concern as long as genetic discoveries continue to be aimed at improving human health.”

“Despite the common understanding that age is a risk factor of not just one but a large portion of human diseases in late life, each specific disease is typically considered as a stand-alone trait. Independence of diseases was a plausible hypothesis in the era of infectious diseases caused by different strains of microbes. Unlike those diseases, the exact etiology and precursors of diseases in late life are still elusive. It is clear, however, that the origin of these diseases differs from that of infectious diseases and that age-related diseases reflect a complicated interplay among ontogenetic changes, senescence processes, and damages from exposures to environmental hazards. Studies of the determinants of diseases in late life provide insights into a number of risk factors, apart from age, that are common for the development of many health pathologies. The presence of such common risk factors makes chronic diseases and hence risks of their occurrence interdependent. This means that the results of many calculations using the assumption of disease independence should be used with care. Chapter 4 argued that disregarding potential dependence among diseases may seriously bias estimates of potential gains in life expectancy attributable to the control or elimination of a specific disease and that the results of the process of coping with a specific disease will depend on the disease elimination strategy, which may affect mortality risks from other diseases.”

## Diabetes and the brain (IV)

Here’s one of my previous posts in the series about the book. In this post I’ll cover material dealing with two acute hyperglycemia-related diabetic complications (DKA and HHS – see below…) as well as multiple topics related to diabetes and stroke. I’ll start out with a few quotes from the book about DKA and HHS:

…

“DKA [diabetic ketoacidosis] is defined by a triad of hyperglycemia, ketosis, and acidemia and occurs in the absolute or near-absolute absence of insulin. […] DKA accounts for the bulk of morbidity and mortality in children with T1DM. National population-based studies estimate DKA mortality at 0.15% in the United States (4), 0.18–0.25% in Canada (4, 5), and 0.31% in the United Kingdom (6). […] Rates reach 25–67% in those who are newly diagnosed (4, 8, 9). The rates are higher in younger children […] The risk of DKA among patients with pre-existing diabetes is 1–10% annual per person […] DKA can present with mild-to-severe symptoms. […] polyuria and polydipsia […] patients may present with signs of dehydration, such as tachycardia and dry mucus membranes. […] Vomiting, abdominal pain, malaise, and weight loss are common presenting symptoms […] Signs related to the ketoacidotic state include hyperventilation with deep breathing (Kussmaul’s respiration) which is a compensatory respiratory response to an underlying metabolic acidosis. Acetonemia may cause a fruity odor to the breath. […] Elevated glucose levels are almost always present; however, euglycemic DKA has been described (19). Anion-gap metabolic acidosis is the hallmark of this condition and is caused by elevated ketone bodies.”

“Clinically significant cerebral edema occurs in approximately 1% of patients with diabetic ketoacidosis […] DKA-related cerebral edema may represent a continuum. Mild forms resulting in subtle edema may result in modest mental status abnormalities whereas the most severe manifestations result in overt cerebral injury. […] Cerebral edema typically presents 4–12 h after the treatment for DKA is started (28, 29), but can occur at any time. […] Increased intracranial pressure with cerebral edema has been recognized as the leading cause of morbidity and mortality in pediatric patients with DKA (59). Mortality from DKA-related cerebral edema in children is high, up to 90% […] and accounts for 60–90% of the mortality seen in DKA […] many patients are left with major neurological deficits (28, 31, 35).”

“The hyperosmolar hyperglycemic state (HHS) is also an acute complication that may occur in patients with diabetes mellitus. It is seen primarily in patients with T2DM and has previously been referred to as “hyperglycemic hyperosmolar non-ketotic coma” or “hyperglycemic hyperosmolar non-ketotic state” (13). HHS is marked by profound dehydration and hyperglycemia and often by some degree of neurological impairment. The term hyperglycemic hyperosmolar state is used because (1) ketosis may be present and (2) there may be varying degrees of altered sensorium besides coma (13). Like DKA, the basic underlying disorder is inadequate circulating insulin, but there is often enough insulin to inhibit free fatty acid mobilization and ketoacidosis. […] Up to 20% of patients diagnosed with HHS do not have a previous history of diabetes mellitus (14). […] Kitabchi et al. estimated the rate of hospital admissions due to HHS to be lower than DKA, accounting for less than 1% of all primary diabetic admissions (13). […] Glucose levels rise in the setting of relative insulin deficiency. The low levels of circulating insulin prevent lipolysis, ketogenesis, and ketoacidosis (62) but are unable to suppress hyperglycemia, glucosuria, and water losses. […] HHS typically presents with one or more precipitating factors, similar to DKA. […] Acute infections […] account for approximately 32–50% of precipitating causes (13). […] The mortality rates for HHS vary between 10 and 20% (14, 93).”

It should perhaps be noted explicitly that the mortality rates for these complications are particularly high in the settings of either very young individuals (DKA) or in elderly individuals (HHS) who might have multiple comorbidities. Relatedly HHS often develops acutely specifically in settings where the precipitating factor is something really unpleasant like pneumonia or a cardiovascular event, so a high-ish mortality rate is perhaps not that surprising. Nor is it surprising that very young brains are particularly vulnerable in the context of DKA (I already discussed some of the research on these matters in some detail in an earlier post about this book).

This post to some extent covered the topic of ‘stroke in general’, however I wanted to include here also some more data specifically on diabetes-related matters about this topic. Here’s a quote to start off with:

“DM [Diabetes Mellitus] has been consistently shown to represent a strong independent risk factor of ischemic stroke. […] The contribution of hyperglycemia to increased stroke risk is not proven. […] the relationship between hyperglycemia and stroke remains subject of debate. In this respect, the association between hyperglycemia and cerebrovascular disease is established less strongly than the association between hyperglycemia and coronary heart disease. […] The course of stroke in patients with DM is characterized by higher mortality, more severe disability, and higher recurrence rate […] It is now well accepted that the risk of stroke in individuals with DM is equal to that of individuals with a history of myocardial infarction or stroke, but no DM (24–26). This was confirmed in a recently published large retrospective study which enrolled all inhabitants of Denmark (more than 3 million people out of whom 71,802 patients with DM) and were followed-up for 5 years. In men without DM the incidence of stroke was 2.5 in those without and 7.8% in those with prior myocardial infarction, whereas in patients with DM it was 9.6 in those without and 27.4% in those with history of myocardial infarction. In women the numbers were 2.5, 9.0, 10.0, and 14.2%, respectively (22).”

That study incidentally is very nice for me in particular to know about, given that I am a Danish diabetic. I do not here face any of the usual tiresome questions about ‘external validity’ and issues pertaining to ‘extrapolating out of sample’ – not only is it quite likely I’ve actually looked at some of the data used in that analysis myself, I also *know* that I am almost certainly one of the people included in the analysis. Of course you need other data as well to assess risk (e.g. age, see the previously linked post), but this is pretty clean as far as it goes. Moving on…

“The number of deaths from stroke attributable to DM is highest in low-and-middle-income countries […] the relative risk conveyed by DM is greater in younger subjects […] It is not well known whether type 1 or type 2 DM affects stroke risk differently. […] In the large cohort of women enrolled in the Nurses’ Health Study (116,316 women followed for up to 26 years) it was shown that the incidence of total stroke was fourfold higher in women with type 1 DM and twofold higher among women with type 2 DM than for non-diabetic women (33). […] The impact of DM duration as a stroke risk factor has not been clearly defined. […] In this context it is important to note that the actual duration of type 2 DM is difficult to determine precisely […*and more generally: “the date of onset of a certain chronic disease is a quantity which is not defined as precisely as mortality“, as Yashin et al. put it – I also talked about this topic in my previous post, but it’s important when you’re looking at these sorts of things and is worth reiterating – US*]. […] Traditional risk factors for stroke such as arterial hypertension, dyslipidemia, atrial fibrillation, heart failure, and previous myocardial infarction are more common in people with DM […]. However, the impact of DM on stroke is not just due to the higher prevalence of these risk factors, as the risk of mortality and morbidity remains over twofold increased after correcting for these factors (4, 37). […] It is informative to distinguish between factors that are non-specific and specific to DM. DM-specific factors, including chronic hyperglycemia, DM duration, DM type and complications, and insulin resistance, may contribute to an elevated stroke risk either by amplification of the harmful effect of other “classical” non-specific risk factors, such as hypertension, or by acting independently.”

More than a few variables are known to impact stroke risk, but the fact that many of the risk factors are related to each other (‘fat people often also have high blood pressure’) makes it hard to figure out which variables are most important, how they interact with each other, etc., etc. One might in that context perhaps conceptualize the metabolic syndrome (-MS) as a sort of indicator variable indicating whether a relatively common *set* of such related potential risk factors of interest are present or not – it is worth noting in that context that the authors include in the text the observation that: “it is yet uncertain if the whole concept of the MS entails more than its individual components. The clustering of risk factors complicates the assessment of the contribution of individual components to the risk of vascular events, as well as assessment of synergistic or interacting effects.” MS confers a two-threefold increased stroke risk, depending on the definition and the population analyzed, so there’s definitely some relevant stuff included in that box, but in the context of developing new treatment options and better assess risk it might be helpful to – to put it simplistically – know if variable X is significantly more important than variable Y (and how the variables interact, etc., etc.). But this sort of information is hard to get.

There’s more than one type of stroke, and the way diabetes modifies the risk of various stroke types is not completely clear:

“Most studies have consistently shown that DM is an important risk factor for ischemic stroke, while the incidence of hemorrhagic stroke in subjects with DM does not seem to be increased. Consequently, the ratio of ischemic to hemorrhagic stroke is higher in patients with DM than in those stroke patients without DM [*recall the base rates I’ve mentioned before in the coverage of this book: 80% of strokes are ischemic strokes in Western countries, and 15 % hemorrhagic*] […] The data regarding an association between DM and the risk of hemorrhagic stroke are quite conflicting. In the most series no increased risk of cerebral hemorrhage was found (10, 101), and in the Copenhagen Stroke Registry, hemorrhagic stroke was even six times less frequent in diabetic patients than in non-diabetic subjects (102). […] However, in another prospective population-based study DM was associated with an increased risk of primary intracerebral hemorrhage (103). […] The significance of DM as a risk factor of hemorrhagic stroke could differ depending on ethnicity of subjects or type of DM. In the large Nurses’ Health Study type 1 DM increased the risk of hemorrhagic stroke by 3.8 times while type 2 DM did not increase such a risk (96). […] It is yet unclear if DM predominantly predisposes to either large or small vessel ischemic stroke. Nevertheless, lacunar stroke (small, less than 15mm in diameter infarction, cyst-like, frequently multiple) is considered to be the typical type of stroke in diabetic subjects (105–107), and DM may be present in up to 28–43% of patients with cerebral lacunar infarction (108–110).”

The Danish results mentioned above might not be as useful to me as they were before if the type is important, because the majority of those diabetics included were type 2 diabetics. I know from personal experience that it is difficult to type-identify diabetics using the Danish registry data available if you want to work with population-level data, and any type of scheme attempting this will be subject to potentially large misidentification problems. *Some* subgroups can be presumably correctly identified using diagnostic codes, but a very large number of individuals will be left out of the analyses if you only rely on identification strategies where you’re (at least reasonably?) certain about the type. I’ve worked on these identification problems during my graduate work so perhaps a few more things are worth mentioning here. In the context of diabetic subgroup analyses, misidentification is in general a much larger problem in the context of type 1 results than in the context of type 2 results; unless the study design takes the large prevalence difference of the two conditions into account, the type 1 sample will be much smaller than the type 2 sample in pretty much all analytical contexts, so a small number of misidentified type 2 individuals can have large impacts on the results of the type 1 sample. Type 1s misidentified as type 2 individuals is in general to be expected to be a much smaller problem in terms of the validity of the type 2 analysis; misidentification of that type will cause a loss of power in the context of the type 1 subgroup analysis, which is already low to start with (and it’ll also make the type 1 subgroup analysis even more vulnerable to misidentified type 2s), but it won’t much change the results of the type 2 subgroup analysis in any significant way. Relatedly, even if enough type 2 patients are misidentified to cause problems with the interpretation of the type 1 subgroup analysis, this would not on its own be a good reason to doubt the results of the type 2 subgroup analysis. Another thing to note in terms of these things is that given that misidentification will tend to lead to ‘mixing’, i.e. it’ll make the subgroup results look similar, when outcomes are *not* similar in the type 1 and the type 2 individuals then this might be taken to be an indicator that something potentially interesting might be going on, because most analyses will struggle with some level of misidentification which will tend to reduce the power of tests of group differences.

What about stroke outcomes? A few observations were included on that topic above, but the book has a lot more stuff on that – some observations on this topic:

“DM is an independent risk factor of death from stroke […]. Tuomilehto et al. (35) calculated that 16% of all stroke mortality in men and 33% in women could be directly attributed to DM. Patients with DM have higher hospital and long-term stroke mortality, more pronounced residual neurological deficits, and more severe disability after acute cerebrovascular accidents […]. The 1-year mortality rate, for example, was twofold higher in diabetic patients compared to non-diabetic subjects (50% vs. 25%) […]. Only 20% of people with DM survive over 5 years after the first stroke and half of these patients die within the first year (36, 128). […] The mechanisms underlying the worse outcome of stroke in diabetic subjects are not fully understood. […] Regarding prevention of stroke in patients with DM, it may be less relevant than in non-DM subjects to distinguish between primary and secondary prevention as all patients with DM are considered to be high-risk subjects regardless of the history of cerebrovascular accidents or the presence of clinical and subclinical vascular lesions. […] The influence of the mode of antihyperglycemic treatment on the risk of stroke is uncertain.”

Control of blood pressure is very important in the diabetic setting:

“There are no doubts that there is a linear relation between elevated systolic blood pressure and the risk of stroke, both in people with or without DM. […] Although DM and arterial hypertension represent significant independent risk factors for stroke if they co-occur in the same patient the risk increases dramatically. A prospective study of almost 50 thousand subjects in Finland followed up for 19 years revealed that the hazard ratio for stroke incidence was 1.4, 2.0, 2.5, 3.5, and 4.5 and for stroke mortality was 1.5, 2.6, 3.1, 5.6, and 9.3, respectively, in subjects with an isolated modestly elevated blood pressure (systolic 140–159/diastolic 90–94 mmHg), isolated more severe hypertension (systolic >159 mmHg, diastolic >94 mmHg, or use of antihypertensive drugs), with isolated DM only, with both DM and modestly elevated blood pressure, and with both DM and more severe hypertension, relative to subjects without either of the risk factors (168). […] it remains unclear whether some classes of antihypertensive agents provide a stronger protection against stroke in diabetic patients than others. […] effective antihypertensive treatment is highly beneficial for reduction of stroke risk in diabetic patients, but the advantages of any particular class of antihypertensive medications are not substantially proven.”

Treatment of dyslipidemia is also very important, but here it does seem to matter how you treat it:

“It seems that the beneficial effect of statins is dose-dependent. The lower the LDL level that is achieved the stronger the cardiovascular protection. […] Recently, the results of the meta-analysis of 14 randomized trials of statins in 18,686 patients with DM had been published. It was calculated that statins use in diabetic patients can result in a 21% reduction of the risk of any stroke per 1 mmol/l reduction of LDL achieved […] There is no evidence from trials that supports efficacy of fibrates for stroke prevention in diabetic patients. […] No reduction of stroke risk by fibrates was shown also in a meta-analysis of eight trials enrolled 12,249 patients with type 2 DM (204).”

Antiplatelets?

“Significant reductions in stroke risk in diabetic patients receiving antiplatelet therapy were found in large-scale controlled trials (205). It appears that based on the high incidence of stroke and prevalence of stroke risk factors in the diabetic population the benefits of routine aspirin use for primary and secondary stroke prevention outweigh its potential risk of hemorrhagic stroke especially in patients older than 30 years having at least one additional risk factor (206). […] both guidelines issued by the AHA/ADA or the ESC/EASD on the prevention of cardiovascular disease in patients with DM support the use of aspirin in a dose of 50–325 mg daily for the primary prevention of stroke in subjects older than 40 years of age and additional risk factors, such as DM […] The newer antiplatelet agent, clopidogrel, was more efficacious in prevention of ischemic stroke than aspirin with greater risk reduction in the diabetic cohort especially in those treated with insulin compared to non-diabetics in CAPRIE trial (209). However, the combination of aspirin and clopidogrel does not appear to be more efficacious and safe compared to clopidogrel or aspirin alone”.

When you treat all risk factors aggressively, it turns out that the elevated stroke risk can be substantially reduced. Again the data on this stuff is from Denmark:

“Gaede et al. (216) have shown in the Steno 2 study that intensive multifactorial intervention aimed at correction of hyperglycemia, hypertension, dyslipidemia, and microalbuminuria along with aspirin use resulted in a reduction of cardiovascular morbidity including non-fatal stroke […] recently the results of the extended 13.3 years follow-up of this study were presented and the reduction of cardiovascular mortality by 57% and morbidity by 59% along with the reduction of the number of non-fatal stroke (6 vs. 30 events) in intensively treated group was convincingly demonstrated (217). Antihypertensive, hypolipidemic treatment, use of aspirin should thus be recommended as either primary or secondary prevention of stroke for patients with DM.”

## Quotes

i. “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” (John Tukey)

ii. “Far better an approximate answer to the *right* question, which is often vague, than an *exact* answer to the wrong question, which can always be made precise.” (-ll-)

iii. “They who can no longer unlearn have lost the power to learn.” (John Lancaster Spalding)

iv. “If there are but few who interest thee, why shouldst thou be disappointed if but few find thee interesting?” (-ll-)

v. “Since the mass of mankind are too ignorant or too indolent to think seriously, if majorities are right it is by accident.” (-ll-)

vi. “As they are the bravest who require no witnesses to their deeds of daring, so they are the best who do right without thinking whether or not it shall be known.” (-ll-)

vii. “Perfection is beyond our reach, but they who earnestly strive to become perfect, acquire excellences and virtues of which the multitude have no conception.” (-ll-)

viii. “We are made ridiculous less by our defects than by the affectation of qualities which are not ours.” (-ll-)

ix. “If thy words are wise, they will not seem so to the foolish: if they are deep the shallow will not appreciate them. Think not highly of thyself, then, when thou art praised by many.” (-ll-)

x. “Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. ” (George E. P. Box)

xi. “Intense ultraviolet (UV) radiation from the young Sun acted on the atmosphere to form small amounts of very many gases. Most of these dissolved easily in water, and fell out in rain, making Earth’s surface water rich in carbon compounds. […] the most important chemical of all may have been cyanide (HCN). It would have formed easily in the upper atmosphere from solar radiation and meteorite impact, then dissolved in raindrops. Today it is broken down almost at once by oxygen, but early in Earth’s history it built up at low concentrations in lakes and oceans. Cyanide is a basic building block for more complex organic molecules such as amino acids and nucleic acid bases. Life probably evolved in chemical conditions that would kill us instantly!” (Richard Cowen, History of Life, p.8)

xii. “Dinosaurs dominated land communities for 100 million years, and it was only after dinosaurs disappeared that mammals became dominant. It’s difficult to avoid the suspicion that dinosaurs were in some way competitively superior to mammals and confined them to small body size and ecological insignificance. […] Dinosaurs dominated many guilds in the Cretaceous, including that of large browsers. […] in terms of their reconstructed behavior […] dinosaurs should be compared not with living reptiles, but with living mammals and birds. […] By the end of the Cretaceous there were mammals with varied sets of genes but muted variation in morphology. […] All Mesozoic mammals were small. Mammals with small bodies can play only a limited number of ecological roles, mainly insectivores and omnivores. But when dinosaurs disappeared at the end of the Cretaceous, some of the Paleocene mammals quickly evolved to take over many of their ecological roles” (ibid., pp. 145, 154, 222, 227-228)

xiii. “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” (Ronald Fisher)

xiv. “Ideas are incestuous.” (Howard Raiffa)

xv. “Game theory […] deals only with the way in which ultrasmart, all knowing people should behave in competitive situations, and has little to say to Mr. X as he confronts the morass of his problem. ” (-ll-)

xvi. “One of the principal objects of theoretical research is to find the point of view from which the subject appears in the greatest simplicity.” (Josiah Williard Gibbs)

xvii. “Nothing is as dangerous as an ignorant friend; a wise enemy is to be preferred.” (Jean de La Fontaine)

xviii. “Humility is a virtue all preach, none practice; and yet everybody is content to hear.” (John Selden)

xix. “Few men make themselves masters of the things they write or speak.” (-ll-)

xx. “Wise men say nothing in dangerous times.” (-ll-)

## Principles of Applied Statistics

“Statistical considerations arise in virtually all areas of science and technology and, beyond these, in issues of public and private policy and in everyday life. While the detailed methods used vary greatly in the level of elaboration involved and often in the way they are described, there is a unity of ideas which gives statistics as a subject both its intellectual challenge and its importance […] In this book we have aimed to discuss the ideas involved in applying statistical methods to advance knowledge and understanding. It is a book not on statistical methods as such but, rather, on how these methods are to be deployed […] We are writing partly for those working as applied statisticians, partly for subject-matter specialists using statistical ideas extensively in their work and partly for masters and doctoral students of statistics concerned with the relationship between the detailed methods and theory they are studying and the effective application of these ideas. Our aim is to emphasize how statistical ideas may be deployed fruitfully rather than to describe the details of statistical techniques.”

…

I gave the book five stars, but as noted in my review on goodreads I’m not sure the word ‘amazing’ is really fitting – however the book had a lot of good stuff and it had very little stuff for me to quibble about, so I figured it deserved a high rating. The book deals to a very large extent with topics which are in some sense common to pretty much all statistical analyses, regardless of the research context; formulation of research questions/hypotheses, data search, study designs, data analysis, and interpretation. The authors spend quite a few pages talking about hypothesis testing but on the other hand no pages talking about statistical information criteria, a topic with which I’m at this point at least reasonably familiar, and I figure if I had been slightly more critical I’d have subtracted a star for this omission – however I have the impression that I’m at times perhaps too hard on non-fiction books on goodreads so I decided not to punish the book for this omission. Part of the reason why I gave the book five stars is also that I’ve sort of wanted to read a book like this one for a while; I think in some sense it’s the first one of its kind I’ve read. I liked the way the book was structured.

Below I have added some observations from the book, as well as a few comments (I should note that I have had to leave out a lot of good stuff).

…

“When the data are very extensive, precision estimates calculated from simple standard statistical methods are likely to underestimate error substantially owing to the neglect of hidden correlations. A large amount of data is in no way synonymous with a large amount of information. In some settings at least, if a modest amount of poor quality data is likely to be modestly misleading, an extremely large amount of poor quality data may be extremely misleading.”

“For studies of a new phenomenon it will usually be best to examine situations in which the phenomenon is likely to appear in the most striking form, even if this is in some sense artificial or not representative. This is in line with the well-known precept in mathematical research: study the issue in the simplest possible context that is not entirely trivial, and later generalize.”

“It often […] aids the interpretation of an observational study to consider the question: what would have been done in a comparable experiment?”

“An important and perhaps sometimes underemphasized issue in empirical prediction is that of stability. Especially when repeated application of the same method is envisaged, it is unlikely that the situations to be encountered will exactly mirror those involved in setting up the method. It may well be wise to use a procedure that works well over a range of conditions even if it is sub-optimal in the data used to set up the method.”

“Many investigations have the broad form of collecting similar data repeatedly, for example on different individuals. In this connection the notion of a *unit of analysis *is often helpful in clarifying an approach to the detailed analysis. Although this notion is more generally applicable, it is clearest in the context of randomized experiments. Here the unit of analysis is that smallest subdivision of the experimental material such that two distinct units *might *be randomized (randomly allocated) to different treatments. […] In general the unit of analysis may not be the same as the unit of interpretation, that is to say, the unit about which conclusions are to drawn. The most difficult situation is when the unit of analysis is an aggregate of several units of interpretation, leading to the possibility of *ecological bias*, that is, a systematic difference between, say, the impact of explanatory variables at different levels of aggregation. […] it is important to identify the unit of analysis, which may be different in different parts of the analysis […] on the whole, limited detail is needed in examining the variation within the unit of analysis in question.”

The book briefly discusses issues pertaining to the scale of effort involved when thinking about appropriate study designs and how much/which data to gather for analysis, and notes that often associated costs are not quantified – rather a judgment call is made. An important related point is that e.g. in survey contexts response patterns will tend to depend upon the quantity of information requested; if you ask for too much, few people might reply (…and perhaps it’s also the case that it’s ‘the wrong people’ that reply? The authors don’t touch upon the potential selection bias issue, but it seems relevant). A few key observations from the book on this topic:

“the intrinsic quality of data, for example the response rates of surveys, may be degraded if too much is collected. […] sampling may give higher [data] quality than the study of a complete population of individuals. […] When researchers studied the effect of the expected length (10, 20 or 30 minutes) of a web-based questionnaire, they found that fewer potential respondents started and completed questionnaires expected to take longer (Galesic and Bosnjak, 2009). Furthermore, questions that appeared later in the questionnaire were given shorter and more uniform answers than questions that appeared near the start of the questionnaire.”

Not surprising, but certainly worth keeping in mind. Moving on…

“In general, while principal component analysis may be helpful in suggesting a base for interpretation and the formation of derived variables there is usually considerable arbitrariness involved in its use. This stems from the need to standardize the variables to comparable scales, typically by the use of correlation coefficients. This means that a variable that happens to have atypically small variability in the data will have a misleadingly depressed weight in the principal components.”

The book includes a few pages about the Berkson error model, which I’d never heard about. Wikipedia doesn’t have much about it and I was debating how much to include about this one here – I probably wouldn’t have done more than including the link here if the wikipedia article actually covered this topic in any detail, but it doesn’t. However it seemed important enough to write a few words about it. The basic difference between the ‘classical error model’, i.e. the one everybody knows about, and the Berkson error model is that in the former case the measurement error is statistically independent of the *true value* of X, whereas in the latter case the measurement error is independent of the *measured value*; the authors note that this implies that the true values are more variable than the measured values in a Berkson error context. Berkson errors can e.g. happen in experimental contexts where levels of a variable are pre-set by some target, for example in a medical context where a drug is supposed to be administered each X hours; the pre-set levels might then be the measured values, and the true values might be different e.g. if the nurse was late. I thought it important to mention this error model not only because it’s a completely new idea to me that you might encounter this sort of error-generating process, but also because there is no statistical test that you can use to figure out if the standard error model is the appropriate one, or if a Berkson error model is better; which means that you need to be aware of the difference and think about which model works best, based on the nature of the measuring process.

Let’s move on to some quotes dealing with modeling:

“while it is appealing to use methods that are in a reasonable sense fully efficient, that is, extract all relevant information in the data, nevertheless any such notion is within the framework of an assumed model. Ideally, methods should have this efficiency property while preserving good behaviour (especially stability of interpretation) when the model is perturbed. Essentially a model translates a subject-matter question into a mathematical or statistical one and, if that translation is seriously defective, the analysis will address a wrong or inappropriate question […] The greatest difficulty with quasi-realistic models [as opposed to ‘toy models’] is likely to be that they require numerical specification of features for some of which there is very little or no empirical information. Sensitivity analysis is then particularly important.”

“Parametric models typically represent some notion of smoothness; their danger is that particular representations of that smoothness may have strong and unfortunate implications. This difficulty is covered for the most part by informal checking that the primary conclusions do not depend critically on the precise form of parametric representation. To some extent such considerations can be formalized but in the last analysis some element of judgement cannot be avoided. One general consideration that is sometimes helpful is the following. If an issue can be addressed nonparametrically then it will often be better to tackle it parametrically; however, if it cannot be resolved nonparametrically then it is usually dangerous to resolve it parametrically.”

“Once a model is formulated two types of question arise. How can the unknown parameters in the model best be estimated? Is there evidence that the model needs modification or indeed should be abandoned in favour of some different representation? The second question is to be interpreted not as asking whether the model is true [*this is the wrong question to ask, as also emphasized by Burnham & Anderson*] but whether there is clear evidence of a specific kind of departure implying a need to change the model so as to avoid distortion of the final conclusions. […] it is important in applications to understand the circumstances under which different methods give similar or different conclusions. In particular, if a more elaborate method gives an apparent improvement in precision, what are the assumptions on which that improvement is based? Are they reasonable? […] the hierarchical principle implies, […] with very rare exceptions, that models with interaction terms should include also the corresponding main effects. […] When considering two families of models, it is important to consider the possibilities that both families are adequate, that one is adequate and not the other and that neither family fits the data.” [Do incidentally recall that in the context of interactions, “the term interaction […] is in some ways a misnomer. There is no necessary implication of interaction in the physical sense or synergy in a biological context. Rather, interaction means a departure from additivity […] This is expressed most explicitly by the requirement that, apart from random fluctuations, the difference in outcome between any two levels of one factor is the same at all levels of the other factor. […] The most directly interpretable form of interaction, certainly not removable by [variable] transformation, is effect reversal.”]

“The *p*-value assesses the data […] via a comparison with that anticipated if *H _{0}* were true. If in two different situations the test of a relevant null hypothesis gives approximately the same

*p*-value, it does not follow that the overall strengths of the evidence in favour of the relevant

*H*are the same in the two cases.”

_{0}“There are […] two sources of uncertainty in observational studies that are not present in randomized experiments. The first is that the ordering of the variables may be inappropriate, a particular hazard in cross-sectional studies. […] if the data are tied to one time point then any presumption of causality relies on a working hypothesis as to whether the components are explanatory or responses. Any check on this can only be from sources external to the current data. […] The second source of uncertainty is that important explanatory variables affecting both the potential cause and the outcome may not be available. […] Retrospective explanations may be convincing if based on firmly established theory but otherwise need to be treated with special caution. It is well known in many fields that ingenious explanations can be constructed retrospectively for almost any finding.”

“The general issue of applying conclusions from aggregate data to specific individuals is essentially that of showing that the individual does not belong to a subaggregate for which a substantially different conclusion applies. In actuality this can at most be indirectly checked for specific subaggregates. […] It is not unknown in the literature to see conclusions such as that there are no treatment differences except for males aged over 80 years, living more than 50 km south of Birmingham and life-long supporters of Aston Villa football club, who show a dramatic improvement under some treatment *T*. Despite the undoubted importance of this particular subgroup, virtually always such conclusions would seem to be unjustified.” [*I loved this example!*]

The authors included a few interesting results from an undated Cochrane publication which I thought I should mention. The file-drawer effect is well known, but there are a few other interesting biases at play in a publication bias context. One is time-lag bias, which means that statistically significant results take less time to get published. Another is language bias; statistically significant results are more likely to be published in English publications. A third bias is multiple publication bias; it turns out that papers with statistically significant results are more likely to be published more than once. The last one mentioned is citation bias; papers with statistically significant results are more likely to be cited in the literature.

The authors include these observations in their concluding remarks: “The overriding general principle [in the context of applied statistics], difficult to achieve, is that there should be a seamless flow between statistical and subject-matter considerations. […] in principle seamlessness requires an individual statistician to have views on subject-matter interpretation and subject-matter specialists to be interested in issues of statistical analysis.”

As already mentioned this is a good book. It’s not long, and/but it’s worth reading if you’re in the target group.

## Quotes

i. “By all means think yourself big but don’t think everyone else small” (‘Notes on Flyleaf of Fresh ms. Book’, *Scott’s Last Expedition*. See also this).

ii. “The man who knows everyone’s job isn’t much good at his own.” (-ll-)

iii. “It is amazing what little harm doctors do when one considers all the opportunities they have” (Mark Twain, as quoted in the Oxford Handbook of Clinical Medicine, p.595).

iv. “A** **first-rate theory predicts; a second-rate theory forbids and a third-rate theory explains after the event.” (Aleksander Isaakovich Kitaigorodski)

v. “[S]ome of the most terrible things in the world are done by people who think, genuinely think, that they’re doing it for the best” (Terry Pratchett, Snuff).

vi. “That was excellently observ’d, say I, when I read a Passage in an Author, where his Opinion agrees with mine. When we differ, there I pronounce him to be mistaken.” (Jonathan Swift)

vii. “Death is nature’s master stroke, albeit a cruel one, because it allows genotypes space to try on new phenotypes.” (Quote from the Oxford Handbook of Clinical Medicine, p.6)

*
*viii. “The purpose of models is not to fit the data but to sharpen the questions.” (Samuel Karlin)

ix. “We may […] view set theory, and mathematics generally, in much the way in which we view theoretical portions of the natural sciences themselves; as comprising truths or hypotheses which are to be vindicated less by the pure light of reason than by the indirect systematic contribution which they make to the organizing of empirical data in the natural sciences.” (Quine)

x. “At root what is needed for scientific inquiry is just receptivity to data, skill in reasoning, and yearning for truth. Admittedly, ingenuity can help too.” (-ll-)

xi. “A statistician carefully assembles facts and figures for others who carefully misinterpret them.” (Quote from *Mathematically Speaking – A Dictionary of Quotations*, p.329. Only source given in the book is: “Quoted in Evan Esar, *20,000 Quips and Quotes*“)

xii. “A knowledge of statistics is like a knowledge of foreign languages or of algebra; it may prove of use at any time under any circumstances.” (Quote from *Mathematically Speaking – A Dictionary of Quotations*, p. 328. The source provided is: “Elements of Statistics, Part I, Chapter I (p.4)”).

xiii. “We own to small faults to persuade others that we have not great ones.” (Rochefoucauld)

xiv. “There is more self-love than love in jealousy.” (-ll-)

xv. “We should not judge of a man’s merit by his great abilities, but by the use he makes of them.” (-ll-)

xvi. “We should gain more by letting the world see what we are than by trying to seem what we are not.” (-ll-)

xvii. “Put succinctly, a prospective study looks for the effects of causes whereas a retrospective study examines the causes of effects.” (Quote from p.49 of *Principles of Applied Statistics*, by Cox & Donnelly)

xviii. “… he who seeks for methods without having a definite problem in mind seeks for the most part in vain.” (David Hilbert)

xix. “Give every man thy ear, but few thy voice” (Shakespeare).

xx. “Often the fear of one evil leads us into a worse.” (Nicolas Boileau-Despréaux)

## The Nature of Statistical Evidence

Here’s my goodreads review of the book.

As I’ve observed many times before, a wordpress blog like mine is not a particularly nice place to cover mathematical topics involving equations and lots of Greek letters, so the coverage below will be more or less purely conceptual; don’t take this to mean that the book doesn’t contain formulas. Some parts of the book look like this:

That of course makes the book hard to blog, also for other reasons than just the fact that it’s typographically hard to deal with the equations. In general it’s hard to talk about the content of a book like this one without going into *a lot* of details outlining how you get from A to B to C – usually you’re only really interested in C, but you need A and B to make sense of C. At this point I’ve sort of concluded that when covering books like this one I’ll only cover some of the main themes which are easy to discuss in a blog post, and I’ve concluded that I should skip coverage of (potentially important) points which might also be of interest if they’re difficult to discuss in a small amount of space, which is unfortunately often the case. I should perhaps observe that although I noted in my goodreads review that in a way there was a bit too much philosophy and a bit too little statistics in the coverage for my taste, you should definitely not take that objection to mean that this book is full of fluff; a lot of that philosophical stuff is ‘formal logic’ type stuff and related comments, and the book in general is quite dense. As I also noted in the goodreads review I didn’t read this book as carefully as I might have done – for example I skipped a couple of the technical proofs because they didn’t seem to be worth the effort – and I’d probably need to read it again to fully understand some of the minor points made throughout the more technical parts of the coverage; so that’s of course a related reason why I don’t cover the book in a great amount of detail here – it’s hard work just to read the damn thing, to talk about the technical stuff in detail here as well would definitely be overkill even if it would surely make me understand the material better.

I have added some observations from the coverage below. I’ve tried to clarify beforehand which question/topic the quote in question deals with, to ease reading/understanding of the topics covered.

…

On how statistical methods are related to experimental science:

“statistical methods have aims similar to the process of experimental science. But statistics is not itself an experimental science, it consists of models of how to do experimental science. Statistical theory is a logical — mostly mathematical — discipline; its findings are not subject to experimental test. […] The primary sense in which statistical theory is a science is that it guides and explains statistical methods. A sharpened statement of the purpose of this book is to provide explanations of the senses in which some statistical methods provide scientific evidence.”

On mathematics and axiomatic systems (the book goes into much more detail than this):

“It is not sufficiently appreciated that a link is needed between mathematics and methods. Mathematics is not about the world until it is interpreted and then it is only about models of the world […]. No contradiction is introduced by either interpreting the same theory in different ways or by modeling the same concept by different theories. […] In general, a primitive undefined term is said to be **interpreted** when a meaning is assigned to it and when all such terms are interpreted we have an **interpretation** of the axiomatic system. It makes no sense to ask which is the correct interpretation of an axiom system. This is a primary strength of the axiomatic method; we can use it to organize and structure our thoughts and knowledge by simultaneously and economically treating all interpretations of an axiom system. It is also a weakness in that failure to define or interpret terms leads to much confusion about the implications of theory for application.”

It’s all about models:

“The scientific method of theory checking is to compare predictions deduced from a theoretical model with observations on nature. Thus science must predict what happens in nature but it need not explain why. […] whether experiment is consistent with theory is relative to accuracy and purpose. All theories are simplifications of reality and hence no theory will be expected to be a perfect predictor. Theories of statistical inference become relevant to scientific process at precisely this point. […] Scientific method is a practice developed to deal with experiments on **nature. **Probability theory is a deductive study of the properties of **models **of such experiments. All of the theorems of probability are results about models of experiments.”

But given a frequentist interpretation you can test your statistical theories with the real world, right? Right? Well…

“How might we check the long run stability of relative frequency? If we are to compare mathematical theory with experiment then only finite sequences can be observed. But for the Bernoulli case, the event that frequency approaches probability is stochastically independent of any sequence of finite length. […] Long-run stability of relative frequency cannot be checked experimentally. There are neither theoretical nor empirical guarantees that, a priori, one can recognize experiments performed under uniform conditions and that under these circumstances one *will* obtain stable frequencies.” [related link]

What should we expect to get out of mathematical and statistical theories of inference?

“What can we expect of a theory of statistical inference? We can expect an internally consistent explanation of why certain conclusions follow from certain data. The theory will not be about inductive rationality but about a *model *of inductive rationality. Statisticians are used to thinking that they apply their logic to models of the physical world; less common is the realization that their logic itself is only a model. Explanation will be in terms of introduced concepts which do not exist in nature. Properties of the concepts will be derived from assumptions which merely seem reasonable. This is the only sense in which the axioms of any mathematical theory are true […] We can expect these concepts, assumptions, and properties to be intuitive but, unlike natural science, they cannot be checked by experiment. Different people have different ideas about what “seems reasonable,” so we can expect different explanations and different properties. We should not be surprised if the theorems of two different theories of statistical evidence differ. If two models had no different properties then they would be different versions of the same model […] We should not expect to achieve, by mathematics alone, a single coherent theory of inference, for mathematical truth is conditional and the assumptions are not “self-evident.” Faith in a set of assumptions would be needed to achieve a single coherent theory.”

On disagreements about the nature of statistical evidence:

“The context of this section is that there is disagreement among experts about the nature of statistical evidence and consequently much use of one formulation to criticize another. Neyman (1950) maintains that, from his behavioral hypothesis testing point of view, Fisherian significance tests do not express evidence. Royall (1997) employs the “law” of likelihood to criticize hypothesis as well as significance testing. Pratt (1965), Berger and Selke (1987), Berger and Berry (1988), and Casella and Berger (1987) employ Bayesian theory to criticize sampling theory. […] Critics assume that their findings are about evidence, but they are at most about models of evidence. Many theoretical statistical criticisms, when stated in terms of evidence, have the following outline: According to model A, evidence satisfies proposition P. But according to model B, which is correct since it is derived from “self-evident truths,” P is not true. Now evidence can’t be two different ways so, since B is right, A must be wrong. Note that the argument is symmetric: since A appears “self-evident” (to adherents of A) B must be wrong. But both conclusions are invalid since evidence can be modeled in different ways, perhaps useful in different contexts and for different purposes. From the observation that P is a theorem of A but not of B, all we can properly conclude is that A and B are different models of evidence. […] The common practice of using one theory of inference to critique another is a misleading activity.”

Is mathematics a science?

“Is mathematics a science? It is certainly systematized knowledge much concerned with structure, but then so is history. Does it employ the scientific method? Well, partly; hypothesis and deduction are the essence of mathematics and the search for counter examples is a mathematical counterpart of experimentation; but the question is not put to nature. Is mathematics about nature? In part. The hypotheses of most mathematics are suggested by some natural primitive concept, for it is difficult to think of interesting hypotheses concerning nonsense syllables and to check their consistency. However, it often happens that as a mathematical subject matures it tends to evolve away from the original concept which motivated it. Mathematics in its purest form is probably not natural science since it lacks the experimental aspect. Art is sometimes defined to be creative work displaying form, beauty and unusual perception. By this definition pure mathematics is clearly an art. On the other hand, applied mathematics, taking its hypotheses from real world concepts, is an attempt to describe nature. Applied mathematics, without regard to experimental verification, is in fact largely the “conditional truth” portion of science. If a body of applied mathematics has survived experimental test to become trustworthy belief then it is the essence of natural science.”

Then what about statistics – is statistics a science?

“Statisticians can and do make contributions to subject matter fields such as physics, and demography but statistical theory and methods proper, distinguished from their findings, are not like physics in that they are not about nature. […] Applied statistics is natural science but the findings are about the subject matter field not statistical theory or method. […] Statistical theory helps with how to do natural science but it is not itself a natural science.”

…

I should note that I am, and have for a long time been, in broad agreement with the author’s remarks on the nature of science and mathematics above. Popper, among many others, discussed this topic a long time ago e.g. in The Logic of Scientific Discovery and I’ve basically been of the opinion that (‘pure’) mathematics is not science (‘but rather ‘something else’ … and that doesn’t mean it’s not useful’) for probably a decade. I’ve had a harder time coming to terms with how precisely to deal with statistics in terms of these things, and in that context the book has been conceptually helpful.

Below I’ve added a few links to other stuff also covered in the book:

Propositional calculus.

Kolmogorov’s axioms.

Neyman-Pearson lemma.

Radon-Nikodyn theorem. (not covered in the book, but the necessity of using ‘a Radon-Nikodyn derivative’ to obtain an answer to a question being asked was remarked upon at one point, and I had no clue what he was talking about – it seems that the stuff in the link was what he was talking about).

A very specific and relevant link: Berger and Wolpert (1984). The stuff about Birnbaum’s argument covered from p.24 (p.40) and forward is covered in some detail in the book. The author is critical of the model and explains in the book in some detail why that is. See also: *On the foundations of statistical inference* (Birnbaum, 1962).

## Cost-effectiveness analysis in health care (III)

This will be my last post about the book. Yesterday I finished reading Darwin’s Origin of Species, which was my 100th book this year (here’s the list), but I can’t face blogging that book at the moment so coverage of that one will have to wait a bit.

In my second post about this book I had originally planned to cover chapter 7 – ‘Analysing costs’ – but as I didn’t like to spend too much time on the post I ended up cutting it short. This omission of coverage in the last post means that some themes to be discussed below are closely related to stuff covered in the second post, whereas on the other hand most of the remaining material, more specifically the material from chapters 8, 9 and 10, deal with decision analytic modelling, a quite different topic; in other words the coverage will be slightly more fragmented and less structured than I’d have liked it to be, but there’s not really much to do about that (it doesn’t help in this respect that I decided to not cover chapter 8, but doing that as well was out of the question).

I’ll start with coverage of some of the things they talk about in chapter 7, which as mentioned deals with how to analyze costs in a cost-effectiveness analysis context. They observe in the chapter that health cost data are often skewed to the right, for several reasons (costs incurred by an individual cannot be negative; for many patients the costs may be zero; some study participants may require much more care than the rest, creating a long tail). One way to address skewness is to use the median instead of the mean as the variable of interest, but a problem with this approach is that the median will not be as useful to policy-makers as will be the mean; as the mean times the population of interest will give a good estimate of the total costs of an intervention, whereas the median is not a very useful variable in the context of arriving at an estimate of the total costs. Doing data transformations and analyzing transformed data is another way to deal with skewness, but their use in cost effectiveness analysis have been questioned for a variety of reasons discussed in the chapter (to give a couple of examples, data transformation methods perform badly if inappropriate transformations are used, and many transformations cannot be used if there are data points with zero costs in the data, which is very common). Of the non-parametric methods aimed at dealing with skewness they discuss a variety of tests which are rarely used, as well as the bootstrap, the latter being one approach which has gained widespread use. They observe in the context of the bootstrap that “it has increasingly been recognized that the conditions the bootstrap requires to produce reliable parameter estimates are not fundamentally different from the conditions required by parametric methods” and note in a later chapter (chapter 11) that: “it is not clear that boostrap results in the presence of severe skewness are likely to be any more or less valid than parametric results […] bootstrap and parametric methods both rely on sufficient sample sizes and are likely to be valid or invalid in similar circumstances. Instead, interest in the bootstrap has increasingly focused on its usefulness in dealing simultaneously with issues such as censoring, missing data, multiple statistics of interest such as costs and effects, and non-normality.” Going back to the coverage in chapter 7, in the context of skewness they also briefly touch upon the potential use of a GLM framework to address this problem.

Data is often missing in cost datasets. Some parts of their coverage of these topics was to me but a review of stuff already covered in Bartholomew. Data can be missing for different reasons and through different mechanisms; one distinction is among data missing completely at random (MCAR), missing at random (MAR) (“missing data are correlated in an observable way with the mechanism that generates the cost, i.e. after adjusting the data for observable differences between complete and missing cases, the cost for those with missing data is the same, except for random variation, as for those with complete data”), and not missing at random (NMAR); the last type is also called non-ignorably missing data, and if you have that sort of data the implication is that the costs of those in the observed and unobserved groups differ in unpredictable ways, and if you ignore the process that drives these differences you’ll probably end up with a biased estimator. Another way to distinguish between different types of missing data is to look at patterns within the dataset, where you have:

“***univariate missingness** – a single variable in a dataset is causing a problem through missing values, while the remaining variables contain complete information

***unit non-response** – no data are recorded for any of the variables for some patients

***monotone missing** – caused, for example, by drop-out in panel or longitudinal studies, resulting in variables observed up to a certain time point or wave but not beyond that

***multivariate missing** – also called item non-response or general missingness, where some but not all of the variables are missing for some of the subjects.”

The authors note that the most common types of missingness in cost information analyses are the latter two. They discuss some techniques for dealing with missing data, such as complete-case analysis, available-case analysis, and imputation, but I won’t go into the details here. In the last parts of the chapter they talk a little bit about censoring, which can be viewed as a specific type of missing data, and ways to deal with it. Censoring happens when follow-up information on some subjects is not available for the full duration of interest, which may be caused e.g. by attrition (people dropping out of the trial), or insufficient follow up (the final date of follow-up might be set before all patients reach the endpoint of interest, e.g. death). The two most common methods for dealing with censored cost data are the Kaplan-Meier sample average (-KMSA) estimator and the inverse probability weighting (-IPW) estimator, both of which are non-parametric interval methods. “Comparisons of the IPW and KMSA estimators have shown that they both perform well over different levels of censoring […], and both are considered reasonable approaches for dealing with censoring.” One difference between the two is that the KMSA, unlike the IPW, is not appropriate for dealing with censoring due to attrition unless the attrition is MCAR (and it almost never is), because the KM estimator, and by extension the KMSA estimator, assumes that censoring is independent of the event of interest.

The focus in chapter 8 is on decision tree models, and I decided to skip that chapter as most of it is known stuff which I felt no need to review here (do remember that I to a large extent use this blog as an extended memory, so I’m not only(/mainly?) writing this stuff for other people..). Chapter 9 deals with Markov models, and I’ll talk a little bit about those in the following.

“Markov models analyse uncertain processes over time. They are suited to decisions where the timing of events is important and when events may happen more than once, and therefore they are appropriate where the strategies being evaluated are of a sequential or repetitive nature. Whereas decision trees model uncertain events at chance nodes, Markov models differ in modelling uncertain events as transitions between health states. In particular, Markov models are suited to modelling long-term outcomes, where costs and effects are spread over a long period of time. Therefore Markov models are particularly suited to chronic diseases or situations where events are likely to recur over time […] Over the last decade there has been an increase in the use of Markov models for conducting economic evaluations in a health-care setting […]

A Markov model comprises a finite set of health states in which an individual can be found. The states are such that in any given time interval, the individual will be in only one health state. All individuals in a particular health state have identical characteristics. The number and nature of the states are governed by the decisions problem. […] Markov models are concerned with transitions during a series of cycles consisting of short time intervals. The model is run for several cycles, and patients move between states or remain in the same state between cycles […] Movements between states are defined by transition probabilities which can be time dependent or constant over time. All individuals within a given health state are assumed to be identical, and this leads to a limitation of Markov models in that the transition probabilities only depend on the current health state and not on past health states […the process is memoryless…] – this is known as the Markovian assumption”.

The note that in order to build and analyze a Markov model, you need to do the following: *define states and allowable transitions [for example from ‘non-dead’ to ‘dead’ is okay, but going the other way is, well… For a Markov process to end, you need at least one state that cannot be left after it has been reached, and those states are termed ‘absorbing states’], *specify initial conditions in terms of starting probabilities/initial distribution of patients, *specify transition probabilities, *specify a cycle length, *set a stopping rule, *determine rewards, *implement discounting if required, *analysis and evaluation of the model, and *exploration of uncertainties. They talk about each step in more detail in the book, but I won’t go too much into this.

Markov models may be governed by transitions that are either constant over time or time-dependent. In a Markov *chain* transition probabilities are constant over time, whereas in a Markov *process *transition probabilities vary over time (/from cycle to cycle). In a simple Markov model the baseline assumption is that transitions only occur once in each cycle and usually the transition is modelled as taking place either at the beginning or the end of cycles, but in reality transitions can take place at any point in time during the cycle. One way to deal with the problem of misidentification (people assumed to be in one health state throughout the cycle even though they’ve transfered to another health state during the cycle) is to use half-cycle corrections, in which an assumption is made that on average state transitions occur halfway through the cycle, instead of at the beginning or the end of a cycle. They note that: “the important principle with the half-cycle correction is not when the transitions occur, but when state membership (i.e. the proportion of the cohort in that state) is counted. The longer the cycle length, the more important it may be to use half-cycle corrections.” When state transitions are assumed to take place may influence factors such as cost discounting (if the cycle is long, it can be important to get the state transition timing reasonably right).

When time dependency is introduced into the model, there are in general two types of time dependencies that impact on transition probabilities in the models. One is time dependency depending on the number of cycles since the start of the model (this is e.g. dealing with how transition probabilities depend on factors like age), whereas the other, which is more difficult to implement, deals with state dependence (curiously they don’t use these two words, but I’ve worked with state dependence models before in labour economics and this is what we’re dealing with here); i.e. here the transition probability will depend upon how long you’ve been in a given state.

Below I mostly discuss stuff covered in chapter 10, however I also include a few observations from the final chapter, chapter 11 (on ‘Presenting cost-effectiveness results’). Chapter 10 deals with how to represent uncertainty in decision analytic models. This is an important topic because as noted later in the book, “The primary objective of economic evaluation should not be hypothesis testing, but rather the estimation of the central parameter of interest—the incremental cost-effectiveness ratio—along with appropriate representation of the uncertainty surrounding that estimate.” In chapter 10 a distinction is made between variability, heterogeneity, and uncertainty. Variability has also been termed first-order uncertainty or stochastic uncertainty, and pertains to variation observed when recording information on resource use or outcomes within a homogenous sample of individuals. Heterogeneity relates to differences between patients which can be explained, at least in part. They distinguish between two types of uncertainty, structural uncertainty – dealing with decisions and assumptions made about the structure of the model – and parameter uncertainty, which of course relates to the precision of the parameters estimated. After briefly talking about ways to deal with these, they talk about sensitivity analysis.

“Sensitivity analysis involves varying parameter estimates across a range and seeing how this impacts on he model’s results. […] The simplest form is a one-way analysis where each parameter estimate is varied independently and singly to observe the impact on the model results. […] One-way sensitivity analysis can give some insight into the factors influencing the results, and may provide a validity check to assess what happens when particular variables take extreme values. However, it is likely to grossly underestimate overall uncertainty, and ignores correlation between parameters.”

Multi-way sensitivity analysis is a more refined approach, in which more than one parameter estimate is varied – this is sometimes termed scenario analysis. A different approach is threshold analysis, where one attempts to identify the critical value of one or more variables so that the conclusion/decision changes. All of these approaches are deterministic approaches, and they are not without problems. “They fail to take account of the joint parameter uncertainty and correlation between parameters, and rather than providing the decision-maker with a useful indication of the likelihood of a result, they simply provide a range of results associated with varying one or more input estimates.” So of course an alternative has been developed, namely probabilistic sensitivity analysis (-PSA), which already in the mid-80es started to be used in health economic decision analyses.

“PSA permits the joint uncertainty across all the parameters in the model to be addressed at the same time. It involves sampling model parameter values from distributions imposed on variables in the model. […] The types of distribution imposed are dependent on the nature of the input parameters [but] decision analytic models for the purpose of economic evaluation tend to use homogenous types of input parameters, namely costs, life-years, QALYs, probabilities, and relative treatment effects, and consequently the number of distributions that are frequently used, such as the beta, gamma, and log-normal distributions, is relatively small. […] Uncertainty is then propagated through the model by randomly selecting values from these distributions for each model parameter using Monte Carlo simulation“.

## Random Stuff / Open Thread

This is not a very ‘meaty’ post, but it’s been a long time since I had one of these and I figured it was time for another one. As always links and comments are welcome.

…

i. The unbearable accuracy of stereotypes. I made a mental note of reading this paper later a long time ago, but I’ve been busy with other things. Today I skimmed it and decided that it looks interesting enough to give it a detailed read later. Some remarks from the summary towards the end of the paper:

“The scientific evidence provides more evidence of accuracy than of inaccuracy in social stereotypes. The most appropriate generalization based on the evidence is that people’s beliefs about groups are usually moderately to highly accurate, and are occasionally highly inaccurate. […] This pattern of empirical support for moderate to high stereotype accuracy is not unique to any particular target or perceiver group. Accuracy has been found with racial and ethnic groups, gender, occupations, and college groups. […] The pattern of moderate to high stereotype accuracy is not unique to any particular research team or methodology. […] This pattern of moderate to high stereotype accuracy is not unique to the substance of the stereotype belief. It occurs for stereotypes regarding personality traits, demographic characteristics, achievement, attitudes, and behavior. […] The strong form of the exaggeration hypothesis – either defining stereotypes as exaggerations or as claiming that stereotypes usually lead to exaggeration – is not supported by data. Exaggeration does sometimes occur, but it does not appear to occur much more frequently than does accuracy or underestimation, and may even occur less frequently.”

I should perhaps note that this research is closely linked to Funder’s research on personality judgment, which I’ve previously covered on the blog here and here.

…

ii. I’ve spent approximately 150 hours on vocabulary.com altogether at this point (having ‘mastered’ ~10.200 words in the process). A few words I’ve recently encountered on the site: Nescience (note to self: if someone calls you ‘nescient’ during a conversation, in many contexts that’ll be an insult, not a compliment) (Related note to self: I should find myself some smarter enemies, who use words like ‘nescient’…), eristic, carrel, oleaginous, decal, gable, epigone, armoire, chalet, cashmere, arrogate, ovine.

…

iii. why p = .048 should be rare (and why this feels counterintuitive).

…

iv. A while back I posted a few comments on SSC and I figured I might as well link to them here (at least it’ll make it easier *for me* to find them later on). Here is where I posted a few comments on a recent study dealing with Ramadan-related IQ effects, a topic which I’ve covered here on the blog before, and here I discuss some of the benefits of not having low self-esteem.

On a completely unrelated note, today I left a comment in a reddit thread about ‘Books That Challenged You / Made You See the World Differently’ which may also be of interest to readers of this blog. I realized while writing the comment that this question is probably getting more and more difficult for me to answer as time goes by. It really all depends upon *what part of the world* you want to see in a different light; which aspects you’re most interested in. For people wondering about where the books about mathematics and statistics were in that comment (I do like to think these fields play some role in terms of ‘how I see the world‘), I wasn’t really sure which book to include on such topics, if any; I can’t think of any single math or stats textbook that’s dramatically changed the way I thought about the world – to the extent that my knowledge about these topics has changed how I think about the world, it’s been a long drawn-out process.

…

v. Chess…

People who care the least bit about such things probably already know that a really strong tournament is currently being played in St. Louis, the so-called Sinquefield Cup, so I’m not going to talk about that here (for resources and relevant links, go here).

I talked about the strong rating pools on ICC not too long ago, but one thing I did not mention when discussing this topic back then was that yes, I also occasionally win against some of those grandmasters the rating pool throws at me – at least I’ve won a few times against GMs by now in bullet. I’m aware that for many ‘serious chess players’ bullet ‘doesn’t really count’ because the time dimension is much more important than it is in other chess settings, but to people who think skill doesn’t matter much in bullet I’d say they should have a match with Hikaru Nakamura and see how well they do against him (if you’re interested in how that might turn out, see e.g. this video – and keep in mind that at the beginning of the video Nakamura had already won 8 games in a row, out of 8, against his opponent in the first games, who incidentally is not exactly a beginner). The skill-sets required do not overlap perfectly between bullet and classical time control games, but when I started playing bullet online I quickly realized that good players really require very little time to completely outplay people who just play random moves (fast). Below I have posted a screencap I took while kibitzing a game of one of my former opponents, an anonymous GM from Germany, against whom I currently have a 2.5/6 score, with two wins, one draw, and three losses (see the ‘My score vs CPE’ box).

I like to think of a score like this as at least some kind of accomplishment, though admittedly perhaps not a very big one.

Also in chess-related news, I’m currently reading Jesús de la Villa’s 100 Endgames book, which Christof Sielecki has said some very nice things about. A lot of the stuff I’ve encountered so far is stuff I’ve seen before, positions I’ve already encountered and worked on, endgame principles I’m familiar with, etc., but not all of it is known stuff and I really like the structure of the book. There are a lot of pages left, and as it is I’m planning to read this book from cover to cover, which is something I usually do not do when I read chess books (few people do, judging from various comments I’ve seen people make in all kinds of different contexts).

Lastly, a lecture:

## Cost-effectiveness analysis in health care (I)

Yesterday’s SMBC was awesome, and I couldn’t help myself from including it here (click to view full size):

…

In a way the three words I chose to omit from the post title are rather important in order to know which kind of book this is – the full title of Gray et al.’s work is: *Applied Methods of* … – but as I won’t be talking much about the ‘applied’ part in my coverage here, focusing instead on broader principles etc. which will be easier for people without a background in economics to follow, I figured I might as well omit those words from the post titles. I should also admit that I personally did not spend much time on the exercises, as this did not seem necessary in view of what I was using the book for. Despite not having spent much time on the exercises myself, I incidentally did reward the authors for including occasionally quite detailed coverage of technical aspects in my rating of the book on goodreads; I feel confident from the coverage that if I need to apply some of the methods they talk about in the book later on, the book will do a good job of helping me get things right. All in all, the book’s coverage made it hard for me not to give it 5 stars – so that was what I did.

I own an actual physical copy of the book, which makes blogging it more difficult than usual; I prefer blogging e-books. The greater amount of work involved in covering physical books is also one reason why I have yet to talk about Eysenck & Keane’s Cognitive Psychology text here on the blog, despite having read more than 500 pages of that book (it’s not that the book is boring). My coverage of the contents of both this book and the Eysenck & Keane book will (assuming I ever get around to blogging the latter, that is) be less detailed than it could have been, but on the other hand it’ll likely be very focused on key points and observations from the coverage.

I have talked about cost-effectiveness before here on the blog, e.g. here, but in my coverage of the book below I have not tried to avoid making points or including observations which I’ve already made elsewhere on the blog; it’s too much work to keep track of such things. With those introductory remarks out of the way, let’s move on to some observations made in the book:

…

“In cost-effectiveness analysis we first calculate the costs and effects of an intervention and one or more alternatives, then calculate the differences in cost and differences in effect, and finally present these differences in the form of a ratio, i.e. the cost per unit of health outcome effect […]. Because the focus is on differences between two (or more) options or treatments, analysts typically refer to incremental costs, incremental effects, and the incremental cost-effectiveness ratio (ICER). Thus, if we have two options *a* and *b*, we calculate their respective costs and effects, then calculate the difference in costs and difference in effects, and then calculate the ICER as the difference in costs divided by the difference in effects […] cost-effectiveness analyses which measure outcomes in terms of QALYs are sometimes referred to as cost-utility studies […] but are sometimes simply considered as a subset of cost-effectiveness analysis.”

“Cost-effectiveness analysis places no monetary value on the health outcomes it is comparing. It does not measure or attempt to measure the underlying worth or value to society of gaining additional QALYs, for example, but simply indicates which options will permit more QALYs to be gained than others with the same resources, assuming that gaining QALYs is agreed to be a reasonable objective for the health care system. Therefore the cost-effectiveness approach will never provide a way of determining how much in total it is worth spending on health care and the pursuit of QALYs rather than on other social objectives such as education, defence, or private consumption. It does not permit us to say whether health care spending is too high or too low, but rather confines itself to the question of how any given level of spending can be arranged to maximize the health outcomes yielded.

In contrast, cost-benefit analysis (CBA) does attempt to place some monetary valuation on health outcomes as well as on health care resources. […] The reasons for the more widespread use of cost-effectiveness analysis compared with cost-benefit analysis in health care are discussed extensively elsewhere, […] but two main issues can be identified. Firstly, significant conceptual or practical problems have been encountered with the two principal methods of obtaining monetary valuations of life or quality of life: the human capital approach […] and the willingness to pay approach […] Second, within the health care sector there remains a widespread and intrinsic aversion to the concept of placing explicit monetary values on health or life. […] The cost-benefit approach should […], in principle, permit broad questions of **allocative efficiency** to be addressed. […] In contrast, cost-effectiveness analysis can address questions of **productive** or **production efficiency**, where a specified good or service is being produced at the lowest possible cost – in this context, health gain using the health care budget.”

“when working in the two-dimensional world of cost-effectiveness analysis, there are two uncertainties that will be encountered. Firstly, there will be uncertainty concerning the location of the intervention on the cost-effectiveness plane: how much more or less effective and how much more or less costly it is than current treatment. Second, there is uncertainty concerning how much the decision-maker is willing to pay for health gain […] these two uncertainties can be presented together in the form of the question ‘What is the probability that this intervention is cost-effective?’, a question which effectively divides our cost-effectiveness plane into just two policy spaces – below the maximum acceptable line, and above it”.

“Conventionally, cost-effectiveness ratios that have been calculated against a baseline or do-nothing option without reference to any alternatives are referred to as *average* cost-effectiveness ratios, while comparisons with the next best alternative are described as *incremental* cost-effectiveness ratios […] it is quite misleading to calculate average cost-effectiveness ratios, as they ignore the alternatives available.”

“A life table provides a method of summarizing the mortality experience of a group of individuals. […] There are two main types of life table. First, there is a **cohort life table**, which is constructed based on the mortality experience of a group of individuals […]. While this approach can be used to characterize life expectancies of insects and some animals, human longevity makes this approach difficult to apply as the observation period would have to be sufficiently long to be able to observe the death of all members of the cohort. Instead, **current life tables** are normally constructed using cross-sectional data of observed mortality rates at different ages at a given point in time […] Life tables can also be classified according to the intervals over which changes in mortality occur. A **complete life table** displays the various rates for each year of life; while an **abridged life table** deals with greater periods of time, for example 5 year age intervals […] A life table can be used to generate a survival curve S(x) for the population at any point in time. This represents the probability of surviving beyond a certain age x (i.e. S(x)=Pr[X>x]). […] The chance of a male living to the age of 60 years is high (around 0.9) [in the UK, presumably – *US*] and so the survival curve is comparatively flat up until this age. The proportion dying each year from the age of 60 years rapidly increases, so the curve has a much steeper downward slope. In the last part of the survival curve there is an inflection, indicating a slowing rate of increase in the proportion dying each year among the very old (over 90 years). […] The hazard rate is the slope of the survival curve at any point, given the instantaneous chance of an individual dying.”

“Life tables are a useful tool for estimating changes in life expectancies from interventions that reduce mortality. […] Multiple-cause life tables are a way of quantifying outcomes when there is more than one mutually exclusive cause of death. These life tables can estimate the potential gains from the elimination of a cause of death and are also useful in calculating the benefits of interventions that reduce the risk of a particular cause of death. […] One issue that arises when death is divided into multiple causes in this type of life table is **competing risk**. […] competing risk can arise ‘when an individual can experience more than one type of event and the occurrence of one type of event hinders the occurrence of other types of events’. Competing risks affect life tables, as those who die from a specific cause have no chance of dying from other causes during the remainder of the interval […]. In practice this will mean that as soon as one cause is eliminated the probabilities of dying of other causes increase […]. Several methods have been proposed to correct for competing risks when calculating life tables.”

“the use of published life-table methods may have limitations, especially when considering particular populations which may have very different risks from the general population. In these cases, there are a host of techniques referred to as **survival analysis** which enables risks to be estimated from patient-level data. […] Survival analysis typically involves observing one or more outcomes in a population of interest over a period of time. The outcome, which is often referred to as an **event** or **endpoint** could be death, a non-fatal outcome such as a major clinical event (e.g. myocardial infarction), the occurrence of an adverse event, or even the date of first non-compliance with a therapy.”

“A key feature of survival data is censoring, which occurs whenever the event of interest is not observed within the follow-up period. This does not mean that the event will not occur some time in the future, just that it has not occurred while the individual was observed. […] The most common case of censoring is referred to as **right censoring**. This occurs whenever the observation of interest occurs after the observation period. […] An alternative form of censoring is **left censoring**, which occurs when there is a period of time when the individuals are at risk prior to the observation period.

A key feature of most survival analysis methods is that they assume that the censoring process is **non-informative**, meaning that there is no dependence between the time to the event of interest and the process that is causing the censoring. However, if the duration of observation is related to the severity of a patient’s disease, for example if patients with more advanced illness are withdrawn early from the study, the censoring is likely to be informative and other techniques are required”.

“Differences in the composition of the intervention and control groups at the end of follow-up may have important implications for estimating outcomes, especially when we are interested in extrapolation. If we know that the intervention group is older and has a lower proportion of females, we would expect these characteristics to increase the hazard mortality in this group over their remaining lifetimes. However, if the intervention group has experienced a lower number of events, this may significantly reduce the hazard for some individuals. They may also benefit from a past treatment which continues to reduce the hazard of a primary outcome such as death. This effect […] is known as the **legacy effect**“.

“Changes in life expectancy are a commonly used outcome measure in economic evaluation. […] Table 4.6 shows selected examples of estimates of the gain in life expectancy for various interventions reported by Wright and Weinstein (1998) […] Gains in life expectancy from preventative interventions in populations of average risk generally ranged from a few days to slightly more than a year. […] The gains in life expectancy from preventing or treating disease in persons at elevated risk [*this type of prevention is known as ‘secondary-‘ and/or ‘tertiary prevention’ (depending on the circumstances), as opposed to ‘primary prevention’ – the distinction between primary prevention and more targeted approaches is often important in public health contexts, because the level of targeting will often interact with the cost-effectiveness dimension* – *US*] are generally greater […*one reason why this does not necessarily mean that targeted approaches are always better is that search costs will often be an increasing function of the level of targeting – US*]. Interventions that treat established disease vary, with gains in life-expectancy ranging from a few months […] to as long as nine years […] the point that Wright and Weinstein (1998) were making was not that absolute gains vary, but that a gain in life expectancy of a month from a preventive intervention targeted at population at average risk and a gain of a year from a preventive intervention targeted at populations at elevated risk could both be considered large. It should also be noted that interventions that produce a comparatively small gain in life expectancy when averaged across the population […] may still be very cost-effective.”

## Model Selection and Multi-Model Inference (II)

I haven’t really blogged this book in anywhere near the amount of detail it deserves even though my first post about the book actually had a few quotes illustrating how much different stuff is covered in the book.

This book is technical, and even if I’m trying to make it less technical by omitting the math in this post it may be a good idea to reread the first post about the book before reading this post to refresh your knowledge of these things.

Quotes and comments below – most of the coverage here focuses on stuff covered in chapters 3 and 4 in the book.

…

“Tests of null hypotheses and information-theoretic approaches should not be used together; they are very different analysis paradigms. A very common mistake seen in the applied literature is to use AIC to rank the candidate models and then “test” to see whether the best model (the alternative hypothesis) is “significantly better” than the second-best model (the null hypothesis). This procedure is flawed, and we strongly recommend against it […] the primary emphasis should be on the size of the treatment effects and their precision; too often we find a statement regarding “significance,” while the treatment and control means are not even presented. Nearly all statisticians are calling for estimates of effect size and associated precision, rather than test statistics, P-values, and “significance.” [*Borenstein & Hedges certainly did as well in their book (written much later), and this was not an issue I omitted to talk about in my coverage of their book…*] […] Information-theoretic criteria such as AIC, AICc, and QAICc are not a “test” in any sense, and there are no associated concepts such as test power or P-values or α-levels. Statistical hypothesis testing represents a very different, and generally inferior, paradigm for the analysis of data in complex settings. **It seems best to avoid use of the word “significant” in reporting research results under an information-theoretic paradigm.** […] AIC allows a ranking of models and the identification of models that are nearly equally useful versus those that are clearly poor explanations for the data at hand […]. Hypothesis testing provides no general way to rank models, even for models that are nested. […] In general, we recommend strongly against the use of null hypothesis testing in model selection.”

“The bootstrap is a type of Monte Carlo method used frequently in applied statistics. This computer-intensive approach is based on resampling of the observed data […] The fundamental idea of the model-based sampling theory approach to statistical inference is that the data arise as a sample from some conceptual probability distribution *f*. Uncertainties of our inferences can be measured if we can estimate *f*. The bootstrap method allows the computation of measures of our inference uncertainty by having a simple empirical estimate of *f* and sampling from this estimated distribution. In practical application, the empirical bootstrap means using some form of resampling with replacement from the actual data x to generate B (e.g., B = 1,000 or 10,000) bootstrap samples […] The set of B bootstrap samples is a proxy for a set of B independent real samples from *f* (in reality we have only one actual sample of data). Properties expected from replicate real samples are inferred from the bootstrap samples by analyzing each bootstrap sample exactly as we first analyzed the real data sample. From the set of results of sample size B we measure our inference uncertainties from sample to (conceptual) population […] For many applications it has been theoretically shown […] that the bootstrap can work well for large sample sizes (n), but it is not generally reliable for small n […], regardless of how many bootstrap samples B are used. […] Just as the analysis of a single data set can have many objectives, the bootstrap can be used to provide insight into a host of questions. For example, for each bootstrap sample one could compute and store the conditional variance–covariance matrix, goodness-of-fit values, the estimated variance inflation factor, the model selected, confidence interval width, and other quantities. Inference can be made concerning these quantities, based on summaries over the B bootstrap samples.”

“**Information criteria attempt only to select the best model from the candidate models available; if a better model exists, but is not offered as a candidate, then the information-theoretic approach cannot be expected to identify this new model**. Adjusted R^{2} […] are useful as a measure of the proportion of the variation “explained,” [but] are not useful in model selection […] adjusted R^{2} is poor in model selection; its usefulness should be restricted to description.”

“As we have struggled to understand the larger issues, it has become clear to us that inference based on only a single best model is often relatively poor for a wide variety of substantive reasons. Instead, we increasingly favor multimodel inference: procedures to allow formal statistical inference from all the models in the set. […] Such multimodel inference includes model averaging, incorporating model selection uncertainty into estimates of precision, confidence sets on models, and simple ways to assess the relative importance of variables.”

“If sample size is small, one must realize that relatively little information is probably contained in the data (unless the effect size if very substantial), and the data may provide few insights of much interest or use. Researchers routinely err by building models that are far too complex for the (often meager) data at hand. They do not realize how little structure can be reliably supported by small amounts of data that are typically “noisy.””

“Sometimes, the selected model [when applying an information criterion] contains a parameter that is constant over time, or areas, or age classes […]. This result should not imply that there is no variation in this parameter, rather that parsimony and its bias/variance tradeoff finds the actual variation in the parameter to be relatively small in relation to the information contained in the sample data. It “costs” too much in lost precision to add estimates of all of the individual *θ*_{i}. As the sample size increases, then at some point a model with estimates of the individual parameters would likely be favored. Just because a parsimonious model contains a parameter that is constant across strata does not mean that there is no variation in that process across the strata.”

“[In a significance testing context,] a significant test result does not relate directly to the issue of what approximating model is best to use for inference. One model selection strategy that has often been used in the past is to do likelihood ratio tests of each structural factor […] and then use a model with all the factors that were “significant” at, say, α = 0.05. However, there is no theory that would suggest that this strategy would lead to a model with good inferential properties (i.e., small bias, good precision, and achieved confidence interval coverage at the nominal level). […] The purpose of the analysis of empirical data is not to find the “true model”— not at all. Instead, we wish to find a best approximating model, based on the data, and then develop statistical inferences from this model. […] We search […] not for a “true model,” but rather for a parsimonious model giving an accurate approximation to the interpretable information in the data at hand. Data analysis involves the question, “What level of model complexity will the data support?” and both under- and overfitting are to be avoided. Larger data sets tend to support more complex models, and the selection of the size of the model represents a tradeoff between bias and variance.”

“The easy part of the information-theoretic approaches includes both the computational aspects and the clear understanding of these results […]. The hard part, and the one where training has been so poor, is the a priori thinking about the science of the matter before data analysis — even before data collection. It has been too easy to collect data on a large number of variables in the hope that a fast computer and sophisticated software will sort out the important things — the “significant” ones […]. Instead, a major effort should be mounted to understand the nature of the problem by critical examination of the literature, talking with others working on the general problem, and thinking deeply about alternative hypotheses. Rather than “test” dozens of trivial matters (is the correlation zero? is the effect of the lead treatment zero? are ravens pink?, Anderson et al. 2000), there must be a more concerted effort to provide evidence on *meaningful* questions that are important to a discipline. This is the critical point: the common failure to address important science questions in a fully competent fashion. […] “Let the computer find out” is a poor strategy for researchers who do not bother to think clearly about the problem of interest and its scientific setting. *The sterile analysis of “just the numbers” will continue to be a poor strategy for progress in the sciences.*

Researchers often resort to using a computer program that will examine all possible models and variables automatically. Here, the hope is that the computer will discover the important variables and relationships […] The primary mistake here is a common one: the failure to posit a small set of a priori models, each representing a plausible research hypothesis.”

“Model selection is most often thought of as a way to select just the best model, then inference is conditional on that model. However, information-theoretic approaches are more general than this simplistic concept of model selection. Given a set of models, specified independently of the sample data, we can make formal inferences based on the entire set of models. […] Part of multimodel inference includes ranking the fitted models from best to worst […] and then scaling to obtain the relative plausibility of each fitted model (*g _{i}*) by a weight of evidence (

*w*) relative to the selected best model. Using the conditional sampling variance […] from each model and the Akaike weights […], unconditional inferences about precision can be made over the entire set of models. Model-averaged parameter estimates and estimates of unconditional sampling variances can be easily computed. Model selection uncertainty is a substantial subject in its own right, well beyond just the issue of determining the best model.”

_{i}“There are three general approaches to assessing model selection uncertainty: (1) theoretical studies, mostly using Monte Carlo simulation methods; (2) the bootstrap applied to a given set of data; and (3) utilizing the set of AIC differences (i.e., ∆* _{i}*) and model weights

*w*from the set of models fit to data.”

_{i}“Statistical science should emphasize estimation of parameters and associated measures of estimator uncertainty. Given a correct model […], an MLE is reliable, and we can compute a reliable estimate of its sampling variance and a reliable confidence interval […]. If the model is selected entirely independently of the data at hand, and is a good approximating model, and if n is large, then the estimated sampling variance is essentially unbiased, and any appropriate confidence interval will essentially achieve its nominal coverage. This would be the case if we used only one model, decided on a priori, and it was a good model, *g*, of the data generated under truth, *f*. However, even when we do objective, data-based model selection (which we are advocating here), the [model] selection process is expected to introduce an added component of sampling uncertainty into any estimated parameter; hence classical theoretical sampling variances are too small: They are conditional on the model and do not reflect model selection uncertainty. One result is that conditional confidence intervals can be expected to have less than nominal coverage.”

“Data analysis is sometimes focused on the variables to include versus exclude in the selected model (e.g., important vs. unimportant). Variable selection is often the focus of model selection for linear or logistic regression models. Often, an investigator uses stepwise analysis to arrive at a final model, and from this a conclusion is drawn that the variables in this model are important, whereas the other variables are not important. While common, this is poor practice and, among other issues, fails to fully consider model selection uncertainty. […] Estimates of the relative importance of predictor variables x_{j} can best be made by summing the Akaike weights across all the models in the set where variable *j* occurs. Thus, the relative importance of variable *j* is reflected in the sum w_{+ }(*j*). The larger the w_{+ }(*j*) the more important variable *j* is, relative to the other variables. Using the w_{+ }(*j*), all the variables can be ranked in their importance. […] This idea extends to subsets of variables. For example, we can judge the importance of a pair of variables, as a pair, by the sum of the Akaike weights of all models that include the pair of variables. […] To summarize, in many contexts the AIC selected best model will include some variables and exclude others. Yet this inclusion or exclusion by itself does not distinguish differential evidence for the importance of a variable in the model. The model weights […] summed over all models that include a given variable provide a better weight of evidence for the importance of that variable in the context of the set of models considered.” [*The reason why I’m not telling you how to calculate Akaike weights is that I don’t want to bother with math formulas in wordpress – but I guess all you need to know is that these are not hard to calculate. It should perhaps be added that one can also use bootstrapping methods to obtain relevant model weights to apply in a multimodel inference context.*]

“If data analysis relies on model selection, then inferences should acknowledge model selection uncertainty. If the goal is to get the best estimates of a set of parameters in common to all models (this includes prediction), model averaging is recommended. If the models have definite, and differing, interpretations as regards understanding relationships among variables, and it is such understanding that is sought, then one wants to identify the best model and make inferences based on that model. […] The bootstrap provides direct, robust estimates of model selection probabilities π_{i} , but we have no reason now to think that use of bootstrap estimates of model selection probabilities rather than use of the Akaike weights will lead to superior unconditional sampling variances or model-averaged parameter estimators. […] Be mindful of possible model redundancy. A carefully thought-out set of a priori models should eliminate model redundancy problems and is a central part of a sound strategy for obtaining reliable inferences. […] **Results are sensitive to having demonstrably poor models in the set of models considered; thus it is very important to exclude models that are a priori poor.** […] The importance of a small number (R) of candidate models, defined prior to detailed analysis of the data, cannot be overstated. […] One should have R much smaller than n. MMI [Multi-Model Inference] approaches become increasingly important in cases where there are many models to consider.”

“In general there is a substantial amount of model selection uncertainty in many practical problems […]. Such uncertainty about what model structure (and associated parameter values) is the K-L [Kullback–Leibler] best approximating model applies whether one uses hypothesis testing, information-theoretic criteria, dimension-consistent criteria, cross-validation, or various Bayesian methods. Often, there is a nonnegligible variance component for estimated parameters (this includes prediction) due to uncertainty about what model to use, and this component should be included in estimates of precision. […] we recommend assessing model selection uncertainty rather than ignoring the matter. […] It is […] not a sound idea to pick a single model and unquestioningly base extrapolated predictions on it when there is model uncertainty.”

## Model Selection and Multi-Model Inference (I)

“We wrote this book to introduce graduate students and research workers in various scientific disciplines to the use of information-theoretic approaches in the analysis of empirical data. These methods allow the data-based selection of a “best” model and a ranking and weighting of the remaining models in a pre-defined set. Traditional statistical inference can then be based on this selected best model. However, we now emphasize that information-theoretic approaches allow formal inference to be based on more than one model (multimodel inference). Such procedures lead to more robust inferences in many cases, and we advocate these approaches throughout the book. […] Information theory includes the celebrated Kullback–Leibler “distance” between two models (actually, probability distributions), and this represents a fundamental quantity in science. In 1973, Hirotugu Akaike derived an estimator of the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood. His measure, now called Akaike’s information criterion (AIC), provided a new paradigm for model selection in the analysis of empirical data. His approach, with a fundamental link to information theory, is relatively simple and easy to use in practice, but little taught in statistics classes and far less understood in the applied sciences than should be the case. […] We do not claim that the information-theoretic methods are always the very best for a particular situation. They do represent a unified and rigorous theory, an extension of likelihood theory, an important application of information theory, and they are objective and practical to employ across a very wide class of empirical problems. Inference from multiple models, or the selection of a single “best” model, by methods based on the Kullback–Leibler distance are almost certainly better than other methods commonly in use now (e.g., null hypothesis testing of various sorts, the use of R^{2}, or merely the use of just one available model).

This is an applied book written primarily for biologists and statisticians using models for making inferences from empirical data. […] This book might be useful as a text for a course for students with substantial experience and education in statistics and applied data analysis. A second primary audience includes honors or graduate students in the biological, medical, or statistical sciences […] Readers should ideally have some maturity in the quantitative sciences and experience in data analysis. Several courses in contemporary statistical theory and methods as well as some philosophy of science would be particularly useful in understanding the material. Some exposure to likelihood theory is nearly essential”.

…

The above quotes are from the preface of the book, which I have so far only briefly talked about here; this post will provide a lot more details. Aside from writing the post in order to mentally process the material and obtain a greater appreciation of the points made in the book, I have also as a secondary goal tried to write the post in a manner so that people who are not necessarily experienced model-builders might also derive some benefit from the coverage. Whether or not I was successful in that respect I do not know – given the outline above, it should be obvious that there are limits as to how ‘readable’ you can make stuff like this to people without a background in a semi-relevant field. I don’t think I have written specifically about the application of information criteria in the model selection context before here on the blog, at least not in any amount of detail, but I have written about ‘model-stuff’ before, also in ‘meta-contexts’ not necessarily related to the application of models in economics; so if you’re interested in ‘this kind of stuff’ but you don’t feel like having a go at a post dealing with a book which includes word combinations like ‘the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood’ in the preface, you can for example have a look at posts like this, this, this and this. I have also discussed here on the blog some stuff somewhat related to the multi-model inference part, how you can combine the results of various models to get a bigger picture of what’s going on, in these posts – they approach ‘the topic’ (these are in fact separate topics…) in a very different manner than does this book, but *some* key ideas *should* presumably transfer. Having said all this, I should also point out that many of the basic points made in the coverage below should be relatively easy to understand, and I should perhaps repeat that I’ve tried to make this post readable to people who’re not too familiar with this kind of stuff. I have deliberately chosen to include no mathematical formulas in my coverage in this post. Please do not assume this is because the book does not contain mathematical formulas.

Before moving on to the main coverage I thought I’d add a note about the remark above that stuff like AIC is “little taught in statistics classes and far less understood in the applied sciences than should be the case”. The book was written a while back, and some things may have changed a bit since then. I have done coursework on the application of information criteria in model selection as it was a topic (briefly) covered in regression analysis(? …or an earlier course), so at least this kind of stuff is now being taught to students of economics where I study and has been for a while as far as I’m aware – meaning that coverage of such topics is probably reasonably widespread at least in this field. However I can hardly claim that I obtained a ‘great’ or ‘full’ understanding of the issues at hand from the work on these topics I did back then – and so I have only gradually, while reading this book, come to appreciate some of the deeper issues and tradeoffs involved in model selection. This could probably be taken as an argument that these topics are still ‘far less understood … than should be the case’ – and another, perhaps stronger, argument would be Seber’s comments in the last part of his book; if a statistician today may still ‘overlook’ information criteria when discussing model selection in a *Springer* text, it’s not hard to argue that the methods are perhaps not as well known as should ‘ideally’ be the case. It’s obvious from the coverage that a lot of people were not using the methods when the book was written, and I’m not sure things have changed as much as would be preferable since then.

What is the book about? A starting point for understanding the sort of questions the book deals with might be to consider the simple question: When we set out to model stuff empirically and we have different candidate models to choose from, how do we decide which of the models is ‘best’? There are a lot of other questions dealt with in the coverage as well. What does the word ‘best’ mean? We might worry over both the functional form of the model and which variables should be included in ‘the best’ model – do we need separate mechanisms for dealing with concerns about the functional form and concerns about variable selection, or can we deal with such things at the same time? How do we best measure the effect of a variable which we have access to and consider including in our model(s) – is it preferable to interpret the effect of a variable on an outcome based on the results you obtain from a ‘best model’ in the set of candidate models, or is it perhaps sometimes better to combine the results of multiple models (and for example take an average of the effects of the variable across multiple proposed models to be the best possible estimate) in the choice set (as should by now be obvious for people who’ve read along here, there are some sometimes quite close parallels between stuff covered in this book and stuff covered in *Borenstein & Hedges*)? If we’re not sure which model is ‘right’, how might we quantify our uncertainty about these matters – and what happens if we don’t try to quantify our uncertainty about which model is correct? What is bootstrapping, and how can we use Monte Carlo methods to help us with model selection? If we apply information criteria to choose among models, what do these criteria tell us, and which sort of issues are they silent about? Are some methods for deciding between models better than others in specific contexts – might it for example be a good idea to make criteria adjustments when faced with small sample sizes which makes it harder for us to rely on asymptotic properties of the criteria we apply? How might the sample size more generally relate to our decision criterion deciding which model might be considered ‘best’ – do we think that what might be considered to be ‘the best model’ might depend upon (‘should depend upon’?) how much data we have access to or not, and if how much data we have access to and the ‘optimal size of a model’ are related, *how *are the two related, and why? The questions included in the previous sentence relate to some fundamental differences between AIC (and similar measures) and BIC – but let’s not get ahead of ourselves. I may or may not go into details like these in my coverage of the book, but I certainly won’t cover stuff like that in this post. Some of the content is really technical: “Chapters 5 and 6 present more difficult material [than chapters 1-4] and some new research results. Few readers will be able to absorb the concepts presented here after just one reading of the material […] Underlying theory is presented in Chapter 7, and this material is much deeper and more mathematical.” – from the preface. The sample size considerations mentioned above relate to stuff covered in chapter 6. As you might already have realized, this book has a lot of stuff.

When dealing with models, one way to think about these things is to consider two in some sense separate issues: On the one hand we might think about which model is most appropriate (model selection), and on the other hand we might think about how best to estimate parameter values and variance-covariance matrices *given* a specific model. As the book points out early on, “if one assumes or somehow chooses a particular model, methods exist that are objective and asymptotically optimal for estimating model parameters and the sampling covariance structure, conditional on that model. […] The sampling distributions of ML [maximum likelihood] estimators are often skewed with small samples, but profile likelihood intervals or log-based intervals or bootstrap procedures can be used to achieve asymmetric confidence intervals with good coverage properties. **In general, the maximum likelihood method provides an objective, omnibus theory for estimation of model parameters and the sampling covariance matrix, given an appropriate model**.” The problem is that it’s not ‘a given’ that the model we’re working on

*is*actually appropriate. That’s where model selection mechanisms enters the picture. Such methods can help us realize which of the models we’re considering might be the most appropriate one(s) to apply in the specific context (there are other things they can’t tell us, however – see below).

Below I have added some quotes from the book and some further comments:

“Generally, alternative models will involve differing numbers of parameters; the number of parameters will often differ by at least an order of magnitude across the set of candidate models. […] The more parameters used, the better the fit of the model to the data that is achieved. Large and extensive data sets are likely to support more complexity, and this should be considered in the development of the set of candidate models. If a particular model (parametrization) does not make biological [/’scientific’] sense, this is reason to exclude it from the set of candidate models, particularly in the case where causation is of interest. In developing the set of candidate models, one must recognize a certain balance between keeping the set small and focused on plausible hypotheses, while making it big enough to guard against omitting a very good a priori model. While this balance should be considered, we advise the inclusion of all models that seem to have a reasonable justification, prior to data analysis. While one must worry about errors due to both underfitting and overfitting, it seems that modest overfitting is less damaging than underfitting (Shibata 1989).” (The key word here is ‘modest’ – and please don’t take these authors to be in favour of obviously overfitted models and data dredging strategies; they spend quite a few pages criticizing such models/approaches!).

“It is not uncommon to see biologists collect data on 50–130 “ecological” variables in the blind hope that some analysis method and computer system will “find the variables that are significant” and sort out the “interesting” results […]. This shotgun strategy will likely uncover mainly spurious correlations […], and it is prevalent in the naive use of many of the traditional multivariate analysis methods (e.g., principal components, stepwise discriminant function analysis, canonical correlation methods, and factor analysis) found in the biological literature [*and elsewhere, US*]. We believe that mostly spurious results will be found using this unthinking approach […], and we encourage investigators to give very serious consideration to a well-founded set of candidate models and predictor variables (as a reduced set of possible prediction) as a means of minimizing the inclusion of spurious variables and relationships. […] Using AIC and other similar methods one can only hope to select the best model from this set; if good models are not in the set of candidates, they cannot be discovered by model selection (i.e., data analysis) algorithms. […] statistically we can infer only that a best model (by some criterion) has been selected, never that it is the true model. […] **Truth and true models are not statistically identifiable from data**.”

“It is generally a mistake to believe that there is a simple “true model” in the biological sciences and that during data analysis this model can be uncovered and its parameters estimated. Instead, biological systems [*and other systems! – US*] are complex, with many small effects, interactions, individual heterogeneity, and individual and environmental covariates (most being unknown to us); we can only hope to identify a model that provides a good approximation to the data available. The words “true model” represent an oxymoron, except in the case of Monte Carlo studies, whereby a model is used to generate “data” using pseudorandom numbers […] A model is a simplification or approximation of reality and hence will not reflect all of reality. […] While a model can never be “truth,” a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless. Model selection methods try to rank models in the candidate set relative to each other; whether any of the models is actually “good” depends primarily on the quality of the data and the science and a priori thinking that went into the modeling. […] Proper modeling and data analysis tell what inferences the data support, not what full reality might be […] Even if a “true model” did exist and if it could be found using some method, it would not be good as a fitted model for general inference (i.e., understanding or prediction) about some biological system, because its numerous parameters would have to be estimated from the finite data, and the precision of these estimated parameters would be quite low.”

A key concept in the context of model selection is the tradeoff between bias and variance in a model framework:

“If the fit is improved by a model with more parameters, then where should one stop? Box and Jenkins […] suggested that the* principle of parsimony* should lead to a model with “. . . the smallest possible number of parameters for adequate representation of the data.” Statisticians view the principle of parsimony as a bias versus variance tradeoff. In general, bias decreases and variance increases as the dimension of the model (K) increases […] The fit of any model can be improved by increasing the number of parameters […]; however, a tradeoff with the increasing variance must be considered in selecting a model for inference. Parsimonious models achieve a proper tradeoff between bias and variance. All model selection methods are based to some extent on the principle of parsimony […] The concept of parsimony and a bias versus variance tradeoff is very important.”

“we reserve the terms underfitted and overfitted for use in relation to a “best approximating model” […] Here, an underfitted model would ignore some important replicable (i.e., conceptually replicable in most other samples) structure in the data and thus fail to identify effects that were actually supported by the data. In this case, bias in the parameter estimators is often substantial, and the sampling variance is underestimated, both factors resulting in poor confidence interval coverage. Underfitted models tend to miss important treatment effects in experimental settings. Overfitted models, as judged against a best approximating model, are often free of bias in the parameter estimators, but have estimated (and actual) sampling variances that are needlessly large (the precision of the estimators is poor, relative to what could have been accomplished with a more parsimonious model). Spurious treatment effects tend to be identified, and spurious variables are included with overfitted models. […] The goal of data collection and analysis is to make inferences from the sample that properly apply to the population […] A paramount consideration is the repeatability, with good precision, of any inference reached. When we imagine many replicate samples, there will be some recognizable features common to almost all of the samples. Such features are the sort of inference about which we seek to make strong inferences (from our single sample). Other features might appear in, say, 60% of the samples yet still reflect something real about the population or process under study, and we would hope to make weaker inferences concerning these. Yet additional features appear in only a few samples, and these might be best included in the error term (σ^{2}) in modeling. If one were to make an inference about these features quite unique to just the single data set at hand, as if they applied to all (or most all) samples (hence to the population), then we would say that the sample is overfitted by the model (we have overfitted the *data*). Conversely, failure to identify the features present that are strongly replicable over samples is underfitting. […] A best approximating model is achieved by properly balancing the errors of underfitting and overfitting.”

Model selection bias is a key concept in the model selection context, and I think this problem is quite similar/closely related to problems encountered in a meta-analytical context which I believe I’ve discussed before here on the blog (see links above to the posts on meta-analysis) – if I’ve understood these authors correctly, one might choose to think of publication bias issues as partly the result of model selection bias issues. Let’s for a moment pretend you have a ‘true model’ which includes three variables (in the book example there are four, but I don’t think you need four…); one is very important, one is a sort of ‘60% of the samples variable’ mentioned above, and the last one would be a variable we might prefer to just include in the error term. Now the problem is this: When people look at samples where the last one of these variables is ‘seen to matter’, the effect size of this variable will be biased away from zero (they don’t explain where this bias comes from in the book, but I’m reasonably sure this is a result of the probability of identification/inclusion of the variable in the model depending on the (‘local’/’sample’) effect size; the bigger the effect size of a specific variable in a specific sample, the more likely the variable is to be identified as important enough to be included in the model – *Bohrenstein and Hedges* talked about similar dynamics, for obvious reasons, and I think their reasoning ‘transfers’ to this situation and is applicable here as well). When models include variables such as the last one, you’ll have model selection bias: “When predictor variables [like these] are included in models, the associated estimator for a σ^{2} is negatively biased and precision is exaggerated. These two types of bias are called model selection bias”. Much later in the book they incidentally conclude that: “**The best way to minimize model selection bias is to reduce the number of models fit to the data by thoughtful a priori model formulation**.”

“Model selection has most often been viewed, and hence taught, in a context of null hypothesis testing. Sequential testing has most often been employed, either stepup (forward) or stepdown (backward) methods. Stepwise procedures allow for variables to be added or deleted at each step. These testing-based methods remain popular in many computer software packages in spite of their poor operating characteristics. […] Generally, hypothesis testing is a very poor basis for model selection […] There is no statistical theory that supports the notion that hypothesis testing with a fixed α level is a basis for model selection. […] Tests of hypotheses within a data set are not independent, making inferences difficult. The order of testing is arbitrary, and differing test order will often lead to different final models. [This is incidentally one, of several, key differences between hypothesis testing approaches and information theoretic approaches: “The order in which the information criterion is computed over the set of models is not relevant.”] […] Model selection is dependent on the arbitrary choice of α, but α should depend on both n and K to be useful in model selection”.

## Statistical Models for Proportions and Probabilities

“Most elementary statistics books discuss inference for proportions and probabilities, and the primary readership for this monograph is the student of statistics, either at an advanced undergraduate or graduate level. As some of the recommended so-called ‘‘large-sample’’ rules in textbooks have been found to be inappropriate, this monograph endeavors to provide more up-to-date information on these topics. I have also included a number of related topics not generally found in textbooks. The emphasis is on model building and the estimation of parameters from the models.

It is assumed that the reader has a background in statistical theory and inference and is familiar with standard univariate and multivariate distributions, including conditional distributions.”

…

The above quote is from the the book‘s preface. The book is highly technical – here’s a screencap of a page roughly in the middle:

I think the above picture provides some background as to why I do not think it’s a good idea to provide detailed coverage of the book here. Not all pages are that bad, but this *is* a book on mathematical statistics. The technical nature of the book made it difficult for me to know how to rate it – I like to ask myself when reading books like this one if I would be able to spot an error in the coverage. In some contexts here I clearly would not be able to do that (given the time I was willing to spend on the book), and when that’s the case I always feel hesitant about rating(/’judging’) books of this nature. I should note that there are pretty much no spelling/formatting errors, and the language is easy to understand (‘if you know enough about statistics…’). I did have one major problem with part of the coverage towards the end of the book, but it didn’t much alter my general impression of the book. The problem was that the author seems to apply (/recommend?) a hypothesis-testing framework for model selection, a practice which although widely used is frankly considered bad statistics by Burnham and Anderson in their book on model selection. In the relevant section of the book Seber discusses an approach to modelling which starts out with a ‘full model’ including both primary effects and various (potentially multi-level) interaction terms (he deals specifically with data derived from multiple (independent?) multinomial distributions, but where the data comes from is not really important here), and then he proceeds to use hypothesis tests of whether interaction terms are zero to determine whether or not interactions should be included in the model or not. For people who don’t know, this model selection method is both very commonly used and a very wrong way to do things; using hypothesis testing as a model selection mechanism is a methodologically invalid approach to model selection, something Burnham and Anderson talks a lot about in their book. I assume I’ll be covering Burnham and Anderson’s book in more detail later on here on the blog, so for now I’ll just make this key point here and then return to that stuff later – if you did not understand the comments above you shouldn’t worry too much about it, I’ll go into much more detail when talking about that stuff later. This problem was the only real problem I had with Seber’s book.

Although I’ll not talk a lot about what the book was about (not only because it might be hard for some readers to follow, I should point out, but also because detailed coverage would take a lot more time than I’d be willing to spend on this stuff), I decided to add a few links to relevant stuff he talks about in the book. Quite a few pages in the book are spent on talking about the properties of various distributions, how to estimate key parameters of interest, and how to construct confidence intervals to be used for hypothesis testing in those specific contexts.

Some of the links below deal with stuff covered in the book, a few others however just deal with stuff I had to look up in order to understand what was going on in the coverage:

Inverse sampling.

Binomial distribution.

Hypergeometric distribution.

Multinomial distribution.

Binomial proportion confidence interval. (Coverage of the Wilson score interval, Jeffreys interval, and the Clopper-Pearson interval included in the book).

Fisher’s exact test.

Marginal distribution.

Fischer information.

Moment-generating function.

Factorial moment-generating function.

Delta method.

Multidimensional central limit theorem (the book applies this, but doesn’t really talk about it).

Matrix function.

McNemar’s test.

## Wikipedia articles of interest

i. Pendle witches.

“The trials of the **Pendle witches** in 1612 are among the most famous witch trials in English history, and some of the best recorded of the 17th century. The twelve accused lived in the area around Pendle Hill in Lancashire, and were charged with the murders of ten people by the use of witchcraft. All but two were tried at Lancaster Assizes on 18–19 August 1612, along with the Samlesbury witches and others, in a series of trials that have become known as the Lancashire witch trials. One was tried at York Assizes on 27 July 1612, and another died in prison. Of the eleven who went to trial – nine women and two men – ten were found guilty and executed by hanging; one was found not guilty.

The official publication of the proceedings by the clerk to the court, Thomas Potts, in his *The Wonderfull Discoverie of Witches in the Countie of Lancaster*, and the number of witches hanged together – nine at Lancaster and one at York – make the trials unusual for England at that time. It has been estimated that all the English witch trials between the early 15th and early 18th centuries resulted in fewer than 500 executions; this series of trials accounts for more than two per cent of that total.”

“One of the accused, Demdike, had been regarded in the area as a witch for fifty years, and some of the deaths the witches were accused of had happened many years before Roger Nowell started to take an interest in 1612.^{[13]} The event that seems to have triggered Nowell’s investigation, culminating in the Pendle witch trials, occurred on 21 March 1612.^{[14]}

On her way to Trawden Forest, Demdike’s granddaughter, Alizon Device, encountered John Law, a pedlar from Halifax, and asked him for some pins.^{[15]} Seventeenth-century metal pins were handmade and relatively expensive, but they were frequently needed for magical purposes, such as in healing – particularly for treating warts – divination, and for love magic, which may have been why Alizon was so keen to get hold of them and why Law was so reluctant to sell them to her.^{[16]} Whether she meant to buy them, as she claimed, and Law refused to undo his pack for such a small transaction, or whether she had no money and was begging for them, as Law’s son Abraham claimed, is unclear.^{[17]} A few minutes after their encounter Alizon saw Law stumble and fall, perhaps because he suffered a stroke; he managed to regain his feet and reach a nearby inn.^{[18]} Initially Law made no accusations against Alizon,^{[19]} but she appears to have been convinced of her own powers; when Abraham Law took her to visit his father a few days after the incident, she reportedly confessed and asked for his forgiveness.^{[20]}

Alizon Device, her mother Elizabeth, and her brother James were summoned to appear before Nowell on 30 March 1612. Alizon confessed that she had sold her soul to the Devil, and that she had told him to lame John Law after he had called her a thief. Her brother, James, stated that his sister had also confessed to bewitching a local child. Elizabeth was more reticent, admitting only that her mother, Demdike, had a mark on her body, something that many, including Nowell, would have regarded as having been left by the Devil after he had sucked her blood.”^{}

“The Pendle witches were tried in a group that also included the Samlesbury witches, Jane Southworth, Jennet Brierley, and Ellen Brierley, the charges against whom included child murder and cannibalism; Margaret Pearson, the so-called Padiham witch, who was facing her third trial for witchcraft, this time for killing a horse; and Isobel Robey from Windle, accused of using witchcraft to cause sickness.^{[33]}

Some of the accused Pendle witches, such as Alizon Device, seem to have genuinely believed in their guilt, but others protested their innocence to the end.”

“Nine-year-old Jennet Device was a key witness for the prosecution, something that would not have been permitted in many other 17th-century criminal trials. However, King James had made a case for suspending the normal rules of evidence for witchcraft trials in his *Daemonologie*.^{[42]} As well as identifying those who had attended the Malkin Tower meeting, Jennet also gave evidence against her mother, brother, and sister. […] When Jennet was asked to stand up and give evidence against her mother, Elizabeth began to scream and curse her daughter, forcing the judges to have her removed from the courtroom before the evidence could be heard.^{[48]} Jennet was placed on a table and stated that she believed her mother had been a witch for three or four years. She also said her mother had a familiar called Ball, who appeared in the shape of a brown dog. Jennet claimed to have witnessed conversations between Ball and her mother, in which Ball had been asked to help with various murders. James Device also gave evidence against his mother, saying he had seen her making a clay figure of one of her victims, John Robinson.^{[49]} Elizabeth Device was found guilty.^{[47]}

James Device pleaded not guilty to the murders by witchcraft of Anne Townley and John Duckworth. However he, like Chattox, had earlier made a confession to Nowell, which was read out in court. That, and the evidence presented against him by his sister Jennet, who said that she had seen her brother asking a black dog he had conjured up to help him kill Townley, was sufficient to persuade the jury to find him guilty.^{[50]}^{[51]}”

“Many of the allegations made in the Pendle witch trials resulted from members of the Demdike and Chattox families making accusations against each other. Historian John Swain has said that the outbreaks of witchcraft in and around Pendle demonstrate the extent to which people could make a living either by posing as a witch, or by accusing or threatening to accuse others of being a witch.^{[17]} Although it is implicit in much of the literature on witchcraft that the accused were victims, often mentally or physically abnormal, for some at least, it may have been a trade like any other, albeit one with significant risks.^{[74]} There may have been bad blood between the Demdike and Chattox families because they were in competition with each other, trying to make a living from healing, begging, and extortion.”

…

ii. Kullback–Leibler divergence.

This article is the only one of the five ‘main articles’ in this post which is not a featured article. I looked this one up because the Burnham & Anderson book I’m currently reading talks about this stuff quite a bit. The book will probably be one of the most technical books I’ll read this year, and I’m not sure how much of it I’ll end up covering here. Basically most of the book deals with the stuff ‘covered’ in the (very short) ‘Relationship between models and reality’ section of the wiki article. There are a lot of details the article left out… The same could be said about the related wiki article about AIC (both articles incidentally include the book in their references).

…

The first thing that would spring to mind if someone asked me what I knew about it would probably be something along the lines of: “…well, it’s *huge*…”

…and it is. But we know a lot more than that – some observations from the article:

“The atmosphere of Jupiter is the largest planetary atmosphere in the Solar System. It is mostly made of molecular hydrogen and helium in roughly solar proportions; other chemical compounds are present only in small amounts […] The atmosphere of Jupiter lacks a clear lower boundary and gradually transitions into the liquid interior of the planet. […] The Jovian atmosphere shows a wide range of active phenomena, including band instabilities, vortices (cyclones and anticyclones), storms and lightning. […] Jupiter has powerful storms, always accompanied by lightning strikes. The storms are a result of moist convection in the atmosphere connected to the evaporation and condensation of water. They are sites of strong upward motion of the air, which leads to the formation of bright and dense clouds. The storms form mainly in belt regions. The lightning strikes on Jupiter are hundreds of times more powerful than those seen on Earth.” [However do note that later on in the article it is stated that: “On Jupiter lighting strikes are on average a few times more powerful than those on Earth.”]

“The composition of Jupiter’s atmosphere is similar to that of the planet as a whole.^{[1]} Jupiter’s atmosphere is the most comprehensively understood of those of all the gas giants because it was observed directly by the *Galileo* atmospheric probe when it entered the Jovian atmosphere on December 7, 1995.^{[26]} Other sources of information about Jupiter’s atmospheric composition include the *Infrared Space Observatory* (ISO),^{[27]} the *Galileo* and *Cassini* orbiters,^{[28]} and Earth-based observations.”

“The visible surface of Jupiter is divided into several bands parallel to the equator. There are two types of bands: lightly colored *zones* and relatively dark *belts.* […] The alternating pattern of belts and zones continues until the polar regions at approximately 50 degrees latitude, where their visible appearance becomes somewhat muted.^{[30]} The basic belt-zone structure probably extends well towards the poles, reaching at least to 80° North or South.^{[5]}

The difference in the appearance between zones and belts is caused by differences in the opacity of the clouds. Ammonia concentration is higher in zones, which leads to the appearance of denser clouds of ammonia ice at higher altitudes, which in turn leads to their lighter color.^{[15]} On the other hand, in belts clouds are thinner and are located at lower altitudes.^{[15]} The upper troposphere is colder in zones and warmer in belts.^{[5]} […] The Jovian bands are bounded by zonal atmospheric flows (winds), called *jets*. […] The location and width of bands, speed and location of jets on Jupiter are remarkably stable, having changed only slightly between 1980 and 2000. […] However bands vary in coloration and intensity over time […] These variations were first observed in the early seventeenth century.”

“Jupiter radiates much more heat than it receives from the Sun. It is estimated that the ratio between the power emitted by the planet and that absorbed from the Sun is 1.67 ± 0.09.”

…

iv. Wife selling (English custom).

“**Wife selling** in England was a way of ending an unsatisfactory marriage by mutual agreement that probably began in the late 17th century, when divorce was a practical impossibility for all but the very wealthiest. After parading his wife with a halter around her neck, arm, or waist, a husband would publicly auction her to the highest bidder. […] Although the custom had no basis in law and frequently resulted in prosecution, particularly from the mid-19th century onwards, the attitude of the authorities was equivocal. At least one early 19th-century magistrate is on record as stating that he did not believe he had the right to prevent wife sales, and there were cases of local Poor Law Commissioners forcing husbands to sell their wives, rather than having to maintain the family in workhouses.”

“Until the passing of the Marriage Act of 1753, a formal ceremony of marriage before a clergyman was not a legal requirement in England, and marriages were unregistered. All that was required was for both parties to agree to the union, so long as each had reached the legal age of consent,^{[8]} which was 12 for girls and 14 for boys.^{[9]} Women were completely subordinated to their husbands after marriage, the husband and wife becoming one legal entity, a legal status known as coverture. […] Married women could not own property in their own right, and were indeed themselves the property of their husbands. […] Five distinct methods of breaking up a marriage existed in the early modern period of English history. One was to sue in the ecclesiastical courts for separation from bed and board (*a mensa et thoro*), on the grounds of adultery or life-threatening cruelty, but it did not allow a remarriage.^{[11]} From the 1550s, until the Matrimonial Causes Act became law in 1857, divorce in England was only possible, if at all, by the complex and costly procedure of a private Act of Parliament.^{[12]} Although the divorce courts set up in the wake of the 1857 Act made the procedure considerably cheaper, divorce remained prohibitively expensive for the poorer members of society.^{[13]}^{[nb 1]} An alternative was to obtain a “private separation”, an agreement negotiated between both spouses, embodied in a deed of separation drawn up by a conveyancer. Desertion or elopement was also possible, whereby the wife was forced out of the family home, or the husband simply set up a new home with his mistress.^{[11]} Finally, the less popular notion of wife selling was an alternative but illegitimate method of ending a marriage.”

“Although some 19th-century wives objected, records of 18th-century women resisting their sales are non-existent. With no financial resources, and no skills on which to trade, for many women a sale was the only way out of an unhappy marriage.^{[17]} Indeed the wife is sometimes reported as having insisted on the sale. […] Although the initiative was usually the husband’s, the wife had to agree to the sale. An 1824 report from Manchester says that “after several biddings she [the wife] was knocked down for 5s; but not liking the purchaser, she was put up again for 3s and a quart of ale”.^{[27]} Frequently the wife was already living with her new partner.^{[28]} In one case in 1804 a London shopkeeper found his wife in bed with a stranger to him, who, following an altercation, offered to purchase the wife. The shopkeeper agreed, and in this instance the sale may have been an acceptable method of resolving the situation. However, the sale was sometimes spontaneous, and the wife could find herself the subject of bids from total strangers.^{[29]} In March 1766, a carpenter from Southwark sold his wife “in a fit of conjugal indifference at the alehouse”. Once sober, the man asked his wife to return, and after she refused he hanged himself. A domestic fight might sometimes precede the sale of a wife, but in most recorded cases the intent was to end a marriage in a way that gave it the legitimacy of a divorce.”^{}

“Prices paid for wives varied considerably, from a high of £100 plus £25 each for her two children in a sale of 1865 (equivalent to about £12,500 in 2015)^{[34]} to a low of a glass of ale, or even free. […] According to authors Wade Mansell and Belinda Meteyard, money seems usually to have been a secondary consideration;^{[4]} the more important factor was that the sale was seen by many as legally binding, despite it having no basis in law. […] In Sussex, inns and public houses were a regular venue for wife-selling, and alcohol often formed part of the payment. […] in Ninfield in 1790, a man who swapped his wife at the village inn for half a pint of gin changed his mind and bought her back later.^{[42]} […] Estimates of the frequency of the ritual usually number about 300 between 1780 and 1850, relatively insignificant compared to the instances of desertion, which in the Victorian era numbered in the tens of thousands.^{[43]}”

“In 1825 a man named Johnson was charged with “having sung a song in the streets describing the merits of his wife, for the purpose of selling her to the highest bidder at Smithfield.” Such songs were not unique; in about 1842 John Ashton wrote “Sale of a Wife”.^{[nb 6]}^{[58]} The arresting officer claimed that the man had gathered a “crowd of all sorts of vagabonds together, who appeared to listen to his ditty, but were in fact, collected to pick pockets.” The defendant, however, replied that he had “not the most distant idea of selling his wife, who was, poor creature, at home with her hungry children, while he was endeavouring to earn a bit of bread for them by the strength of his lungs.” He had also printed copies of the song, and the story of a wife sale, to earn money. Before releasing him, the Lord Mayor, judging the case, cautioned Johnson that the practice could not be allowed, and must not be repeated.^{[59]} In 1833 the sale of a woman was reported at Epping. She was sold for 2s. 6d., with a duty of 6d. Once sober, and placed before the Justices of the Peace, the husband claimed that he had been forced into marriage by the parish authorities, and had “never since lived with her, and that she had lived in open adultery with the man Bradley, by whom she had been purchased”. He was imprisoned for “having deserted his wife”.^{[60]}”

…

v. Bog turtle.

“The **bog turtle** (*Glyptemys muhlenbergii*) is a semiaquatic turtle endemic to the eastern United States. […] It is the smallest North American turtle, measuring about 10 centimeters (4 in) long when fully grown. […] The bog turtle can be found from Vermont in the north, south to Georgia, and west to Ohio. Diurnal and secretive, it spends most of its time buried in mud and – during the winter months – in hibernation. The bog turtle is omnivorous, feeding mainly on small invertebrates.”

“The bog turtle is native only to the eastern United States,^{[nb 1]} congregating in colonies that often consist of fewer than 20 individuals.^{[23]} […] densities can range from 5 to 125 individuals per 0.81 hectares (2.0 acres). […] The bog turtle spends its life almost exclusively in the wetland where it hatched. In its natural environment, it has a maximum lifespan of perhaps 50 years or more,^{[47]} and the average lifespan is 20–30 years.”

“The bog turtle is primarily diurnal, active during the day and sleeping at night. It wakes in the early morning, basks until fully warm, then begins its search for food.^{[31]} It is a seclusive species, making it challenging to observe in its natural habitat.^{[11]} During colder days, the bog turtle will spend much of its time in dense underbrush, underwater, or buried in mud. […] Day-to-day, the bog turtle moves very little, typically basking in the sun and waiting for prey. […] Various studies have found different rates of daily movement in bog turtles, varying from 2.1 to 23 meters (6.9 to 75.5 ft) in males and 1.1 to 18 meters (3.6 to 59.1 ft) in females.”

“Changes to the bog turtle’s habitat have resulted in the disappearance of 80 percent of the colonies that existed 30 years ago.^{[7]} Because of the turtle’s rarity, it is also in danger of illegal collection, often for the worldwide pet trade. […] The bog turtle was listed as *critically endangered* in the 2011 IUCN Red List.^{[53]}“

## Evidence-Based Diagnosis

“Evidence-Based Diagnosis is a textbook about diagnostic, screening, and prognostic tests in clinical medicine. The authors’ approach is based on many years of experience teaching physicians in a clinical research training program. Although requiring only a minimum of mathematics knowledge, the quantitative discussions in this book are deeper and more rigorous than those in most introductory texts. […] It is aimed primarily at clinicians, particularly those who are academically minded, but it should be helpful and accessible to anyone involved with selection, development, or marketing of diagnostic, screening, or prognostic tests. […] Our perspective is that of skeptical consumers of tests. We want to make proper diagnoses and not miss treatable diseases. Yet, we are aware that vast resources are spent on tests that too frequently provide wrong answers or right answers of little value, and that new tests are being developed, marketed, and sold all the time, sometimes with little or no demonstrable or projected benefit to patients. This book is intended to provide readers with the tools they need to evaluate these tests, to decide if and when they are worth doing, and to interpret the results.”

…

I simply could not possibly justify not giving this book a shot considering the amazon ratings – it has an insane average rating of five stars, based on nine ratings. I agree with the reviewers: This is a really nice book. It covers a lot of stuff I’ve seen before, e.g. in Fletcher and Fletcher, Petrie and Sabin, Juth and Munthe, Borenstein, Hedges et al., Adam, Baltussen et al. (listing all of these suddenly made me realize how much stuff I’ve actually read about these sorts of topics in the past…), as well as in stats courses I’ve taken, but as the book is focusing specifically on medical testing aspects there is also a lot of new stuff as well. It should be noted that some people will benefit a lot more from reading the book than I did; I’ve spent weeks dealing with related aspects of subtopics they cover in just a few pages, and there were a lot of familiar concepts, distinctions, etc. in the book. Even so, this book is remarkably well-written and these guys really know their stuff. If you want to read a book about the basics of how to make sense of the results of medical tests and stuff like that, this is the book you’ll want to read.

Let’s say you have a test measuring some variable which might be useful in a diagnostic context. How would we know it might be useful? Well, one might come up with some criteria such a test should meet; like that the results of the test doesn’t depend on who’s doing the testing, perhaps it also shouldn’t matter when the test is done. You might also want the test to be somewhat accurate. But what do we even mean by that? There are various approaches to thinking about accuracy, and some may be better than others. So the book covers familiar topics like sensitivity and specificity, likelihood ratios, and receiver operating characteristic (ROC) curves. A test might be accurate, but if the results of a test does not change clinical decision-making it might not be worth it to do the test; so the question of whether a test is accurate or not is different from whether it’s also useful. In terms of usefulness concepts like positive- and negative predictive value and distinctions such as that between absolute and relative risk become important. It might not even be a good idea to use a test even if it distinguishes reasonably well between people who are sick and people who are not, because a very accurate test might be too expensive to be justified undertaking; the book also has a bit of stuff on cost-effectiveness. Of course costs associated with getting tested for a health condition are not limited to monetary costs; a test might be uncomfortable, and it may also for example be the case that a false positive or a false negative result might sometimes have quite severe consequences (e.g. in the context of cancer screening). In such contexts concepts like the number needed to treat might be useful. It might also on the other hand be that a test gives answers which are wrong so often that even if it’s very cheap to do, it still might not be worth doing. There’s stuff in the book about how to think about, and come up with decision-rules about, how to identify things like treatment-thresholds; variables which will be determined by probability of disease and costs associated with testing (/and treatment). A variable like the cost of a treatment might in an analytical framework involve both the costs of treating people with the health condition as well as the costs of treating people who tested positive without being sick and the costs of not treating sick people who tested negative. One might think in one context that it would be twice as bad to miss a diagnosis than it would be to treat someone who does not have the disease, which would lead to one set of decision-rules in terms of when to test and when to treat, whereas in another context it might be a lot worse to miss a diagnosis, so we’d be less worried about treating someone without the disease. There may be more than one relevant threshold in the diagnostic setting; usually there’ll be some range of prior probabilities of disease for which the test will add enough information to change decision-making, but at either end of the range the test might not add enough information to be justified. To be more specific, if you’re almost certain the patient has some specific disease, you’ll want to treat him because the test result will not change anything; and if on the other hand you’re almost certain that he does not have the disease, based e.g. on the prevalence rate and the clinical presentation, then you’ll want to refrain from testing if the test has costs (including time costs, inconvenience, etc.). The book includes formal and reasonably detailed analysis of such topics.

In terms of how to interpret the results of a test it matters who you’re testing, and as already indicated the authors apply a Bayesian approach to these matters and repeatedly emphasize the importance of priors when evaluating test results (or for that matter findings from the literature). In that context some important notions are included about what you can and can’t use e.g. variables like prevalence and incidence for, how best to use such variables to inform decision-making, and things like how the study design might impact which variables are available to you for analysis (don’t try to estimate prevalence if you’re dealing with a case-control setup, where this variable is determined by the study design).

Of course medical most tests don’t just give two results. Dichotomization adds simplicity compared to more complex scenarios, so that’s where the book starts out, but it doesn’t stop there. If you have a test involving a continuous variable then dichotomizing the results will reduce the value of the test; this is equivalent to using pair-wise comparisons to make sense of continuous data in other contexts. However it’s sometimes useful to do it anyway because you may be in a situation where you need to quickly/easily separate ‘normal’ from ‘abnormal’. Likelihood ratios are really useful in the context of multi-level tests. In the simple dichotomous test, the LR for a test result is the probability of the result in a patient with disease divided by the probability of the result in a patient without disease. If you have lots of possible test results however, you’ll not be limited to two likelihood ratios; you’ll have as many likelihood ratios as there are results of the test. Those likelihood ratios are useful because the LR in the context of a multi-level test is equal to the slope of the ROC curve over the relevant interval. The ROC curve in some sense displays the tradeoff between sensitivity (‘true positive’) and specificity (‘true negative’); each point on the curve represents a different cut-off for calling a test positive. Such curves are quite useful in terms of figuring out if a test adds information or not, how well it distinguishes between patients. If you want to compare different tests and how they perform, Bland-Altman plots also seem to be useful tools.

Sometimes the results of more than one test will be relevant to decision-making, and a key question to ask here is the extent to which tests, and test results, are independent. If tests are not independent, one should be careful about how to update the probability of disease based on a new laboratory finding, and about which conclusions can be drawn regarding the extent to which an additional test might or might not be necessary/useful/required. The book does not go into too much detail, but enough is said on this topic to make it clear that test dependence is a potential issue one should keep in mind when evaluating multiple test results. They do talk a bit about how to come up with decision-rules about which tests to prefer in situations where multiple interdependent tests are available for analysis.

…

Sometimes blinding is difficult. The book tells us that it’s particularly important to blind when outcomes are subjective (like pain), and when prognostic factors may affect treatment in the study setting.

Medical tests can be used for different things, and not all tests are equal. One important distinction they talk about in the book is the distinction between diagnostic tests, which are done on sick people to figure out why they’re sick, and screening tests, which are mostly done on healthy people with a low prior probability of disease. There are different types of screening tests. One type of test is screening for symptomatic disease, which is sometimes done because people may be sick and have symptoms without being aware of the fact that they’re sick; screening for depression might be an example of this (that *may* even sometimes be cost-effective). These tests are reasonably similar to traditional diagnostic tests, and so can be evaluated in a similar manner. However most screening tests are of a different kind; they’re aimed at identifying *risk factors*, rather than ‘actual disease’ (a third kind is screening for presymptomatic disease). This generally tends to make them harder to justify undertaking, for reasons covered in much greater detail in *Juth and Munthe* (see the link over the word ‘may’ above). There are other differences as well; concepts such as sensitivity and specificity are for example difficult to relate to screening tests aimed at identifying risk factors, as such screening tests have as a goal to estimate incidence, rather than prevalence, which will often make it hard to compare such tests with the established ‘gold standard’ (as is usually the case). I decided to include a few quotes from this part of the coverage:

“the general public tends to be supportive of screening programs. Part of this is wishful thinking. We would like to believe that bad things happen for a reason, and that there are things we can do to prevent them […] .We also tend to be much more swayed by stories of individual patients (either those whose disease was detected early or those in whom it was found “too late”) than by boring statistics about risks, costs, and benefits […]. Because, at least in the U.S., there is no clear connection between money spent on screening tests and money not being available to spend on other things, the public tends not to be swayed by arguments about cost efficacy […]. In fact, in the general public’s view of screening, even wrong answers are not necessarily a bad thing. Schwartz et al. (2004) did a national telephone survey of attitudes about cancer screening in the U.S. They found that 38% of respondents had experienced at least one false-positive screening test. Although more than 40% of these subjects referred to that experience as “very scary” or the “scariest time of my life,” 98% were glad they had the screening test! […] Another disturbing result of the survey by Schwartz et al. was that, even though (as of 2002) the U.S. Preventive Health Services Task Force felt that evidence was insufficient to recommend prostate cancer screening, more than 60% of respondents said that a 55-year-old man who did not have a routine PSA test was “irresponsible,” and more than a third said this for an 80-year old […] Thus, regardless of the efficacy of screening tests, they have become an obligation if one does not wish to be blamed for getting some illnesses.”

There are many reasons why there may be problems with using observational studies to evaluate screening tests, and they talk a bit about those. One is what they call ‘volunteer bias’, which is just basic selection bias. Then there are the familiar problems of lead-time bias and length time bias. It should perhaps be noted here that both of the two latter problems can be handled in the context of a randomized controlled trial; neither lead-time bias nor length time bias are issues if the study is an RCT which compares the entire screened group with the entire unscreened group. Yet another problem is stage-migration bias, which for example can be a problem when more sensitive tests allow for earlier detection which changes how people are staged; this may lead to changes in stage-specific mortality rates, without actually improving overall mortality at all. A final problem they talk about is overdiagnoses related to the problem of pseudodisease, which is disease that would never have affected the patient if it had not been diagnosed by the screening procedure. Again a quote might be in order:

“It is difficult to identify pseudodisease in an individual patient, because it requires completely ignoring the diagnosis. (If you treat pseudodisease, the treatment will always appear to be curative, and you won’t realize the patient had pseudodisease rather than real disease!) In some ways, pseudodisease is an extreme type of stage migration bias. Patients who were not previously diagnosed as having the disease are now counted as having it. Although the incidence of the disease goes up, the prognosis of those who have it improves. […] Lack of understanding of pseudodisease, including the lack of people who know they have had it, is a real problem, because most of us understand the world through stories […]. Patients whose pseudodisease has been “cured” become strong proponents of screening and treatment and can tell a powerful and easily understood story about their experience. On the other hand, there aren’t people who can tell a compelling story of pseudodisease – men who can say, “I had a completely unnecessary prostatectomy,” or women who say, “I had a completely unnecessary mastectomy,” even though we know statistically that many such people exist.

The existence of pseudo–lung cancer was strongly suggested by the results of the Mayo Lung Study, a randomized trial of chest x-rays and sputum cytology to screen for lung cancer among 9,211 male cigarette smokers (Marcus et al. 2000).”

I included the last part also to indicate that this is actually a real problem also in situations where you’d be very likely to imagine it couldn’t possibly be a problem; even a disease as severe as lung cancer is subject to this kind of issue. There are also problems that may make screening tests look worse than they really are; like power issues, unskilled medical personnel doing the testing, and lack of follow-up (if a positive test result does not lead to any change in health care provision, there’s no good reason to assume earlier diagnosis as a result of screening will impact e.g. disease-specific mortality. On a related note there’s some debate about which mortality metric (general vs disease-specific) is to be preferred in the screening context, and they talk a bit about that as well).

I expected to write more about the book in this post than I have so far and perhaps include a few more quotes, but my computer broke down while I was writing this post yesterday so this is what you get. However as already mentioned this is a great book, and if you think you might like it based on the observations included in this post you should definitely read it.

## Introduction to Meta Analysis (III)

(xkcd).

…

This will be my last post about the book. Below I have included some observations from the last 100 pages.

…

“A central theme in this volume is the fact that we usually prefer to work with effect sizes, rather than p-values. […] While we would argue that researchers should shift their focus to effect sizes even when working entirely with primary studies, the shift is *absolutely critical* when our goal is to synthesize data from multiple studies. A narrative reviewer who works with p-values (or with reports that were based on p-values) and uses these as the basis for a synthesis, is facing an impossible task. Where people tend to misinterpret a single p-value, the problem is much worse when they need to compare a series of p-values. […] the p-value is often misinterpreted. Because researchers *care about* the effect size, they tend to take whatever information they have and press it into service as an indicator of effect size. A statistically significant p-value is assumed to reflect a clinically important effect, and a nonsignificant p-value is assumed to reflect a trivial (or zero) effect. However, these interpretations are not necessarily correct. […] The narrative review typically works with p-values (or with conclusions that are based on p-values), and therefore lends itself to […] mistakes. p-values that differ are assumed to reflect different effect sizes but may not […], p-values that are the same are assumed to reflect similar effect sizes but may not […], and a more significant p-value is assumed to reflect a larger effect size when it may actually be based on a smaller effect size […]. By contrast, the meta-analysis works with effect sizes. As such it not only focuses on the question of interest (what is the size of the effect) but allows us to compare the effect size from study to study.”

“To compute the summary effect in a meta-analysis we compute an effect size for each study and then combine these effect sizes, rather than pooling the data directly. […] This approach allows us to study the dispersion of effects before proceeding to the summary effect. For a random-effects model this approach also allows us to incorporate the between-studies dispersion into the weights. There is one additional reason for using this approach […]. The reason is to ensure that each effect size is based on the comparison of a group with its own control group, and thus avoid a problem known as Simpson’s paradox. In some cases, particularly when we are working with observational studies, this is a critically important feature. […] The term paradox refers to the fact that one group can do better in every one of the included studies, but still do worse when the raw data are pooled. The problem is not limited to studies that use proportions, but can exist also in studies that use means or other indices. The problem exists only when the base rate (or mean) varies from study to study and the proportion of participants from each group varies as well. For this reason, the problem is generally limited to observational studies, although it can exist in randomized trials when allocation ratios vary from study to study.” [*See the wiki article for more*]

“When studies are addressing the same outcome, measured in the same way, using the same approach to analysis, but presenting results in different ways, then the only obstacles to meta-analysis are practical. If sufficient information is available to estimate the effect size of interest, then a meta-analysis is possible. […]

When studies are addressing the same outcome, measured in the same way, but using different approaches to analysis, then the possibility of a meta-analysis depends on both statistical and practical considerations. One important point is that all studies in a meta-analysis must use essentially the same index of treatment effect. For example, we cannot combine a risk difference with a risk ratio. Rather, we would need to use the summary data to compute the same index for all studies.

There are some indices that are similar, if not exactly the same, and judgments are required as to whether it is acceptable to combine them. One example is odds ratios and risk ratios. When the event is rare, then these are approximately equal and can readily be combined. As the event gets more common the two diverge and should not be combined. Other indices that are similar to risk ratios are hazard ratios and rate ratios. Some people decide these are similar enough to combine; others do not. The judgment of the meta-analyst in the context of the aims of the meta-analysis will be required to make such decisions on a case by case basis.

When studies are addressing the same outcome measured in different ways, or different outcomes altogether, then the suitability of a meta-analysis depends mainly on substantive considerations. The researcher will have to decide whether a combined analysis would have a meaningful interpretation. […] There is a useful class of indices that are, perhaps surprisingly, combinable under some simple transformations. In particular, formulas are available to convert standardized mean differences, odds ratios and correlations to a common metric [*I should note that the book covers these data transformations, but I decided early on not to talk about that kind of stuff in my posts because it’s highly technical and difficult to blog*] […] These kinds of conversions require some assumptions about the underlying nature of the data, and violations of these assumptions can have an impact on the validity of the process. […] A report should state the computational model used in the analysis and explain why this model was selected. A common mistake is to use the fixed-effect model on the basis that there is no evidence of heterogeneity. As [already] explained […], the decision to use one model or the other should depend on the nature of the studies, and not on the significance of this test [because the test will often have low power anyway]. […] The report of a meta-analysis should generally include a forest plot.”

“The issues addressed by a sensitivity analysis for a systematic review are similar to those that might be addressed by a sensitivity analysis for a primary study. That is, the focus is on the extent to which the results are (or are not) robust to assumptions and decisions that were made when carrying out the synthesis. The kinds of issues that need to be included in a sensitivity analysis will vary from one synthesis to the next. […] One kind of sensitivity analysis is concerned with the impact of decisions that lead to different data being used in the analysis. A common example of sensitivity analysis is to ask how results might have changed if different study inclusion rules had been used. […] Another kind of sensitivity analysis is concerned with the impact of the statistical methods used […] For example one might ask whether the conclusions would have been different if a different effect size measure had been used […] Alternatively, one might ask whether the conclusions would be the same if fixed-effect versus random-effects methods had been used. […] Yet another kind of sensitivity analysis is concerned with how we addressed missing data […] A very important form of missing data is the missing data on effect sizes that may result from incomplete reporting or selective reporting of statistical results within studies. When data are selectively reported in a way that is related to the magnitude of the effect size (e.g., when results are only reported when they are statistically significant), such missing data can have biasing effects similar to publication bias on entire studies. In either case, we need to ask how the results would have changed if we had dealt with missing data in another way.”

“A cumulative meta-analysis is a meta-analysis that is performed first with one study, then with two studies, and so on, until all relevant studies have been included in the analysis. As such, a cumulative analysis *is not a different analytic method* than a standard analysis, but simply *a mechanism for displaying a series of separate analyses* in one table or plot. When the series are sorted into a sequence based on some factor, the display shows how our estimate of the effect size (and its precision) shifts as a function of this factor. When the studies are sorted chronologically, the display shows how the evidence accumulated, and how the conclusions may have shifted, over a period of time.”

“While cumulative analyses are most often used to display the pattern of the evidence over time, the same technique can be used for other purposes as well. Rather than sort the data chronologically, we can sort it by any variable, and then display the pattern of effect sizes. For example, assume that we have 100 studies that looked at the impact of homeopathic medicines, and we think that the effect is related to the quality of the blinding process. We anticipate that studies with complete blinding will show no effect, those with lower quality blinding will show a minor effect, those that blind only some people will show a larger effect, and so on. We could sort the studies based on the quality of the blinding (from high to low), and then perform a cumulative analysis. […] Similarly, we could use cumulative analyses to display the possible impact of publication bias. […] large studies are assumed to be unbiased, but the smaller studies may tend to over-estimate the effect size. We could perform a cumulative analysis, entering the larger studies at the top and adding the smaller studies at the bottom. If the effect was initially small when the large (nonbiased) studies were included, and then increased as the smaller studies were added, we would indeed be concerned that the effect size was related to sample size. A benefit of the cumulative analysis is that it displays not only *if* there is a shift in effect size, but also *the* *magnitude* of the shift. […] It is important to recognize that cumulative meta-analysis is a mechanism for display, rather than analysis. […] These kinds of displays are compelling and can serve an important function. However, if our goal is actually to examine the relationship between a factor and effect size, then the appropriate analysis is a meta-regression”

“John C. Bailar, in an editorial for the New England Journal of Medicine (Bailar, 1997), [wrote] that mistakes […] are common in meta-analysis. He argues that a meta-analysis is inherently so complicated that mistakes by the persons performing the analysis are all but inevitable. He also argues that journal editors are unlikely to uncover all of these mistakes. […] The specific points made by Bailar about problems with meta-analysis are entirely reasonable. He is correct that many meta-analyses contain errors, some of them important ones. His list of potential (and common) problems can serve as a bullet list of mistakes to avoid when performing a meta-analysis. However, the mistakes cited by Bailar are flaws in the application of the method, rather than problems with the method itself. Many primary studies suffer from flaws in the design, analyses, and conclusions. In fact, some serious kinds of problems are endemic in the literature. The response of the research community is to locate these flaws, consider their impact for the study in question, and (hopefully) take steps to avoid similar mistakes in the future. In the case of meta-analysis, as in the case of primary studies, we cannot condemn a method because some people have used that method improperly. […] In his editorial Bailar concludes that, until such time as the quality of meta-analyses is improved, he would prefer to work with the traditional narrative reviews […] We disagree with the conclusion that narrative reviews are preferable to systematic reviews, and that meta-analyses should be avoided. The narrative review suffers from every one of the problems cited for the systematic review. The only difference is that, in the narrative review, these problems are less obvious. […] the key advantage of the systematic approach of a meta-analysis is that all steps are clearly described so that the process is transparent.”

## Open Thread

It’s been a long time since I had one of these. Questions? Comments? Random observations?

I hate posting posts devoid of content, so here’s some random stuff:

i.

If you think the stuff above is all fun and games I should note that the topic of chiralty, which is one of the things talked about in the lecture above, was actually covered in some detail in Gale’s book, which hardly is a book which spends a great deal of time talking about esoteric mathematical concepts. On a related note, the main reason why I have not blogged that book is incidentally that I lost all notes and highlights I’d made in the first 200 pages of the book when my computer broke down, and I just can’t face reading that book again simply in order to blog it. It’s a good book, with interesting stuff, and I may decide to blog it later, but I don’t feel like doing it at the moment; without highlights and notes it’s a real pain to blog a book, and right now it’s just not worth it to reread the book. Rereading books can be fun – I’ve incidentally been rereading *Darwin* lately and I may decide to blog this book soon; I imagine I might also choose to reread some of Asimov’s books before long – but it’s not much fun if you’re finding yourself having to do it simply because the computer deleted your work.

…

ii. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.

Here’s the abstract:

“Statistical power analysis provides the conventional approach to assess error rates when designing a research study. However, power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations in which (a) the probability of an estimate being in the wrong direction (*Type S [sign] error*) and (b) the factor by which the magnitude of an effect might be overestimated (*Type M [magnitude] error or exaggeration ratio*) are estimated. We illustrate with examples from recent published research and discuss the largest challenge in a design calculation: coming up with reasonable estimates of plausible effect sizes based on external information.”

If a study has low power, you can get into a lot of trouble. Some problems are well known, others probably aren’t. A bit more from the paper:

“design calculations can reveal three problems:

1. Most obvious, a study with low power is unlikely to “succeed” in the sense of yielding a statistically significant result.

2. It is quite possible for a result to be significant at the 5% level — with a 95% confidence interval that entirely excludes zero — and for there to be a high chance, sometimes 40% or more, that this interval is on the wrong side of zero. Even sophisticated users of statistics can be unaware of this point — that the probability of a Type S error is not the same as the p value or significance level.[3]

3. Using statistical significance as a screener can lead researchers to drastically overestimate the magnitude of an effect (Button et al., 2013).

Design analysis can provide a clue about the importance of these problems in any particular case.”

“Statistics textbooks commonly give the advice that statistical significance is not the same as practical significance, often with examples in which an effect is clearly demonstrated but is very small […]. In many studies in psychology and medicine, however, the problem is the opposite: an estimate that is statistically significant but with such a large uncertainty that it provides essentially no information about the phenomenon of interest. […] There is a range of evidence to demonstrate that it remains the case that too many small studies are done and preferentially published when “significant.” We suggest that one reason for the continuing lack of real movement on this problem is the historic focus on power as a lever for ensuring statistical significance, with inadequate attention being paid to the difficulties of interpreting statistical significance in underpowered studies. Because insufficient attention has been paid to these issues, we believe that too many small studies are done and preferentially published when “significant.” There is a common misconception that if you happen to obtain statistical significance with low power, then you have achieved a particularly impressive feat, obtaining scientific success under difficult conditions.

However, that is incorrect if the goal is scientific understanding rather than (say) publication in a top journal. In fact, statistically significant results in a noisy setting are highly likely to be in the wrong direction and invariably overestimate the absolute values of any actual effect sizes, often by a substantial factor.”

…

iii. I’m sure most people who might be interested in following the match are already well aware that Anand and Carlsen are currently competing for the world chess championship, and I’m not going to talk about that match here. However I do want to mention to people interested in improving their chess that I recently came across this site, and that I quite like it. It only deals with endgames, but endgames are really important. If you don’t know much about endgames you may find the videos available here, here and here to be helpful.

…

iv. A link: Crosss Validated: “Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.”

A friend recently told me about this resource. I knew about the existence of StackExchange, but I haven’t really spent much time there. These days I mostly stick to books and a few sites I already know about; I rarely look for new interesting stuff online. This also means you should not automatically assume I surely already know about X when you’re considering whether to tell me about X in an Open Thread.

## Introduction to Meta Analysis (II)

You can read my first post about the book here. Some parts of the book are fairly technical, so I decided in the post below to skip some chapters in my coverage, simply because I could see no good way to cover the stuff on a wordpress blog (which as already mentioned many times is not ideal for math coverage) without spending a lot more time on that stuff than I wanted to. If you’re a new reader and/or you don’t know what a meta-analysis is, I highly recommend you read my first post about the book before moving on to the coverage below (and/or you can watch this brief video on the topic).

Below I have added some more quotes and observations from the book.

…

“In primary studies we use regression, or multiple regression, to assess the relationship between one or more covariates (moderators) and a dependent variable. Essentially the same approach can be used with meta-analysis, except that the covariates are at the level of the study rather than the level of the subject, and the dependent variable is the effect size in the studies rather than subject scores. We use the term *meta-regression* to refer to these procedures when they are used in a meta-analysis.

The differences that we need to address as we move from primary studies to meta-analysis for regression are similar to those we needed to address as we moved from primary studies to meta-analysis for subgroup analyses. These include the need to assign a weight to each study and the need to select the appropriate model (fixed versus random effects). Also, as was true for subgroup analyses, the *R ^{2} *index, which is used to quantify the proportion of variance explained by the covariates, must be modified for use in meta-analysis.

With these modifications, however, the full arsenal of procedures that fall under the heading of multiple regression becomes available to the meta-analyst. […] As is true in primary studies, where we need an appropriately large ratio of

*subjects*to covariates in order for the analysis be to meaningful, in meta-analysis we need an appropriately large ratio of

*studies*to covariates. Therefore, the use of meta-regression, especially with multiple covariates, is not a recommended option when the number of studies is small.”

“Power depends on the size of the effect and the precision with which we measure the effect. For subgroup analysis this means that power will increase as the difference between (or among) subgroup means increases, and/or the standard error within subgroups decreases. For meta-regression this means that power will increase as the magnitude of the relationship between the covariate and effect size increases, and/or the precision of the estimate increases. In both cases, a key factor driving the precision of the estimate will be the total number of individual subjects across all studies and (for random effects) the total number of studies. […] While there is a general perception that power for testing the main effect is consistently high in meta-analysis, this perception is not correct […] and certainly does not extend to tests of subgroup differences or to meta-regression. […] Statistical power for detecting a difference among subgroups, or for detecting the relationship between a covariate and effect size, is often low [and] failure to obtain a statistically significant difference among subgroups should never be interpreted as evidence that the effect is the same across subgroups. Similarly, failure to obtain a statistically significant effect for a covariate should never be interpreted as evidence that there is no relationship between the covariate and the effect size.”

“When we have effect sizes for more than one outcome (or time-point) within a study, based on the same participants, the information for the different effects is not independent and we need to take account of this in the analysis. […] When we are working with different outcomes at a single point in time, the plausible range of correlations [between outcomes] will depend on the similarity of the outcomes. When we are working with the same outcome at multiple time-points, the plausible range of correlations will depend on such factors as the time elapsed between assessments and the stability of the relative scores over this time period. […] Researchers who do not know the correlation between outcomes sometimes fall back on either of two ‘default’ positions. Some will include both [outcome variables] in the analysis and treat them as independent. Others would use the average of the [variances of the two outcomes]. It is instructive, therefore, to consider the practical impact of these choices. […] In effect, […] researchers who adopt either of these positions as a way of bypassing the need to specify a correlation, are actually adopting a correlation, albeit implicitly. And, the correlation that they adopt falls at either extreme of the possible range (either zero or 1.0). The first approach is almost certain to underestimate the variance and overestimate the precision. The second approach is almost certain to overestimate the variance and underestimate the precision.” [*A good example of a more general point in the context of statistical/mathematical modelling: Sometimes it’s really hard not to make assumptions, and trying to get around such problems by ‘ignoring them’ may sometimes lead to the implicit adoption of assumptions which are highly questionable as well.*]

“Vote counting is the name used to describe the idea of seeing how many studies yielded a significant result, and how many did not. […] narrative reviewers often resort to [vote counting] […] In some cases this process has been formalized, such that one actually counts the number of significant and non-significant p-values and picks the winner. In some variants, the reviewer would look for a clear majority rather than a simple majority. […] One might think that summarizing *p*-values through a vote-counting procedure would yield more accurate decision than any one of the single significance tests being summarized. This is not generally the case, however. In fact, Hedges and Olkin (1980) showed that the power of vote-counting considered as a statistical decision procedure can not only be lower than that of the studies on which it is based, the power of vote counting can tend toward zero as the number of studies increases. […] the idea of vote counting is fundamentally flawed and the variants on this process are equally flawed (and perhaps even more dangerous, since the basic flaw is less obvious when hidden behind a more complicated algorithm or is one step removed from the *p*-value). […] The logic of vote counting says that a significant finding is evidence that an effect exists, while a non-significant finding is evidence that an effect is absent. While the first statement is true, the second is not. While a nonsignificant finding *could* be due to the fact that the true effect is nil, it can also be due simply to low statistical power. Put simply, the *p*-value reported for any study is a function of the observed effect size and the sample size. Even if the observed effect is substantial, the *p*-value will not be significant unless the sample size is adequate. In other words, as most of us learned in our first statistics course, *the absence of a statistically significant effect is not evidence that an effect is absent*.”

“While the term vote counting is associated with narrative reviews it can also be applied to the single study, where a significant p-value is taken as evidence that an effect exists, and a nonsignificant p-value is taken as evidence that an effect does not exist. Numerous surveys in a wide variety of substantive fields have repeatedly documented the ubiquitous nature of this mistake. […] When we are working with a single study and we have a nonsignificant result we don’t have any way of knowing whether or not the effect is real. The nonsignificant *p*-value could reflect either the fact that the true effect is nil *or* the fact that our study had low power. While we caution against accepting the former (that the true effect is nil) we cannot rule it out. By contrast, when we use meta-analysis to synthesize the data from a series of studies we can often identify the true effect. And in many cases (for example if the true effect is substantial and is consistent across studies) we can assert that the nonsignificant *p*-value in the separate studies was due to low power rather than the absence of an effect. […] vote

counting is never a valid approach.”

“The fact that a meta-analysis will often [but not always] have high power is important because […] primary studies often suffer from low power. While researchers are encouraged to design studies with power of at least 80%, this goal is often elusive. Many studies in medicine, psychology, education and an array of other fields have power substantially lower than 80% to detect large effects, and substantially lower than 50% to detect smaller effects that are still important enough to be of theoretical or practical importance. By contrast, a meta-analysis based on multiple studies will have a higher total sample size than any of the separate studies and the increase in power can be substantial. The problem of low power in the primary studies is especially acute when looking for adverse events. The problem here is that studies to test new drugs are *powered* to find a treatment effect for the drug, and do not have adequate power to detect side effects (which have a much lower event rate, and therefore lower power).”

“Assuming a nontrivial effect size, power is primarily a function of the precision […] When we are working with a fixed-effect analysis, precision for the summary effect is always higher than it is for any of the included studies. Under the fixed-effect analysis precision is largely determined by the total sample size […], and it follows the total sample size will be higher across studies than within studies. […] in a random-effects meta-analysis, power depends on within-study error and between-studies variation [*…if you don’t recall the difference between fixed-effects models and random effects models, see the previous post*]. If the effect sizes are reasonably consistent from study to study, and/or if the analysis includes a substantial number of studies, then the second of these will tend to be small, and power will be driven by the cumulative sample size. In this case the meta-analysis will tend to have higher power than any of the included studies. […] However, if the effect size varies substantially from study to study, and the analysis includes only a few studies, then this second aspect will limit the potential power of the meta-analysis. In this case, power could be limited to some low value even if the analysis includes tens of thousands of persons. […] The Cochrane Database of Systematic Reviews is a database of systematic reviews, primarily of randomized trials, for medical interventions in all areas of healthcare, and currently includes over 3000 reviews. In this database, the median number of trials included in a review is six. When a review includes only six studies, power to detect even a moderately large effect, let alone a small one, can be well under 80%. While the median number of studies in a review differs by the field of research, in almost any field we do find some reviews based on a small number of studies, and so we cannot simply assume that power is high. […] Even when power to test the main effect is high, many meta-analyses are not concerned with the main effect at all, but are performed solely to assess the impact of covariates (or moderator variables). […] The question to be addressed is not whether the treatment works, but whether one variant of the treatment is more effective than another variant. The test of a moderator variable in a meta-analysis is akin to the test of an interaction in a primary study, and both suffer from the same factors that tend to decrease power. First, the effect size is actually the difference between the two effect sizes and so is almost invariably smaller than the main effect size. Second, the sample size within groups is (by definition) smaller than the total sample size. Therefore, power for testing the moderator will often be very low (Hedges and Pigott, 2004).”

“It is important to understand that the fixed-effect model and random-effects model address different hypotheses, and that they use different estimates of the variance because they make different assumptions about the nature of the distribution of effects across studies […]. Researchers sometimes remark that power is lower under the random-effects model than for the fixed-effect model. While this statement may be true, it misses the larger point: it is not meaningful to compare power for fixed- and random-effects analyses since the two values of power are not addressing the same question. […] Many meta-analyses include a test of homogeneity, which asks whether or not the between-studies dispersion is more than would be expected by chance. The test of significance is […] based on *Q*, the sum of the squared deviations of each study’s effect size estimate (*Yi*) from the summary effect (*M*), with each deviation weighted by the inverse of that study’s variance. […] Power for this test depends on three factors. The larger the ratio of between-studies to within-studies variance, the larger the number of studies, and the more liberal the criterion for significance, the higher the power.”

“While a meta-analysis will yield a mathematically accurate synthesis of the studies included in the analysis, if these studies are a biased sample of all relevant studies, then the mean effect computed by the meta-analysis will reflect this bias. Several lines of evidence show that studies that report relatively high effect sizes are more likely to be published than studies that report lower effect sizes. Since published studies are more likely to find their way into a meta-analysis, any bias in the literature is likely to be reflected in the meta-analysis as well. This issue is generally known as publication bias. The problem of publication bias is not unique to systematic reviews. It affects the researcher who writes a narrative review and even the clinician who is searching a database for primary papers. […] Other factors that can lead to an upward bias in effect size and are included under the umbrella of publication bias are the following. Language bias (English-language databases and journals are more likely to be searched, which leads to an oversampling of statistically significant studies) […]; availability bias (selective inclusion of studies that are easily accessible to the researcher); cost bias (selective inclusion of studies that are available free or at low cost); familiarity bias (selective inclusion of studies only from one’s own discipline); duplication bias (studies with statistically significant results are more likely to be published more than once […]) and citation bias (whereby studies with statistically significant results are more likely to be cited by others and therefore easier to identify […]). […] If persons performing a systematic review were able to locate studies that had been published in the grey literature (any literature produced in electronic or print format that is not controlled by commercial publishers, such as technical reports and similar sources), then the fact that the studies with higher effects are more likely to be published in the more mainstream publications would not be a problem for meta-analysis. In fact, though, this is not usually the case.

While a systematic review *should* include a thorough search for all relevant studies, the actual amount of grey/unpublished literature included, and the types, varies considerably across meta-analyses.”

“In sum, it is possible that the studies in a meta-analysis may overestimate the true effect size because they are based on a biased sample of the target population of studies. But how do we deal with this concern? The only true test for publication bias is to compare effects in the published studies formally with effects in the unpublished studies. This requires access to the unpublished studies, and if we had that we would no longer be concerned. Nevertheless, the best approach would be for the reviewer to perform a truly comprehensive search of the literature, in hopes of minimizing the bias. In fact, there is evidence that this approach is somewhat effective. Cochrane reviews tend to include more studies and to report a smaller effect size than similar reviews published in medical journals. Serious efforts to find unpublished, and difficult to find studies, typical of Cochrane reviews, may therefore reduce some of the effects of publication bias. Despite the increased resources that are needed to locate and retrieve data from sources such as dissertations, theses, conference papers, government and technical reports and the like, it is generally indefensible to conduct a synthesis that categorically excludes these types of research reports. Potential benefits and costs of grey literature searches must be balanced against each other.”

“Since we cannot be certain that we have avoided bias, researchers have developed methods intended to assess its potential impact on any given meta-analysis. These methods address the following questions:

*Is there evidence of any bias?

*Is it possible that the entire effect is an artifact of bias?

*How much of an impact might the bias have? […]

Methods developed to address publication bias require us to make many assumptions, including the assumption that the pattern of results is due to bias, and that this bias follows a certain model. […] In order to gauge the impact of publication bias we need a model that tells us which studies are likely to be missing. The model that is generally used […] makes the following assumptions: (a) Large studies are likely to be published regardless of statistical significance because these involve large commitments of time and resources. (b) Moderately sized studies are at risk for being lost, but with a moderate sample size even modest effects will be significant, and so only some studies are lost here. (c) Small studies are at greatest risk for being lost. Because of the small sample size, only the largest effects are likely to be significant, with the small and moderate effects likely to be unpublished.

The combined result of these three items is that we expect the bias to increase as the sample size goes down, and the methods described […] are all based on this model. […] [One problem is however that] when there is clear evidence of asymmetry, we cannot assume that this reflects publication bias. The effect size may be larger in small studies because we retrieved a biased sample of the smaller studies, but it is also possible that the effect size really is larger in smaller studies for entirely unrelated reasons. For example, the small studies may have been performed using patients who were quite ill, and therefore more likely to benefit from the drug (as is sometimes the case in early trials of a new compound). Or, the small studies may have been performed with better (or worse) quality control than the larger ones. Sterne et al. (2001) use the term *small-study effect* to describe a pattern where the effect is larger in small studies, and to highlight the fact that the mechanism for this effect is not known.”

“It is almost always important to include an assessment of publication bias in relation to a meta-analysis. It will either assure the reviewer that the results are robust, or alert them that the results are suspect.”