## Beyond Significance Testing (III)

There are many ways to misinterpret significance tests, and this book spends quite a bit of time and effort on these kinds of issues. I decided to include in this post quite a few quotes from chapter 4 of the book, which deals with these topics in some detail. I also included some notes on effect sizes.

…

“[*P*] < .05 means that the likelihood of the data or results even more extreme given random sampling under the null hypothesis is < .05, assuming that all distributional requirements of the test statistic are satisfied and there are no other sources of error variance. […] the **odds-against-chance** **fallacy **[…] [is] the false belief that *p *indicates the probability that a result happened by sampling error; thus, p < .05 says that there is less than a 5% likelihood that a particular finding is due to chance. There is a related misconception i call the **filter myth**, which says that *p* values sort results into two categories, those that are a result of “chance” (H_{0} not rejected) and others that are due to “real” effects (H_{0} rejected). These beliefs are wrong […] When *p* is calculated, it is already assumed that H_{0} is true, so the probability that sampling error is the only explanation is already taken to be 1.00. It is thus illogical to view *p* as measuring the likelihood of sampling error. […] There is no such thing as a statistical technique that determines the probability that various causal factors, including sampling error, acted on a particular result.“

“Most psychology students and professors may endorse the **local Type I error fallacy** [which is] the mistaken belief that p < .05 given α = .05 means that the likelihood that the decision just taken to reject H_{0} is a type I error is less than 5%. […] *p *values from statistical tests are conditional probabilities of data, so they do not apply to any specific decision to reject H_{0}. This is because any particular decision to do so is either right or wrong, so no probability is associated with it (other than 0 or 1.0). Only with sufficient replication could one determine whether a decision to reject H_{0} in a particular study was correct. […] the **valid research hypothesis fallacy **[…] refers to the false belief that the probability that H_{1} is true is > .95, given p < .05. The complement of *p *is a probability, but 1 – p is just the probability of getting a result even less extreme under H_{0} than the one actually found. This fallacy is endorsed by most psychology students and professors”.

“[S]everal different false conclusions may be reached after deciding to reject or fail to reject H_{0}. […] the **magnitude fallacy **is the false belief that low *p *values indicate large effects. […] *p* values are confounded measures of effect size and sample size […]. Thus, effects of trivial magnitude need only a large enough sample to be statistically significant. […] the **zero fallacy **[…] is the mistaken belief that the failure to reject a nil hypothesis means that the population effect size is zero. Maybe it is, but you cannot tell based on a result in one sample, especially if power is low. […] The **equivalence fallacy **occurs when the failure to reject H_{0}: µ1 = µ2 is interpreted as saying that the populations are equivalent. This is wrong because even if µ1 = µ2, distributions can differ in other ways, such as variability or distribution shape.”

“[T]he **reification fallacy** is the faulty belief that failure to replicate a result is the failure to make the same decision about H_{0} across studies […]. In this view, a result is not considered replicated if H_{0} is rejected in the first study but not in the second study. This sophism ignores sample size, effect size, and power across different studies. […] The **sanctification fallacy** refers to dichotomous thinking about continuous p values. […] Differences between results that are “significant” versus “not significant” by close margins, such as p = .03 versus p = .07 when α = .05, are themselves often not statistically significant. That is, relatively large changes in *p* can correspond to small, nonsignificant changes in the underlying variable (Gelman & Stern, 2006). […] Classical parametric statistical tests are not robust against outliers or violations of distributional assumptions, especially in small, unrepresentative samples. But many researchers believe just the opposite, which is the **robustness fallacy**. […] most researchers do not provide evidence about whether distributional or other assumptions are met”.

“Many [of the above] fallacies involve wishful thinking about things that researchers really want to know. These include the probability that H_{0} or H_{1} is true, the likelihood of replication, and the chance that a particular decision to reject H_{0} is wrong. Alas, statistical tests tell us only the conditional probability of the data. […] But there is [however] a method that can tell us what we want to know. It is not a statistical technique; rather, it is good, old-fashioned replication, which is also the best way to deal with the problem of sampling error. […] Statistical significance provides even in the best case nothing more than low-level support for the existence of an effect, relation, or difference. That best case occurs when researchers estimate a priori power, specify the correct construct definitions and operationalizations, work with random or at least representative samples, analyze highly reliable scores in distributions that respect test assumptions, control other major sources of imprecision besides sampling error, and test plausible null hypotheses. In this idyllic scenario, *p* values from statistical tests may be reasonably accurate and potentially meaningful, if they are not misinterpreted. […] The capability of significance tests to address the dichotomous question of whether effects, relations, or differences are greater than expected levels of sampling error may be useful in some new research areas. Due to the many limitations of statistical tests, this period of usefulness should be brief. Given evidence that an effect exists, the next steps should involve estimation of its magnitude and evaluation of its substantive significance, both of which are beyond what significance testing can tell us. […] It should be a hallmark of a maturing research area that significance testing is not the primary inference method.”

“[An] **effect size** [is] a quantitative reﬂection of the magnitude of some phenomenon used for the sake of addressing a specific research question. In this sense, an effect size is a statistic (in samples) or parameter (in populations) with a purpose, that of quantifying a phenomenon of interest. more specific definitions may depend on study design. […] **cause size** refers to the independent variable and specifically to the amount of change in it that produces a given effect on the dependent variable. A related idea is that of **causal efficacy**, or the ratio of effect size to the size of its cause. The greater the causal efficacy, the more that a given change on an independent variable results in proportionally bigger changes on the dependent variable. The idea of cause size is most relevant when the factor is experimental and its levels are quantitative. […] An **effect size measure** […] is a named expression that maps data, statistics, or parameters onto a quantity that represents the magnitude of the phenomenon of interest. This expression connects dimensions or generalized units that are abstractions of variables of interest with a specific operationalization of those units.”

“A good effect size measure has the [following properties:] […] 1. Its scale (metric) should be appropriate for the research question. […] 2. It should be independent of sample size. […] 3. As a point estimate, an effect size should have good statistical properties; that is, it should be unbiased, consistent […], and efficient […]. 4. The effect size [should be] reported with a confidence interval. […] Not all effect size measures […] have all the properties just listed. But it is possible to report multiple effect sizes that address the same question in order to improve the communication of the results.”

“Examples of outcomes with meaningful metrics include salaries in dollars and post-treatment survival time in years. means or contrasts for variables with meaningful units are **unstandardized effect sizes** that can be directly interpreted. […] In medical research, physical measurements with meaningful metrics are often available. […] But in psychological research there are typically no “natural” units for abstract, nonphysical constructs such as intelligence, scholastic achievement, or self-concept. […] Therefore, metrics in psychological research are often arbitrary instead of meaningful. An example is the total score for a set of true-false items. Because responses can be coded with any two different numbers, the total is arbitrary. Standard scores such as percentiles and normal deviates are arbitrary, too […] **Standardized effect sizes** can be computed for results expressed in arbitrary metrics. Such effect sizes can also be directly compared across studies where outcomes have different scales. this is because standardized effect sizes are based on units that have a common meaning regardless of the original metric.”

“1. It is better to report unstandardized effect sizes for outcomes with meaningful metrics. This is because the original scale is lost when results are standardized. 2. Unstandardized effect sizes are best for comparing results across different samples measured on the same outcomes. […] 3. Standardized effect sizes are better for comparing conceptually similar results based on different units of measure. […] 4. Standardized effect sizes are affected by the corresponding unstandardized effect sizes plus characteristics of the study, including its design […], whether factors are fixed or random, the extent of error variance, and sample base rates. This means that standardized effect sizes are less directly comparable over studies that differ in their designs or samples. […] 5. There is no such thing as **T-shirt effect sizes** (Lenth, 2006– 2009) that classify standardized effect sizes as “small,” “medium,” or “large” and apply over all research areas. This is because what is considered a large effect in one area may be seen as small or trivial in another. […] 6. There is usually no way to directly translate standardized effect sizes into implications for substantive significance. […] It is standardized effect sizes from sets of related studies that are analyzed in most meta analyses.”

## Beyond Significance Testing (II)

I have added some more quotes and observations from the book below.

…

“The least squares estimators *M* and *s ^{2}* are not robust against the effects of extreme scores. […] Conventional methods to construct confidence intervals rely on sample standard deviations to estimate standard errors. These methods also rely on critical values in central test distributions, such as

*t*and

*z*, that assume normality or homoscedasticity […] Such distributional assumptions are not always plausible. […] One option to deal with outliers is to apply transformations, which convert original scores with a mathematical operation to new ones that may be more normally distributed. The effect of applying a

**monotonic transformation**is to compress one part of the distribution more than another, thereby changing its shape but not the rank order of the scores. […] It can be difficult to find a transformation that works in a particular data set. Some distributions can be so severely nonnormal that basically no transformation will work. […] An alternative that also deals with departures from distributional assumptions is robust estimation.

**Robust (resistant) estimators**are generally less affected than least squares estimators by outliers or nonnormality.”

“An estimator’s quantitative robustness can be described by its **finite-sample breakdown point** (BP), or the smallest proportion of scores that when made arbitrarily very large or small renders the statistic meaningless. The lower the value of BP, the less robust the estimator. For both M and s^{2}, BP = 0, the lowest possible value. This is because the value of either statistic can be distorted by a single outlier, and the ratio 1/N approaches zero as sample size increases. In contrast, BP = .50 for the median because its value is not distorted by arbitrarily extreme scores unless they make up at least half the sample. But the median is not an optimal estimator because its value is determined by a single score, the one at the 50th percentile. In this sense, all the other scores are discarded by the median. A compromise between the sample mean and the median is the **trimmed mean**. A trimmed mean M_{tr} is calculated by (a) ordering the scores from lowest to highest, (b) deleting the same proportion of the most extreme scores from each tail of the distribution, and then (c) calculating the average of the scores that remain. […] A common practice is to trim 20% of the scores from each tail of the distribution when calculating trimmed estimators. This proportion tends to maintain the robustness of trimmed means while minimizing their standard errors when sampling from symmetrical distributions […] For 20% trimmed means, BP = .20, which says they are robust against arbitrarily extreme scores unless such outliers make up at least 20% of the sample.”

“The standard H_{0} is both a point hypothesis and a nil hypothesis. A **point hypothesis **specifies the numerical value of a parameter or the difference between two or more parameters, and a **nil hypothesis **states that this value is zero. The latter is usually a prediction that an effect, difference, or association is zero. […] Nil hypotheses as default explanations may be fine in new research areas when it is unknown whether effects exist at all. But they are less suitable in established areas when it is known that some effect is probably not zero. […] Nil hypotheses are tested much more often than non-nil hypotheses even when the former are implausible. […] If a nil hypothesis is implausible, estimated probabilities of data will be too low. This means that risk for Type I error is basically zero and a Type II error is the only possible kind when H_{0} is known in advance to be false.”

“Too many researchers treat the conventional levels of α, either .05 or .01, as golden rules. If other levels of α are specifed, they tend to be even lower […]. Sanctification of .05 as the highest “acceptable” level is problematic. […] Instead of blindly accepting either .05 or .01, one does better to […] [s]pecify a level of α that reﬂects the **desired relative seriousness **(DRS) of Type I error versus Type II error. […] researchers should not rely on a mechanical ritual (i.e., automatically specify .05 or .01) to control risk for Type I error that ignores the consequences of Type II error.”

“Although *p* and α are derived in the same theoretical sampling distribution, *p* does not estimate the conditional probability of a Type I error […]. This is because *p* is based on a range of results under H_{0}, but α has nothing to do with actual results and is supposed to be specified before any data are collected. Confusion between *p* and α is widespread […] To differentiate the two, Gigerenzer (1993) referred to *p* as the **exact level of significance**. If *p* = .032 and α = .05, H_{0} is rejected at the .05 level, but .032 is not the long-run probability of Type I error, which is .05 for this example. The exact level of significance is the conditional probability of the data (or any result even more extreme) assuming H_{0} is true, given all other assumptions about sampling, distributions, and scores. […] Because *p* values are estimated assuming that H_{0} is true, they do not somehow measure the likelihood that H_{0} is correct. […] The false belief that *p* is the probability that H_{0} is true, or the inverse probability error […] is widespread.”

“Probabilities from significance tests say little about effect size. This is because essentially any test statistic (TS) can be expressed as the product TS = ES × *f(N)* […] where ES is an effect size and *f(N)* is a function of sample size. This equation explains how it is possible that (a) trivial effects can be statistically significant in large samples or (b) large effects may not be statistically significant in small samples. So *p* is a confounded measure of effect size and sample size.”

“Power is the probability of getting statistical significance over many random replications when H_{1} is true. it varies directly with sample size and the magnitude of the population effect size. […] This combination leads to the greatest power: a large population effect size, a large sample, a higher level of α […], a within-subjects design, a parametric test rather than a nonparametric test (e.g., *t* instead of Mann–Whitney), and very reliable scores. […] Power ≥ .80 is generally desirable, but an even higher standard may be need if consequences of Type II error are severe. […] Reviews from the 1970s and 1980s indicated that the typical power of behavioral science research is only about .50 […] and there is little evidence that power is any higher in more recent studies […] Ellis (2010) estimated that < 10% of studies have samples sufficiently large to detect smaller population effect sizes. Increasing sample size would address low power, but the number of additional cases necessary to reach even nominal power when studying smaller effects may be so great as to be practically impossible […] Too few researchers, generally < 20% (Osborne, 2008), bother to report prospective power despite admonitions to do so […] The concept of power does not stand without significance testing. as statistical tests play a smaller role in the analysis, the relevance of power also declines. If significance tests are not used, power is irrelevant. Cumming (2012) described an alternative called **precision for research planning**, where the researcher specifies a target margin of error for estimating the parameter of interest. […] The advantage over power analysis is that researchers must consider both effect size and precision in study planning.”

“Classical nonparametric tests are alternatives to the parametric *t* and *F* tests for means (e.g., the Mann–Whitney test is the nonparametric analogue to the t test). Nonparametric tests generally work by converting the original scores to ranks. They also make fewer assumptions about the distributions of those ranks than do parametric tests applied to the original scores. Nonparametric tests date to the 1950s–1960s, and they share some limitations. One is that they are not generally robust against heteroscedasticity, and another is that their application is typically limited to single-factor designs […] Modern robust tests are an alternative. They are generally more ﬂexible than nonparametric tests and can be applied in designs with multiple factors. […] At the end of the day, robust statistical tests are subject to many of the same limitations as other statistical tests. For example, they assume random sampling albeit from population distributions that may be nonnormal or heteroscedastic; they also assume that sampling error is the only source of error variance. Alternative tests, such as the Welch–James and Yuen–Welch versions of a robust *t* test, do not always yield the same *p* value for the same data, and it is not always clear which alternative is best (Wilcox, 2003).”

## Beyond Significance Testing (I)

“This book introduces readers to the principles and practice of statistics reform in the behavioral sciences. it (a) reviews the now even larger literature about shortcomings of significance testing; (b) explains why these criticisms have sufficient merit to justify major changes in the ways researchers analyze their data and report the results; (c) helps readers acquire new skills concerning interval estimation and effect size estimation; and (d) reviews alternative ways to test hypotheses, including Bayesian estimation. […] I assume that the reader has had undergraduate courses in statistics that covered at least the basics of regression and factorial analysis of variance. […] This book is suitable as a textbook for an introductory course in behavioral science statistics at the graduate level.”

…

I’m currently reading this book. I have so far read 8 out of the 10 chapters included, and I’m currently sort of hovering between a 3 and 4 star goodreads rating; some parts of the book are really great, but there are also a few aspects I don’t like. Some parts of the coverage are rather technical and I’m still debating to which extent I should cover the technical stuff in detail later here on the blog; there are quite a few equations included in the book and I find it annoying to cover math using the wordpress format of this blog. For now I’ll start out with a reasonably non-technical post with some quotes and key ideas from the first parts of the book.

…

“In studies of intervention outcomes, a statistically significant difference between treated and untreated cases […] has nothing to do with whether treatment leads to any tangible benefits in the real world. In the context of diagnostic criteria, clinical significance concerns whether treated cases can no longer be distinguished from control cases not meeting the same criteria. For example, does treatment typically prompt a return to normal levels of functioning? A treatment effect can be statistically significant yet trivial in terms of its clinical significance, and clinically meaningful results are not always statistically significant. Accordingly, the proper response to claims of statistical significance in any context should be “so what?” — or, more pointedly, “who cares?” — without more information.”

“There are free computer tools for estimating power, but most researchers — probably at least 80% (e.g., Ellis, 2010) — ignore the power of their analyses. […] Ignoring power is regrettable because the median power of published nonexperimental studies is only about .50 (e.g., Maxwell, 2004). This implies a 50% chance of correctly rejecting the null hypothesis based on the data. In this case the researcher may as well not collect any data but instead just toss a coin to decide whether or not to reject the null hypothesis. […] A consequence of low power is that the research literature is often difficult to interpret. Specifically, if there is a real effect but power is only .50, about half the studies will yield statistically significant results and the rest will yield no statistically significant findings. If all these studies were somehow published, the number of positive and negative results would be roughly equal. In an old-fashioned, narrative review, the research literature would appear to be ambiguous, given this balance. It may be concluded that “more research is needed,” but any new results will just reinforce the original ambiguity, if power remains low.”

“Statistical tests of a treatment effect that is actually clinically significant may fail to reject the null hypothesis of no difference when power is low. If the researcher in this case ignored whether the observed effect size is clinically significant, a potentially beneficial treatment may be overlooked. This is exactly what was found by Freiman, Chalmers, Smith, and Kuebler (1978), who reviewed 71 randomized clinical trials of mainly heart- and cancer-related treatments with “negative” results (i.e., not statistically significant). They found that if the authors of 50 of the 71 trials had considered the power of their tests along with the observed effect sizes, those authors should have concluded just the opposite, or that the treatments resulted in clinically meaningful improvements.”

“Even if researchers avoided the kinds of mistakes just described, there are grounds to suspect that p values from statistical tests are simply incorrect in most studies: 1. They (*p *values) are estimated in theoretical sampling distributions that assume random sampling from known populations. Very few samples in behavioral research are random samples. Instead, most are convenience samples collected under conditions that have little resemblance to true random sampling. […] 2. Results of more quantitative reviews suggest that, due to assumptions violations, there are few actual data sets in which significance testing gives accurate results […] 3. Probabilities from statistical tests (*p* values) generally assume that all other sources of error besides sampling error are nil. This includes measurement error […] Other sources of error arise from failure to control for extraneous sources of variance or from ﬂawed operational definitions of hypothetical constructs. It is absurd to assume in most studies that there is no error variance besides sampling error. Instead it is more practical to expect that sampling error makes up the small part of all possible kinds of error when the number of cases is reasonably large (Ziliak & mcCloskey, 2008).”

“The p values from statistical tests do not tell researchers what they want to know, which often concerns whether the data support a particular hypothesis. This is because p values merely estimate the conditional probability of the data under a statistical hypothesis — the null hypothesis — that in most studies is an implausible, straw man argument. In fact, p values do not directly “test” any hypothesis at all, but they are often misinterpreted as though they describe hypotheses instead of data. Although p values ultimately provide a yes-or-no answer (i.e., reject or fail to reject the null hypothesis), the question — p < a?, where a is the criterion level of statistical significance, usually .05 or .01 — is typically uninteresting. The yes-or-no answer to this question says nothing about scientific relevance, clinical significance, or effect size. […] determining clinical significance is not just a matter of statistics; it also requires strong knowledge about the subject matter.”

“[M]any null hypotheses have little if any scientific value. For example, Anderson et al. (2000) reviewed null hypotheses tested in several hundred empirical studies published from 1978 to 1998 in two environmental sciences journals. They found many implausible null hypotheses that specified things such as equal survival probabilities for juvenile and adult members of a species or that growth rates did not differ across species, among other assumptions known to be false before collecting data. I am unaware of a similar survey of null hypotheses in the behavioral sciences, but I would be surprised if the results would be very different.”

“Hoekstra, Finch, Kiers, and Johnson (2006) examined a total of 266 articles published in Psychonomic Bulletin & Review during 2002–2004. Results of significance tests were reported in about 97% of the articles, but confidence intervals were reported in only about 6%. Sadly, p values were misinterpreted in about 60% of surveyed articles. Fidler, Burgman, Cumming, Buttrose, and Thomason (2006) sampled 200 articles published in two different biology journals. Results of significance testing were reported in 92% of articles published during 2001–2002, but this rate dropped to 78% in 2005. There were also corresponding increases in the reporting of confidence intervals, but power was estimated in only 8% and p values were misinterpreted in 63%. […] Sun, Pan, and Wang (2010) reviewed a total of 1,243 works published in 14 different psychology and education journals during 2005–2007. The percentage of articles reporting effect sizes was 49%, and 57% of these authors interpreted their effect sizes.”

“It is a myth that the larger the sample, the more closely it approximates a normal distribution. This idea probably stems from a misunderstanding of the central limit theorem, which applies to certain group statistics such as means. […] This theorem justifies approximating distributions of random means with normal curves, but it does not apply to distributions of scores in individual samples. […] larger samples do not generally have more normal distributions than smaller samples. If the population distribution is, say, positively skewed, this shape will tend to show up in the distributions of random samples that are either smaller or larger.”

“A **standard error** is the standard deviation in a **sampling distribution**, the probability distribution of a statistic across all random samples drawn from the same population(s) and with each sample based on the same number of cases. It estimates the amount of sampling error in standard deviation units. The square of a standard error is the error variance. […] Variability of the sampling distributions […] decreases as the sample size increases. […] The standard error s_{M}, which estimates variability of the group statistic M, is often confused with the standard deviation *s*, which measures variability at the case level. This confusion is a source of misinterpretation of both statistical tests and confidence intervals […] Note that the standard error s_{M} itself has a standard error (as do standard errors for all other kinds of statistics). This is because the value of s_{M} varies over random samples. This explains why one should not overinterpret a confidence interval or *p *value from a significance test based on a single sample.”

“Standard errors estimate sampling error under random sampling. What they measure when sampling is not random may not be clear. […] Standard errors also ignore […] other sources of error [:] 1. **Measurement error** [which] refers to the difference between an observed score X and the true score on the underlying construct. […] Measurement error reduces absolute effect sizes and the power of statistical tests. […] 2. **Construct definition error** [which] involves problems with how hypothetical constructs are defined or operationalized. […] 3. **Specification error** [which] refers to the omission from a regression equation of at least one predictor that covaries with the measured (included) predictors. […] 4. **Treatment implementation error** occurs when an intervention does not follow prescribed procedures. […] Gosset used the term **real error **to refer all types of error besides sampling error […]. In reasonably large samples, the impact of real error may be greater than that of sampling error.”

“The technique of **bootstrapping** […] is a computer-based method of resampling that recombines the cases in a data set in different ways to estimate statistical precision, with fewer assumptions than traditional methods about population distributions. Perhaps the best known form is **nonparametric bootstrapping**, which generally makes no assumptions other than that the distribution in the sample reﬂects the basic shape of that in the population. It treats your data file as a pseudo-population in that cases are randomly selected with replacement to generate other data sets, usually of the same size as the original. […] The technique of nonparametric bootstrapping seems well suited for interval estimation when the researcher is either unwilling or unable to make a lot of assumptions about population distributions. […] potential limitations of nonparametric bootstrapping: 1. Nonparametric bootstrapping simulates random sampling, but true random sampling is rarely used in practice. […] 2. […] If the shape of the sample distribution is very different compared with that in the population, results of nonparametric bootstrapping may have poor external validity. 3. The “population” from which bootstrapped samples are drawn is merely the original data file. If this data set is small or the observations are not independent, resampling from it will not somehow fix these problems. In fact, resampling can magnify the effects of unusual features in a small data set […] 4. Results of bootstrap analyses are probably quite biased in small samples, but this is true of many traditional methods, too. […] [In] **parametric bootstrapping **[…] the researcher specifies the numerical and distributional properties of a theoretical probability density function, and then the computer randomly samples from that distribution. When repeated many times by the computer, values of statistics in these synthesized samples vary randomly about the parameters specified by the researcher, which simulates sampling error.”

## Melanoma therapeutic strategies that select against resistance

A short lecture, but interesting:

…

If you’re not an oncologist, these two links in particular might be helpful to have a look at before you start out: BRAF (gene) & Myc. A very substantial proportion of the talk is devoted to math and stats methodology (which some people will find interesting and others …will not).

## The Personality Puzzle (I)

I don’t really like this book, which is a personality psychology introductory textbook by David Funder. I’ve read the first 400 pages (out of 700), but I’m still debating whether or not to finish it, it just isn’t very good; the level of coverage is low, it’s very fluffy and the signal-to-noise ratio is nowhere near where I’d like it to be when I’m reading academic texts. Some parts of it frankly reads like popular science. However despite not feeling that the book is all that great I can’t justify not blogging it; stuff I don’t blog I tend to forget, and if I’m reading a mediocre textbook anyway I should at least try to pick out some of the decent stuff in there which keeps me reading and try to make it easier for me to recall that stuff later. Some parts of- and arguments/observations included in the book are in my opinion just plain silly or stupid, but I won’t go into these things in this post because I don’t really see what would be the point of doing that.

The main reason why I decided to give the book a go was that I liked Funder’s book Personality Judgment, which I read a few years ago and which deals with some topics also covered superficially in this text – it’s a much better book, in my opinion, at least as far as I can remember (…I have actually been starting to wonder if it was really all that great, if it was written by the same guy who wrote this book…), if you’re interested in these matters. If you’re interested in a more ‘pure’ personality psychology text, a significantly better alternative is Leary *et al.*‘s *Handbook of Individual Differences in Social Behavior*. Because of the multi-author format it also includes some very poor chapters, but those tend to be somewhat easy to identify and skip to get to the good stuff if you’re so inclined, and the general coverage is at a much higher level than that of this book.

Below I have added some quotes and observations from the first 150 pages of the book.

…

“A theory that accounts for certain things extremely well will probably not explain everything else so well. And a theory that tries to explain almost everything […] would probably not provide the best explanation for any one thing. […] different [personality psychology] basic approaches address different sets of questions […] each basic approach usually just ignores the topics it is not good at explaining.”

“Personality psychology tends to emphasize how individuals are different from one another. […] Other areas of psychology, by contrast, are more likely to treat people as if they were the same or nearly the same. Not only do the experimental subfields of psychology, such as cognitive and social psychology, tend to ignore how people are different from each other, but also the statistical analyses central to their research literally put individual differences into their “error” terms […] Although the emphasis of personality psychology often entails categorizing and labeling people, it also leads the field to be extraordinarily sensitive — more than any other area of psychology — to the fact that people really are different.”

“If you want to “look at” personality, what do you look at, exactly? Four different things. First, and perhaps most obviously, you can have the person describe herself. Personality psychologists often do exactly this. Second, you can ask people who know the person to describe her. Third, you can check on how the person is faring in life. And finally, you can observe what the person does and try to measure her behavior as directly and objectively as possible. These four types of clues can be called S [self-judgments], I [informants], L [life], and B [behavior] data […] The point of the four-way classification […] is not to place every kind of data neatly into one and only one category. Rather, the point is to illustrate the types of data that are relevant to personality and to show how they all have both advantages and disadvantages.”

“For cost-effectiveness, S data simply cannot be beat. […] According to one analysis, 70 percent of the articles in an important personality journal were based on self-report (Vazire, 2006).”

“I data are judgments by knowledgeable “informants” about general attributes of the individual’s personality. […] Usually, close acquaintanceship paired with common sense is enough to allow people to make judgments of each other’s attributes with impressive accuracy […]. Indeed, they may be more accurate than self-judgments, especially when the judgments concern traits that are extremely desirable or extremely undesirable […]. Only when the judgments are of a technical nature (e.g., the diagnosis of a mental disorder) does psychological education become relevant. Even then, acquaintances without professional training are typically well aware when someone has psychological problems […] psychologists often base their conclusions on contrived tests of one kind or another, or on observations in carefully constructed and controlled environments. Because I data derive from behaviors informants have seen in daily social interactions, they enjoy an extra chance of being relevant to aspects of personality that affect important life outcomes. […] I data reﬂect the opinions of people who interact with the person every day; they are the person’s reputation. […] personality judgments can [however] be [both] unfair as well as mistaken […] The most common problem that arises from letting people choose their own informants — the usual practice in research — may be the “letter of recommendation effect” […] research participants may tend to nominate informants who think well of them, leading to I data that provide a more positive picture than might have been obtained from more neutral parties.”

“L data […] are verifable, concrete, real-life facts that may hold psychological significance. […] An advantage of using archival records is that they are not prone to the potential biases of self-report or the judgments of others. […] [However] L data have many causes, so trying to establish direct connections between specific attributes of personality and life outcomes is chancy. […] a psychologist can predict a particular outcome from psychological data only to the degree that the outcome is psychologically caused. L data often are psychologically caused only to a small degree.”

“The idea of B data is that participants are found, or put, in some sort of a situation, sometimes referred to as a *testing situation*, and then their behavior is directly observed. […] B data are expensive [and] are not used very often compared to the other types. Relatively few psychologists have the necessary resources.”

“Reliable data […] are measurements that reﬂect what you are trying to assess and are not affected by anything else. […] When trying to measure a stable attribute of personality—a trait rather than a state — the question of reliability reduces to this: Can you get the same result more than once? […] Validity is the degree to which a measurement actually reﬂects what one thinks or hopes it does. […] for a measure to be valid, it must be reliable. But a reliable measure is not necessarily valid. […] A measure that is reliable gives the same answer time after time. […] But even if a measure is the same time after time, that does not necessarily mean it is correct.”

“[M]ost personality tests provide S data. […] Other personality tests yield B data. […] IQ tests […] yield B data. Imagine trying to assess intelligence using an S-data test, asking questions such as “Are you an intelligent person?” and “Are you good at math?” Researchers have actually tried this, but simply asking people whether they are smart turns out to be a poor way to measure intelligence”.

“The answer an individual gives to any one question might not be particularly informative […] a single answer will tend to be unreliable. But if a group of similar questions is asked, the average of the answers ought to be much more stable, or reliable, because random ﬂuctuations tend to cancel each other out. For this reason, one way to make a personality test more reliable is simply to make it longer.”

“The factor analytic method of test construction is based on a statistical technique. Factor analysis identifies groups of things […] that seem to have something in common. […] To use factor analysis to construct a personality test, researchers begin with a long list of […] items […] The next step is to administer these items to a large number of participants. […] The analysis is based on calculating correlation coefficients between each item and every other item. Many items […] will not correlate highly with anything and can be dropped. But the items that do correlate with each other can be assembled into groups. […] The next steps are to consider what the items have in common, and then name the factor. […] Factor analysis has been used not only to construct tests, but also to decide how many fundamental traits exist […] Various analysts have come up with different answers.”

[The Big Five were derived from factor analyses.]

“The empirical strategy of test construction is an attempt to allow reality to speak for itself. […] Like the factor analytic approach described earlier, the frst step of the empirical approach is to gather lots of items. […] The second step, however, is quite different. For this step, you need to have a sample of participants who have already independently been divided into the groups you are interested in. Occupational groups and diagnostic categories are often used for this purpose. […] Then you are ready for the third step: administering your test to your participants. The fourth step is to compare the answers given by the different groups of participants. […] The basic assumption of the empirical approach […] is that certain kinds of people answer certain questions on personality inventories in distinctive ways. If you answer questions the same way as members of some occupational or diagnostic group did in the original derivation study, then you might belong to that group too. […] responses to empirically derived tests are difficult to fake. With a personality test of the straightforward, S-data variety, you can describe yourself the way you want to be seen, and that is indeed the score you will get. But because the items on empirically derived scales sometimes seem backward or absurd, it is difficult to know how to answer in such a way as to guarantee the score you want. This is often held up as one of the great advantages of the empirical approach […] [However] empirically derived tests are only as good as the criteria by which they are developed or against which they are cross-validated. […] the empirical correlates of item responses by which these tests are assembled are those found in one place, at one time, with one group of participants. If no attention is paid to item content, then there is no way to be confident that the test will work in a similar manner at another time, in another place, with different participants. […] A particular concern is that the empirical correlates of item response might change over time. The MMPI was developed decades ago and has undergone a major revision only once”.

“It is not correct, for example, that the significance level provides the probability that the substantive (non-null) hypothesis is true. […] the significance level gives the probability of getting the result one found if the null hypothesis were true. One statistical writer offered the following analogy (Dienes, 2011): The probability that a person is dead, given that a shark has bitten his head off, is 1.0. However, the probability that a person’s head was bitten off by a shark, given that he is dead, is much lower. The probability of the data given the hypothesis, and of the hypothesis given the data, is not the same thing. And the latter is what we really want to know. […] An effect size is more meaningful than a significance level. […] It is both facile and misleading to use the frequently taught method of squaring correlations if the intention is to evaluate effect size.”

## The Mathematical Challenge of Large Networks

This is another one of the aforementioned lectures I watched a while ago, but had never got around to blogging:

…

If I had to watch this one again, I’d probably skip most of the second half; it contains highly technical coverage of topics in graph theory, and it was very difficult for me to follow (but I did watch it to the end, just out of curiosity).

The lecturer has put up a ~500 page publication on these and related topics, which is available here, so if you want to know more that’s an obvious place to go have a look. A few other relevant links to stuff mentioned/covered in the lecture:

Szemerédi regularity lemma.

Graphon.

Turán’s theorem.

Quantum graph.

## A few diabetes papers of interest

i. Association Between Blood Pressure and Adverse Renal Events in Type 1 Diabetes.

“The Joint National Committee and American Diabetes Association guidelines currently recommend a blood pressure (BP) target of <140/90 mmHg for all adults with diabetes, regardless of type (1–3). However, evidence used to support this recommendation is primarily based on data from trials of type 2 diabetes (4–6). The relationship between BP and adverse outcomes in type 1 and type 2 diabetes may differ, given that the type 1 diabetes population is typically much younger at disease onset, hypertension is less frequently present at diagnosis (3), and the basis for the pathophysiology and disease complications may differ between the two populations.

Prior prospective cohort studies (7,8) of patients with type 1 diabetes suggested that lower BP levels (<110–120/70–80 mmHg) at baseline entry were associated with a lower risk of adverse renal outcomes, including incident microalbuminuria. In one trial of antihypertensive treatment in type 1 diabetes (9), assignment to a lower mean arterial pressure (MAP) target of <92 mmHg (corresponding to ∼125/75 mmHg) led to a significant reduction in proteinuria compared with a MAP target of 100–107 mmHg (corresponding to ∼130–140/85–90 mmHg). Thus, it is possible that lower BP (<120/80 mmHg) reduces the risk of important renal outcomes, such as proteinuria, in patients with type 1 diabetes and may provide a synergistic benefit with intensive glycemic control on renal outcomes (10–12). However, fewer studies have examined the association between BP levels over time and the risk of more advanced renal outcomes, such as stage III chronic kidney disease (CKD) or end-stage renal disease (ESRD)”.

“The primary objective of this study was to determine whether there is an association between lower BP levels and the risk of more advanced diabetic nephropathy, defined as macroalbuminuria or stage III CKD, within a background of different glycemic control strategies […] We included 1,441 participants with type 1 diabetes between the ages of 13 and 39 years who had previously been randomized to receive intensive versus conventional glycemic control in the Diabetes Control and Complications Trial (DCCT). The exposures of interest were time-updated systolic BP (SBP) and diastolic BP (DBP) categories. Outcomes included macroalbuminuria (>300 mg/24 h) or stage III chronic kidney disease (CKD) […] During a median follow-up time of 24 years, there were 84 cases of stage III CKD and 169 cases of macroalbuminuria. In adjusted models, SBP in the 2 (95% CI 1.05–1.21), and a 1.04 times higher risk of ESRD (95% CI 0.77–1.41) in adjusted Cox models. Every 10 mmHg increase in DBP was associated with a 1.17 times higher risk of microalbuminuria (95% CI 1.03–1.32), a 1.15 times higher risk of eGFR decline to 2 (95% CI 1.04–1.29), and a 0.80 times higher risk of ESRD (95% CI 0.47–1.38) in adjusted models. […] Because these data are observational, they cannot prove causation. It remains possible that subtle kidney disease may lead to early elevations in BP, and we cannot rule out the potential for reverse causation in our findings. However, we note similar trends in our data even when imposing a 7-year lag between BP and CKD ascertainment.”

“**CONCLUSIONS** A lower BP (<120/70 mmHg) was associated with a substantially lower risk of adverse renal outcomes, regardless of the prior assigned glycemic control strategy. Interventional trials may be useful to help determine whether the currently recommended BP target of 140/90 mmHg may be too high for optimal renal protection in type 1 diabetes.”

It’s important to keep in mind when interpreting these results that endpoints like ESRD and stage III CKD are not the only relevant outcomes in this setting; even mild-stage kidney disease in diabetics significantly increase the risk of death from cardiovascular disease, and a substantial proportion of patients may die from cardiovascular disease before reaching a late-stage kidney disease endpoint (here’s a relevant link).

…

Identifying Causes for Excess Mortality in Patients With Diabetes: Closer but Not There Yet.

“A number of epidemiological studies have quantified the risk of death among patients with diabetes and assessed the causes of death (2–6), with highly varying results […] Overall, the studies to date have confirmed that diabetes is associated with an increased risk of all-cause mortality, but the magnitude of this excess risk is highly variable, with the relative risk ranging from 1.15 to 3.15. Nevertheless, all studies agree that mortality is mainly attributable to cardiovascular causes (2–6). On the other hand, studies of cancer-related death have generally been lacking despite the diabetes–cancer association and a number of plausible biological mechanisms identified to explain this link (8,9). In fact, studies assessing the specific causes of noncardiovascular death in diabetes have been sparse. […] In this issue of *Diabetes Care*, Baena-Díez et al. (10) report on an observational study of the association between diabetes and cause-specific death. This study involved 55,292 individuals from 12 Spanish population cohorts with no prior history of cardiovascular disease, aged 35 to 79 years, with a 10-year follow-up. […] This study found that individuals with diabetes compared with those without diabetes had a higher risk of cardiovascular death, cancer death, and noncardiovascular noncancer death with similar estimates obtained using the two statistical approaches. […] Baena-Díez et al. (10) showed that individuals with diabetes have an approximately threefold increased risk of cardiovascular mortality, which is much higher than what has been reported by recent studies (5,6). While this may be due to the lack of adjustment for important confounders in this study, there remains uncertainty regarding the magnitude of this increase.”

“[A]ll studies of excess mortality associated with diabetes, including the current one, have produced highly variable results. The reasons may be methodological. For instance, it may be that because of the wide range of age in these studies, comparing the rates of death between the patients with diabetes and those without diabetes using a measure based on the ratio of the rates may be misleading because the ratio can vary by age [*it almost certainly does vary by age,*

*US*]. Instead, a measure based on the difference in rates may be more appropriate (16). Another issue relates to the fact that the studies include patients with longstanding diabetes of variable duration, resulting in so-called prevalent cohorts that can result in muddled mortality estimates since these are necessarily based on a mix of patients at different stages of disease (17). Thus, a paradigm change may be in order for future observational studies of diabetes and mortality, in the way they are both designed and analyzed. With respect to cancer, such studies will also need to tease out the independent contribution of antidiabetes treatments on cancer incidence and mortality (18–20). It is thus clear that the quantification of the excess mortality associated with diabetes per se will need more accurate tools.”

…

iii. Risk of Cause-Specific Death in Individuals With Diabetes: A Competing Risks Analysis. This is the paper some of the results of which were discussed above. I’ll just include the highlights here:

“**RESULTS** We included 55,292 individuals (15.6% with diabetes and overall mortality of 9.1%). The adjusted hazard ratios showed that diabetes increased mortality risk: *1*) cardiovascular death, CSH = 2.03 (95% CI 1.63–2.52) and PSH = 1.99 (1.60–2.49) in men; and CSH = 2.28 (1.75–2.97) and PSH = 2.23 (1.70–2.91) in women; *2*) cancer death, CSH = 1.37 (1.13–1.67) and PSH = 1.35 (1.10–1.65) in men; and CSH = 1.68 (1.29–2.20) and PSH = 1.66 (1.25–2.19) in women; and *3*) noncardiovascular noncancer death, CSH = 1.53 (1.23–1.91) and PSH = 1.50 (1.20–1.89) in men; and CSH = 1.89 (1.43–2.48) and PSH = 1.84 (1.39–2.45) in women. In all instances, the cumulative mortality function was significantly higher in individuals with diabetes.

**CONCLUSIONS** Diabetes is associated with premature death from cardiovascular disease, cancer, and noncardiovascular noncancer causes.”

### “Summary

Diabetes is associated with premature death from cardiovascular diseases (coronary heart disease, stroke, and heart failure), several cancers (liver, colorectal, and lung), and other diseases (chronic obstructive pulmonary disease and liver and kidney disease). In addition, the cause-specific cumulative mortality for cardiovascular, cancer, and noncardiovascular noncancer causes was significantly higher in individuals with diabetes, compared with the general population. The dual analysis with CSH and PSH methods provides a comprehensive view of mortality dynamics in the population with diabetes. This approach identifies the individuals with diabetes as a vulnerable population for several causes of death aside from the traditionally reported cardiovascular death.”

…

iv. Disability-Free Life-Years Lost Among Adults Aged ≥50 Years With and Without Diabetes.

“**RESEARCH DESIGN AND METHODS** Adults (*n* = 20,008) aged 50 years and older were followed from 1998 to 2012 in the Health and Retirement Study, a prospective biannual survey of a nationally representative sample of adults. Diabetes and disability status (defined by mobility loss, difficulty with instrumental activities of daily living [IADL], and/or difficulty with activities of daily living [ADL]) were self-reported. We estimated incidence of disability, remission to nondisability, and mortality. We developed a discrete-time Markov simulation model with a 1-year transition cycle to predict and compare lifetime disability-related outcomes between people with and without diabetes. Data represent the U.S. population in 1998.

**RESULTS** From age 50 years, adults with diabetes died 4.6 years earlier, developed disability 6–7 years earlier, and spent about 1–2 more years in a disabled state than adults without diabetes. With increasing baseline age, diabetes was associated with significant (*P* < 0.05) reductions in the number of total and disability-free life-years, but the absolute difference in years between those with and without diabetes was less than at younger baseline age. Men with diabetes spent about twice as many of their remaining years disabled (20–24% of remaining life across the three disability definitions) as men without diabetes (12–16% of remaining life across the three disability definitions). Similar associations between diabetes status and disability-free and disabled years were observed among women.

**CONCLUSIONS** Diabetes is associated with a substantial reduction in nondisabled years, to a greater extent than the reduction of longevity. […] Using a large, nationally representative cohort of Americans aged 50 years and older, we found that diabetes is associated with a substantial deterioration of nondisabled years and that this is a greater number of years than the loss of longevity associated with diabetes. On average, a middle-aged adult with diabetes has an onset of disability 6–7 years earlier than one without diabetes, spends 1–2 more years with disability, and loses 7 years of disability-free life to the condition. Although other nationally representative studies have reported large reductions in complications (9) and mortality among the population with diabetes in recent decades (1), these studies, akin to our results, suggest that diabetes continues to have a substantial impact on morbidity and quality of remaining years of life.”

…

“People with type 1 diabetes have a documented shorter life expectancy than the general population without diabetes (1). Cardiovascular disease (CVD) is the main cause of the excess morbidity and mortality, and despite advances in management and therapy, individuals with type 1 diabetes have a markedly elevated risk of cardiovascular events and death compared with the general population (2).

Lipid-lowering treatment with hydroxymethylglutaryl-CoA reductase inhibitors (statins) prevents major cardiovascular events and death in a broad spectrum of patients (3,4). […] We hypothesized that primary prevention with lipid-lowering therapy (LLT) can reduce the incidence of cardiovascular morbidity and mortality in individuals with type 1 diabetes. The aim of the study was to examine this in a nationwide longitudinal cohort study of patients with no history of CVD. […] A total of 24,230 individuals included in 2006–2008 NDR with type 1 diabetes without a history of CVD were followed until 31 December 2012; 18,843 were untreated and 5,387 treated with LLT [Lipid-Lowering Therapy] (97% statins). The mean follow-up was 6.0 years. […] Hazard ratios (HRs) for treated versus untreated were as follows: cardiovascular death 0.60 (95% CI 0.50–0.72), all-cause death 0.56 (0.48–0.64), fatal/nonfatal stroke 0.56 (0.46–0.70), fatal/nonfatal acute myocardial infarction 0.78 (0.66–0.92), fatal/nonfatal coronary heart disease 0.85 (0.74–0.97), and fatal/nonfatal CVD 0.77 (0.69–0.87).

**CONCLUSIONS** This observational study shows that LLT is associated with 22–44% reduction in the risk of CVD and cardiovascular death among individuals with type 1 diabetes without history of CVD and underlines the importance of primary prevention with LLT to reduce cardiovascular risk in type 1 diabetes.”

…

“In many prognostic factor studies, multivariate analyses using the Cox proportional hazards model are applied to identify independent prognostic factors. However, the coefficient estimates derived from the Cox proportional hazards model may be biased as a result of violating assumptions of independence. […] RPA [Recursive Partitioning Analysis] classification is a useful tool that could prioritize the prognostic factors and divide the subjects into distinctive groups. RPA has an advantage over the proportional hazards model in identifying prognostic factors because it does not require risk factor independence and, as a nonparametric technique, makes no requirement on the underlying distributions of the variables considered. Hence, it relies on fewer modeling assumptions. Also, because the method is designed to divide subjects into groups based on the length of survival, it defines groupings for risk classification, whereas Cox regression models do not. Moreover, there is no need to explicitly include covariate interactions because of the recursive splitting structure of tree model construction.”

“This is the first study that characterizes the risk factors associated with the transition from one preclinical stage to the next following a recommended staging classification system (9). The tree-structured prediction model reveals that the risk parameters are not the same across each transition. […] Based on the RPA classification, the subjects at younger age and with higher GAD65Ab [*an important biomarker in the context of autoimmune forms of diabetes, US – here’s a relevant link*] titer are at higher risk for progression to multiple positive autoantibodies from a single autoantibody (seroconversion). Approximately 70% of subjects with a single autoantibody were positive for GAD65Ab, much higher than for insulin autoantibody (24%) and IA-2A [*here’s a relevant link – US*] (5%). Our study results are consistent with those of others (22–24) in that seroconversion is age related. Previous studies in infants and children at an early age have shown that progression from single to two or more autoantibodies occurs more commonly in children 25). The subjects ≤16 years of age had almost triple the 5-year risk compared with subjects >16 years of age at the same GAD65Ab titer level. Hence, not all individuals with a single islet autoantibody can be thought of as being at low risk for disease progression.”

“This is the first study that identifies the risk factors associated with the timing of transitions from one preclinical stage to the next in the development of T1D. Based on RPA risk parameters, we identify the characteristics of groups with similar 5-year risks for advancing to the next preclinical stage. It is clear that individuals with one or more autoantibodies or with dysglycemia are not homogeneous with regard to the risk of disease progression. Also, there are differences in risk factors at each stage that are associated with increased risk of progression. The potential benefit of identifying these groups allows for a more informed discussion of diabetes risk and the selective enrollment of individuals into clinical trials whose risk more appropriately matches the potential benefit of an experimental intervention. Since the risk levels in these groups are substantial, their definition makes possible the design of more efficient trials with target sample sizes that are feasible, opening up the field of prevention to additional at-risk cohorts. […] Our results support the evidence that autoantibody titers are strong predictors at each transition leading to T1D development. The risk of the development of multiple autoantibodies was significantly increased when the GAD65Ab titer level was elevated, and the risk of the development of dysglycemia was increased when the IA-2A titer level increased. These indicate that better risk prediction on the timing of transitions can be obtained by evaluating autoantibody titers. The results also suggest that an autoantibody titer should be carefully considered in planning prevention trials for T1D in addition to the number of positive autoantibodies and the type of autoantibody.”

## Biodemography of aging (IV)

My working assumption as I was reading part two of the book was that I would not be covering that part of the book in much detail here because it would simply be too much work to make such posts legible to the readership of this blog. However I then later, while writing this post, had the thought that given that almost nobody reads along here anyway (I’m not complaining, mind you – this is how I like it these days), the main beneficiary of my blog posts will always be myself, which lead to the related observation/notion that I should not be limiting my coverage of interesting stuff here simply because some hypothetical and probably nonexistent readership out there might not be able to follow the coverage. So when I started out writing this post I was working under the assumption that it would be my last post about the book, but I now feel sure that if I find the time I’ll add at least one more post about the book’s statistics coverage. On a related note I am explicitly making the observation here that this post was written for *my* benefit, not yours. You can read it if you like, or not, but it was not really written for you.

I have added bold a few places to emphasize key concepts and observations from the quoted paragraphs and in order to make the post easier for me to navigate later (all the italics below are on the other hand those of the authors of the book).

…

“** Biodemography** is a multidisciplinary branch of science that unites under its umbrella various analytic approaches aimed at integrating biological knowledge and methods and traditional demographic analyses to shed more light on variability in mortality and health across populations and between individuals.

**is a special subfield of biodemography that focuses on understanding the impact of processes related to aging on health and longevity.”**

*Biodemography of aging*“Mortality rates as a function of age are a cornerstone of many demographic analyses. The longitudinal **age** **trajectories of biomarkers** add a new dimension to the traditional demographic analyses: the mortality rate becomes a function of not only age but also of these biomarkers (with additional dependence on a set of sociodemographic variables). Such analyses should incorporate dynamic characteristics of trajectories of biomarkers to evaluate their impact on mortality or other outcomes of interest. Traditional analyses using baseline values of biomarkers (e.g., Cox proportional hazards or logistic regression models) do not take into account these dynamics. One approach to the evaluation of the impact of biomarkers on mortality rates is to use the Cox proportional hazards model with time-dependent covariates; this approach is used extensively in various applications and is available in all popular statistical packages. In such a model, the biomarker is considered a time-dependent covariate of the hazard rate and the corresponding regression parameter is estimated along with standard errors to make statistical inference on the direction and the significance of the effect of the biomarker on the outcome of interest (e.g., mortality). However, **the choice of the analytic approach should not be governed exclusively by its simplicity or convenience of application. It is essential to consider whether the method gives meaningful and interpretable results relevant to the research agenda. In the particular case of biodemographic analyses, the Cox proportional hazards model with time-dependent covariates is not the best choice.**”

“Longitudinal studies of aging present special methodological challenges due to inherent characteristics of the data that need to be addressed in order to avoid biased inference. The challenges are related to the fact that the populations under study (aging individuals) experience substantial **dropout rates** related to death or poor health and often have co-morbid conditions related to the disease of interest. The standard assumption made in longitudinal analyses (although usually not explicitly mentioned in publications) is that dropout (e.g., death) is not associated with the outcome of interest. While this can be safely assumed in many general longitudinal studies (where, e.g., the main causes of dropout might be the administrative end of the study or moving out of the study area, which are presumably not related to the studied outcomes), the very nature of the longitudinal outcomes (e.g., measurements of some physiological biomarkers) analyzed in a longitudinal study of aging assumes that they are (at least hypothetically) related to the process of aging. Because the process of aging leads to the development of diseases and, eventually, death, in longitudinal studies of aging an assumption of non-association of the reason for dropout and the outcome of interest is, at best, risky, and usually is wrong. As an illustration, we found that the average trajectories of different physiological indices of individuals dying at earlier ages markedly deviate from those of long-lived individuals, both in the entire Framingham original cohort […] and also among carriers of specific alleles […] In such a situation, **panel compositional changes due to attrition** affect the averaging procedure and modify the averages in the total sample. Furthermore, biomarkers are subject to **measurement error** and random biological variability. They are usually collected intermittently at examination times which may be sparse and typically biomarkers are not observed at event times. It is well known in the statistical literature that ignoring measurement errors and biological variation in such variables and using their observed “raw” values as time-dependent covariates in a Cox regression model may lead to biased estimates and incorrect inferences […] **Standard methods of survival analysis such as the Cox proportional hazards model** (Cox 1972) **with time-dependent covariates should be avoided in analyses of biomarkers measured with errors** because they can lead to biased estimates.”

“Statistical methods aimed at analyses of time-to-event data jointly with longitudinal measurements have become known in the mainstream biostatistical literature as “**joint models for longitudinal and time-to-event data**” (“survival” or “failure time” are often used interchangeably with “time-to-event”) or simply “**joint models**.” This is an active and fruitful area of biostatistics with an explosive growth in recent years. […] The standard joint model consists of two parts, the first representing the dynamics of longitudinal data (which is referred to as the “longitudinal sub-model”) and the second one modeling survival or, generally, time-to-event data (which is referred to as the “survival sub-model”). […] Numerous extensions of this basic model have appeared in the joint modeling literature in recent decades, providing great flexibility in applications to a wide range of practical problems. […] The standard parameterization of the joint model (11.2) assumes that the risk of the event at age t depends on the current “true” value of the longitudinal biomarker at this age. While this is a reasonable assumption in general, it may be argued that additional dynamic characteristics of the longitudinal trajectory can also play a role in the risk of death or onset of a disease. For example, if two individuals at the same age have exactly the same level of some biomarker at this age, but the trajectory for the first individual increases faster with age than that of the second one, then the first individual can have worse survival chances for subsequent years. […] Therefore, extensions of the basic parameterization of joint models allowing for dependence of the risk of an event on such dynamic characteristics of the longitudinal trajectory can provide additional opportunities for comprehensive analyses of relationships between the risks and longitudinal trajectories. Several authors have considered such extended models. […] **joint models are computationally intensive** and are sometimes prone to convergence problems [however such] models provide more efficient estimates of the effect of a covariate […] on the time-to-event outcome in the case in which there is […] an effect of the covariate on the longitudinal trajectory of a biomarker. This means that** analyses of longitudinal and time-to-event data in joint models may require smaller sample sizes to achieve comparable statistical power** **with analyses based on time-to-event data alone** (Chen et al. 2011).”

“To be useful as a tool for biodemographers and gerontologists who seek biological explanations for observed processes, models of longitudinal data should be based on realistic assumptions and reflect relevant knowledge accumulated in the field. An example is the shape of the risk functions. Epidemiological studies show that **the conditional hazards of health and survival events considered as functions of risk factors often have U- or J-shapes** […], so a model of aging-related changes should incorporate this information. In addition, risk variables, and, what is very important, their effects on the risks of corresponding health and survival events, experience aging-related changes and these can differ among individuals. […] An important class of models for joint analyses of longitudinal and time-to-event data incorporating a stochastic process for description of longitudinal measurements uses an epidemiologically-justified assumption of a quadratic hazard (i.e., U-shaped in general and J-shaped for variables that can take values only on one side of the U-curve) considered as a function of physiological variables. **Quadratic hazard models** have been developed and intensively applied in studies of human longitudinal data”.

“Various approaches to statistical model building and data analysis that incorporate unobserved heterogeneity are ubiquitous in different scientific disciplines. **Unobserved heterogeneity** in models of health and survival outcomes can arise because there may be relevant risk factors affecting an outcome of interest that are either unknown or not measured in the data. Frailty models introduce the concept of unobserved heterogeneity in survival analysis for time-to-event data. […] Individual age trajectories of biomarkers can differ due to various observed as well as unobserved (and unknown) factors and such individual differences propagate to differences in risks of related time-to-event outcomes such as the onset of a disease or death. […] The joint analysis of longitudinal and time-to-event data is the realm of a special area of biostatistics named “joint models for longitudinal and time-to-event data” or simply “joint models” […] Approaches that incorporate heterogeneity in populations through random variables with continuous distributions (as in the standard joint models and their extensions […]) assume that the risks of events and longitudinal trajectories follow similar patterns for all individuals in a population (e.g., that biomarkers change linearly with age for all individuals). Although such homogeneity in patterns can be justifiable for some applications, generally this is a rather strict assumption […] A population under study may consist of subpopulations with distinct patterns of longitudinal trajectories of biomarkers that can also have different effects on the time-to-event outcome in each subpopulation. When such subpopulations can be defined on the base of observed covariate(s), one can perform stratified analyses applying different models for each subpopulation. However, observed covariates may not capture the entire heterogeneity in the population in which case it may be useful to conceive of the population as consisting of *latent* subpopulations defined by unobserved characteristics. Special methodological approaches are necessary to accommodate such hidden heterogeneity. Within the joint modeling framework, a special class of models, **joint latent class models**, was developed to account for such heterogeneity […] The joint latent class model has three components. First, it is assumed that a population consists of a fixed number of (latent) subpopulations. The latent class indicator represents the latent class membership and the probability of belonging to the latent class is specified by a multinomial logistic regression function of observed covariates. It is assumed that individuals from different latent classes have different patterns of longitudinal trajectories of biomarkers and different risks of event. The key assumption of the model is conditional independence of the biomarker and the time-to-events given the latent classes. Then the class-specific models for the longitudinal and time-to-event outcomes constitute the second and third component of the model thus completing its specification. […] **the latent class stochastic process model** […] provides a useful tool for dealing with unobserved heterogeneity in joint analyses of longitudinal and time-to-event outcomes and taking into account hidden components of aging in their joint influence on health and longevity. This approach is also helpful for sensitivity analyses in applications of the original stochastic process model. We recommend starting the analyses with the original stochastic process model and estimating the model ignoring possible hidden heterogeneity in the population. Then the latent class stochastic process model can be applied to test hypotheses about the presence of hidden heterogeneity in the data in order to appropriately adjust the conclusions if a latent structure is revealed.”

“**The longitudinal genetic-demographic model** (or the genetic-demographic model for longitudinal data) […] combines three sources of information in the likelihood function: (1) follow-up data on survival (or, generally, on some time-to-event) for genotyped individuals; (2) (cross-sectional) information on ages at biospecimen collection for genotyped individuals; and (3) follow-up data on survival for non-genotyped individuals. […] Such joint analyses of genotyped and non-genotyped individuals can result in substantial improvements in statistical power and accuracy of estimates compared to analyses of the genotyped subsample alone if the proportion of non-genotyped participants is large. Situations in which genetic information cannot be collected for all participants of longitudinal studies are not uncommon. They can arise for several reasons: (1) the longitudinal study may have started some time before genotyping was added to the study design so that some initially participating individuals dropped out of the study (i.e., died or were lost to follow-up) by the time of genetic data collection; (2) budget constraints prohibit obtaining genetic information for the entire sample; (3) some participants refuse to provide samples for genetic analyses. Nevertheless, even when genotyped individuals constitute a majority of the sample or the entire sample, application of such an approach is still beneficial […] **The genetic stochastic process model** […] adds a new dimension to genetic biodemographic analyses, combining information on longitudinal measurements of biomarkers available for participants of a longitudinal study with follow-up data and genetic information. Such **joint analyses of different sources of information** collected in both genotyped and non-genotyped individuals allow for more efficient use of the research potential of longitudinal data which otherwise remains underused when only genotyped individuals or only subsets of available information (e.g., only follow-up data on genotyped individuals) are involved in analyses. Similar to the longitudinal genetic-demographic model […], **the benefits of combining data** on genotyped and non-genotyped individuals in the genetic SPM come from the presence of common parameters describing characteristics of the model for genotyped and non-genotyped subsamples of the data. This takes into account the knowledge that the non-genotyped subsample is a mixture of carriers and non-carriers of the same alleles or genotypes represented in the genotyped subsample and applies the ideas of heterogeneity analyses […] When the non-genotyped subsample is substantially larger than the genotyped subsample, these joint analyses can lead to a noticeable increase in the power of statistical estimates of genetic parameters compared to estimates based only on information from the genotyped subsample. **This approach is applicable not only to genetic data but to any discrete time-independent variable that is observed only for a subsample of individuals in a longitudinal study.**”

“Despite an existing tradition of interpreting differences in the shapes or parameters of the mortality rates (survival functions) resulting from the effects of exposure to different conditions or other interventions in terms of characteristics of individual aging, this practice has to be used with care. This is because such characteristics are difficult to interpret in terms of properties of external and internal processes affecting the chances of death. An important question then is: What kind of mortality model has to be developed to obtain parameters that are biologically interpretable? The purpose of this chapter is to describe an approach to mortality modeling that represents mortality rates in terms of parameters of physiological changes and declining health status accompanying the process of aging in humans. […] **A traditional (demographic) description of changes in individual health/survival status is performed using a continuous-time random Markov process** with a finite number of states, and age-dependent transition intensity functions (transitions rates). Transitions to the absorbing state are associated with death, and the corresponding transition intensity is a mortality rate. Although such a description characterizes connections between health and mortality, it does not allow for studying factors and mechanisms involved in the aging-related health decline. Numerous epidemiological studies provide compelling evidence that health transition rates are influenced by a number of factors. Some of them are fixed at the time of birth […]. Others experience stochastic changes over the life course […] **The presence of** such **randomly changing influential factors violates the Markov assumption, and makes the description of aging-related changes in health status more complicated.** […] The age dynamics of influential factors (e.g., physiological variables) in connection with mortality risks has been described using a stochastic process model of human mortality and aging […]. Recent extensions of this model have been used in analyses of longitudinal data on aging, health, and longevity, collected in the Framingham Heart Study […] This model and its extensions are described in terms of **a Markov stochastic process satisfying a diffusion-type stochastic differential equation.** The stochastic process is stopped at random times associated with individuals’ deaths. […] When an individual’s health status is taken into account, the coefficients of the stochastic differential equations become dependent on values of the **jumping process.** This dependence violates the Markov assumption and renders the conditional Gaussian property invalid. So the description of this (continuously changing) component of aging-related changes in the body also becomes more complicated. Since studying age trajectories of physiological states in connection with changes in health status and mortality would provide more realistic scenarios for analyses of available longitudinal data, it would be a good idea to find an appropriate mathematical description of the joint evolution of these interdependent processes in aging organisms. For this purpose,** we propose a comprehensive model of human aging, health, and mortality in which the Markov assumption is fulfilled by a two-component stochastic process consisting of jumping and continuously changing processes. The jumping component is used to describe relatively fast changes in health status occurring at random times, and the continuous component describes relatively slow stochastic age-related changes of individual physiological states. **[…] The use of stochastic differential equations for random continuously changing covariates has been studied intensively in the analysis of longitudinal data […] Such a description is convenient since it captures the feedback mechanism typical of biological systems reflecting regular aging-related changes and takes into account the presence of random noise affecting individual trajectories. It also captures the dynamic connections between aging-related changes in health and physiological states, which are important in many applications.”

## Biodemography of aging (III)

Latent class representation of the Grade of Membership model.

Singular value decomposition.

Affine space.

Lebesgue measure.

General linear position.

The links above are links to topics I looked up while reading the second half of the book. The first link is quite relevant to the book’s coverage as a comprehensive longitudinal Grade of Membership (-GoM) model is covered in chapter 17. Relatedly, chapter 18 covers linear latent structure (-LLS) models, and as observed in the book LLS is a generalization of GoM. As should be obvious from the nature of the links some of the stuff included in the second half of the text is highly technical, and I’ll readily admit I was not fully able to understand all the details included in the coverage of chapters 17 and 18 in particular. On account of the technical nature of the coverage in Part 2 I’m not sure I’ll cover the second half of the book in much detail, though I probably shall devote at least one more post to some of those topics, as they were quite interesting even if some of the details were difficult to follow.

I have almost finished the book at this point, and I have already decided to both give the book five stars and include it on my list of favorite books on goodreads; it’s really well written, and it provides consistently highly detailed coverage of very high quality. As I also noted in the first post about the book the authors have given readability aspects some thought, and I am sure most readers would learn quite a bit from this text even if they were to skip some of the more technical chapters. The main body of Part 2 of the book, the subtitle of which is ‘Statistical Modeling of Aging, Health, and Longevity’, is however probably in general not worth the effort of reading unless you have a solid background in statistics.

This post includes some observations and quotes from the last chapters of the book’s Part 1.

…

“The proportion of older adults in the U.S. population is growing. This raises important questions about the increasing prevalence of aging-related diseases, multimorbidity issues, and disability among the elderly population. […] In 2009, 46.3 million people were covered by Medicare: 38.7 million of them were aged 65 years and older, and 7.6 million were disabled […]. By 2031, when the baby-boomer generation will be completely enrolled, Medicare is expected to reach 77 million individuals […]. Because the Medicare program covers 95 % of the nation’s aged population […], the prediction of future Medicare costs based on these data can be an important source of health care planning.”

“Three essential components (which could be also referred as sub-models) need to be developed to construct a modern model of forecasting of population health and associated medical costs: (i) a model of medical cost projections conditional on each health state in the model, (ii) health state projections, and (iii) a description of the distribution of initial health states of a cohort to be projected […] In making medical cost projections, two major effects should be taken into account: the dynamics of the medical costs during the time periods comprising the date of onset of chronic diseases and the increase of medical costs during the last years of life. In this chapter, we investigate and model the first of these two effects. […] the approach developed in this chapter generalizes the approach known as “life tables with covariates” […], resulting in a new family of forecasting models with covariates such as comorbidity indexes or medical costs. In sum, this chapter develops a model of the relationships between individual cost trajectories following the onset of aging-related chronic diseases. […] The underlying methodological idea is to aggregate the health state information into a single (or several) covariate(s) that can be determinative in predicting the risk of a health event (e.g., disease incidence) and whose dynamics could be represented by the model assumptions. An advantage of such an approach is its substantial reduction of the degrees of freedom compared with existing forecasting models (e.g., the FEM model, Goldman and RAND Corporation 2004). […] We found that the time patterns of medical cost trajectories were similar for all diseases considered and can be described in terms of four components having the meanings of (i) the pre-diagnosis cost associated with initial comorbidity represented by medical expenditures, (ii) the cost peak associated with the onset of each disease, (iii) the decline/reduction in medical expenditures after the disease onset, and (iv) the difference between post- and pre-diagnosis cost levels associated with an acquired comorbidity. The description of the trajectories was formalized by a model which explicitly involves four parameters reflecting these four components.”

As I noted earlier in my coverage of the book, I don’t think the model above fully captures all relevant cost contributions of the diseases included, as the follow-up period was too short to capture all relevant costs to be included in the part iv model component. This is definitely a problem in the context of diabetes. But then again nothing in theory stops people from combining the model above with other models which are better at dealing with the excess costs associated with long-term complications of chronic diseases, and the model results were intriguing even if the model likely underperforms in a few specific disease contexts.

Moving on…

“Models of medical cost projections usually are based on regression models estimated with the majority of independent predictors describing demographic status of the individual, patient’s health state, and level of functional limitations, as well as their interactions […]. If the health states needs to be described by a number of simultaneously manifested diseases, then detailed stratification over the categorized variables or use of multivariate regression models allows for a better description of the health states. However, it can result in an abundance of model parameters to be estimated. One way to overcome these difficulties is to use an approach in which the model components are demographically-based aggregated characteristics that mimic the effects of specific states. The model developed in this chapter is an example of such an approach: the use of a comorbidity index rather than of a set of correlated categorical regressor variables to represent the health state allows for an essential reduction in the degrees of freedom of the problem.”

“Unlike mortality, the onset time of chronic disease is difficult to define with high precision due to the large variety of disease-specific criteria for onset/incident case identification […] there is always some arbitrariness in defining the date of chronic disease onset, and a unified definition of date of onset is necessary for population studies with a long-term follow-up.”

“Individual age trajectories of physiological indices are the product of a complicated interplay among genetic and non-genetic (environmental, behavioral, stochastic) factors that influence the human body during the course of aging. Accordingly, they may differ substantially among individuals in a cohort. Despite this fact, the average age trajectories for the same index follow remarkable regularities. […] some indices tend to change monotonically with age: the level of blood glucose (BG) increases almost monotonically; pulse pressure (PP) increases from age 40 until age 85, then levels off and shows a tendency to decline only at later ages. The age trajectories of other indices are non-monotonic: they tend to increase first and then decline. Body mass index (BMI) increases up to about age 70 and then declines, diastolic blood pressure (DBP) increases until age 55–60 and then declines, systolic blood pressure (SBP) increases until age 75 and then declines, serum cholesterol (SCH) increases until age 50 in males and age 70 in females and then declines, ventricular rate (VR) increases until age 55 in males and age 45 in females and then declines. With small variations, these general patterns are similar in males and females. The shapes of the age-trajectories of the physiological variables also appear to be similar for different genotypes. […] The effects of these physiological indices on mortality risk were studied in Yashin et al. (2006), who found that the effects are gender and age specific. They also found that the dynamic properties of the individual age trajectories of physiological indices may differ dramatically from one individual to the next.”

“An increase in the mortality rate with age is traditionally associated with the process of aging. This influence is mediated by aging-associated changes in thousands of biological and physiological variables, some of which have been measured in aging studies. The fact that the age trajectories of some of these variables differ among individuals with short and long life spans and healthy life spans indicates that dynamic properties of the indices affect life history traits. Our analyses of the FHS data clearly demonstrate that the values of physiological indices at age 40 are significant contributors both to life span and healthy life span […] suggesting that normalizing these variables around age 40 is important for preventing age-associated morbidity and mortality later in life. […] results [also] suggest that keeping physiological indices stable over the years of life could be as important as their normalizing around age 40.”

“The results […] indicate that, in the quest of identifying longevity genes, it may be important to look for candidate genes with pleiotropic effects on more than one dynamic characteristic of the age-trajectory of a physiological variable, such as genes that may influence both the initial value of a trait (intercept) and the rates of its changes over age (slopes). […] Our results indicate that the dynamic characteristics of age-related changes in physiological variables are important predictors of morbidity and mortality risks in aging individuals. […] We showed that the initial value (*intercept*), the rate of changes (*slope*), and the *variability* of a physiological index, in the age interval 40–60 years, significantly influenced both mortality risk and onset of unhealthy life at ages 60+ in our analyses of the Framingham Heart Study data. That is, these dynamic characteristics may serve as good predictors of late life morbidity and mortality risks. The results also suggest that physiological changes taking place in the organism in middle life may affect longevity through promoting or preventing diseases of old age. For non-monotonically changing indices, we found that having a later age at the peak value of the index […], a lower peak value […], a slower rate of decline in the index at older ages […], and less variability in the index over time, can be beneficial for longevity. Also, the dynamic characteristics of the physiological indices were, overall, associated with mortality risk more significantly than with onset of unhealthy life.”

“Decades of studies of candidate genes show that they are not linked to aging-related traits in a straightforward manner […]. Recent genome-wide association studies (GWAS) have reached fundamentally the same conclusion by showing that the traits in late life likely are controlled by a relatively large number of common genetic variants […]. Further, GWAS often show that the detected associations are of tiny effect […] the weak effect of genes on traits in late life can be not only because they confer small risks having small penetrance but because they confer large risks but in a complex fashion […] In this chapter, we consider several examples of complex modes of gene actions, including genetic tradeoffs, antagonistic genetic effects on the same traits at different ages, and variable genetic effects on lifespan. The analyses focus on the *APOE* common polymorphism. […] The analyses reported in this chapter suggest that the e4 allele can be protective against cancer with a more pronounced role in men. This protective effect is more characteristic of cancers at older ages and it holds in both the parental and offspring generations of the FHS participants. Unlike cancer, the effect of the e4 allele on risks of CVD is more pronounced in women. […] [The] results […] explicitly show that the same allele can change its role on risks of CVD in an antagonistic fashion from detrimental in women with onsets at younger ages to protective in women with onsets at older ages. […] e4 allele carriers have worse survival compared to non-e4 carriers in each cohort. […] Sex stratification shows sexual dimorphism in the effect of the e4 allele on survival […] with the e4 female carriers, particularly, being more exposed to worse survival. […] The results of these analyses provide two important insights into the role of genes in lifespan. First, they provide evidence on the key role of aging-related processes in genetic susceptibility to lifespan. For example, taking into account the specifics of aging-related processes gains 18 % in estimates of the RRs and five orders of magnitude in significance in the same sample of women […] without additional investments in increasing sample sizes and new genotyping. The second is that a detailed study of the role of aging-related processes in estimates of the effects of genes on lifespan (and healthspan) helps in detecting more homogeneous [high risk] sub-samples”.

“The aging of populations in developed countries requires effective strategies to extend healthspan. A promising solution could be to yield insights into the genetic predispositions for endophenotypes, diseases, well-being, and survival. It was thought that genome-wide association studies (GWAS) would be a major breakthrough in this endeavor. Various genetic association studies including GWAS assume that there should be a deterministic (unconditional) genetic component in such complex phenotypes. However, the idea of unconditional contributions of genes to these phenotypes faces serious difficulties which stem from the lack of direct evolutionary selection against or in favor of such phenotypes. In fact, evolutionary constraints imply that genes should be linked to age-related phenotypes in a complex manner through different mechanisms specific for given periods of life. Accordingly, the linkage between genes and these traits should be strongly modulated by age-related processes in a changing environment, i.e., by the individuals’ life course. The inherent sensitivity of genetic mechanisms of complex health traits to the life course will be a key concern as long as genetic discoveries continue to be aimed at improving human health.”

“Despite the common understanding that age is a risk factor of not just one but a large portion of human diseases in late life, each specific disease is typically considered as a stand-alone trait. Independence of diseases was a plausible hypothesis in the era of infectious diseases caused by different strains of microbes. Unlike those diseases, the exact etiology and precursors of diseases in late life are still elusive. It is clear, however, that the origin of these diseases differs from that of infectious diseases and that age-related diseases reflect a complicated interplay among ontogenetic changes, senescence processes, and damages from exposures to environmental hazards. Studies of the determinants of diseases in late life provide insights into a number of risk factors, apart from age, that are common for the development of many health pathologies. The presence of such common risk factors makes chronic diseases and hence risks of their occurrence interdependent. This means that the results of many calculations using the assumption of disease independence should be used with care. Chapter 4 argued that disregarding potential dependence among diseases may seriously bias estimates of potential gains in life expectancy attributable to the control or elimination of a specific disease and that the results of the process of coping with a specific disease will depend on the disease elimination strategy, which may affect mortality risks from other diseases.”

## Diabetes and the brain (IV)

Here’s one of my previous posts in the series about the book. In this post I’ll cover material dealing with two acute hyperglycemia-related diabetic complications (DKA and HHS – see below…) as well as multiple topics related to diabetes and stroke. I’ll start out with a few quotes from the book about DKA and HHS:

…

“DKA [diabetic ketoacidosis] is defined by a triad of hyperglycemia, ketosis, and acidemia and occurs in the absolute or near-absolute absence of insulin. […] DKA accounts for the bulk of morbidity and mortality in children with T1DM. National population-based studies estimate DKA mortality at 0.15% in the United States (4), 0.18–0.25% in Canada (4, 5), and 0.31% in the United Kingdom (6). […] Rates reach 25–67% in those who are newly diagnosed (4, 8, 9). The rates are higher in younger children […] The risk of DKA among patients with pre-existing diabetes is 1–10% annual per person […] DKA can present with mild-to-severe symptoms. […] polyuria and polydipsia […] patients may present with signs of dehydration, such as tachycardia and dry mucus membranes. […] Vomiting, abdominal pain, malaise, and weight loss are common presenting symptoms […] Signs related to the ketoacidotic state include hyperventilation with deep breathing (Kussmaul’s respiration) which is a compensatory respiratory response to an underlying metabolic acidosis. Acetonemia may cause a fruity odor to the breath. […] Elevated glucose levels are almost always present; however, euglycemic DKA has been described (19). Anion-gap metabolic acidosis is the hallmark of this condition and is caused by elevated ketone bodies.”

“Clinically significant cerebral edema occurs in approximately 1% of patients with diabetic ketoacidosis […] DKA-related cerebral edema may represent a continuum. Mild forms resulting in subtle edema may result in modest mental status abnormalities whereas the most severe manifestations result in overt cerebral injury. […] Cerebral edema typically presents 4–12 h after the treatment for DKA is started (28, 29), but can occur at any time. […] Increased intracranial pressure with cerebral edema has been recognized as the leading cause of morbidity and mortality in pediatric patients with DKA (59). Mortality from DKA-related cerebral edema in children is high, up to 90% […] and accounts for 60–90% of the mortality seen in DKA […] many patients are left with major neurological deficits (28, 31, 35).”

“The hyperosmolar hyperglycemic state (HHS) is also an acute complication that may occur in patients with diabetes mellitus. It is seen primarily in patients with T2DM and has previously been referred to as “hyperglycemic hyperosmolar non-ketotic coma” or “hyperglycemic hyperosmolar non-ketotic state” (13). HHS is marked by profound dehydration and hyperglycemia and often by some degree of neurological impairment. The term hyperglycemic hyperosmolar state is used because (1) ketosis may be present and (2) there may be varying degrees of altered sensorium besides coma (13). Like DKA, the basic underlying disorder is inadequate circulating insulin, but there is often enough insulin to inhibit free fatty acid mobilization and ketoacidosis. […] Up to 20% of patients diagnosed with HHS do not have a previous history of diabetes mellitus (14). […] Kitabchi et al. estimated the rate of hospital admissions due to HHS to be lower than DKA, accounting for less than 1% of all primary diabetic admissions (13). […] Glucose levels rise in the setting of relative insulin deficiency. The low levels of circulating insulin prevent lipolysis, ketogenesis, and ketoacidosis (62) but are unable to suppress hyperglycemia, glucosuria, and water losses. […] HHS typically presents with one or more precipitating factors, similar to DKA. […] Acute infections […] account for approximately 32–50% of precipitating causes (13). […] The mortality rates for HHS vary between 10 and 20% (14, 93).”

It should perhaps be noted explicitly that the mortality rates for these complications are particularly high in the settings of either very young individuals (DKA) or in elderly individuals (HHS) who might have multiple comorbidities. Relatedly HHS often develops acutely specifically in settings where the precipitating factor is something really unpleasant like pneumonia or a cardiovascular event, so a high-ish mortality rate is perhaps not that surprising. Nor is it surprising that very young brains are particularly vulnerable in the context of DKA (I already discussed some of the research on these matters in some detail in an earlier post about this book).

This post to some extent covered the topic of ‘stroke in general’, however I wanted to include here also some more data specifically on diabetes-related matters about this topic. Here’s a quote to start off with:

“DM [Diabetes Mellitus] has been consistently shown to represent a strong independent risk factor of ischemic stroke. […] The contribution of hyperglycemia to increased stroke risk is not proven. […] the relationship between hyperglycemia and stroke remains subject of debate. In this respect, the association between hyperglycemia and cerebrovascular disease is established less strongly than the association between hyperglycemia and coronary heart disease. […] The course of stroke in patients with DM is characterized by higher mortality, more severe disability, and higher recurrence rate […] It is now well accepted that the risk of stroke in individuals with DM is equal to that of individuals with a history of myocardial infarction or stroke, but no DM (24–26). This was confirmed in a recently published large retrospective study which enrolled all inhabitants of Denmark (more than 3 million people out of whom 71,802 patients with DM) and were followed-up for 5 years. In men without DM the incidence of stroke was 2.5 in those without and 7.8% in those with prior myocardial infarction, whereas in patients with DM it was 9.6 in those without and 27.4% in those with history of myocardial infarction. In women the numbers were 2.5, 9.0, 10.0, and 14.2%, respectively (22).”

That study incidentally is very nice for me in particular to know about, given that I am a Danish diabetic. I do not here face any of the usual tiresome questions about ‘external validity’ and issues pertaining to ‘extrapolating out of sample’ – not only is it quite likely I’ve actually looked at some of the data used in that analysis myself, I also *know* that I am almost certainly one of the people included in the analysis. Of course you need other data as well to assess risk (e.g. age, see the previously linked post), but this is pretty clean as far as it goes. Moving on…

“The number of deaths from stroke attributable to DM is highest in low-and-middle-income countries […] the relative risk conveyed by DM is greater in younger subjects […] It is not well known whether type 1 or type 2 DM affects stroke risk differently. […] In the large cohort of women enrolled in the Nurses’ Health Study (116,316 women followed for up to 26 years) it was shown that the incidence of total stroke was fourfold higher in women with type 1 DM and twofold higher among women with type 2 DM than for non-diabetic women (33). […] The impact of DM duration as a stroke risk factor has not been clearly defined. […] In this context it is important to note that the actual duration of type 2 DM is difficult to determine precisely […*and more generally: “the date of onset of a certain chronic disease is a quantity which is not defined as precisely as mortality“, as Yashin et al. put it – I also talked about this topic in my previous post, but it’s important when you’re looking at these sorts of things and is worth reiterating – US*]. […] Traditional risk factors for stroke such as arterial hypertension, dyslipidemia, atrial fibrillation, heart failure, and previous myocardial infarction are more common in people with DM […]. However, the impact of DM on stroke is not just due to the higher prevalence of these risk factors, as the risk of mortality and morbidity remains over twofold increased after correcting for these factors (4, 37). […] It is informative to distinguish between factors that are non-specific and specific to DM. DM-specific factors, including chronic hyperglycemia, DM duration, DM type and complications, and insulin resistance, may contribute to an elevated stroke risk either by amplification of the harmful effect of other “classical” non-specific risk factors, such as hypertension, or by acting independently.”

More than a few variables are known to impact stroke risk, but the fact that many of the risk factors are related to each other (‘fat people often also have high blood pressure’) makes it hard to figure out which variables are most important, how they interact with each other, etc., etc. One might in that context perhaps conceptualize the metabolic syndrome (-MS) as a sort of indicator variable indicating whether a relatively common *set* of such related potential risk factors of interest are present or not – it is worth noting in that context that the authors include in the text the observation that: “it is yet uncertain if the whole concept of the MS entails more than its individual components. The clustering of risk factors complicates the assessment of the contribution of individual components to the risk of vascular events, as well as assessment of synergistic or interacting effects.” MS confers a two-threefold increased stroke risk, depending on the definition and the population analyzed, so there’s definitely some relevant stuff included in that box, but in the context of developing new treatment options and better assess risk it might be helpful to – to put it simplistically – know if variable X is significantly more important than variable Y (and how the variables interact, etc., etc.). But this sort of information is hard to get.

There’s more than one type of stroke, and the way diabetes modifies the risk of various stroke types is not completely clear:

“Most studies have consistently shown that DM is an important risk factor for ischemic stroke, while the incidence of hemorrhagic stroke in subjects with DM does not seem to be increased. Consequently, the ratio of ischemic to hemorrhagic stroke is higher in patients with DM than in those stroke patients without DM [*recall the base rates I’ve mentioned before in the coverage of this book: 80% of strokes are ischemic strokes in Western countries, and 15 % hemorrhagic*] […] The data regarding an association between DM and the risk of hemorrhagic stroke are quite conflicting. In the most series no increased risk of cerebral hemorrhage was found (10, 101), and in the Copenhagen Stroke Registry, hemorrhagic stroke was even six times less frequent in diabetic patients than in non-diabetic subjects (102). […] However, in another prospective population-based study DM was associated with an increased risk of primary intracerebral hemorrhage (103). […] The significance of DM as a risk factor of hemorrhagic stroke could differ depending on ethnicity of subjects or type of DM. In the large Nurses’ Health Study type 1 DM increased the risk of hemorrhagic stroke by 3.8 times while type 2 DM did not increase such a risk (96). […] It is yet unclear if DM predominantly predisposes to either large or small vessel ischemic stroke. Nevertheless, lacunar stroke (small, less than 15mm in diameter infarction, cyst-like, frequently multiple) is considered to be the typical type of stroke in diabetic subjects (105–107), and DM may be present in up to 28–43% of patients with cerebral lacunar infarction (108–110).”

The Danish results mentioned above might not be as useful to me as they were before if the type is important, because the majority of those diabetics included were type 2 diabetics. I know from personal experience that it is difficult to type-identify diabetics using the Danish registry data available if you want to work with population-level data, and any type of scheme attempting this will be subject to potentially large misidentification problems. *Some* subgroups can be presumably correctly identified using diagnostic codes, but a very large number of individuals will be left out of the analyses if you only rely on identification strategies where you’re (at least reasonably?) certain about the type. I’ve worked on these identification problems during my graduate work so perhaps a few more things are worth mentioning here. In the context of diabetic subgroup analyses, misidentification is in general a much larger problem in the context of type 1 results than in the context of type 2 results; unless the study design takes the large prevalence difference of the two conditions into account, the type 1 sample will be much smaller than the type 2 sample in pretty much all analytical contexts, so a small number of misidentified type 2 individuals can have large impacts on the results of the type 1 sample. Type 1s misidentified as type 2 individuals is in general to be expected to be a much smaller problem in terms of the validity of the type 2 analysis; misidentification of that type will cause a loss of power in the context of the type 1 subgroup analysis, which is already low to start with (and it’ll also make the type 1 subgroup analysis even more vulnerable to misidentified type 2s), but it won’t much change the results of the type 2 subgroup analysis in any significant way. Relatedly, even if enough type 2 patients are misidentified to cause problems with the interpretation of the type 1 subgroup analysis, this would not on its own be a good reason to doubt the results of the type 2 subgroup analysis. Another thing to note in terms of these things is that given that misidentification will tend to lead to ‘mixing’, i.e. it’ll make the subgroup results look similar, when outcomes are *not* similar in the type 1 and the type 2 individuals then this might be taken to be an indicator that something potentially interesting might be going on, because most analyses will struggle with some level of misidentification which will tend to reduce the power of tests of group differences.

What about stroke outcomes? A few observations were included on that topic above, but the book has a lot more stuff on that – some observations on this topic:

“DM is an independent risk factor of death from stroke […]. Tuomilehto et al. (35) calculated that 16% of all stroke mortality in men and 33% in women could be directly attributed to DM. Patients with DM have higher hospital and long-term stroke mortality, more pronounced residual neurological deficits, and more severe disability after acute cerebrovascular accidents […]. The 1-year mortality rate, for example, was twofold higher in diabetic patients compared to non-diabetic subjects (50% vs. 25%) […]. Only 20% of people with DM survive over 5 years after the first stroke and half of these patients die within the first year (36, 128). […] The mechanisms underlying the worse outcome of stroke in diabetic subjects are not fully understood. […] Regarding prevention of stroke in patients with DM, it may be less relevant than in non-DM subjects to distinguish between primary and secondary prevention as all patients with DM are considered to be high-risk subjects regardless of the history of cerebrovascular accidents or the presence of clinical and subclinical vascular lesions. […] The influence of the mode of antihyperglycemic treatment on the risk of stroke is uncertain.”

Control of blood pressure is very important in the diabetic setting:

“There are no doubts that there is a linear relation between elevated systolic blood pressure and the risk of stroke, both in people with or without DM. […] Although DM and arterial hypertension represent significant independent risk factors for stroke if they co-occur in the same patient the risk increases dramatically. A prospective study of almost 50 thousand subjects in Finland followed up for 19 years revealed that the hazard ratio for stroke incidence was 1.4, 2.0, 2.5, 3.5, and 4.5 and for stroke mortality was 1.5, 2.6, 3.1, 5.6, and 9.3, respectively, in subjects with an isolated modestly elevated blood pressure (systolic 140–159/diastolic 90–94 mmHg), isolated more severe hypertension (systolic >159 mmHg, diastolic >94 mmHg, or use of antihypertensive drugs), with isolated DM only, with both DM and modestly elevated blood pressure, and with both DM and more severe hypertension, relative to subjects without either of the risk factors (168). […] it remains unclear whether some classes of antihypertensive agents provide a stronger protection against stroke in diabetic patients than others. […] effective antihypertensive treatment is highly beneficial for reduction of stroke risk in diabetic patients, but the advantages of any particular class of antihypertensive medications are not substantially proven.”

Treatment of dyslipidemia is also very important, but here it does seem to matter how you treat it:

“It seems that the beneficial effect of statins is dose-dependent. The lower the LDL level that is achieved the stronger the cardiovascular protection. […] Recently, the results of the meta-analysis of 14 randomized trials of statins in 18,686 patients with DM had been published. It was calculated that statins use in diabetic patients can result in a 21% reduction of the risk of any stroke per 1 mmol/l reduction of LDL achieved […] There is no evidence from trials that supports efficacy of fibrates for stroke prevention in diabetic patients. […] No reduction of stroke risk by fibrates was shown also in a meta-analysis of eight trials enrolled 12,249 patients with type 2 DM (204).”

Antiplatelets?

“Significant reductions in stroke risk in diabetic patients receiving antiplatelet therapy were found in large-scale controlled trials (205). It appears that based on the high incidence of stroke and prevalence of stroke risk factors in the diabetic population the benefits of routine aspirin use for primary and secondary stroke prevention outweigh its potential risk of hemorrhagic stroke especially in patients older than 30 years having at least one additional risk factor (206). […] both guidelines issued by the AHA/ADA or the ESC/EASD on the prevention of cardiovascular disease in patients with DM support the use of aspirin in a dose of 50–325 mg daily for the primary prevention of stroke in subjects older than 40 years of age and additional risk factors, such as DM […] The newer antiplatelet agent, clopidogrel, was more efficacious in prevention of ischemic stroke than aspirin with greater risk reduction in the diabetic cohort especially in those treated with insulin compared to non-diabetics in CAPRIE trial (209). However, the combination of aspirin and clopidogrel does not appear to be more efficacious and safe compared to clopidogrel or aspirin alone”.

When you treat all risk factors aggressively, it turns out that the elevated stroke risk can be substantially reduced. Again the data on this stuff is from Denmark:

“Gaede et al. (216) have shown in the Steno 2 study that intensive multifactorial intervention aimed at correction of hyperglycemia, hypertension, dyslipidemia, and microalbuminuria along with aspirin use resulted in a reduction of cardiovascular morbidity including non-fatal stroke […] recently the results of the extended 13.3 years follow-up of this study were presented and the reduction of cardiovascular mortality by 57% and morbidity by 59% along with the reduction of the number of non-fatal stroke (6 vs. 30 events) in intensively treated group was convincingly demonstrated (217). Antihypertensive, hypolipidemic treatment, use of aspirin should thus be recommended as either primary or secondary prevention of stroke for patients with DM.”

## Quotes

i. “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” (John Tukey)

ii. “Far better an approximate answer to the *right* question, which is often vague, than an *exact* answer to the wrong question, which can always be made precise.” (-ll-)

iii. “They who can no longer unlearn have lost the power to learn.” (John Lancaster Spalding)

iv. “If there are but few who interest thee, why shouldst thou be disappointed if but few find thee interesting?” (-ll-)

v. “Since the mass of mankind are too ignorant or too indolent to think seriously, if majorities are right it is by accident.” (-ll-)

vi. “As they are the bravest who require no witnesses to their deeds of daring, so they are the best who do right without thinking whether or not it shall be known.” (-ll-)

vii. “Perfection is beyond our reach, but they who earnestly strive to become perfect, acquire excellences and virtues of which the multitude have no conception.” (-ll-)

viii. “We are made ridiculous less by our defects than by the affectation of qualities which are not ours.” (-ll-)

ix. “If thy words are wise, they will not seem so to the foolish: if they are deep the shallow will not appreciate them. Think not highly of thyself, then, when thou art praised by many.” (-ll-)

x. “Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. ” (George E. P. Box)

xi. “Intense ultraviolet (UV) radiation from the young Sun acted on the atmosphere to form small amounts of very many gases. Most of these dissolved easily in water, and fell out in rain, making Earth’s surface water rich in carbon compounds. […] the most important chemical of all may have been cyanide (HCN). It would have formed easily in the upper atmosphere from solar radiation and meteorite impact, then dissolved in raindrops. Today it is broken down almost at once by oxygen, but early in Earth’s history it built up at low concentrations in lakes and oceans. Cyanide is a basic building block for more complex organic molecules such as amino acids and nucleic acid bases. Life probably evolved in chemical conditions that would kill us instantly!” (Richard Cowen, History of Life, p.8)

xii. “Dinosaurs dominated land communities for 100 million years, and it was only after dinosaurs disappeared that mammals became dominant. It’s difficult to avoid the suspicion that dinosaurs were in some way competitively superior to mammals and confined them to small body size and ecological insignificance. […] Dinosaurs dominated many guilds in the Cretaceous, including that of large browsers. […] in terms of their reconstructed behavior […] dinosaurs should be compared not with living reptiles, but with living mammals and birds. […] By the end of the Cretaceous there were mammals with varied sets of genes but muted variation in morphology. […] All Mesozoic mammals were small. Mammals with small bodies can play only a limited number of ecological roles, mainly insectivores and omnivores. But when dinosaurs disappeared at the end of the Cretaceous, some of the Paleocene mammals quickly evolved to take over many of their ecological roles” (ibid., pp. 145, 154, 222, 227-228)

xiii. “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” (Ronald Fisher)

xiv. “Ideas are incestuous.” (Howard Raiffa)

xv. “Game theory […] deals only with the way in which ultrasmart, all knowing people should behave in competitive situations, and has little to say to Mr. X as he confronts the morass of his problem. ” (-ll-)

xvi. “One of the principal objects of theoretical research is to find the point of view from which the subject appears in the greatest simplicity.” (Josiah Williard Gibbs)

xvii. “Nothing is as dangerous as an ignorant friend; a wise enemy is to be preferred.” (Jean de La Fontaine)

xviii. “Humility is a virtue all preach, none practice; and yet everybody is content to hear.” (John Selden)

xix. “Few men make themselves masters of the things they write or speak.” (-ll-)

xx. “Wise men say nothing in dangerous times.” (-ll-)

## Principles of Applied Statistics

“Statistical considerations arise in virtually all areas of science and technology and, beyond these, in issues of public and private policy and in everyday life. While the detailed methods used vary greatly in the level of elaboration involved and often in the way they are described, there is a unity of ideas which gives statistics as a subject both its intellectual challenge and its importance […] In this book we have aimed to discuss the ideas involved in applying statistical methods to advance knowledge and understanding. It is a book not on statistical methods as such but, rather, on how these methods are to be deployed […] We are writing partly for those working as applied statisticians, partly for subject-matter specialists using statistical ideas extensively in their work and partly for masters and doctoral students of statistics concerned with the relationship between the detailed methods and theory they are studying and the effective application of these ideas. Our aim is to emphasize how statistical ideas may be deployed fruitfully rather than to describe the details of statistical techniques.”

…

I gave the book five stars, but as noted in my review on goodreads I’m not sure the word ‘amazing’ is really fitting – however the book had a lot of good stuff and it had very little stuff for me to quibble about, so I figured it deserved a high rating. The book deals to a very large extent with topics which are in some sense common to pretty much all statistical analyses, regardless of the research context; formulation of research questions/hypotheses, data search, study designs, data analysis, and interpretation. The authors spend quite a few pages talking about hypothesis testing but on the other hand no pages talking about statistical information criteria, a topic with which I’m at this point at least reasonably familiar, and I figure if I had been slightly more critical I’d have subtracted a star for this omission – however I have the impression that I’m at times perhaps too hard on non-fiction books on goodreads so I decided not to punish the book for this omission. Part of the reason why I gave the book five stars is also that I’ve sort of wanted to read a book like this one for a while; I think in some sense it’s the first one of its kind I’ve read. I liked the way the book was structured.

Below I have added some observations from the book, as well as a few comments (I should note that I have had to leave out a lot of good stuff).

…

“When the data are very extensive, precision estimates calculated from simple standard statistical methods are likely to underestimate error substantially owing to the neglect of hidden correlations. A large amount of data is in no way synonymous with a large amount of information. In some settings at least, if a modest amount of poor quality data is likely to be modestly misleading, an extremely large amount of poor quality data may be extremely misleading.”

“For studies of a new phenomenon it will usually be best to examine situations in which the phenomenon is likely to appear in the most striking form, even if this is in some sense artificial or not representative. This is in line with the well-known precept in mathematical research: study the issue in the simplest possible context that is not entirely trivial, and later generalize.”

“It often […] aids the interpretation of an observational study to consider the question: what would have been done in a comparable experiment?”

“An important and perhaps sometimes underemphasized issue in empirical prediction is that of stability. Especially when repeated application of the same method is envisaged, it is unlikely that the situations to be encountered will exactly mirror those involved in setting up the method. It may well be wise to use a procedure that works well over a range of conditions even if it is sub-optimal in the data used to set up the method.”

“Many investigations have the broad form of collecting similar data repeatedly, for example on different individuals. In this connection the notion of a *unit of analysis *is often helpful in clarifying an approach to the detailed analysis. Although this notion is more generally applicable, it is clearest in the context of randomized experiments. Here the unit of analysis is that smallest subdivision of the experimental material such that two distinct units *might *be randomized (randomly allocated) to different treatments. […] In general the unit of analysis may not be the same as the unit of interpretation, that is to say, the unit about which conclusions are to drawn. The most difficult situation is when the unit of analysis is an aggregate of several units of interpretation, leading to the possibility of *ecological bias*, that is, a systematic difference between, say, the impact of explanatory variables at different levels of aggregation. […] it is important to identify the unit of analysis, which may be different in different parts of the analysis […] on the whole, limited detail is needed in examining the variation within the unit of analysis in question.”

The book briefly discusses issues pertaining to the scale of effort involved when thinking about appropriate study designs and how much/which data to gather for analysis, and notes that often associated costs are not quantified – rather a judgment call is made. An important related point is that e.g. in survey contexts response patterns will tend to depend upon the quantity of information requested; if you ask for too much, few people might reply (…and perhaps it’s also the case that it’s ‘the wrong people’ that reply? The authors don’t touch upon the potential selection bias issue, but it seems relevant). A few key observations from the book on this topic:

“the intrinsic quality of data, for example the response rates of surveys, may be degraded if too much is collected. […] sampling may give higher [data] quality than the study of a complete population of individuals. […] When researchers studied the effect of the expected length (10, 20 or 30 minutes) of a web-based questionnaire, they found that fewer potential respondents started and completed questionnaires expected to take longer (Galesic and Bosnjak, 2009). Furthermore, questions that appeared later in the questionnaire were given shorter and more uniform answers than questions that appeared near the start of the questionnaire.”

Not surprising, but certainly worth keeping in mind. Moving on…

“In general, while principal component analysis may be helpful in suggesting a base for interpretation and the formation of derived variables there is usually considerable arbitrariness involved in its use. This stems from the need to standardize the variables to comparable scales, typically by the use of correlation coefficients. This means that a variable that happens to have atypically small variability in the data will have a misleadingly depressed weight in the principal components.”

The book includes a few pages about the Berkson error model, which I’d never heard about. Wikipedia doesn’t have much about it and I was debating how much to include about this one here – I probably wouldn’t have done more than including the link here if the wikipedia article actually covered this topic in any detail, but it doesn’t. However it seemed important enough to write a few words about it. The basic difference between the ‘classical error model’, i.e. the one everybody knows about, and the Berkson error model is that in the former case the measurement error is statistically independent of the *true value* of X, whereas in the latter case the measurement error is independent of the *measured value*; the authors note that this implies that the true values are more variable than the measured values in a Berkson error context. Berkson errors can e.g. happen in experimental contexts where levels of a variable are pre-set by some target, for example in a medical context where a drug is supposed to be administered each X hours; the pre-set levels might then be the measured values, and the true values might be different e.g. if the nurse was late. I thought it important to mention this error model not only because it’s a completely new idea to me that you might encounter this sort of error-generating process, but also because there is no statistical test that you can use to figure out if the standard error model is the appropriate one, or if a Berkson error model is better; which means that you need to be aware of the difference and think about which model works best, based on the nature of the measuring process.

Let’s move on to some quotes dealing with modeling:

“while it is appealing to use methods that are in a reasonable sense fully efficient, that is, extract all relevant information in the data, nevertheless any such notion is within the framework of an assumed model. Ideally, methods should have this efficiency property while preserving good behaviour (especially stability of interpretation) when the model is perturbed. Essentially a model translates a subject-matter question into a mathematical or statistical one and, if that translation is seriously defective, the analysis will address a wrong or inappropriate question […] The greatest difficulty with quasi-realistic models [as opposed to ‘toy models’] is likely to be that they require numerical specification of features for some of which there is very little or no empirical information. Sensitivity analysis is then particularly important.”

“Parametric models typically represent some notion of smoothness; their danger is that particular representations of that smoothness may have strong and unfortunate implications. This difficulty is covered for the most part by informal checking that the primary conclusions do not depend critically on the precise form of parametric representation. To some extent such considerations can be formalized but in the last analysis some element of judgement cannot be avoided. One general consideration that is sometimes helpful is the following. If an issue can be addressed nonparametrically then it will often be better to tackle it parametrically; however, if it cannot be resolved nonparametrically then it is usually dangerous to resolve it parametrically.”

“Once a model is formulated two types of question arise. How can the unknown parameters in the model best be estimated? Is there evidence that the model needs modification or indeed should be abandoned in favour of some different representation? The second question is to be interpreted not as asking whether the model is true [*this is the wrong question to ask, as also emphasized by Burnham & Anderson*] but whether there is clear evidence of a specific kind of departure implying a need to change the model so as to avoid distortion of the final conclusions. […] it is important in applications to understand the circumstances under which different methods give similar or different conclusions. In particular, if a more elaborate method gives an apparent improvement in precision, what are the assumptions on which that improvement is based? Are they reasonable? […] the hierarchical principle implies, […] with very rare exceptions, that models with interaction terms should include also the corresponding main effects. […] When considering two families of models, it is important to consider the possibilities that both families are adequate, that one is adequate and not the other and that neither family fits the data.” [Do incidentally recall that in the context of interactions, “the term interaction […] is in some ways a misnomer. There is no necessary implication of interaction in the physical sense or synergy in a biological context. Rather, interaction means a departure from additivity […] This is expressed most explicitly by the requirement that, apart from random fluctuations, the difference in outcome between any two levels of one factor is the same at all levels of the other factor. […] The most directly interpretable form of interaction, certainly not removable by [variable] transformation, is effect reversal.”]

“The *p*-value assesses the data […] via a comparison with that anticipated if *H _{0}* were true. If in two different situations the test of a relevant null hypothesis gives approximately the same

*p*-value, it does not follow that the overall strengths of the evidence in favour of the relevant

*H*are the same in the two cases.”

_{0}“There are […] two sources of uncertainty in observational studies that are not present in randomized experiments. The first is that the ordering of the variables may be inappropriate, a particular hazard in cross-sectional studies. […] if the data are tied to one time point then any presumption of causality relies on a working hypothesis as to whether the components are explanatory or responses. Any check on this can only be from sources external to the current data. […] The second source of uncertainty is that important explanatory variables affecting both the potential cause and the outcome may not be available. […] Retrospective explanations may be convincing if based on firmly established theory but otherwise need to be treated with special caution. It is well known in many fields that ingenious explanations can be constructed retrospectively for almost any finding.”

“The general issue of applying conclusions from aggregate data to specific individuals is essentially that of showing that the individual does not belong to a subaggregate for which a substantially different conclusion applies. In actuality this can at most be indirectly checked for specific subaggregates. […] It is not unknown in the literature to see conclusions such as that there are no treatment differences except for males aged over 80 years, living more than 50 km south of Birmingham and life-long supporters of Aston Villa football club, who show a dramatic improvement under some treatment *T*. Despite the undoubted importance of this particular subgroup, virtually always such conclusions would seem to be unjustified.” [*I loved this example!*]

The authors included a few interesting results from an undated Cochrane publication which I thought I should mention. The file-drawer effect is well known, but there are a few other interesting biases at play in a publication bias context. One is time-lag bias, which means that statistically significant results take less time to get published. Another is language bias; statistically significant results are more likely to be published in English publications. A third bias is multiple publication bias; it turns out that papers with statistically significant results are more likely to be published more than once. The last one mentioned is citation bias; papers with statistically significant results are more likely to be cited in the literature.

The authors include these observations in their concluding remarks: “The overriding general principle [in the context of applied statistics], difficult to achieve, is that there should be a seamless flow between statistical and subject-matter considerations. […] in principle seamlessness requires an individual statistician to have views on subject-matter interpretation and subject-matter specialists to be interested in issues of statistical analysis.”

As already mentioned this is a good book. It’s not long, and/but it’s worth reading if you’re in the target group.

## Quotes

i. “By all means think yourself big but don’t think everyone else small” (‘Notes on Flyleaf of Fresh ms. Book’, *Scott’s Last Expedition*. See also this).

ii. “The man who knows everyone’s job isn’t much good at his own.” (-ll-)

iii. “It is amazing what little harm doctors do when one considers all the opportunities they have” (Mark Twain, as quoted in the Oxford Handbook of Clinical Medicine, p.595).

iv. “A** **first-rate theory predicts; a second-rate theory forbids and a third-rate theory explains after the event.” (Aleksander Isaakovich Kitaigorodski)

v. “[S]ome of the most terrible things in the world are done by people who think, genuinely think, that they’re doing it for the best” (Terry Pratchett, Snuff).

vi. “That was excellently observ’d, say I, when I read a Passage in an Author, where his Opinion agrees with mine. When we differ, there I pronounce him to be mistaken.” (Jonathan Swift)

vii. “Death is nature’s master stroke, albeit a cruel one, because it allows genotypes space to try on new phenotypes.” (Quote from the Oxford Handbook of Clinical Medicine, p.6)

*
*viii. “The purpose of models is not to fit the data but to sharpen the questions.” (Samuel Karlin)

ix. “We may […] view set theory, and mathematics generally, in much the way in which we view theoretical portions of the natural sciences themselves; as comprising truths or hypotheses which are to be vindicated less by the pure light of reason than by the indirect systematic contribution which they make to the organizing of empirical data in the natural sciences.” (Quine)

x. “At root what is needed for scientific inquiry is just receptivity to data, skill in reasoning, and yearning for truth. Admittedly, ingenuity can help too.” (-ll-)

xi. “A statistician carefully assembles facts and figures for others who carefully misinterpret them.” (Quote from *Mathematically Speaking – A Dictionary of Quotations*, p.329. Only source given in the book is: “Quoted in Evan Esar, *20,000 Quips and Quotes*“)

xii. “A knowledge of statistics is like a knowledge of foreign languages or of algebra; it may prove of use at any time under any circumstances.” (Quote from *Mathematically Speaking – A Dictionary of Quotations*, p. 328. The source provided is: “Elements of Statistics, Part I, Chapter I (p.4)”).

xiii. “We own to small faults to persuade others that we have not great ones.” (Rochefoucauld)

xiv. “There is more self-love than love in jealousy.” (-ll-)

xv. “We should not judge of a man’s merit by his great abilities, but by the use he makes of them.” (-ll-)

xvi. “We should gain more by letting the world see what we are than by trying to seem what we are not.” (-ll-)

xvii. “Put succinctly, a prospective study looks for the effects of causes whereas a retrospective study examines the causes of effects.” (Quote from p.49 of *Principles of Applied Statistics*, by Cox & Donnelly)

xviii. “… he who seeks for methods without having a definite problem in mind seeks for the most part in vain.” (David Hilbert)

xix. “Give every man thy ear, but few thy voice” (Shakespeare).

xx. “Often the fear of one evil leads us into a worse.” (Nicolas Boileau-Despréaux)

## The Nature of Statistical Evidence

Here’s my goodreads review of the book.

As I’ve observed many times before, a wordpress blog like mine is not a particularly nice place to cover mathematical topics involving equations and lots of Greek letters, so the coverage below will be more or less purely conceptual; don’t take this to mean that the book doesn’t contain formulas. Some parts of the book look like this:

That of course makes the book hard to blog, also for other reasons than just the fact that it’s typographically hard to deal with the equations. In general it’s hard to talk about the content of a book like this one without going into *a lot* of details outlining how you get from A to B to C – usually you’re only really interested in C, but you need A and B to make sense of C. At this point I’ve sort of concluded that when covering books like this one I’ll only cover some of the main themes which are easy to discuss in a blog post, and I’ve concluded that I should skip coverage of (potentially important) points which might also be of interest if they’re difficult to discuss in a small amount of space, which is unfortunately often the case. I should perhaps observe that although I noted in my goodreads review that in a way there was a bit too much philosophy and a bit too little statistics in the coverage for my taste, you should definitely not take that objection to mean that this book is full of fluff; a lot of that philosophical stuff is ‘formal logic’ type stuff and related comments, and the book in general is quite dense. As I also noted in the goodreads review I didn’t read this book as carefully as I might have done – for example I skipped a couple of the technical proofs because they didn’t seem to be worth the effort – and I’d probably need to read it again to fully understand some of the minor points made throughout the more technical parts of the coverage; so that’s of course a related reason why I don’t cover the book in a great amount of detail here – it’s hard work just to read the damn thing, to talk about the technical stuff in detail here as well would definitely be overkill even if it would surely make me understand the material better.

I have added some observations from the coverage below. I’ve tried to clarify beforehand which question/topic the quote in question deals with, to ease reading/understanding of the topics covered.

…

On how statistical methods are related to experimental science:

“statistical methods have aims similar to the process of experimental science. But statistics is not itself an experimental science, it consists of models of how to do experimental science. Statistical theory is a logical — mostly mathematical — discipline; its findings are not subject to experimental test. […] The primary sense in which statistical theory is a science is that it guides and explains statistical methods. A sharpened statement of the purpose of this book is to provide explanations of the senses in which some statistical methods provide scientific evidence.”

On mathematics and axiomatic systems (the book goes into much more detail than this):

“It is not sufficiently appreciated that a link is needed between mathematics and methods. Mathematics is not about the world until it is interpreted and then it is only about models of the world […]. No contradiction is introduced by either interpreting the same theory in different ways or by modeling the same concept by different theories. […] In general, a primitive undefined term is said to be **interpreted** when a meaning is assigned to it and when all such terms are interpreted we have an **interpretation** of the axiomatic system. It makes no sense to ask which is the correct interpretation of an axiom system. This is a primary strength of the axiomatic method; we can use it to organize and structure our thoughts and knowledge by simultaneously and economically treating all interpretations of an axiom system. It is also a weakness in that failure to define or interpret terms leads to much confusion about the implications of theory for application.”

It’s all about models:

“The scientific method of theory checking is to compare predictions deduced from a theoretical model with observations on nature. Thus science must predict what happens in nature but it need not explain why. […] whether experiment is consistent with theory is relative to accuracy and purpose. All theories are simplifications of reality and hence no theory will be expected to be a perfect predictor. Theories of statistical inference become relevant to scientific process at precisely this point. […] Scientific method is a practice developed to deal with experiments on **nature. **Probability theory is a deductive study of the properties of **models **of such experiments. All of the theorems of probability are results about models of experiments.”

But given a frequentist interpretation you can test your statistical theories with the real world, right? Right? Well…

“How might we check the long run stability of relative frequency? If we are to compare mathematical theory with experiment then only finite sequences can be observed. But for the Bernoulli case, the event that frequency approaches probability is stochastically independent of any sequence of finite length. […] Long-run stability of relative frequency cannot be checked experimentally. There are neither theoretical nor empirical guarantees that, a priori, one can recognize experiments performed under uniform conditions and that under these circumstances one *will* obtain stable frequencies.” [related link]

What should we expect to get out of mathematical and statistical theories of inference?

“What can we expect of a theory of statistical inference? We can expect an internally consistent explanation of why certain conclusions follow from certain data. The theory will not be about inductive rationality but about a *model *of inductive rationality. Statisticians are used to thinking that they apply their logic to models of the physical world; less common is the realization that their logic itself is only a model. Explanation will be in terms of introduced concepts which do not exist in nature. Properties of the concepts will be derived from assumptions which merely seem reasonable. This is the only sense in which the axioms of any mathematical theory are true […] We can expect these concepts, assumptions, and properties to be intuitive but, unlike natural science, they cannot be checked by experiment. Different people have different ideas about what “seems reasonable,” so we can expect different explanations and different properties. We should not be surprised if the theorems of two different theories of statistical evidence differ. If two models had no different properties then they would be different versions of the same model […] We should not expect to achieve, by mathematics alone, a single coherent theory of inference, for mathematical truth is conditional and the assumptions are not “self-evident.” Faith in a set of assumptions would be needed to achieve a single coherent theory.”

On disagreements about the nature of statistical evidence:

“The context of this section is that there is disagreement among experts about the nature of statistical evidence and consequently much use of one formulation to criticize another. Neyman (1950) maintains that, from his behavioral hypothesis testing point of view, Fisherian significance tests do not express evidence. Royall (1997) employs the “law” of likelihood to criticize hypothesis as well as significance testing. Pratt (1965), Berger and Selke (1987), Berger and Berry (1988), and Casella and Berger (1987) employ Bayesian theory to criticize sampling theory. […] Critics assume that their findings are about evidence, but they are at most about models of evidence. Many theoretical statistical criticisms, when stated in terms of evidence, have the following outline: According to model A, evidence satisfies proposition P. But according to model B, which is correct since it is derived from “self-evident truths,” P is not true. Now evidence can’t be two different ways so, since B is right, A must be wrong. Note that the argument is symmetric: since A appears “self-evident” (to adherents of A) B must be wrong. But both conclusions are invalid since evidence can be modeled in different ways, perhaps useful in different contexts and for different purposes. From the observation that P is a theorem of A but not of B, all we can properly conclude is that A and B are different models of evidence. […] The common practice of using one theory of inference to critique another is a misleading activity.”

Is mathematics a science?

“Is mathematics a science? It is certainly systematized knowledge much concerned with structure, but then so is history. Does it employ the scientific method? Well, partly; hypothesis and deduction are the essence of mathematics and the search for counter examples is a mathematical counterpart of experimentation; but the question is not put to nature. Is mathematics about nature? In part. The hypotheses of most mathematics are suggested by some natural primitive concept, for it is difficult to think of interesting hypotheses concerning nonsense syllables and to check their consistency. However, it often happens that as a mathematical subject matures it tends to evolve away from the original concept which motivated it. Mathematics in its purest form is probably not natural science since it lacks the experimental aspect. Art is sometimes defined to be creative work displaying form, beauty and unusual perception. By this definition pure mathematics is clearly an art. On the other hand, applied mathematics, taking its hypotheses from real world concepts, is an attempt to describe nature. Applied mathematics, without regard to experimental verification, is in fact largely the “conditional truth” portion of science. If a body of applied mathematics has survived experimental test to become trustworthy belief then it is the essence of natural science.”

Then what about statistics – is statistics a science?

“Statisticians can and do make contributions to subject matter fields such as physics, and demography but statistical theory and methods proper, distinguished from their findings, are not like physics in that they are not about nature. […] Applied statistics is natural science but the findings are about the subject matter field not statistical theory or method. […] Statistical theory helps with how to do natural science but it is not itself a natural science.”

…

I should note that I am, and have for a long time been, in broad agreement with the author’s remarks on the nature of science and mathematics above. Popper, among many others, discussed this topic a long time ago e.g. in The Logic of Scientific Discovery and I’ve basically been of the opinion that (‘pure’) mathematics is not science (‘but rather ‘something else’ … and that doesn’t mean it’s not useful’) for probably a decade. I’ve had a harder time coming to terms with how precisely to deal with statistics in terms of these things, and in that context the book has been conceptually helpful.

Below I’ve added a few links to other stuff also covered in the book:

Propositional calculus.

Kolmogorov’s axioms.

Neyman-Pearson lemma.

Radon-Nikodyn theorem. (not covered in the book, but the necessity of using ‘a Radon-Nikodyn derivative’ to obtain an answer to a question being asked was remarked upon at one point, and I had no clue what he was talking about – it seems that the stuff in the link was what he was talking about).

A very specific and relevant link: Berger and Wolpert (1984). The stuff about Birnbaum’s argument covered from p.24 (p.40) and forward is covered in some detail in the book. The author is critical of the model and explains in the book in some detail why that is. See also: *On the foundations of statistical inference* (Birnbaum, 1962).

## Cost-effectiveness analysis in health care (III)

This will be my last post about the book. Yesterday I finished reading Darwin’s Origin of Species, which was my 100th book this year (here’s the list), but I can’t face blogging that book at the moment so coverage of that one will have to wait a bit.

In my second post about this book I had originally planned to cover chapter 7 – ‘Analysing costs’ – but as I didn’t like to spend too much time on the post I ended up cutting it short. This omission of coverage in the last post means that some themes to be discussed below are closely related to stuff covered in the second post, whereas on the other hand most of the remaining material, more specifically the material from chapters 8, 9 and 10, deal with decision analytic modelling, a quite different topic; in other words the coverage will be slightly more fragmented and less structured than I’d have liked it to be, but there’s not really much to do about that (it doesn’t help in this respect that I decided to not cover chapter 8, but doing that as well was out of the question).

I’ll start with coverage of some of the things they talk about in chapter 7, which as mentioned deals with how to analyze costs in a cost-effectiveness analysis context. They observe in the chapter that health cost data are often skewed to the right, for several reasons (costs incurred by an individual cannot be negative; for many patients the costs may be zero; some study participants may require much more care than the rest, creating a long tail). One way to address skewness is to use the median instead of the mean as the variable of interest, but a problem with this approach is that the median will not be as useful to policy-makers as will be the mean; as the mean times the population of interest will give a good estimate of the total costs of an intervention, whereas the median is not a very useful variable in the context of arriving at an estimate of the total costs. Doing data transformations and analyzing transformed data is another way to deal with skewness, but their use in cost effectiveness analysis have been questioned for a variety of reasons discussed in the chapter (to give a couple of examples, data transformation methods perform badly if inappropriate transformations are used, and many transformations cannot be used if there are data points with zero costs in the data, which is very common). Of the non-parametric methods aimed at dealing with skewness they discuss a variety of tests which are rarely used, as well as the bootstrap, the latter being one approach which has gained widespread use. They observe in the context of the bootstrap that “it has increasingly been recognized that the conditions the bootstrap requires to produce reliable parameter estimates are not fundamentally different from the conditions required by parametric methods” and note in a later chapter (chapter 11) that: “it is not clear that boostrap results in the presence of severe skewness are likely to be any more or less valid than parametric results […] bootstrap and parametric methods both rely on sufficient sample sizes and are likely to be valid or invalid in similar circumstances. Instead, interest in the bootstrap has increasingly focused on its usefulness in dealing simultaneously with issues such as censoring, missing data, multiple statistics of interest such as costs and effects, and non-normality.” Going back to the coverage in chapter 7, in the context of skewness they also briefly touch upon the potential use of a GLM framework to address this problem.

Data is often missing in cost datasets. Some parts of their coverage of these topics was to me but a review of stuff already covered in Bartholomew. Data can be missing for different reasons and through different mechanisms; one distinction is among data missing completely at random (MCAR), missing at random (MAR) (“missing data are correlated in an observable way with the mechanism that generates the cost, i.e. after adjusting the data for observable differences between complete and missing cases, the cost for those with missing data is the same, except for random variation, as for those with complete data”), and not missing at random (NMAR); the last type is also called non-ignorably missing data, and if you have that sort of data the implication is that the costs of those in the observed and unobserved groups differ in unpredictable ways, and if you ignore the process that drives these differences you’ll probably end up with a biased estimator. Another way to distinguish between different types of missing data is to look at patterns within the dataset, where you have:

“***univariate missingness** – a single variable in a dataset is causing a problem through missing values, while the remaining variables contain complete information

***unit non-response** – no data are recorded for any of the variables for some patients

***monotone missing** – caused, for example, by drop-out in panel or longitudinal studies, resulting in variables observed up to a certain time point or wave but not beyond that

***multivariate missing** – also called item non-response or general missingness, where some but not all of the variables are missing for some of the subjects.”

The authors note that the most common types of missingness in cost information analyses are the latter two. They discuss some techniques for dealing with missing data, such as complete-case analysis, available-case analysis, and imputation, but I won’t go into the details here. In the last parts of the chapter they talk a little bit about censoring, which can be viewed as a specific type of missing data, and ways to deal with it. Censoring happens when follow-up information on some subjects is not available for the full duration of interest, which may be caused e.g. by attrition (people dropping out of the trial), or insufficient follow up (the final date of follow-up might be set before all patients reach the endpoint of interest, e.g. death). The two most common methods for dealing with censored cost data are the Kaplan-Meier sample average (-KMSA) estimator and the inverse probability weighting (-IPW) estimator, both of which are non-parametric interval methods. “Comparisons of the IPW and KMSA estimators have shown that they both perform well over different levels of censoring […], and both are considered reasonable approaches for dealing with censoring.” One difference between the two is that the KMSA, unlike the IPW, is not appropriate for dealing with censoring due to attrition unless the attrition is MCAR (and it almost never is), because the KM estimator, and by extension the KMSA estimator, assumes that censoring is independent of the event of interest.

The focus in chapter 8 is on decision tree models, and I decided to skip that chapter as most of it is known stuff which I felt no need to review here (do remember that I to a large extent use this blog as an extended memory, so I’m not only(/mainly?) writing this stuff for other people..). Chapter 9 deals with Markov models, and I’ll talk a little bit about those in the following.

“Markov models analyse uncertain processes over time. They are suited to decisions where the timing of events is important and when events may happen more than once, and therefore they are appropriate where the strategies being evaluated are of a sequential or repetitive nature. Whereas decision trees model uncertain events at chance nodes, Markov models differ in modelling uncertain events as transitions between health states. In particular, Markov models are suited to modelling long-term outcomes, where costs and effects are spread over a long period of time. Therefore Markov models are particularly suited to chronic diseases or situations where events are likely to recur over time […] Over the last decade there has been an increase in the use of Markov models for conducting economic evaluations in a health-care setting […]

A Markov model comprises a finite set of health states in which an individual can be found. The states are such that in any given time interval, the individual will be in only one health state. All individuals in a particular health state have identical characteristics. The number and nature of the states are governed by the decisions problem. […] Markov models are concerned with transitions during a series of cycles consisting of short time intervals. The model is run for several cycles, and patients move between states or remain in the same state between cycles […] Movements between states are defined by transition probabilities which can be time dependent or constant over time. All individuals within a given health state are assumed to be identical, and this leads to a limitation of Markov models in that the transition probabilities only depend on the current health state and not on past health states […the process is memoryless…] – this is known as the Markovian assumption”.

The note that in order to build and analyze a Markov model, you need to do the following: *define states and allowable transitions [for example from ‘non-dead’ to ‘dead’ is okay, but going the other way is, well… For a Markov process to end, you need at least one state that cannot be left after it has been reached, and those states are termed ‘absorbing states’], *specify initial conditions in terms of starting probabilities/initial distribution of patients, *specify transition probabilities, *specify a cycle length, *set a stopping rule, *determine rewards, *implement discounting if required, *analysis and evaluation of the model, and *exploration of uncertainties. They talk about each step in more detail in the book, but I won’t go too much into this.

Markov models may be governed by transitions that are either constant over time or time-dependent. In a Markov *chain* transition probabilities are constant over time, whereas in a Markov *process *transition probabilities vary over time (/from cycle to cycle). In a simple Markov model the baseline assumption is that transitions only occur once in each cycle and usually the transition is modelled as taking place either at the beginning or the end of cycles, but in reality transitions can take place at any point in time during the cycle. One way to deal with the problem of misidentification (people assumed to be in one health state throughout the cycle even though they’ve transfered to another health state during the cycle) is to use half-cycle corrections, in which an assumption is made that on average state transitions occur halfway through the cycle, instead of at the beginning or the end of a cycle. They note that: “the important principle with the half-cycle correction is not when the transitions occur, but when state membership (i.e. the proportion of the cohort in that state) is counted. The longer the cycle length, the more important it may be to use half-cycle corrections.” When state transitions are assumed to take place may influence factors such as cost discounting (if the cycle is long, it can be important to get the state transition timing reasonably right).

When time dependency is introduced into the model, there are in general two types of time dependencies that impact on transition probabilities in the models. One is time dependency depending on the number of cycles since the start of the model (this is e.g. dealing with how transition probabilities depend on factors like age), whereas the other, which is more difficult to implement, deals with state dependence (curiously they don’t use these two words, but I’ve worked with state dependence models before in labour economics and this is what we’re dealing with here); i.e. here the transition probability will depend upon how long you’ve been in a given state.

Below I mostly discuss stuff covered in chapter 10, however I also include a few observations from the final chapter, chapter 11 (on ‘Presenting cost-effectiveness results’). Chapter 10 deals with how to represent uncertainty in decision analytic models. This is an important topic because as noted later in the book, “The primary objective of economic evaluation should not be hypothesis testing, but rather the estimation of the central parameter of interest—the incremental cost-effectiveness ratio—along with appropriate representation of the uncertainty surrounding that estimate.” In chapter 10 a distinction is made between variability, heterogeneity, and uncertainty. Variability has also been termed first-order uncertainty or stochastic uncertainty, and pertains to variation observed when recording information on resource use or outcomes within a homogenous sample of individuals. Heterogeneity relates to differences between patients which can be explained, at least in part. They distinguish between two types of uncertainty, structural uncertainty – dealing with decisions and assumptions made about the structure of the model – and parameter uncertainty, which of course relates to the precision of the parameters estimated. After briefly talking about ways to deal with these, they talk about sensitivity analysis.

“Sensitivity analysis involves varying parameter estimates across a range and seeing how this impacts on he model’s results. […] The simplest form is a one-way analysis where each parameter estimate is varied independently and singly to observe the impact on the model results. […] One-way sensitivity analysis can give some insight into the factors influencing the results, and may provide a validity check to assess what happens when particular variables take extreme values. However, it is likely to grossly underestimate overall uncertainty, and ignores correlation between parameters.”

Multi-way sensitivity analysis is a more refined approach, in which more than one parameter estimate is varied – this is sometimes termed scenario analysis. A different approach is threshold analysis, where one attempts to identify the critical value of one or more variables so that the conclusion/decision changes. All of these approaches are deterministic approaches, and they are not without problems. “They fail to take account of the joint parameter uncertainty and correlation between parameters, and rather than providing the decision-maker with a useful indication of the likelihood of a result, they simply provide a range of results associated with varying one or more input estimates.” So of course an alternative has been developed, namely probabilistic sensitivity analysis (-PSA), which already in the mid-80es started to be used in health economic decision analyses.

“PSA permits the joint uncertainty across all the parameters in the model to be addressed at the same time. It involves sampling model parameter values from distributions imposed on variables in the model. […] The types of distribution imposed are dependent on the nature of the input parameters [but] decision analytic models for the purpose of economic evaluation tend to use homogenous types of input parameters, namely costs, life-years, QALYs, probabilities, and relative treatment effects, and consequently the number of distributions that are frequently used, such as the beta, gamma, and log-normal distributions, is relatively small. […] Uncertainty is then propagated through the model by randomly selecting values from these distributions for each model parameter using Monte Carlo simulation“.

## Random Stuff / Open Thread

This is not a very ‘meaty’ post, but it’s been a long time since I had one of these and I figured it was time for another one. As always links and comments are welcome.

…

i. The unbearable accuracy of stereotypes. I made a mental note of reading this paper later a long time ago, but I’ve been busy with other things. Today I skimmed it and decided that it looks interesting enough to give it a detailed read later. Some remarks from the summary towards the end of the paper:

“The scientific evidence provides more evidence of accuracy than of inaccuracy in social stereotypes. The most appropriate generalization based on the evidence is that people’s beliefs about groups are usually moderately to highly accurate, and are occasionally highly inaccurate. […] This pattern of empirical support for moderate to high stereotype accuracy is not unique to any particular target or perceiver group. Accuracy has been found with racial and ethnic groups, gender, occupations, and college groups. […] The pattern of moderate to high stereotype accuracy is not unique to any particular research team or methodology. […] This pattern of moderate to high stereotype accuracy is not unique to the substance of the stereotype belief. It occurs for stereotypes regarding personality traits, demographic characteristics, achievement, attitudes, and behavior. […] The strong form of the exaggeration hypothesis – either defining stereotypes as exaggerations or as claiming that stereotypes usually lead to exaggeration – is not supported by data. Exaggeration does sometimes occur, but it does not appear to occur much more frequently than does accuracy or underestimation, and may even occur less frequently.”

I should perhaps note that this research is closely linked to Funder’s research on personality judgment, which I’ve previously covered on the blog here and here.

…

ii. I’ve spent approximately 150 hours on vocabulary.com altogether at this point (having ‘mastered’ ~10.200 words in the process). A few words I’ve recently encountered on the site: Nescience (note to self: if someone calls you ‘nescient’ during a conversation, in many contexts that’ll be an insult, not a compliment) (Related note to self: I should find myself some smarter enemies, who use words like ‘nescient’…), eristic, carrel, oleaginous, decal, gable, epigone, armoire, chalet, cashmere, arrogate, ovine.

…

iii. why p = .048 should be rare (and why this feels counterintuitive).

…

iv. A while back I posted a few comments on SSC and I figured I might as well link to them here (at least it’ll make it easier *for me* to find them later on). Here is where I posted a few comments on a recent study dealing with Ramadan-related IQ effects, a topic which I’ve covered here on the blog before, and here I discuss some of the benefits of not having low self-esteem.

On a completely unrelated note, today I left a comment in a reddit thread about ‘Books That Challenged You / Made You See the World Differently’ which may also be of interest to readers of this blog. I realized while writing the comment that this question is probably getting more and more difficult for me to answer as time goes by. It really all depends upon *what part of the world* you want to see in a different light; which aspects you’re most interested in. For people wondering about where the books about mathematics and statistics were in that comment (I do like to think these fields play some role in terms of ‘how I see the world‘), I wasn’t really sure which book to include on such topics, if any; I can’t think of any single math or stats textbook that’s dramatically changed the way I thought about the world – to the extent that my knowledge about these topics has changed how I think about the world, it’s been a long drawn-out process.

…

v. Chess…

People who care the least bit about such things probably already know that a really strong tournament is currently being played in St. Louis, the so-called Sinquefield Cup, so I’m not going to talk about that here (for resources and relevant links, go here).

I talked about the strong rating pools on ICC not too long ago, but one thing I did not mention when discussing this topic back then was that yes, I also occasionally win against some of those grandmasters the rating pool throws at me – at least I’ve won a few times against GMs by now in bullet. I’m aware that for many ‘serious chess players’ bullet ‘doesn’t really count’ because the time dimension is much more important than it is in other chess settings, but to people who think skill doesn’t matter much in bullet I’d say they should have a match with Hikaru Nakamura and see how well they do against him (if you’re interested in how that might turn out, see e.g. this video – and keep in mind that at the beginning of the video Nakamura had already won 8 games in a row, out of 8, against his opponent in the first games, who incidentally is not exactly a beginner). The skill-sets required do not overlap perfectly between bullet and classical time control games, but when I started playing bullet online I quickly realized that good players really require very little time to completely outplay people who just play random moves (fast). Below I have posted a screencap I took while kibitzing a game of one of my former opponents, an anonymous GM from Germany, against whom I currently have a 2.5/6 score, with two wins, one draw, and three losses (see the ‘My score vs CPE’ box).

I like to think of a score like this as at least some kind of accomplishment, though admittedly perhaps not a very big one.

Also in chess-related news, I’m currently reading Jesús de la Villa’s 100 Endgames book, which Christof Sielecki has said some very nice things about. A lot of the stuff I’ve encountered so far is stuff I’ve seen before, positions I’ve already encountered and worked on, endgame principles I’m familiar with, etc., but not all of it is known stuff and I really like the structure of the book. There are a lot of pages left, and as it is I’m planning to read this book from cover to cover, which is something I usually do not do when I read chess books (few people do, judging from various comments I’ve seen people make in all kinds of different contexts).

Lastly, a lecture:

## Cost-effectiveness analysis in health care (I)

Yesterday’s SMBC was awesome, and I couldn’t help myself from including it here (click to view full size):

…

In a way the three words I chose to omit from the post title are rather important in order to know which kind of book this is – the full title of Gray et al.’s work is: *Applied Methods of* … – but as I won’t be talking much about the ‘applied’ part in my coverage here, focusing instead on broader principles etc. which will be easier for people without a background in economics to follow, I figured I might as well omit those words from the post titles. I should also admit that I personally did not spend much time on the exercises, as this did not seem necessary in view of what I was using the book for. Despite not having spent much time on the exercises myself, I incidentally did reward the authors for including occasionally quite detailed coverage of technical aspects in my rating of the book on goodreads; I feel confident from the coverage that if I need to apply some of the methods they talk about in the book later on, the book will do a good job of helping me get things right. All in all, the book’s coverage made it hard for me not to give it 5 stars – so that was what I did.

I own an actual physical copy of the book, which makes blogging it more difficult than usual; I prefer blogging e-books. The greater amount of work involved in covering physical books is also one reason why I have yet to talk about Eysenck & Keane’s Cognitive Psychology text here on the blog, despite having read more than 500 pages of that book (it’s not that the book is boring). My coverage of the contents of both this book and the Eysenck & Keane book will (assuming I ever get around to blogging the latter, that is) be less detailed than it could have been, but on the other hand it’ll likely be very focused on key points and observations from the coverage.

I have talked about cost-effectiveness before here on the blog, e.g. here, but in my coverage of the book below I have not tried to avoid making points or including observations which I’ve already made elsewhere on the blog; it’s too much work to keep track of such things. With those introductory remarks out of the way, let’s move on to some observations made in the book:

…

“In cost-effectiveness analysis we first calculate the costs and effects of an intervention and one or more alternatives, then calculate the differences in cost and differences in effect, and finally present these differences in the form of a ratio, i.e. the cost per unit of health outcome effect […]. Because the focus is on differences between two (or more) options or treatments, analysts typically refer to incremental costs, incremental effects, and the incremental cost-effectiveness ratio (ICER). Thus, if we have two options *a* and *b*, we calculate their respective costs and effects, then calculate the difference in costs and difference in effects, and then calculate the ICER as the difference in costs divided by the difference in effects […] cost-effectiveness analyses which measure outcomes in terms of QALYs are sometimes referred to as cost-utility studies […] but are sometimes simply considered as a subset of cost-effectiveness analysis.”

“Cost-effectiveness analysis places no monetary value on the health outcomes it is comparing. It does not measure or attempt to measure the underlying worth or value to society of gaining additional QALYs, for example, but simply indicates which options will permit more QALYs to be gained than others with the same resources, assuming that gaining QALYs is agreed to be a reasonable objective for the health care system. Therefore the cost-effectiveness approach will never provide a way of determining how much in total it is worth spending on health care and the pursuit of QALYs rather than on other social objectives such as education, defence, or private consumption. It does not permit us to say whether health care spending is too high or too low, but rather confines itself to the question of how any given level of spending can be arranged to maximize the health outcomes yielded.

In contrast, cost-benefit analysis (CBA) does attempt to place some monetary valuation on health outcomes as well as on health care resources. […] The reasons for the more widespread use of cost-effectiveness analysis compared with cost-benefit analysis in health care are discussed extensively elsewhere, […] but two main issues can be identified. Firstly, significant conceptual or practical problems have been encountered with the two principal methods of obtaining monetary valuations of life or quality of life: the human capital approach […] and the willingness to pay approach […] Second, within the health care sector there remains a widespread and intrinsic aversion to the concept of placing explicit monetary values on health or life. […] The cost-benefit approach should […], in principle, permit broad questions of **allocative efficiency** to be addressed. […] In contrast, cost-effectiveness analysis can address questions of **productive** or **production efficiency**, where a specified good or service is being produced at the lowest possible cost – in this context, health gain using the health care budget.”

“when working in the two-dimensional world of cost-effectiveness analysis, there are two uncertainties that will be encountered. Firstly, there will be uncertainty concerning the location of the intervention on the cost-effectiveness plane: how much more or less effective and how much more or less costly it is than current treatment. Second, there is uncertainty concerning how much the decision-maker is willing to pay for health gain […] these two uncertainties can be presented together in the form of the question ‘What is the probability that this intervention is cost-effective?’, a question which effectively divides our cost-effectiveness plane into just two policy spaces – below the maximum acceptable line, and above it”.

“Conventionally, cost-effectiveness ratios that have been calculated against a baseline or do-nothing option without reference to any alternatives are referred to as *average* cost-effectiveness ratios, while comparisons with the next best alternative are described as *incremental* cost-effectiveness ratios […] it is quite misleading to calculate average cost-effectiveness ratios, as they ignore the alternatives available.”

“A life table provides a method of summarizing the mortality experience of a group of individuals. […] There are two main types of life table. First, there is a **cohort life table**, which is constructed based on the mortality experience of a group of individuals […]. While this approach can be used to characterize life expectancies of insects and some animals, human longevity makes this approach difficult to apply as the observation period would have to be sufficiently long to be able to observe the death of all members of the cohort. Instead, **current life tables** are normally constructed using cross-sectional data of observed mortality rates at different ages at a given point in time […] Life tables can also be classified according to the intervals over which changes in mortality occur. A **complete life table** displays the various rates for each year of life; while an **abridged life table** deals with greater periods of time, for example 5 year age intervals […] A life table can be used to generate a survival curve S(x) for the population at any point in time. This represents the probability of surviving beyond a certain age x (i.e. S(x)=Pr[X>x]). […] The chance of a male living to the age of 60 years is high (around 0.9) [in the UK, presumably – *US*] and so the survival curve is comparatively flat up until this age. The proportion dying each year from the age of 60 years rapidly increases, so the curve has a much steeper downward slope. In the last part of the survival curve there is an inflection, indicating a slowing rate of increase in the proportion dying each year among the very old (over 90 years). […] The hazard rate is the slope of the survival curve at any point, given the instantaneous chance of an individual dying.”

“Life tables are a useful tool for estimating changes in life expectancies from interventions that reduce mortality. […] Multiple-cause life tables are a way of quantifying outcomes when there is more than one mutually exclusive cause of death. These life tables can estimate the potential gains from the elimination of a cause of death and are also useful in calculating the benefits of interventions that reduce the risk of a particular cause of death. […] One issue that arises when death is divided into multiple causes in this type of life table is **competing risk**. […] competing risk can arise ‘when an individual can experience more than one type of event and the occurrence of one type of event hinders the occurrence of other types of events’. Competing risks affect life tables, as those who die from a specific cause have no chance of dying from other causes during the remainder of the interval […]. In practice this will mean that as soon as one cause is eliminated the probabilities of dying of other causes increase […]. Several methods have been proposed to correct for competing risks when calculating life tables.”

“the use of published life-table methods may have limitations, especially when considering particular populations which may have very different risks from the general population. In these cases, there are a host of techniques referred to as **survival analysis** which enables risks to be estimated from patient-level data. […] Survival analysis typically involves observing one or more outcomes in a population of interest over a period of time. The outcome, which is often referred to as an **event** or **endpoint** could be death, a non-fatal outcome such as a major clinical event (e.g. myocardial infarction), the occurrence of an adverse event, or even the date of first non-compliance with a therapy.”

“A key feature of survival data is censoring, which occurs whenever the event of interest is not observed within the follow-up period. This does not mean that the event will not occur some time in the future, just that it has not occurred while the individual was observed. […] The most common case of censoring is referred to as **right censoring**. This occurs whenever the observation of interest occurs after the observation period. […] An alternative form of censoring is **left censoring**, which occurs when there is a period of time when the individuals are at risk prior to the observation period.

A key feature of most survival analysis methods is that they assume that the censoring process is **non-informative**, meaning that there is no dependence between the time to the event of interest and the process that is causing the censoring. However, if the duration of observation is related to the severity of a patient’s disease, for example if patients with more advanced illness are withdrawn early from the study, the censoring is likely to be informative and other techniques are required”.

“Differences in the composition of the intervention and control groups at the end of follow-up may have important implications for estimating outcomes, especially when we are interested in extrapolation. If we know that the intervention group is older and has a lower proportion of females, we would expect these characteristics to increase the hazard mortality in this group over their remaining lifetimes. However, if the intervention group has experienced a lower number of events, this may significantly reduce the hazard for some individuals. They may also benefit from a past treatment which continues to reduce the hazard of a primary outcome such as death. This effect […] is known as the **legacy effect**“.

“Changes in life expectancy are a commonly used outcome measure in economic evaluation. […] Table 4.6 shows selected examples of estimates of the gain in life expectancy for various interventions reported by Wright and Weinstein (1998) […] Gains in life expectancy from preventative interventions in populations of average risk generally ranged from a few days to slightly more than a year. […] The gains in life expectancy from preventing or treating disease in persons at elevated risk [*this type of prevention is known as ‘secondary-‘ and/or ‘tertiary prevention’ (depending on the circumstances), as opposed to ‘primary prevention’ – the distinction between primary prevention and more targeted approaches is often important in public health contexts, because the level of targeting will often interact with the cost-effectiveness dimension* – *US*] are generally greater […*one reason why this does not necessarily mean that targeted approaches are always better is that search costs will often be an increasing function of the level of targeting – US*]. Interventions that treat established disease vary, with gains in life-expectancy ranging from a few months […] to as long as nine years […] the point that Wright and Weinstein (1998) were making was not that absolute gains vary, but that a gain in life expectancy of a month from a preventive intervention targeted at population at average risk and a gain of a year from a preventive intervention targeted at populations at elevated risk could both be considered large. It should also be noted that interventions that produce a comparatively small gain in life expectancy when averaged across the population […] may still be very cost-effective.”

## Model Selection and Multi-Model Inference (II)

I haven’t really blogged this book in anywhere near the amount of detail it deserves even though my first post about the book actually had a few quotes illustrating how much different stuff is covered in the book.

This book is technical, and even if I’m trying to make it less technical by omitting the math in this post it may be a good idea to reread the first post about the book before reading this post to refresh your knowledge of these things.

Quotes and comments below – most of the coverage here focuses on stuff covered in chapters 3 and 4 in the book.

…

“Tests of null hypotheses and information-theoretic approaches should not be used together; they are very different analysis paradigms. A very common mistake seen in the applied literature is to use AIC to rank the candidate models and then “test” to see whether the best model (the alternative hypothesis) is “significantly better” than the second-best model (the null hypothesis). This procedure is flawed, and we strongly recommend against it […] the primary emphasis should be on the size of the treatment effects and their precision; too often we find a statement regarding “significance,” while the treatment and control means are not even presented. Nearly all statisticians are calling for estimates of effect size and associated precision, rather than test statistics, P-values, and “significance.” [*Borenstein & Hedges certainly did as well in their book (written much later), and this was not an issue I omitted to talk about in my coverage of their book…*] […] Information-theoretic criteria such as AIC, AICc, and QAICc are not a “test” in any sense, and there are no associated concepts such as test power or P-values or α-levels. Statistical hypothesis testing represents a very different, and generally inferior, paradigm for the analysis of data in complex settings. **It seems best to avoid use of the word “significant” in reporting research results under an information-theoretic paradigm.** […] AIC allows a ranking of models and the identification of models that are nearly equally useful versus those that are clearly poor explanations for the data at hand […]. Hypothesis testing provides no general way to rank models, even for models that are nested. […] In general, we recommend strongly against the use of null hypothesis testing in model selection.”

“The bootstrap is a type of Monte Carlo method used frequently in applied statistics. This computer-intensive approach is based on resampling of the observed data […] The fundamental idea of the model-based sampling theory approach to statistical inference is that the data arise as a sample from some conceptual probability distribution *f*. Uncertainties of our inferences can be measured if we can estimate *f*. The bootstrap method allows the computation of measures of our inference uncertainty by having a simple empirical estimate of *f* and sampling from this estimated distribution. In practical application, the empirical bootstrap means using some form of resampling with replacement from the actual data x to generate B (e.g., B = 1,000 or 10,000) bootstrap samples […] The set of B bootstrap samples is a proxy for a set of B independent real samples from *f* (in reality we have only one actual sample of data). Properties expected from replicate real samples are inferred from the bootstrap samples by analyzing each bootstrap sample exactly as we first analyzed the real data sample. From the set of results of sample size B we measure our inference uncertainties from sample to (conceptual) population […] For many applications it has been theoretically shown […] that the bootstrap can work well for large sample sizes (n), but it is not generally reliable for small n […], regardless of how many bootstrap samples B are used. […] Just as the analysis of a single data set can have many objectives, the bootstrap can be used to provide insight into a host of questions. For example, for each bootstrap sample one could compute and store the conditional variance–covariance matrix, goodness-of-fit values, the estimated variance inflation factor, the model selected, confidence interval width, and other quantities. Inference can be made concerning these quantities, based on summaries over the B bootstrap samples.”

“**Information criteria attempt only to select the best model from the candidate models available; if a better model exists, but is not offered as a candidate, then the information-theoretic approach cannot be expected to identify this new model**. Adjusted R^{2} […] are useful as a measure of the proportion of the variation “explained,” [but] are not useful in model selection […] adjusted R^{2} is poor in model selection; its usefulness should be restricted to description.”

“As we have struggled to understand the larger issues, it has become clear to us that inference based on only a single best model is often relatively poor for a wide variety of substantive reasons. Instead, we increasingly favor multimodel inference: procedures to allow formal statistical inference from all the models in the set. […] Such multimodel inference includes model averaging, incorporating model selection uncertainty into estimates of precision, confidence sets on models, and simple ways to assess the relative importance of variables.”

“If sample size is small, one must realize that relatively little information is probably contained in the data (unless the effect size if very substantial), and the data may provide few insights of much interest or use. Researchers routinely err by building models that are far too complex for the (often meager) data at hand. They do not realize how little structure can be reliably supported by small amounts of data that are typically “noisy.””

“Sometimes, the selected model [when applying an information criterion] contains a parameter that is constant over time, or areas, or age classes […]. This result should not imply that there is no variation in this parameter, rather that parsimony and its bias/variance tradeoff finds the actual variation in the parameter to be relatively small in relation to the information contained in the sample data. It “costs” too much in lost precision to add estimates of all of the individual *θ*_{i}. As the sample size increases, then at some point a model with estimates of the individual parameters would likely be favored. Just because a parsimonious model contains a parameter that is constant across strata does not mean that there is no variation in that process across the strata.”

“[In a significance testing context,] a significant test result does not relate directly to the issue of what approximating model is best to use for inference. One model selection strategy that has often been used in the past is to do likelihood ratio tests of each structural factor […] and then use a model with all the factors that were “significant” at, say, α = 0.05. However, there is no theory that would suggest that this strategy would lead to a model with good inferential properties (i.e., small bias, good precision, and achieved confidence interval coverage at the nominal level). […] The purpose of the analysis of empirical data is not to find the “true model”— not at all. Instead, we wish to find a best approximating model, based on the data, and then develop statistical inferences from this model. […] We search […] not for a “true model,” but rather for a parsimonious model giving an accurate approximation to the interpretable information in the data at hand. Data analysis involves the question, “What level of model complexity will the data support?” and both under- and overfitting are to be avoided. Larger data sets tend to support more complex models, and the selection of the size of the model represents a tradeoff between bias and variance.”

“The easy part of the information-theoretic approaches includes both the computational aspects and the clear understanding of these results […]. The hard part, and the one where training has been so poor, is the a priori thinking about the science of the matter before data analysis — even before data collection. It has been too easy to collect data on a large number of variables in the hope that a fast computer and sophisticated software will sort out the important things — the “significant” ones […]. Instead, a major effort should be mounted to understand the nature of the problem by critical examination of the literature, talking with others working on the general problem, and thinking deeply about alternative hypotheses. Rather than “test” dozens of trivial matters (is the correlation zero? is the effect of the lead treatment zero? are ravens pink?, Anderson et al. 2000), there must be a more concerted effort to provide evidence on *meaningful* questions that are important to a discipline. This is the critical point: the common failure to address important science questions in a fully competent fashion. […] “Let the computer find out” is a poor strategy for researchers who do not bother to think clearly about the problem of interest and its scientific setting. *The sterile analysis of “just the numbers” will continue to be a poor strategy for progress in the sciences.*

Researchers often resort to using a computer program that will examine all possible models and variables automatically. Here, the hope is that the computer will discover the important variables and relationships […] The primary mistake here is a common one: the failure to posit a small set of a priori models, each representing a plausible research hypothesis.”

“Model selection is most often thought of as a way to select just the best model, then inference is conditional on that model. However, information-theoretic approaches are more general than this simplistic concept of model selection. Given a set of models, specified independently of the sample data, we can make formal inferences based on the entire set of models. […] Part of multimodel inference includes ranking the fitted models from best to worst […] and then scaling to obtain the relative plausibility of each fitted model (*g _{i}*) by a weight of evidence (

*w*) relative to the selected best model. Using the conditional sampling variance […] from each model and the Akaike weights […], unconditional inferences about precision can be made over the entire set of models. Model-averaged parameter estimates and estimates of unconditional sampling variances can be easily computed. Model selection uncertainty is a substantial subject in its own right, well beyond just the issue of determining the best model.”

_{i}“There are three general approaches to assessing model selection uncertainty: (1) theoretical studies, mostly using Monte Carlo simulation methods; (2) the bootstrap applied to a given set of data; and (3) utilizing the set of AIC differences (i.e., ∆* _{i}*) and model weights

*w*from the set of models fit to data.”

_{i}“Statistical science should emphasize estimation of parameters and associated measures of estimator uncertainty. Given a correct model […], an MLE is reliable, and we can compute a reliable estimate of its sampling variance and a reliable confidence interval […]. If the model is selected entirely independently of the data at hand, and is a good approximating model, and if n is large, then the estimated sampling variance is essentially unbiased, and any appropriate confidence interval will essentially achieve its nominal coverage. This would be the case if we used only one model, decided on a priori, and it was a good model, *g*, of the data generated under truth, *f*. However, even when we do objective, data-based model selection (which we are advocating here), the [model] selection process is expected to introduce an added component of sampling uncertainty into any estimated parameter; hence classical theoretical sampling variances are too small: They are conditional on the model and do not reflect model selection uncertainty. One result is that conditional confidence intervals can be expected to have less than nominal coverage.”

“Data analysis is sometimes focused on the variables to include versus exclude in the selected model (e.g., important vs. unimportant). Variable selection is often the focus of model selection for linear or logistic regression models. Often, an investigator uses stepwise analysis to arrive at a final model, and from this a conclusion is drawn that the variables in this model are important, whereas the other variables are not important. While common, this is poor practice and, among other issues, fails to fully consider model selection uncertainty. […] Estimates of the relative importance of predictor variables x_{j} can best be made by summing the Akaike weights across all the models in the set where variable *j* occurs. Thus, the relative importance of variable *j* is reflected in the sum w_{+ }(*j*). The larger the w_{+ }(*j*) the more important variable *j* is, relative to the other variables. Using the w_{+ }(*j*), all the variables can be ranked in their importance. […] This idea extends to subsets of variables. For example, we can judge the importance of a pair of variables, as a pair, by the sum of the Akaike weights of all models that include the pair of variables. […] To summarize, in many contexts the AIC selected best model will include some variables and exclude others. Yet this inclusion or exclusion by itself does not distinguish differential evidence for the importance of a variable in the model. The model weights […] summed over all models that include a given variable provide a better weight of evidence for the importance of that variable in the context of the set of models considered.” [*The reason why I’m not telling you how to calculate Akaike weights is that I don’t want to bother with math formulas in wordpress – but I guess all you need to know is that these are not hard to calculate. It should perhaps be added that one can also use bootstrapping methods to obtain relevant model weights to apply in a multimodel inference context.*]

“If data analysis relies on model selection, then inferences should acknowledge model selection uncertainty. If the goal is to get the best estimates of a set of parameters in common to all models (this includes prediction), model averaging is recommended. If the models have definite, and differing, interpretations as regards understanding relationships among variables, and it is such understanding that is sought, then one wants to identify the best model and make inferences based on that model. […] The bootstrap provides direct, robust estimates of model selection probabilities π_{i} , but we have no reason now to think that use of bootstrap estimates of model selection probabilities rather than use of the Akaike weights will lead to superior unconditional sampling variances or model-averaged parameter estimators. […] Be mindful of possible model redundancy. A carefully thought-out set of a priori models should eliminate model redundancy problems and is a central part of a sound strategy for obtaining reliable inferences. […] **Results are sensitive to having demonstrably poor models in the set of models considered; thus it is very important to exclude models that are a priori poor.** […] The importance of a small number (R) of candidate models, defined prior to detailed analysis of the data, cannot be overstated. […] One should have R much smaller than n. MMI [Multi-Model Inference] approaches become increasingly important in cases where there are many models to consider.”

“In general there is a substantial amount of model selection uncertainty in many practical problems […]. Such uncertainty about what model structure (and associated parameter values) is the K-L [Kullback–Leibler] best approximating model applies whether one uses hypothesis testing, information-theoretic criteria, dimension-consistent criteria, cross-validation, or various Bayesian methods. Often, there is a nonnegligible variance component for estimated parameters (this includes prediction) due to uncertainty about what model to use, and this component should be included in estimates of precision. […] we recommend assessing model selection uncertainty rather than ignoring the matter. […] It is […] not a sound idea to pick a single model and unquestioningly base extrapolated predictions on it when there is model uncertainty.”

## Model Selection and Multi-Model Inference (I)

“We wrote this book to introduce graduate students and research workers in various scientific disciplines to the use of information-theoretic approaches in the analysis of empirical data. These methods allow the data-based selection of a “best” model and a ranking and weighting of the remaining models in a pre-defined set. Traditional statistical inference can then be based on this selected best model. However, we now emphasize that information-theoretic approaches allow formal inference to be based on more than one model (multimodel inference). Such procedures lead to more robust inferences in many cases, and we advocate these approaches throughout the book. […] Information theory includes the celebrated Kullback–Leibler “distance” between two models (actually, probability distributions), and this represents a fundamental quantity in science. In 1973, Hirotugu Akaike derived an estimator of the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood. His measure, now called Akaike’s information criterion (AIC), provided a new paradigm for model selection in the analysis of empirical data. His approach, with a fundamental link to information theory, is relatively simple and easy to use in practice, but little taught in statistics classes and far less understood in the applied sciences than should be the case. […] We do not claim that the information-theoretic methods are always the very best for a particular situation. They do represent a unified and rigorous theory, an extension of likelihood theory, an important application of information theory, and they are objective and practical to employ across a very wide class of empirical problems. Inference from multiple models, or the selection of a single “best” model, by methods based on the Kullback–Leibler distance are almost certainly better than other methods commonly in use now (e.g., null hypothesis testing of various sorts, the use of R^{2}, or merely the use of just one available model).

This is an applied book written primarily for biologists and statisticians using models for making inferences from empirical data. […] This book might be useful as a text for a course for students with substantial experience and education in statistics and applied data analysis. A second primary audience includes honors or graduate students in the biological, medical, or statistical sciences […] Readers should ideally have some maturity in the quantitative sciences and experience in data analysis. Several courses in contemporary statistical theory and methods as well as some philosophy of science would be particularly useful in understanding the material. Some exposure to likelihood theory is nearly essential”.

…

The above quotes are from the preface of the book, which I have so far only briefly talked about here; this post will provide a lot more details. Aside from writing the post in order to mentally process the material and obtain a greater appreciation of the points made in the book, I have also as a secondary goal tried to write the post in a manner so that people who are not necessarily experienced model-builders might also derive some benefit from the coverage. Whether or not I was successful in that respect I do not know – given the outline above, it should be obvious that there are limits as to how ‘readable’ you can make stuff like this to people without a background in a semi-relevant field. I don’t think I have written specifically about the application of information criteria in the model selection context before here on the blog, at least not in any amount of detail, but I have written about ‘model-stuff’ before, also in ‘meta-contexts’ not necessarily related to the application of models in economics; so if you’re interested in ‘this kind of stuff’ but you don’t feel like having a go at a post dealing with a book which includes word combinations like ‘the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood’ in the preface, you can for example have a look at posts like this, this, this and this. I have also discussed here on the blog some stuff somewhat related to the multi-model inference part, how you can combine the results of various models to get a bigger picture of what’s going on, in these posts – they approach ‘the topic’ (these are in fact separate topics…) in a very different manner than does this book, but *some* key ideas *should* presumably transfer. Having said all this, I should also point out that many of the basic points made in the coverage below should be relatively easy to understand, and I should perhaps repeat that I’ve tried to make this post readable to people who’re not too familiar with this kind of stuff. I have deliberately chosen to include no mathematical formulas in my coverage in this post. Please do not assume this is because the book does not contain mathematical formulas.

Before moving on to the main coverage I thought I’d add a note about the remark above that stuff like AIC is “little taught in statistics classes and far less understood in the applied sciences than should be the case”. The book was written a while back, and some things may have changed a bit since then. I have done coursework on the application of information criteria in model selection as it was a topic (briefly) covered in regression analysis(? …or an earlier course), so at least this kind of stuff is now being taught to students of economics where I study and has been for a while as far as I’m aware – meaning that coverage of such topics is probably reasonably widespread at least in this field. However I can hardly claim that I obtained a ‘great’ or ‘full’ understanding of the issues at hand from the work on these topics I did back then – and so I have only gradually, while reading this book, come to appreciate some of the deeper issues and tradeoffs involved in model selection. This could probably be taken as an argument that these topics are still ‘far less understood … than should be the case’ – and another, perhaps stronger, argument would be Seber’s comments in the last part of his book; if a statistician today may still ‘overlook’ information criteria when discussing model selection in a *Springer* text, it’s not hard to argue that the methods are perhaps not as well known as should ‘ideally’ be the case. It’s obvious from the coverage that a lot of people were not using the methods when the book was written, and I’m not sure things have changed as much as would be preferable since then.

What is the book about? A starting point for understanding the sort of questions the book deals with might be to consider the simple question: When we set out to model stuff empirically and we have different candidate models to choose from, how do we decide which of the models is ‘best’? There are a lot of other questions dealt with in the coverage as well. What does the word ‘best’ mean? We might worry over both the functional form of the model and which variables should be included in ‘the best’ model – do we need separate mechanisms for dealing with concerns about the functional form and concerns about variable selection, or can we deal with such things at the same time? How do we best measure the effect of a variable which we have access to and consider including in our model(s) – is it preferable to interpret the effect of a variable on an outcome based on the results you obtain from a ‘best model’ in the set of candidate models, or is it perhaps sometimes better to combine the results of multiple models (and for example take an average of the effects of the variable across multiple proposed models to be the best possible estimate) in the choice set (as should by now be obvious for people who’ve read along here, there are some sometimes quite close parallels between stuff covered in this book and stuff covered in *Borenstein & Hedges*)? If we’re not sure which model is ‘right’, how might we quantify our uncertainty about these matters – and what happens if we don’t try to quantify our uncertainty about which model is correct? What is bootstrapping, and how can we use Monte Carlo methods to help us with model selection? If we apply information criteria to choose among models, what do these criteria tell us, and which sort of issues are they silent about? Are some methods for deciding between models better than others in specific contexts – might it for example be a good idea to make criteria adjustments when faced with small sample sizes which makes it harder for us to rely on asymptotic properties of the criteria we apply? How might the sample size more generally relate to our decision criterion deciding which model might be considered ‘best’ – do we think that what might be considered to be ‘the best model’ might depend upon (‘should depend upon’?) how much data we have access to or not, and if how much data we have access to and the ‘optimal size of a model’ are related, *how *are the two related, and why? The questions included in the previous sentence relate to some fundamental differences between AIC (and similar measures) and BIC – but let’s not get ahead of ourselves. I may or may not go into details like these in my coverage of the book, but I certainly won’t cover stuff like that in this post. Some of the content is really technical: “Chapters 5 and 6 present more difficult material [than chapters 1-4] and some new research results. Few readers will be able to absorb the concepts presented here after just one reading of the material […] Underlying theory is presented in Chapter 7, and this material is much deeper and more mathematical.” – from the preface. The sample size considerations mentioned above relate to stuff covered in chapter 6. As you might already have realized, this book has a lot of stuff.

When dealing with models, one way to think about these things is to consider two in some sense separate issues: On the one hand we might think about which model is most appropriate (model selection), and on the other hand we might think about how best to estimate parameter values and variance-covariance matrices *given* a specific model. As the book points out early on, “if one assumes or somehow chooses a particular model, methods exist that are objective and asymptotically optimal for estimating model parameters and the sampling covariance structure, conditional on that model. […] The sampling distributions of ML [maximum likelihood] estimators are often skewed with small samples, but profile likelihood intervals or log-based intervals or bootstrap procedures can be used to achieve asymmetric confidence intervals with good coverage properties. **In general, the maximum likelihood method provides an objective, omnibus theory for estimation of model parameters and the sampling covariance matrix, given an appropriate model**.” The problem is that it’s not ‘a given’ that the model we’re working on

*is*actually appropriate. That’s where model selection mechanisms enters the picture. Such methods can help us realize which of the models we’re considering might be the most appropriate one(s) to apply in the specific context (there are other things they can’t tell us, however – see below).

Below I have added some quotes from the book and some further comments:

“Generally, alternative models will involve differing numbers of parameters; the number of parameters will often differ by at least an order of magnitude across the set of candidate models. […] The more parameters used, the better the fit of the model to the data that is achieved. Large and extensive data sets are likely to support more complexity, and this should be considered in the development of the set of candidate models. If a particular model (parametrization) does not make biological [/’scientific’] sense, this is reason to exclude it from the set of candidate models, particularly in the case where causation is of interest. In developing the set of candidate models, one must recognize a certain balance between keeping the set small and focused on plausible hypotheses, while making it big enough to guard against omitting a very good a priori model. While this balance should be considered, we advise the inclusion of all models that seem to have a reasonable justification, prior to data analysis. While one must worry about errors due to both underfitting and overfitting, it seems that modest overfitting is less damaging than underfitting (Shibata 1989).” (The key word here is ‘modest’ – and please don’t take these authors to be in favour of obviously overfitted models and data dredging strategies; they spend quite a few pages criticizing such models/approaches!).

“It is not uncommon to see biologists collect data on 50–130 “ecological” variables in the blind hope that some analysis method and computer system will “find the variables that are significant” and sort out the “interesting” results […]. This shotgun strategy will likely uncover mainly spurious correlations […], and it is prevalent in the naive use of many of the traditional multivariate analysis methods (e.g., principal components, stepwise discriminant function analysis, canonical correlation methods, and factor analysis) found in the biological literature [*and elsewhere, US*]. We believe that mostly spurious results will be found using this unthinking approach […], and we encourage investigators to give very serious consideration to a well-founded set of candidate models and predictor variables (as a reduced set of possible prediction) as a means of minimizing the inclusion of spurious variables and relationships. […] Using AIC and other similar methods one can only hope to select the best model from this set; if good models are not in the set of candidates, they cannot be discovered by model selection (i.e., data analysis) algorithms. […] statistically we can infer only that a best model (by some criterion) has been selected, never that it is the true model. […] **Truth and true models are not statistically identifiable from data**.”

“It is generally a mistake to believe that there is a simple “true model” in the biological sciences and that during data analysis this model can be uncovered and its parameters estimated. Instead, biological systems [*and other systems! – US*] are complex, with many small effects, interactions, individual heterogeneity, and individual and environmental covariates (most being unknown to us); we can only hope to identify a model that provides a good approximation to the data available. The words “true model” represent an oxymoron, except in the case of Monte Carlo studies, whereby a model is used to generate “data” using pseudorandom numbers […] A model is a simplification or approximation of reality and hence will not reflect all of reality. […] While a model can never be “truth,” a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless. Model selection methods try to rank models in the candidate set relative to each other; whether any of the models is actually “good” depends primarily on the quality of the data and the science and a priori thinking that went into the modeling. […] Proper modeling and data analysis tell what inferences the data support, not what full reality might be […] Even if a “true model” did exist and if it could be found using some method, it would not be good as a fitted model for general inference (i.e., understanding or prediction) about some biological system, because its numerous parameters would have to be estimated from the finite data, and the precision of these estimated parameters would be quite low.”

A key concept in the context of model selection is the tradeoff between bias and variance in a model framework:

“If the fit is improved by a model with more parameters, then where should one stop? Box and Jenkins […] suggested that the* principle of parsimony* should lead to a model with “. . . the smallest possible number of parameters for adequate representation of the data.” Statisticians view the principle of parsimony as a bias versus variance tradeoff. In general, bias decreases and variance increases as the dimension of the model (K) increases […] The fit of any model can be improved by increasing the number of parameters […]; however, a tradeoff with the increasing variance must be considered in selecting a model for inference. Parsimonious models achieve a proper tradeoff between bias and variance. All model selection methods are based to some extent on the principle of parsimony […] The concept of parsimony and a bias versus variance tradeoff is very important.”

“we reserve the terms underfitted and overfitted for use in relation to a “best approximating model” […] Here, an underfitted model would ignore some important replicable (i.e., conceptually replicable in most other samples) structure in the data and thus fail to identify effects that were actually supported by the data. In this case, bias in the parameter estimators is often substantial, and the sampling variance is underestimated, both factors resulting in poor confidence interval coverage. Underfitted models tend to miss important treatment effects in experimental settings. Overfitted models, as judged against a best approximating model, are often free of bias in the parameter estimators, but have estimated (and actual) sampling variances that are needlessly large (the precision of the estimators is poor, relative to what could have been accomplished with a more parsimonious model). Spurious treatment effects tend to be identified, and spurious variables are included with overfitted models. […] The goal of data collection and analysis is to make inferences from the sample that properly apply to the population […] A paramount consideration is the repeatability, with good precision, of any inference reached. When we imagine many replicate samples, there will be some recognizable features common to almost all of the samples. Such features are the sort of inference about which we seek to make strong inferences (from our single sample). Other features might appear in, say, 60% of the samples yet still reflect something real about the population or process under study, and we would hope to make weaker inferences concerning these. Yet additional features appear in only a few samples, and these might be best included in the error term (σ^{2}) in modeling. If one were to make an inference about these features quite unique to just the single data set at hand, as if they applied to all (or most all) samples (hence to the population), then we would say that the sample is overfitted by the model (we have overfitted the *data*). Conversely, failure to identify the features present that are strongly replicable over samples is underfitting. […] A best approximating model is achieved by properly balancing the errors of underfitting and overfitting.”

Model selection bias is a key concept in the model selection context, and I think this problem is quite similar/closely related to problems encountered in a meta-analytical context which I believe I’ve discussed before here on the blog (see links above to the posts on meta-analysis) – if I’ve understood these authors correctly, one might choose to think of publication bias issues as partly the result of model selection bias issues. Let’s for a moment pretend you have a ‘true model’ which includes three variables (in the book example there are four, but I don’t think you need four…); one is very important, one is a sort of ‘60% of the samples variable’ mentioned above, and the last one would be a variable we might prefer to just include in the error term. Now the problem is this: When people look at samples where the last one of these variables is ‘seen to matter’, the effect size of this variable will be biased away from zero (they don’t explain where this bias comes from in the book, but I’m reasonably sure this is a result of the probability of identification/inclusion of the variable in the model depending on the (‘local’/’sample’) effect size; the bigger the effect size of a specific variable in a specific sample, the more likely the variable is to be identified as important enough to be included in the model – *Bohrenstein and Hedges* talked about similar dynamics, for obvious reasons, and I think their reasoning ‘transfers’ to this situation and is applicable here as well). When models include variables such as the last one, you’ll have model selection bias: “When predictor variables [like these] are included in models, the associated estimator for a σ^{2} is negatively biased and precision is exaggerated. These two types of bias are called model selection bias”. Much later in the book they incidentally conclude that: “**The best way to minimize model selection bias is to reduce the number of models fit to the data by thoughtful a priori model formulation**.”

“Model selection has most often been viewed, and hence taught, in a context of null hypothesis testing. Sequential testing has most often been employed, either stepup (forward) or stepdown (backward) methods. Stepwise procedures allow for variables to be added or deleted at each step. These testing-based methods remain popular in many computer software packages in spite of their poor operating characteristics. […] Generally, hypothesis testing is a very poor basis for model selection […] There is no statistical theory that supports the notion that hypothesis testing with a fixed α level is a basis for model selection. […] Tests of hypotheses within a data set are not independent, making inferences difficult. The order of testing is arbitrary, and differing test order will often lead to different final models. [This is incidentally one, of several, key differences between hypothesis testing approaches and information theoretic approaches: “The order in which the information criterion is computed over the set of models is not relevant.”] […] Model selection is dependent on the arbitrary choice of α, but α should depend on both n and K to be useful in model selection”.

## Statistical Models for Proportions and Probabilities

“Most elementary statistics books discuss inference for proportions and probabilities, and the primary readership for this monograph is the student of statistics, either at an advanced undergraduate or graduate level. As some of the recommended so-called ‘‘large-sample’’ rules in textbooks have been found to be inappropriate, this monograph endeavors to provide more up-to-date information on these topics. I have also included a number of related topics not generally found in textbooks. The emphasis is on model building and the estimation of parameters from the models.

It is assumed that the reader has a background in statistical theory and inference and is familiar with standard univariate and multivariate distributions, including conditional distributions.”

…

The above quote is from the the book‘s preface. The book is highly technical – here’s a screencap of a page roughly in the middle:

I think the above picture provides some background as to why I do not think it’s a good idea to provide detailed coverage of the book here. Not all pages are that bad, but this *is* a book on mathematical statistics. The technical nature of the book made it difficult for me to know how to rate it – I like to ask myself when reading books like this one if I would be able to spot an error in the coverage. In some contexts here I clearly would not be able to do that (given the time I was willing to spend on the book), and when that’s the case I always feel hesitant about rating(/’judging’) books of this nature. I should note that there are pretty much no spelling/formatting errors, and the language is easy to understand (‘if you know enough about statistics…’). I did have one major problem with part of the coverage towards the end of the book, but it didn’t much alter my general impression of the book. The problem was that the author seems to apply (/recommend?) a hypothesis-testing framework for model selection, a practice which although widely used is frankly considered bad statistics by Burnham and Anderson in their book on model selection. In the relevant section of the book Seber discusses an approach to modelling which starts out with a ‘full model’ including both primary effects and various (potentially multi-level) interaction terms (he deals specifically with data derived from multiple (independent?) multinomial distributions, but where the data comes from is not really important here), and then he proceeds to use hypothesis tests of whether interaction terms are zero to determine whether or not interactions should be included in the model or not. For people who don’t know, this model selection method is both very commonly used and a very wrong way to do things; using hypothesis testing as a model selection mechanism is a methodologically invalid approach to model selection, something Burnham and Anderson talks a lot about in their book. I assume I’ll be covering Burnham and Anderson’s book in more detail later on here on the blog, so for now I’ll just make this key point here and then return to that stuff later – if you did not understand the comments above you shouldn’t worry too much about it, I’ll go into much more detail when talking about that stuff later. This problem was the only real problem I had with Seber’s book.

Although I’ll not talk a lot about what the book was about (not only because it might be hard for some readers to follow, I should point out, but also because detailed coverage would take a lot more time than I’d be willing to spend on this stuff), I decided to add a few links to relevant stuff he talks about in the book. Quite a few pages in the book are spent on talking about the properties of various distributions, how to estimate key parameters of interest, and how to construct confidence intervals to be used for hypothesis testing in those specific contexts.

Some of the links below deal with stuff covered in the book, a few others however just deal with stuff I had to look up in order to understand what was going on in the coverage:

Inverse sampling.

Binomial distribution.

Hypergeometric distribution.

Multinomial distribution.

Binomial proportion confidence interval. (Coverage of the Wilson score interval, Jeffreys interval, and the Clopper-Pearson interval included in the book).

Fisher’s exact test.

Marginal distribution.

Fischer information.

Moment-generating function.

Factorial moment-generating function.

Delta method.

Multidimensional central limit theorem (the book applies this, but doesn’t really talk about it).

Matrix function.

McNemar’s test.