## Quotes

i. “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” (John Tukey)

ii. “Far better an approximate answer to the *right* question, which is often vague, than an *exact* answer to the wrong question, which can always be made precise.” (-ll-)

iii. “They who can no longer unlearn have lost the power to learn.” (John Lancaster Spalding)

iv. “If there are but few who interest thee, why shouldst thou be disappointed if but few find thee interesting?” (-ll-)

v. “Since the mass of mankind are too ignorant or too indolent to think seriously, if majorities are right it is by accident.” (-ll-)

vi. “As they are the bravest who require no witnesses to their deeds of daring, so they are the best who do right without thinking whether or not it shall be known.” (-ll-)

vii. “Perfection is beyond our reach, but they who earnestly strive to become perfect, acquire excellences and virtues of which the multitude have no conception.” (-ll-)

viii. “We are made ridiculous less by our defects than by the affectation of qualities which are not ours.” (-ll-)

ix. “If thy words are wise, they will not seem so to the foolish: if they are deep the shallow will not appreciate them. Think not highly of thyself, then, when thou art praised by many.” (-ll-)

x. “Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. ” (George E. P. Box)

xi. “Intense ultraviolet (UV) radiation from the young Sun acted on the atmosphere to form small amounts of very many gases. Most of these dissolved easily in water, and fell out in rain, making Earth’s surface water rich in carbon compounds. […] the most important chemical of all may have been cyanide (HCN). It would have formed easily in the upper atmosphere from solar radiation and meteorite impact, then dissolved in raindrops. Today it is broken down almost at once by oxygen, but early in Earth’s history it built up at low concentrations in lakes and oceans. Cyanide is a basic building block for more complex organic molecules such as amino acids and nucleic acid bases. Life probably evolved in chemical conditions that would kill us instantly!” (Richard Cowen, History of Life, p.8)

xii. “Dinosaurs dominated land communities for 100 million years, and it was only after dinosaurs disappeared that mammals became dominant. It’s difficult to avoid the suspicion that dinosaurs were in some way competitively superior to mammals and confined them to small body size and ecological insignificance. […] Dinosaurs dominated many guilds in the Cretaceous, including that of large browsers. […] in terms of their reconstructed behavior […] dinosaurs should be compared not with living reptiles, but with living mammals and birds. […] By the end of the Cretaceous there were mammals with varied sets of genes but muted variation in morphology. […] All Mesozoic mammals were small. Mammals with small bodies can play only a limited number of ecological roles, mainly insectivores and omnivores. But when dinosaurs disappeared at the end of the Cretaceous, some of the Paleocene mammals quickly evolved to take over many of their ecological roles” (ibid., pp. 145, 154, 222, 227-228)

xiii. “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” (Ronald Fisher)

xiv. “Ideas are incestuous.” (Howard Raiffa)

xv. “Game theory […] deals only with the way in which ultrasmart, all knowing people should behave in competitive situations, and has little to say to Mr. X as he confronts the morass of his problem. ” (-ll-)

xvi. “One of the principal objects of theoretical research is to find the point of view from which the subject appears in the greatest simplicity.” (Josiah Williard Gibbs)

xvii. “Nothing is as dangerous as an ignorant friend; a wise enemy is to be preferred.” (Jean de La Fontaine)

xviii. “Humility is a virtue all preach, none practice; and yet everybody is content to hear.” (John Selden)

xix. “Few men make themselves masters of the things they write or speak.” (-ll-)

xx. “Wise men say nothing in dangerous times.” (-ll-)

## Principles of Applied Statistics

“Statistical considerations arise in virtually all areas of science and technology and, beyond these, in issues of public and private policy and in everyday life. While the detailed methods used vary greatly in the level of elaboration involved and often in the way they are described, there is a unity of ideas which gives statistics as a subject both its intellectual challenge and its importance […] In this book we have aimed to discuss the ideas involved in applying statistical methods to advance knowledge and understanding. It is a book not on statistical methods as such but, rather, on how these methods are to be deployed […] We are writing partly for those working as applied statisticians, partly for subject-matter specialists using statistical ideas extensively in their work and partly for masters and doctoral students of statistics concerned with the relationship between the detailed methods and theory they are studying and the effective application of these ideas. Our aim is to emphasize how statistical ideas may be deployed fruitfully rather than to describe the details of statistical techniques.”

…

I gave the book five stars, but as noted in my review on goodreads I’m not sure the word ‘amazing’ is really fitting – however the book had a lot of good stuff and it had very little stuff for me to quibble about, so I figured it deserved a high rating. The book deals to a very large extent with topics which are in some sense common to pretty much all statistical analyses, regardless of the research context; formulation of research questions/hypotheses, data search, study designs, data analysis, and interpretation. The authors spend quite a few pages talking about hypothesis testing but on the other hand no pages talking about statistical information criteria, a topic with which I’m at this point at least reasonably familiar, and I figure if I had been slightly more critical I’d have subtracted a star for this omission – however I have the impression that I’m at times perhaps too hard on non-fiction books on goodreads so I decided not to punish the book for this omission. Part of the reason why I gave the book five stars is also that I’ve sort of wanted to read a book like this one for a while; I think in some sense it’s the first one of its kind I’ve read. I liked the way the book was structured.

Below I have added some observations from the book, as well as a few comments (I should note that I have had to leave out a lot of good stuff).

…

“When the data are very extensive, precision estimates calculated from simple standard statistical methods are likely to underestimate error substantially owing to the neglect of hidden correlations. A large amount of data is in no way synonymous with a large amount of information. In some settings at least, if a modest amount of poor quality data is likely to be modestly misleading, an extremely large amount of poor quality data may be extremely misleading.”

“For studies of a new phenomenon it will usually be best to examine situations in which the phenomenon is likely to appear in the most striking form, even if this is in some sense artificial or not representative. This is in line with the well-known precept in mathematical research: study the issue in the simplest possible context that is not entirely trivial, and later generalize.”

“It often […] aids the interpretation of an observational study to consider the question: what would have been done in a comparable experiment?”

“An important and perhaps sometimes underemphasized issue in empirical prediction is that of stability. Especially when repeated application of the same method is envisaged, it is unlikely that the situations to be encountered will exactly mirror those involved in setting up the method. It may well be wise to use a procedure that works well over a range of conditions even if it is sub-optimal in the data used to set up the method.”

“Many investigations have the broad form of collecting similar data repeatedly, for example on different individuals. In this connection the notion of a *unit of analysis *is often helpful in clarifying an approach to the detailed analysis. Although this notion is more generally applicable, it is clearest in the context of randomized experiments. Here the unit of analysis is that smallest subdivision of the experimental material such that two distinct units *might *be randomized (randomly allocated) to different treatments. […] In general the unit of analysis may not be the same as the unit of interpretation, that is to say, the unit about which conclusions are to drawn. The most difficult situation is when the unit of analysis is an aggregate of several units of interpretation, leading to the possibility of *ecological bias*, that is, a systematic difference between, say, the impact of explanatory variables at different levels of aggregation. […] it is important to identify the unit of analysis, which may be different in different parts of the analysis […] on the whole, limited detail is needed in examining the variation within the unit of analysis in question.”

The book briefly discusses issues pertaining to the scale of effort involved when thinking about appropriate study designs and how much/which data to gather for analysis, and notes that often associated costs are not quantified – rather a judgment call is made. An important related point is that e.g. in survey contexts response patterns will tend to depend upon the quantity of information requested; if you ask for too much, few people might reply (…and perhaps it’s also the case that it’s ‘the wrong people’ that reply? The authors don’t touch upon the potential selection bias issue, but it seems relevant). A few key observations from the book on this topic:

“the intrinsic quality of data, for example the response rates of surveys, may be degraded if too much is collected. […] sampling may give higher [data] quality than the study of a complete population of individuals. […] When researchers studied the effect of the expected length (10, 20 or 30 minutes) of a web-based questionnaire, they found that fewer potential respondents started and completed questionnaires expected to take longer (Galesic and Bosnjak, 2009). Furthermore, questions that appeared later in the questionnaire were given shorter and more uniform answers than questions that appeared near the start of the questionnaire.”

Not surprising, but certainly worth keeping in mind. Moving on…

“In general, while principal component analysis may be helpful in suggesting a base for interpretation and the formation of derived variables there is usually considerable arbitrariness involved in its use. This stems from the need to standardize the variables to comparable scales, typically by the use of correlation coefficients. This means that a variable that happens to have atypically small variability in the data will have a misleadingly depressed weight in the principal components.”

The book includes a few pages about the Berkson error model, which I’d never heard about. Wikipedia doesn’t have much about it and I was debating how much to include about this one here – I probably wouldn’t have done more than including the link here if the wikipedia article actually covered this topic in any detail, but it doesn’t. However it seemed important enough to write a few words about it. The basic difference between the ‘classical error model’, i.e. the one everybody knows about, and the Berkson error model is that in the former case the measurement error is statistically independent of the *true value* of X, whereas in the latter case the measurement error is independent of the *measured value*; the authors note that this implies that the true values are more variable than the measured values in a Berkson error context. Berkson errors can e.g. happen in experimental contexts where levels of a variable are pre-set by some target, for example in a medical context where a drug is supposed to be administered each X hours; the pre-set levels might then be the measured values, and the true values might be different e.g. if the nurse was late. I thought it important to mention this error model not only because it’s a completely new idea to me that you might encounter this sort of error-generating process, but also because there is no statistical test that you can use to figure out if the standard error model is the appropriate one, or if a Berkson error model is better; which means that you need to be aware of the difference and think about which model works best, based on the nature of the measuring process.

Let’s move on to some quotes dealing with modeling:

“while it is appealing to use methods that are in a reasonable sense fully efficient, that is, extract all relevant information in the data, nevertheless any such notion is within the framework of an assumed model. Ideally, methods should have this efficiency property while preserving good behaviour (especially stability of interpretation) when the model is perturbed. Essentially a model translates a subject-matter question into a mathematical or statistical one and, if that translation is seriously defective, the analysis will address a wrong or inappropriate question […] The greatest difficulty with quasi-realistic models [as opposed to ‘toy models’] is likely to be that they require numerical specification of features for some of which there is very little or no empirical information. Sensitivity analysis is then particularly important.”

“Parametric models typically represent some notion of smoothness; their danger is that particular representations of that smoothness may have strong and unfortunate implications. This difficulty is covered for the most part by informal checking that the primary conclusions do not depend critically on the precise form of parametric representation. To some extent such considerations can be formalized but in the last analysis some element of judgement cannot be avoided. One general consideration that is sometimes helpful is the following. If an issue can be addressed nonparametrically then it will often be better to tackle it parametrically; however, if it cannot be resolved nonparametrically then it is usually dangerous to resolve it parametrically.”

“Once a model is formulated two types of question arise. How can the unknown parameters in the model best be estimated? Is there evidence that the model needs modification or indeed should be abandoned in favour of some different representation? The second question is to be interpreted not as asking whether the model is true [*this is the wrong question to ask, as also emphasized by Burnham & Anderson*] but whether there is clear evidence of a specific kind of departure implying a need to change the model so as to avoid distortion of the final conclusions. […] it is important in applications to understand the circumstances under which different methods give similar or different conclusions. In particular, if a more elaborate method gives an apparent improvement in precision, what are the assumptions on which that improvement is based? Are they reasonable? […] the hierarchical principle implies, […] with very rare exceptions, that models with interaction terms should include also the corresponding main effects. […] When considering two families of models, it is important to consider the possibilities that both families are adequate, that one is adequate and not the other and that neither family fits the data.” [Do incidentally recall that in the context of interactions, “the term interaction […] is in some ways a misnomer. There is no necessary implication of interaction in the physical sense or synergy in a biological context. Rather, interaction means a departure from additivity […] This is expressed most explicitly by the requirement that, apart from random fluctuations, the difference in outcome between any two levels of one factor is the same at all levels of the other factor. […] The most directly interpretable form of interaction, certainly not removable by [variable] transformation, is effect reversal.”]

“The *p*-value assesses the data […] via a comparison with that anticipated if *H _{0}* were true. If in two different situations the test of a relevant null hypothesis gives approximately the same

*p*-value, it does not follow that the overall strengths of the evidence in favour of the relevant

*H*are the same in the two cases.”

_{0}“There are […] two sources of uncertainty in observational studies that are not present in randomized experiments. The first is that the ordering of the variables may be inappropriate, a particular hazard in cross-sectional studies. […] if the data are tied to one time point then any presumption of causality relies on a working hypothesis as to whether the components are explanatory or responses. Any check on this can only be from sources external to the current data. […] The second source of uncertainty is that important explanatory variables affecting both the potential cause and the outcome may not be available. […] Retrospective explanations may be convincing if based on firmly established theory but otherwise need to be treated with special caution. It is well known in many fields that ingenious explanations can be constructed retrospectively for almost any finding.”

“The general issue of applying conclusions from aggregate data to specific individuals is essentially that of showing that the individual does not belong to a subaggregate for which a substantially different conclusion applies. In actuality this can at most be indirectly checked for specific subaggregates. […] It is not unknown in the literature to see conclusions such as that there are no treatment differences except for males aged over 80 years, living more than 50 km south of Birmingham and life-long supporters of Aston Villa football club, who show a dramatic improvement under some treatment *T*. Despite the undoubted importance of this particular subgroup, virtually always such conclusions would seem to be unjustified.” [*I loved this example!*]

The authors included a few interesting results from an undated Cochrane publication which I thought I should mention. The file-drawer effect is well known, but there are a few other interesting biases at play in a publication bias context. One is time-lag bias, which means that statistically significant results take less time to get published. Another is language bias; statistically significant results are more likely to be published in English publications. A third bias is multiple publication bias; it turns out that papers with statistically significant results are more likely to be published more than once. The last one mentioned is citation bias; papers with statistically significant results are more likely to be cited in the literature.

The authors include these observations in their concluding remarks: “The overriding general principle [in the context of applied statistics], difficult to achieve, is that there should be a seamless flow between statistical and subject-matter considerations. […] in principle seamlessness requires an individual statistician to have views on subject-matter interpretation and subject-matter specialists to be interested in issues of statistical analysis.”

As already mentioned this is a good book. It’s not long, and/but it’s worth reading if you’re in the target group.

## Quotes

i. “By all means think yourself big but don’t think everyone else small” (‘Notes on Flyleaf of Fresh ms. Book’, *Scott’s Last Expedition*. See also this).

ii. “The man who knows everyone’s job isn’t much good at his own.” (-ll-)

iii. “It is amazing what little harm doctors do when one considers all the opportunities they have” (Mark Twain, as quoted in the Oxford Handbook of Clinical Medicine, p.595).

iv. “A** **first-rate theory predicts; a second-rate theory forbids and a third-rate theory explains after the event.” (Aleksander Isaakovich Kitaigorodski)

v. “[S]ome of the most terrible things in the world are done by people who think, genuinely think, that they’re doing it for the best” (Terry Pratchett, Snuff).

vi. “That was excellently observ’d, say I, when I read a Passage in an Author, where his Opinion agrees with mine. When we differ, there I pronounce him to be mistaken.” (Jonathan Swift)

vii. “Death is nature’s master stroke, albeit a cruel one, because it allows genotypes space to try on new phenotypes.” (Quote from the Oxford Handbook of Clinical Medicine, p.6)

*
*viii. “The purpose of models is not to fit the data but to sharpen the questions.” (Samuel Karlin)

ix. “We may […] view set theory, and mathematics generally, in much the way in which we view theoretical portions of the natural sciences themselves; as comprising truths or hypotheses which are to be vindicated less by the pure light of reason than by the indirect systematic contribution which they make to the organizing of empirical data in the natural sciences.” (Quine)

x. “At root what is needed for scientific inquiry is just receptivity to data, skill in reasoning, and yearning for truth. Admittedly, ingenuity can help too.” (-ll-)

xi. “A statistician carefully assembles facts and figures for others who carefully misinterpret them.” (Quote from *Mathematically Speaking – A Dictionary of Quotations*, p.329. Only source given in the book is: “Quoted in Evan Esar, *20,000 Quips and Quotes*“)

xii. “A knowledge of statistics is like a knowledge of foreign languages or of algebra; it may prove of use at any time under any circumstances.” (Quote from *Mathematically Speaking – A Dictionary of Quotations*, p. 328. The source provided is: “Elements of Statistics, Part I, Chapter I (p.4)”).

xiii. “We own to small faults to persuade others that we have not great ones.” (Rochefoucauld)

xiv. “There is more self-love than love in jealousy.” (-ll-)

xv. “We should not judge of a man’s merit by his great abilities, but by the use he makes of them.” (-ll-)

xvi. “We should gain more by letting the world see what we are than by trying to seem what we are not.” (-ll-)

xvii. “Put succinctly, a prospective study looks for the effects of causes whereas a retrospective study examines the causes of effects.” (Quote from p.49 of *Principles of Applied Statistics*, by Cox & Donnelly)

xviii. “… he who seeks for methods without having a definite problem in mind seeks for the most part in vain.” (David Hilbert)

xix. “Give every man thy ear, but few thy voice” (Shakespeare).

xx. “Often the fear of one evil leads us into a worse.” (Nicolas Boileau-Despréaux)

## The Nature of Statistical Evidence

Here’s my goodreads review of the book.

As I’ve observed many times before, a wordpress blog like mine is not a particularly nice place to cover mathematical topics involving equations and lots of Greek letters, so the coverage below will be more or less purely conceptual; don’t take this to mean that the book doesn’t contain formulas. Some parts of the book look like this:

That of course makes the book hard to blog, also for other reasons than just the fact that it’s typographically hard to deal with the equations. In general it’s hard to talk about the content of a book like this one without going into *a lot* of details outlining how you get from A to B to C – usually you’re only really interested in C, but you need A and B to make sense of C. At this point I’ve sort of concluded that when covering books like this one I’ll only cover some of the main themes which are easy to discuss in a blog post, and I’ve concluded that I should skip coverage of (potentially important) points which might also be of interest if they’re difficult to discuss in a small amount of space, which is unfortunately often the case. I should perhaps observe that although I noted in my goodreads review that in a way there was a bit too much philosophy and a bit too little statistics in the coverage for my taste, you should definitely not take that objection to mean that this book is full of fluff; a lot of that philosophical stuff is ‘formal logic’ type stuff and related comments, and the book in general is quite dense. As I also noted in the goodreads review I didn’t read this book as carefully as I might have done – for example I skipped a couple of the technical proofs because they didn’t seem to be worth the effort – and I’d probably need to read it again to fully understand some of the minor points made throughout the more technical parts of the coverage; so that’s of course a related reason why I don’t cover the book in a great amount of detail here – it’s hard work just to read the damn thing, to talk about the technical stuff in detail here as well would definitely be overkill even if it would surely make me understand the material better.

I have added some observations from the coverage below. I’ve tried to clarify beforehand which question/topic the quote in question deals with, to ease reading/understanding of the topics covered.

…

On how statistical methods are related to experimental science:

“statistical methods have aims similar to the process of experimental science. But statistics is not itself an experimental science, it consists of models of how to do experimental science. Statistical theory is a logical — mostly mathematical — discipline; its findings are not subject to experimental test. […] The primary sense in which statistical theory is a science is that it guides and explains statistical methods. A sharpened statement of the purpose of this book is to provide explanations of the senses in which some statistical methods provide scientific evidence.”

On mathematics and axiomatic systems (the book goes into much more detail than this):

“It is not sufficiently appreciated that a link is needed between mathematics and methods. Mathematics is not about the world until it is interpreted and then it is only about models of the world […]. No contradiction is introduced by either interpreting the same theory in different ways or by modeling the same concept by different theories. […] In general, a primitive undefined term is said to be **interpreted** when a meaning is assigned to it and when all such terms are interpreted we have an **interpretation** of the axiomatic system. It makes no sense to ask which is the correct interpretation of an axiom system. This is a primary strength of the axiomatic method; we can use it to organize and structure our thoughts and knowledge by simultaneously and economically treating all interpretations of an axiom system. It is also a weakness in that failure to define or interpret terms leads to much confusion about the implications of theory for application.”

It’s all about models:

“The scientific method of theory checking is to compare predictions deduced from a theoretical model with observations on nature. Thus science must predict what happens in nature but it need not explain why. […] whether experiment is consistent with theory is relative to accuracy and purpose. All theories are simplifications of reality and hence no theory will be expected to be a perfect predictor. Theories of statistical inference become relevant to scientific process at precisely this point. […] Scientific method is a practice developed to deal with experiments on **nature. **Probability theory is a deductive study of the properties of **models **of such experiments. All of the theorems of probability are results about models of experiments.”

But given a frequentist interpretation you can test your statistical theories with the real world, right? Right? Well…

“How might we check the long run stability of relative frequency? If we are to compare mathematical theory with experiment then only finite sequences can be observed. But for the Bernoulli case, the event that frequency approaches probability is stochastically independent of any sequence of finite length. […] Long-run stability of relative frequency cannot be checked experimentally. There are neither theoretical nor empirical guarantees that, a priori, one can recognize experiments performed under uniform conditions and that under these circumstances one *will* obtain stable frequencies.” [related link]

What should we expect to get out of mathematical and statistical theories of inference?

“What can we expect of a theory of statistical inference? We can expect an internally consistent explanation of why certain conclusions follow from certain data. The theory will not be about inductive rationality but about a *model *of inductive rationality. Statisticians are used to thinking that they apply their logic to models of the physical world; less common is the realization that their logic itself is only a model. Explanation will be in terms of introduced concepts which do not exist in nature. Properties of the concepts will be derived from assumptions which merely seem reasonable. This is the only sense in which the axioms of any mathematical theory are true […] We can expect these concepts, assumptions, and properties to be intuitive but, unlike natural science, they cannot be checked by experiment. Different people have different ideas about what “seems reasonable,” so we can expect different explanations and different properties. We should not be surprised if the theorems of two different theories of statistical evidence differ. If two models had no different properties then they would be different versions of the same model […] We should not expect to achieve, by mathematics alone, a single coherent theory of inference, for mathematical truth is conditional and the assumptions are not “self-evident.” Faith in a set of assumptions would be needed to achieve a single coherent theory.”

On disagreements about the nature of statistical evidence:

“The context of this section is that there is disagreement among experts about the nature of statistical evidence and consequently much use of one formulation to criticize another. Neyman (1950) maintains that, from his behavioral hypothesis testing point of view, Fisherian significance tests do not express evidence. Royall (1997) employs the “law” of likelihood to criticize hypothesis as well as significance testing. Pratt (1965), Berger and Selke (1987), Berger and Berry (1988), and Casella and Berger (1987) employ Bayesian theory to criticize sampling theory. […] Critics assume that their findings are about evidence, but they are at most about models of evidence. Many theoretical statistical criticisms, when stated in terms of evidence, have the following outline: According to model A, evidence satisfies proposition P. But according to model B, which is correct since it is derived from “self-evident truths,” P is not true. Now evidence can’t be two different ways so, since B is right, A must be wrong. Note that the argument is symmetric: since A appears “self-evident” (to adherents of A) B must be wrong. But both conclusions are invalid since evidence can be modeled in different ways, perhaps useful in different contexts and for different purposes. From the observation that P is a theorem of A but not of B, all we can properly conclude is that A and B are different models of evidence. […] The common practice of using one theory of inference to critique another is a misleading activity.”

Is mathematics a science?

“Is mathematics a science? It is certainly systematized knowledge much concerned with structure, but then so is history. Does it employ the scientific method? Well, partly; hypothesis and deduction are the essence of mathematics and the search for counter examples is a mathematical counterpart of experimentation; but the question is not put to nature. Is mathematics about nature? In part. The hypotheses of most mathematics are suggested by some natural primitive concept, for it is difficult to think of interesting hypotheses concerning nonsense syllables and to check their consistency. However, it often happens that as a mathematical subject matures it tends to evolve away from the original concept which motivated it. Mathematics in its purest form is probably not natural science since it lacks the experimental aspect. Art is sometimes defined to be creative work displaying form, beauty and unusual perception. By this definition pure mathematics is clearly an art. On the other hand, applied mathematics, taking its hypotheses from real world concepts, is an attempt to describe nature. Applied mathematics, without regard to experimental verification, is in fact largely the “conditional truth” portion of science. If a body of applied mathematics has survived experimental test to become trustworthy belief then it is the essence of natural science.”

Then what about statistics – is statistics a science?

“Statisticians can and do make contributions to subject matter fields such as physics, and demography but statistical theory and methods proper, distinguished from their findings, are not like physics in that they are not about nature. […] Applied statistics is natural science but the findings are about the subject matter field not statistical theory or method. […] Statistical theory helps with how to do natural science but it is not itself a natural science.”

…

I should note that I am, and have for a long time been, in broad agreement with the author’s remarks on the nature of science and mathematics above. Popper, among many others, discussed this topic a long time ago e.g. in The Logic of Scientific Discovery and I’ve basically been of the opinion that (‘pure’) mathematics is not science (‘but rather ‘something else’ … and that doesn’t mean it’s not useful’) for probably a decade. I’ve had a harder time coming to terms with how precisely to deal with statistics in terms of these things, and in that context the book has been conceptually helpful.

Below I’ve added a few links to other stuff also covered in the book:

Propositional calculus.

Kolmogorov’s axioms.

Neyman-Pearson lemma.

Radon-Nikodyn theorem. (not covered in the book, but the necessity of using ‘a Radon-Nikodyn derivative’ to obtain an answer to a question being asked was remarked upon at one point, and I had no clue what he was talking about – it seems that the stuff in the link was what he was talking about).

A very specific and relevant link: Berger and Wolpert (1984). The stuff about Birnbaum’s argument covered from p.24 (p.40) and forward is covered in some detail in the book. The author is critical of the model and explains in the book in some detail why that is. See also: *On the foundations of statistical inference* (Birnbaum, 1962).

## Cost-effectiveness analysis in health care (III)

This will be my last post about the book. Yesterday I finished reading Darwin’s Origin of Species, which was my 100th book this year (here’s the list), but I can’t face blogging that book at the moment so coverage of that one will have to wait a bit.

In my second post about this book I had originally planned to cover chapter 7 – ‘Analysing costs’ – but as I didn’t like to spend too much time on the post I ended up cutting it short. This omission of coverage in the last post means that some themes to be discussed below are closely related to stuff covered in the second post, whereas on the other hand most of the remaining material, more specifically the material from chapters 8, 9 and 10, deal with decision analytic modelling, a quite different topic; in other words the coverage will be slightly more fragmented and less structured than I’d have liked it to be, but there’s not really much to do about that (it doesn’t help in this respect that I decided to not cover chapter 8, but doing that as well was out of the question).

I’ll start with coverage of some of the things they talk about in chapter 7, which as mentioned deals with how to analyze costs in a cost-effectiveness analysis context. They observe in the chapter that health cost data are often skewed to the right, for several reasons (costs incurred by an individual cannot be negative; for many patients the costs may be zero; some study participants may require much more care than the rest, creating a long tail). One way to address skewness is to use the median instead of the mean as the variable of interest, but a problem with this approach is that the median will not be as useful to policy-makers as will be the mean; as the mean times the population of interest will give a good estimate of the total costs of an intervention, whereas the median is not a very useful variable in the context of arriving at an estimate of the total costs. Doing data transformations and analyzing transformed data is another way to deal with skewness, but their use in cost effectiveness analysis have been questioned for a variety of reasons discussed in the chapter (to give a couple of examples, data transformation methods perform badly if inappropriate transformations are used, and many transformations cannot be used if there are data points with zero costs in the data, which is very common). Of the non-parametric methods aimed at dealing with skewness they discuss a variety of tests which are rarely used, as well as the bootstrap, the latter being one approach which has gained widespread use. They observe in the context of the bootstrap that “it has increasingly been recognized that the conditions the bootstrap requires to produce reliable parameter estimates are not fundamentally different from the conditions required by parametric methods” and note in a later chapter (chapter 11) that: “it is not clear that boostrap results in the presence of severe skewness are likely to be any more or less valid than parametric results […] bootstrap and parametric methods both rely on sufficient sample sizes and are likely to be valid or invalid in similar circumstances. Instead, interest in the bootstrap has increasingly focused on its usefulness in dealing simultaneously with issues such as censoring, missing data, multiple statistics of interest such as costs and effects, and non-normality.” Going back to the coverage in chapter 7, in the context of skewness they also briefly touch upon the potential use of a GLM framework to address this problem.

Data is often missing in cost datasets. Some parts of their coverage of these topics was to me but a review of stuff already covered in Bartholomew. Data can be missing for different reasons and through different mechanisms; one distinction is among data missing completely at random (MCAR), missing at random (MAR) (“missing data are correlated in an observable way with the mechanism that generates the cost, i.e. after adjusting the data for observable differences between complete and missing cases, the cost for those with missing data is the same, except for random variation, as for those with complete data”), and not missing at random (NMAR); the last type is also called non-ignorably missing data, and if you have that sort of data the implication is that the costs of those in the observed and unobserved groups differ in unpredictable ways, and if you ignore the process that drives these differences you’ll probably end up with a biased estimator. Another way to distinguish between different types of missing data is to look at patterns within the dataset, where you have:

“***univariate missingness** – a single variable in a dataset is causing a problem through missing values, while the remaining variables contain complete information

***unit non-response** – no data are recorded for any of the variables for some patients

***monotone missing** – caused, for example, by drop-out in panel or longitudinal studies, resulting in variables observed up to a certain time point or wave but not beyond that

***multivariate missing** – also called item non-response or general missingness, where some but not all of the variables are missing for some of the subjects.”

The authors note that the most common types of missingness in cost information analyses are the latter two. They discuss some techniques for dealing with missing data, such as complete-case analysis, available-case analysis, and imputation, but I won’t go into the details here. In the last parts of the chapter they talk a little bit about censoring, which can be viewed as a specific type of missing data, and ways to deal with it. Censoring happens when follow-up information on some subjects is not available for the full duration of interest, which may be caused e.g. by attrition (people dropping out of the trial), or insufficient follow up (the final date of follow-up might be set before all patients reach the endpoint of interest, e.g. death). The two most common methods for dealing with censored cost data are the Kaplan-Meier sample average (-KMSA) estimator and the inverse probability weighting (-IPW) estimator, both of which are non-parametric interval methods. “Comparisons of the IPW and KMSA estimators have shown that they both perform well over different levels of censoring […], and both are considered reasonable approaches for dealing with censoring.” One difference between the two is that the KMSA, unlike the IPW, is not appropriate for dealing with censoring due to attrition unless the attrition is MCAR (and it almost never is), because the KM estimator, and by extension the KMSA estimator, assumes that censoring is independent of the event of interest.

The focus in chapter 8 is on decision tree models, and I decided to skip that chapter as most of it is known stuff which I felt no need to review here (do remember that I to a large extent use this blog as an extended memory, so I’m not only(/mainly?) writing this stuff for other people..). Chapter 9 deals with Markov models, and I’ll talk a little bit about those in the following.

“Markov models analyse uncertain processes over time. They are suited to decisions where the timing of events is important and when events may happen more than once, and therefore they are appropriate where the strategies being evaluated are of a sequential or repetitive nature. Whereas decision trees model uncertain events at chance nodes, Markov models differ in modelling uncertain events as transitions between health states. In particular, Markov models are suited to modelling long-term outcomes, where costs and effects are spread over a long period of time. Therefore Markov models are particularly suited to chronic diseases or situations where events are likely to recur over time […] Over the last decade there has been an increase in the use of Markov models for conducting economic evaluations in a health-care setting […]

A Markov model comprises a finite set of health states in which an individual can be found. The states are such that in any given time interval, the individual will be in only one health state. All individuals in a particular health state have identical characteristics. The number and nature of the states are governed by the decisions problem. […] Markov models are concerned with transitions during a series of cycles consisting of short time intervals. The model is run for several cycles, and patients move between states or remain in the same state between cycles […] Movements between states are defined by transition probabilities which can be time dependent or constant over time. All individuals within a given health state are assumed to be identical, and this leads to a limitation of Markov models in that the transition probabilities only depend on the current health state and not on past health states […the process is memoryless…] – this is known as the Markovian assumption”.

The note that in order to build and analyze a Markov model, you need to do the following: *define states and allowable transitions [for example from ‘non-dead’ to ‘dead’ is okay, but going the other way is, well… For a Markov process to end, you need at least one state that cannot be left after it has been reached, and those states are termed ‘absorbing states’], *specify initial conditions in terms of starting probabilities/initial distribution of patients, *specify transition probabilities, *specify a cycle length, *set a stopping rule, *determine rewards, *implement discounting if required, *analysis and evaluation of the model, and *exploration of uncertainties. They talk about each step in more detail in the book, but I won’t go too much into this.

Markov models may be governed by transitions that are either constant over time or time-dependent. In a Markov *chain* transition probabilities are constant over time, whereas in a Markov *process *transition probabilities vary over time (/from cycle to cycle). In a simple Markov model the baseline assumption is that transitions only occur once in each cycle and usually the transition is modelled as taking place either at the beginning or the end of cycles, but in reality transitions can take place at any point in time during the cycle. One way to deal with the problem of misidentification (people assumed to be in one health state throughout the cycle even though they’ve transfered to another health state during the cycle) is to use half-cycle corrections, in which an assumption is made that on average state transitions occur halfway through the cycle, instead of at the beginning or the end of a cycle. They note that: “the important principle with the half-cycle correction is not when the transitions occur, but when state membership (i.e. the proportion of the cohort in that state) is counted. The longer the cycle length, the more important it may be to use half-cycle corrections.” When state transitions are assumed to take place may influence factors such as cost discounting (if the cycle is long, it can be important to get the state transition timing reasonably right).

When time dependency is introduced into the model, there are in general two types of time dependencies that impact on transition probabilities in the models. One is time dependency depending on the number of cycles since the start of the model (this is e.g. dealing with how transition probabilities depend on factors like age), whereas the other, which is more difficult to implement, deals with state dependence (curiously they don’t use these two words, but I’ve worked with state dependence models before in labour economics and this is what we’re dealing with here); i.e. here the transition probability will depend upon how long you’ve been in a given state.

Below I mostly discuss stuff covered in chapter 10, however I also include a few observations from the final chapter, chapter 11 (on ‘Presenting cost-effectiveness results’). Chapter 10 deals with how to represent uncertainty in decision analytic models. This is an important topic because as noted later in the book, “The primary objective of economic evaluation should not be hypothesis testing, but rather the estimation of the central parameter of interest—the incremental cost-effectiveness ratio—along with appropriate representation of the uncertainty surrounding that estimate.” In chapter 10 a distinction is made between variability, heterogeneity, and uncertainty. Variability has also been termed first-order uncertainty or stochastic uncertainty, and pertains to variation observed when recording information on resource use or outcomes within a homogenous sample of individuals. Heterogeneity relates to differences between patients which can be explained, at least in part. They distinguish between two types of uncertainty, structural uncertainty – dealing with decisions and assumptions made about the structure of the model – and parameter uncertainty, which of course relates to the precision of the parameters estimated. After briefly talking about ways to deal with these, they talk about sensitivity analysis.

“Sensitivity analysis involves varying parameter estimates across a range and seeing how this impacts on he model’s results. […] The simplest form is a one-way analysis where each parameter estimate is varied independently and singly to observe the impact on the model results. […] One-way sensitivity analysis can give some insight into the factors influencing the results, and may provide a validity check to assess what happens when particular variables take extreme values. However, it is likely to grossly underestimate overall uncertainty, and ignores correlation between parameters.”

Multi-way sensitivity analysis is a more refined approach, in which more than one parameter estimate is varied – this is sometimes termed scenario analysis. A different approach is threshold analysis, where one attempts to identify the critical value of one or more variables so that the conclusion/decision changes. All of these approaches are deterministic approaches, and they are not without problems. “They fail to take account of the joint parameter uncertainty and correlation between parameters, and rather than providing the decision-maker with a useful indication of the likelihood of a result, they simply provide a range of results associated with varying one or more input estimates.” So of course an alternative has been developed, namely probabilistic sensitivity analysis (-PSA), which already in the mid-80es started to be used in health economic decision analyses.

“PSA permits the joint uncertainty across all the parameters in the model to be addressed at the same time. It involves sampling model parameter values from distributions imposed on variables in the model. […] The types of distribution imposed are dependent on the nature of the input parameters [but] decision analytic models for the purpose of economic evaluation tend to use homogenous types of input parameters, namely costs, life-years, QALYs, probabilities, and relative treatment effects, and consequently the number of distributions that are frequently used, such as the beta, gamma, and log-normal distributions, is relatively small. […] Uncertainty is then propagated through the model by randomly selecting values from these distributions for each model parameter using Monte Carlo simulation“.

## Random Stuff / Open Thread

This is not a very ‘meaty’ post, but it’s been a long time since I had one of these and I figured it was time for another one. As always links and comments are welcome.

…

i. The unbearable accuracy of stereotypes. I made a mental note of reading this paper later a long time ago, but I’ve been busy with other things. Today I skimmed it and decided that it looks interesting enough to give it a detailed read later. Some remarks from the summary towards the end of the paper:

“The scientific evidence provides more evidence of accuracy than of inaccuracy in social stereotypes. The most appropriate generalization based on the evidence is that people’s beliefs about groups are usually moderately to highly accurate, and are occasionally highly inaccurate. […] This pattern of empirical support for moderate to high stereotype accuracy is not unique to any particular target or perceiver group. Accuracy has been found with racial and ethnic groups, gender, occupations, and college groups. […] The pattern of moderate to high stereotype accuracy is not unique to any particular research team or methodology. […] This pattern of moderate to high stereotype accuracy is not unique to the substance of the stereotype belief. It occurs for stereotypes regarding personality traits, demographic characteristics, achievement, attitudes, and behavior. […] The strong form of the exaggeration hypothesis – either defining stereotypes as exaggerations or as claiming that stereotypes usually lead to exaggeration – is not supported by data. Exaggeration does sometimes occur, but it does not appear to occur much more frequently than does accuracy or underestimation, and may even occur less frequently.”

I should perhaps note that this research is closely linked to Funder’s research on personality judgment, which I’ve previously covered on the blog here and here.

…

ii. I’ve spent approximately 150 hours on vocabulary.com altogether at this point (having ‘mastered’ ~10.200 words in the process). A few words I’ve recently encountered on the site: Nescience (note to self: if someone calls you ‘nescient’ during a conversation, in many contexts that’ll be an insult, not a compliment) (Related note to self: I should find myself some smarter enemies, who use words like ‘nescient’…), eristic, carrel, oleaginous, decal, gable, epigone, armoire, chalet, cashmere, arrogate, ovine.

…

iii. why p = .048 should be rare (and why this feels counterintuitive).

…

iv. A while back I posted a few comments on SSC and I figured I might as well link to them here (at least it’ll make it easier *for me* to find them later on). Here is where I posted a few comments on a recent study dealing with Ramadan-related IQ effects, a topic which I’ve covered here on the blog before, and here I discuss some of the benefits of not having low self-esteem.

On a completely unrelated note, today I left a comment in a reddit thread about ‘Books That Challenged You / Made You See the World Differently’ which may also be of interest to readers of this blog. I realized while writing the comment that this question is probably getting more and more difficult for me to answer as time goes by. It really all depends upon *what part of the world* you want to see in a different light; which aspects you’re most interested in. For people wondering about where the books about mathematics and statistics were in that comment (I do like to think these fields play some role in terms of ‘how I see the world‘), I wasn’t really sure which book to include on such topics, if any; I can’t think of any single math or stats textbook that’s dramatically changed the way I thought about the world – to the extent that my knowledge about these topics has changed how I think about the world, it’s been a long drawn-out process.

…

v. Chess…

People who care the least bit about such things probably already know that a really strong tournament is currently being played in St. Louis, the so-called Sinquefield Cup, so I’m not going to talk about that here (for resources and relevant links, go here).

I talked about the strong rating pools on ICC not too long ago, but one thing I did not mention when discussing this topic back then was that yes, I also occasionally win against some of those grandmasters the rating pool throws at me – at least I’ve won a few times against GMs by now in bullet. I’m aware that for many ‘serious chess players’ bullet ‘doesn’t really count’ because the time dimension is much more important than it is in other chess settings, but to people who think skill doesn’t matter much in bullet I’d say they should have a match with Hikaru Nakamura and see how well they do against him (if you’re interested in how that might turn out, see e.g. this video – and keep in mind that at the beginning of the video Nakamura had already won 8 games in a row, out of 8, against his opponent in the first games, who incidentally is not exactly a beginner). The skill-sets required do not overlap perfectly between bullet and classical time control games, but when I started playing bullet online I quickly realized that good players really require very little time to completely outplay people who just play random moves (fast). Below I have posted a screencap I took while kibitzing a game of one of my former opponents, an anonymous GM from Germany, against whom I currently have a 2.5/6 score, with two wins, one draw, and three losses (see the ‘My score vs CPE’ box).

I like to think of a score like this as at least some kind of accomplishment, though admittedly perhaps not a very big one.

Also in chess-related news, I’m currently reading Jesús de la Villa’s 100 Endgames book, which Christof Sielecki has said some very nice things about. A lot of the stuff I’ve encountered so far is stuff I’ve seen before, positions I’ve already encountered and worked on, endgame principles I’m familiar with, etc., but not all of it is known stuff and I really like the structure of the book. There are a lot of pages left, and as it is I’m planning to read this book from cover to cover, which is something I usually do not do when I read chess books (few people do, judging from various comments I’ve seen people make in all kinds of different contexts).

Lastly, a lecture:

## Cost-effectiveness analysis in health care (I)

Yesterday’s SMBC was awesome, and I couldn’t help myself from including it here (click to view full size):

…

In a way the three words I chose to omit from the post title are rather important in order to know which kind of book this is – the full title of Gray et al.’s work is: *Applied Methods of* … – but as I won’t be talking much about the ‘applied’ part in my coverage here, focusing instead on broader principles etc. which will be easier for people without a background in economics to follow, I figured I might as well omit those words from the post titles. I should also admit that I personally did not spend much time on the exercises, as this did not seem necessary in view of what I was using the book for. Despite not having spent much time on the exercises myself, I incidentally did reward the authors for including occasionally quite detailed coverage of technical aspects in my rating of the book on goodreads; I feel confident from the coverage that if I need to apply some of the methods they talk about in the book later on, the book will do a good job of helping me get things right. All in all, the book’s coverage made it hard for me not to give it 5 stars – so that was what I did.

I own an actual physical copy of the book, which makes blogging it more difficult than usual; I prefer blogging e-books. The greater amount of work involved in covering physical books is also one reason why I have yet to talk about Eysenck & Keane’s Cognitive Psychology text here on the blog, despite having read more than 500 pages of that book (it’s not that the book is boring). My coverage of the contents of both this book and the Eysenck & Keane book will (assuming I ever get around to blogging the latter, that is) be less detailed than it could have been, but on the other hand it’ll likely be very focused on key points and observations from the coverage.

I have talked about cost-effectiveness before here on the blog, e.g. here, but in my coverage of the book below I have not tried to avoid making points or including observations which I’ve already made elsewhere on the blog; it’s too much work to keep track of such things. With those introductory remarks out of the way, let’s move on to some observations made in the book:

…

“In cost-effectiveness analysis we first calculate the costs and effects of an intervention and one or more alternatives, then calculate the differences in cost and differences in effect, and finally present these differences in the form of a ratio, i.e. the cost per unit of health outcome effect […]. Because the focus is on differences between two (or more) options or treatments, analysts typically refer to incremental costs, incremental effects, and the incremental cost-effectiveness ratio (ICER). Thus, if we have two options *a* and *b*, we calculate their respective costs and effects, then calculate the difference in costs and difference in effects, and then calculate the ICER as the difference in costs divided by the difference in effects […] cost-effectiveness analyses which measure outcomes in terms of QALYs are sometimes referred to as cost-utility studies […] but are sometimes simply considered as a subset of cost-effectiveness analysis.”

“Cost-effectiveness analysis places no monetary value on the health outcomes it is comparing. It does not measure or attempt to measure the underlying worth or value to society of gaining additional QALYs, for example, but simply indicates which options will permit more QALYs to be gained than others with the same resources, assuming that gaining QALYs is agreed to be a reasonable objective for the health care system. Therefore the cost-effectiveness approach will never provide a way of determining how much in total it is worth spending on health care and the pursuit of QALYs rather than on other social objectives such as education, defence, or private consumption. It does not permit us to say whether health care spending is too high or too low, but rather confines itself to the question of how any given level of spending can be arranged to maximize the health outcomes yielded.

In contrast, cost-benefit analysis (CBA) does attempt to place some monetary valuation on health outcomes as well as on health care resources. […] The reasons for the more widespread use of cost-effectiveness analysis compared with cost-benefit analysis in health care are discussed extensively elsewhere, […] but two main issues can be identified. Firstly, significant conceptual or practical problems have been encountered with the two principal methods of obtaining monetary valuations of life or quality of life: the human capital approach […] and the willingness to pay approach […] Second, within the health care sector there remains a widespread and intrinsic aversion to the concept of placing explicit monetary values on health or life. […] The cost-benefit approach should […], in principle, permit broad questions of **allocative efficiency** to be addressed. […] In contrast, cost-effectiveness analysis can address questions of **productive** or **production efficiency**, where a specified good or service is being produced at the lowest possible cost – in this context, health gain using the health care budget.”

“when working in the two-dimensional world of cost-effectiveness analysis, there are two uncertainties that will be encountered. Firstly, there will be uncertainty concerning the location of the intervention on the cost-effectiveness plane: how much more or less effective and how much more or less costly it is than current treatment. Second, there is uncertainty concerning how much the decision-maker is willing to pay for health gain […] these two uncertainties can be presented together in the form of the question ‘What is the probability that this intervention is cost-effective?’, a question which effectively divides our cost-effectiveness plane into just two policy spaces – below the maximum acceptable line, and above it”.

“Conventionally, cost-effectiveness ratios that have been calculated against a baseline or do-nothing option without reference to any alternatives are referred to as *average* cost-effectiveness ratios, while comparisons with the next best alternative are described as *incremental* cost-effectiveness ratios […] it is quite misleading to calculate average cost-effectiveness ratios, as they ignore the alternatives available.”

“A life table provides a method of summarizing the mortality experience of a group of individuals. […] There are two main types of life table. First, there is a **cohort life table**, which is constructed based on the mortality experience of a group of individuals […]. While this approach can be used to characterize life expectancies of insects and some animals, human longevity makes this approach difficult to apply as the observation period would have to be sufficiently long to be able to observe the death of all members of the cohort. Instead, **current life tables** are normally constructed using cross-sectional data of observed mortality rates at different ages at a given point in time […] Life tables can also be classified according to the intervals over which changes in mortality occur. A **complete life table** displays the various rates for each year of life; while an **abridged life table** deals with greater periods of time, for example 5 year age intervals […] A life table can be used to generate a survival curve S(x) for the population at any point in time. This represents the probability of surviving beyond a certain age x (i.e. S(x)=Pr[X>x]). […] The chance of a male living to the age of 60 years is high (around 0.9) [in the UK, presumably – *US*] and so the survival curve is comparatively flat up until this age. The proportion dying each year from the age of 60 years rapidly increases, so the curve has a much steeper downward slope. In the last part of the survival curve there is an inflection, indicating a slowing rate of increase in the proportion dying each year among the very old (over 90 years). […] The hazard rate is the slope of the survival curve at any point, given the instantaneous chance of an individual dying.”

“Life tables are a useful tool for estimating changes in life expectancies from interventions that reduce mortality. […] Multiple-cause life tables are a way of quantifying outcomes when there is more than one mutually exclusive cause of death. These life tables can estimate the potential gains from the elimination of a cause of death and are also useful in calculating the benefits of interventions that reduce the risk of a particular cause of death. […] One issue that arises when death is divided into multiple causes in this type of life table is **competing risk**. […] competing risk can arise ‘when an individual can experience more than one type of event and the occurrence of one type of event hinders the occurrence of other types of events’. Competing risks affect life tables, as those who die from a specific cause have no chance of dying from other causes during the remainder of the interval […]. In practice this will mean that as soon as one cause is eliminated the probabilities of dying of other causes increase […]. Several methods have been proposed to correct for competing risks when calculating life tables.”

“the use of published life-table methods may have limitations, especially when considering particular populations which may have very different risks from the general population. In these cases, there are a host of techniques referred to as **survival analysis** which enables risks to be estimated from patient-level data. […] Survival analysis typically involves observing one or more outcomes in a population of interest over a period of time. The outcome, which is often referred to as an **event** or **endpoint** could be death, a non-fatal outcome such as a major clinical event (e.g. myocardial infarction), the occurrence of an adverse event, or even the date of first non-compliance with a therapy.”

“A key feature of survival data is censoring, which occurs whenever the event of interest is not observed within the follow-up period. This does not mean that the event will not occur some time in the future, just that it has not occurred while the individual was observed. […] The most common case of censoring is referred to as **right censoring**. This occurs whenever the observation of interest occurs after the observation period. […] An alternative form of censoring is **left censoring**, which occurs when there is a period of time when the individuals are at risk prior to the observation period.

A key feature of most survival analysis methods is that they assume that the censoring process is **non-informative**, meaning that there is no dependence between the time to the event of interest and the process that is causing the censoring. However, if the duration of observation is related to the severity of a patient’s disease, for example if patients with more advanced illness are withdrawn early from the study, the censoring is likely to be informative and other techniques are required”.

“Differences in the composition of the intervention and control groups at the end of follow-up may have important implications for estimating outcomes, especially when we are interested in extrapolation. If we know that the intervention group is older and has a lower proportion of females, we would expect these characteristics to increase the hazard mortality in this group over their remaining lifetimes. However, if the intervention group has experienced a lower number of events, this may significantly reduce the hazard for some individuals. They may also benefit from a past treatment which continues to reduce the hazard of a primary outcome such as death. This effect […] is known as the **legacy effect**“.

“Changes in life expectancy are a commonly used outcome measure in economic evaluation. […] Table 4.6 shows selected examples of estimates of the gain in life expectancy for various interventions reported by Wright and Weinstein (1998) […] Gains in life expectancy from preventative interventions in populations of average risk generally ranged from a few days to slightly more than a year. […] The gains in life expectancy from preventing or treating disease in persons at elevated risk [*this type of prevention is known as ‘secondary-‘ and/or ‘tertiary prevention’ (depending on the circumstances), as opposed to ‘primary prevention’ – the distinction between primary prevention and more targeted approaches is often important in public health contexts, because the level of targeting will often interact with the cost-effectiveness dimension* – *US*] are generally greater […*one reason why this does not necessarily mean that targeted approaches are always better is that search costs will often be an increasing function of the level of targeting – US*]. Interventions that treat established disease vary, with gains in life-expectancy ranging from a few months […] to as long as nine years […] the point that Wright and Weinstein (1998) were making was not that absolute gains vary, but that a gain in life expectancy of a month from a preventive intervention targeted at population at average risk and a gain of a year from a preventive intervention targeted at populations at elevated risk could both be considered large. It should also be noted that interventions that produce a comparatively small gain in life expectancy when averaged across the population […] may still be very cost-effective.”

## Model Selection and Multi-Model Inference (II)

I haven’t really blogged this book in anywhere near the amount of detail it deserves even though my first post about the book actually had a few quotes illustrating how much different stuff is covered in the book.

This book is technical, and even if I’m trying to make it less technical by omitting the math in this post it may be a good idea to reread the first post about the book before reading this post to refresh your knowledge of these things.

Quotes and comments below – most of the coverage here focuses on stuff covered in chapters 3 and 4 in the book.

…

“Tests of null hypotheses and information-theoretic approaches should not be used together; they are very different analysis paradigms. A very common mistake seen in the applied literature is to use AIC to rank the candidate models and then “test” to see whether the best model (the alternative hypothesis) is “significantly better” than the second-best model (the null hypothesis). This procedure is flawed, and we strongly recommend against it […] the primary emphasis should be on the size of the treatment effects and their precision; too often we find a statement regarding “significance,” while the treatment and control means are not even presented. Nearly all statisticians are calling for estimates of effect size and associated precision, rather than test statistics, P-values, and “significance.” [*Borenstein & Hedges certainly did as well in their book (written much later), and this was not an issue I omitted to talk about in my coverage of their book…*] […] Information-theoretic criteria such as AIC, AICc, and QAICc are not a “test” in any sense, and there are no associated concepts such as test power or P-values or α-levels. Statistical hypothesis testing represents a very different, and generally inferior, paradigm for the analysis of data in complex settings. **It seems best to avoid use of the word “significant” in reporting research results under an information-theoretic paradigm.** […] AIC allows a ranking of models and the identification of models that are nearly equally useful versus those that are clearly poor explanations for the data at hand […]. Hypothesis testing provides no general way to rank models, even for models that are nested. […] In general, we recommend strongly against the use of null hypothesis testing in model selection.”

“The bootstrap is a type of Monte Carlo method used frequently in applied statistics. This computer-intensive approach is based on resampling of the observed data […] The fundamental idea of the model-based sampling theory approach to statistical inference is that the data arise as a sample from some conceptual probability distribution *f*. Uncertainties of our inferences can be measured if we can estimate *f*. The bootstrap method allows the computation of measures of our inference uncertainty by having a simple empirical estimate of *f* and sampling from this estimated distribution. In practical application, the empirical bootstrap means using some form of resampling with replacement from the actual data x to generate B (e.g., B = 1,000 or 10,000) bootstrap samples […] The set of B bootstrap samples is a proxy for a set of B independent real samples from *f* (in reality we have only one actual sample of data). Properties expected from replicate real samples are inferred from the bootstrap samples by analyzing each bootstrap sample exactly as we first analyzed the real data sample. From the set of results of sample size B we measure our inference uncertainties from sample to (conceptual) population […] For many applications it has been theoretically shown […] that the bootstrap can work well for large sample sizes (n), but it is not generally reliable for small n […], regardless of how many bootstrap samples B are used. […] Just as the analysis of a single data set can have many objectives, the bootstrap can be used to provide insight into a host of questions. For example, for each bootstrap sample one could compute and store the conditional variance–covariance matrix, goodness-of-fit values, the estimated variance inflation factor, the model selected, confidence interval width, and other quantities. Inference can be made concerning these quantities, based on summaries over the B bootstrap samples.”

“**Information criteria attempt only to select the best model from the candidate models available; if a better model exists, but is not offered as a candidate, then the information-theoretic approach cannot be expected to identify this new model**. Adjusted R^{2} […] are useful as a measure of the proportion of the variation “explained,” [but] are not useful in model selection […] adjusted R^{2} is poor in model selection; its usefulness should be restricted to description.”

“As we have struggled to understand the larger issues, it has become clear to us that inference based on only a single best model is often relatively poor for a wide variety of substantive reasons. Instead, we increasingly favor multimodel inference: procedures to allow formal statistical inference from all the models in the set. […] Such multimodel inference includes model averaging, incorporating model selection uncertainty into estimates of precision, confidence sets on models, and simple ways to assess the relative importance of variables.”

“If sample size is small, one must realize that relatively little information is probably contained in the data (unless the effect size if very substantial), and the data may provide few insights of much interest or use. Researchers routinely err by building models that are far too complex for the (often meager) data at hand. They do not realize how little structure can be reliably supported by small amounts of data that are typically “noisy.””

“Sometimes, the selected model [when applying an information criterion] contains a parameter that is constant over time, or areas, or age classes […]. This result should not imply that there is no variation in this parameter, rather that parsimony and its bias/variance tradeoff finds the actual variation in the parameter to be relatively small in relation to the information contained in the sample data. It “costs” too much in lost precision to add estimates of all of the individual *θ*_{i}. As the sample size increases, then at some point a model with estimates of the individual parameters would likely be favored. Just because a parsimonious model contains a parameter that is constant across strata does not mean that there is no variation in that process across the strata.”

“[In a significance testing context,] a significant test result does not relate directly to the issue of what approximating model is best to use for inference. One model selection strategy that has often been used in the past is to do likelihood ratio tests of each structural factor […] and then use a model with all the factors that were “significant” at, say, α = 0.05. However, there is no theory that would suggest that this strategy would lead to a model with good inferential properties (i.e., small bias, good precision, and achieved confidence interval coverage at the nominal level). […] The purpose of the analysis of empirical data is not to find the “true model”— not at all. Instead, we wish to find a best approximating model, based on the data, and then develop statistical inferences from this model. […] We search […] not for a “true model,” but rather for a parsimonious model giving an accurate approximation to the interpretable information in the data at hand. Data analysis involves the question, “What level of model complexity will the data support?” and both under- and overfitting are to be avoided. Larger data sets tend to support more complex models, and the selection of the size of the model represents a tradeoff between bias and variance.”

“The easy part of the information-theoretic approaches includes both the computational aspects and the clear understanding of these results […]. The hard part, and the one where training has been so poor, is the a priori thinking about the science of the matter before data analysis — even before data collection. It has been too easy to collect data on a large number of variables in the hope that a fast computer and sophisticated software will sort out the important things — the “significant” ones […]. Instead, a major effort should be mounted to understand the nature of the problem by critical examination of the literature, talking with others working on the general problem, and thinking deeply about alternative hypotheses. Rather than “test” dozens of trivial matters (is the correlation zero? is the effect of the lead treatment zero? are ravens pink?, Anderson et al. 2000), there must be a more concerted effort to provide evidence on *meaningful* questions that are important to a discipline. This is the critical point: the common failure to address important science questions in a fully competent fashion. […] “Let the computer find out” is a poor strategy for researchers who do not bother to think clearly about the problem of interest and its scientific setting. *The sterile analysis of “just the numbers” will continue to be a poor strategy for progress in the sciences.*

Researchers often resort to using a computer program that will examine all possible models and variables automatically. Here, the hope is that the computer will discover the important variables and relationships […] The primary mistake here is a common one: the failure to posit a small set of a priori models, each representing a plausible research hypothesis.”

“Model selection is most often thought of as a way to select just the best model, then inference is conditional on that model. However, information-theoretic approaches are more general than this simplistic concept of model selection. Given a set of models, specified independently of the sample data, we can make formal inferences based on the entire set of models. […] Part of multimodel inference includes ranking the fitted models from best to worst […] and then scaling to obtain the relative plausibility of each fitted model (*g _{i}*) by a weight of evidence (

*w*) relative to the selected best model. Using the conditional sampling variance […] from each model and the Akaike weights […], unconditional inferences about precision can be made over the entire set of models. Model-averaged parameter estimates and estimates of unconditional sampling variances can be easily computed. Model selection uncertainty is a substantial subject in its own right, well beyond just the issue of determining the best model.”

_{i}“There are three general approaches to assessing model selection uncertainty: (1) theoretical studies, mostly using Monte Carlo simulation methods; (2) the bootstrap applied to a given set of data; and (3) utilizing the set of AIC differences (i.e., ∆* _{i}*) and model weights

*w*from the set of models fit to data.”

_{i}“Statistical science should emphasize estimation of parameters and associated measures of estimator uncertainty. Given a correct model […], an MLE is reliable, and we can compute a reliable estimate of its sampling variance and a reliable confidence interval […]. If the model is selected entirely independently of the data at hand, and is a good approximating model, and if n is large, then the estimated sampling variance is essentially unbiased, and any appropriate confidence interval will essentially achieve its nominal coverage. This would be the case if we used only one model, decided on a priori, and it was a good model, *g*, of the data generated under truth, *f*. However, even when we do objective, data-based model selection (which we are advocating here), the [model] selection process is expected to introduce an added component of sampling uncertainty into any estimated parameter; hence classical theoretical sampling variances are too small: They are conditional on the model and do not reflect model selection uncertainty. One result is that conditional confidence intervals can be expected to have less than nominal coverage.”

“Data analysis is sometimes focused on the variables to include versus exclude in the selected model (e.g., important vs. unimportant). Variable selection is often the focus of model selection for linear or logistic regression models. Often, an investigator uses stepwise analysis to arrive at a final model, and from this a conclusion is drawn that the variables in this model are important, whereas the other variables are not important. While common, this is poor practice and, among other issues, fails to fully consider model selection uncertainty. […] Estimates of the relative importance of predictor variables x_{j} can best be made by summing the Akaike weights across all the models in the set where variable *j* occurs. Thus, the relative importance of variable *j* is reflected in the sum w_{+ }(*j*). The larger the w_{+ }(*j*) the more important variable *j* is, relative to the other variables. Using the w_{+ }(*j*), all the variables can be ranked in their importance. […] This idea extends to subsets of variables. For example, we can judge the importance of a pair of variables, as a pair, by the sum of the Akaike weights of all models that include the pair of variables. […] To summarize, in many contexts the AIC selected best model will include some variables and exclude others. Yet this inclusion or exclusion by itself does not distinguish differential evidence for the importance of a variable in the model. The model weights […] summed over all models that include a given variable provide a better weight of evidence for the importance of that variable in the context of the set of models considered.” [*The reason why I’m not telling you how to calculate Akaike weights is that I don’t want to bother with math formulas in wordpress – but I guess all you need to know is that these are not hard to calculate. It should perhaps be added that one can also use bootstrapping methods to obtain relevant model weights to apply in a multimodel inference context.*]

“If data analysis relies on model selection, then inferences should acknowledge model selection uncertainty. If the goal is to get the best estimates of a set of parameters in common to all models (this includes prediction), model averaging is recommended. If the models have definite, and differing, interpretations as regards understanding relationships among variables, and it is such understanding that is sought, then one wants to identify the best model and make inferences based on that model. […] The bootstrap provides direct, robust estimates of model selection probabilities π_{i} , but we have no reason now to think that use of bootstrap estimates of model selection probabilities rather than use of the Akaike weights will lead to superior unconditional sampling variances or model-averaged parameter estimators. […] Be mindful of possible model redundancy. A carefully thought-out set of a priori models should eliminate model redundancy problems and is a central part of a sound strategy for obtaining reliable inferences. […] **Results are sensitive to having demonstrably poor models in the set of models considered; thus it is very important to exclude models that are a priori poor.** […] The importance of a small number (R) of candidate models, defined prior to detailed analysis of the data, cannot be overstated. […] One should have R much smaller than n. MMI [Multi-Model Inference] approaches become increasingly important in cases where there are many models to consider.”

“In general there is a substantial amount of model selection uncertainty in many practical problems […]. Such uncertainty about what model structure (and associated parameter values) is the K-L [Kullback–Leibler] best approximating model applies whether one uses hypothesis testing, information-theoretic criteria, dimension-consistent criteria, cross-validation, or various Bayesian methods. Often, there is a nonnegligible variance component for estimated parameters (this includes prediction) due to uncertainty about what model to use, and this component should be included in estimates of precision. […] we recommend assessing model selection uncertainty rather than ignoring the matter. […] It is […] not a sound idea to pick a single model and unquestioningly base extrapolated predictions on it when there is model uncertainty.”

## Model Selection and Multi-Model Inference (I)

“We wrote this book to introduce graduate students and research workers in various scientific disciplines to the use of information-theoretic approaches in the analysis of empirical data. These methods allow the data-based selection of a “best” model and a ranking and weighting of the remaining models in a pre-defined set. Traditional statistical inference can then be based on this selected best model. However, we now emphasize that information-theoretic approaches allow formal inference to be based on more than one model (multimodel inference). Such procedures lead to more robust inferences in many cases, and we advocate these approaches throughout the book. […] Information theory includes the celebrated Kullback–Leibler “distance” between two models (actually, probability distributions), and this represents a fundamental quantity in science. In 1973, Hirotugu Akaike derived an estimator of the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood. His measure, now called Akaike’s information criterion (AIC), provided a new paradigm for model selection in the analysis of empirical data. His approach, with a fundamental link to information theory, is relatively simple and easy to use in practice, but little taught in statistics classes and far less understood in the applied sciences than should be the case. […] We do not claim that the information-theoretic methods are always the very best for a particular situation. They do represent a unified and rigorous theory, an extension of likelihood theory, an important application of information theory, and they are objective and practical to employ across a very wide class of empirical problems. Inference from multiple models, or the selection of a single “best” model, by methods based on the Kullback–Leibler distance are almost certainly better than other methods commonly in use now (e.g., null hypothesis testing of various sorts, the use of R^{2}, or merely the use of just one available model).

This is an applied book written primarily for biologists and statisticians using models for making inferences from empirical data. […] This book might be useful as a text for a course for students with substantial experience and education in statistics and applied data analysis. A second primary audience includes honors or graduate students in the biological, medical, or statistical sciences […] Readers should ideally have some maturity in the quantitative sciences and experience in data analysis. Several courses in contemporary statistical theory and methods as well as some philosophy of science would be particularly useful in understanding the material. Some exposure to likelihood theory is nearly essential”.

…

The above quotes are from the preface of the book, which I have so far only briefly talked about here; this post will provide a lot more details. Aside from writing the post in order to mentally process the material and obtain a greater appreciation of the points made in the book, I have also as a secondary goal tried to write the post in a manner so that people who are not necessarily experienced model-builders might also derive some benefit from the coverage. Whether or not I was successful in that respect I do not know – given the outline above, it should be obvious that there are limits as to how ‘readable’ you can make stuff like this to people without a background in a semi-relevant field. I don’t think I have written specifically about the application of information criteria in the model selection context before here on the blog, at least not in any amount of detail, but I have written about ‘model-stuff’ before, also in ‘meta-contexts’ not necessarily related to the application of models in economics; so if you’re interested in ‘this kind of stuff’ but you don’t feel like having a go at a post dealing with a book which includes word combinations like ‘the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood’ in the preface, you can for example have a look at posts like this, this, this and this. I have also discussed here on the blog some stuff somewhat related to the multi-model inference part, how you can combine the results of various models to get a bigger picture of what’s going on, in these posts – they approach ‘the topic’ (these are in fact separate topics…) in a very different manner than does this book, but *some* key ideas *should* presumably transfer. Having said all this, I should also point out that many of the basic points made in the coverage below should be relatively easy to understand, and I should perhaps repeat that I’ve tried to make this post readable to people who’re not too familiar with this kind of stuff. I have deliberately chosen to include no mathematical formulas in my coverage in this post. Please do not assume this is because the book does not contain mathematical formulas.

Before moving on to the main coverage I thought I’d add a note about the remark above that stuff like AIC is “little taught in statistics classes and far less understood in the applied sciences than should be the case”. The book was written a while back, and some things may have changed a bit since then. I have done coursework on the application of information criteria in model selection as it was a topic (briefly) covered in regression analysis(? …or an earlier course), so at least this kind of stuff is now being taught to students of economics where I study and has been for a while as far as I’m aware – meaning that coverage of such topics is probably reasonably widespread at least in this field. However I can hardly claim that I obtained a ‘great’ or ‘full’ understanding of the issues at hand from the work on these topics I did back then – and so I have only gradually, while reading this book, come to appreciate some of the deeper issues and tradeoffs involved in model selection. This could probably be taken as an argument that these topics are still ‘far less understood … than should be the case’ – and another, perhaps stronger, argument would be Seber’s comments in the last part of his book; if a statistician today may still ‘overlook’ information criteria when discussing model selection in a *Springer* text, it’s not hard to argue that the methods are perhaps not as well known as should ‘ideally’ be the case. It’s obvious from the coverage that a lot of people were not using the methods when the book was written, and I’m not sure things have changed as much as would be preferable since then.

What is the book about? A starting point for understanding the sort of questions the book deals with might be to consider the simple question: When we set out to model stuff empirically and we have different candidate models to choose from, how do we decide which of the models is ‘best’? There are a lot of other questions dealt with in the coverage as well. What does the word ‘best’ mean? We might worry over both the functional form of the model and which variables should be included in ‘the best’ model – do we need separate mechanisms for dealing with concerns about the functional form and concerns about variable selection, or can we deal with such things at the same time? How do we best measure the effect of a variable which we have access to and consider including in our model(s) – is it preferable to interpret the effect of a variable on an outcome based on the results you obtain from a ‘best model’ in the set of candidate models, or is it perhaps sometimes better to combine the results of multiple models (and for example take an average of the effects of the variable across multiple proposed models to be the best possible estimate) in the choice set (as should by now be obvious for people who’ve read along here, there are some sometimes quite close parallels between stuff covered in this book and stuff covered in *Borenstein & Hedges*)? If we’re not sure which model is ‘right’, how might we quantify our uncertainty about these matters – and what happens if we don’t try to quantify our uncertainty about which model is correct? What is bootstrapping, and how can we use Monte Carlo methods to help us with model selection? If we apply information criteria to choose among models, what do these criteria tell us, and which sort of issues are they silent about? Are some methods for deciding between models better than others in specific contexts – might it for example be a good idea to make criteria adjustments when faced with small sample sizes which makes it harder for us to rely on asymptotic properties of the criteria we apply? How might the sample size more generally relate to our decision criterion deciding which model might be considered ‘best’ – do we think that what might be considered to be ‘the best model’ might depend upon (‘should depend upon’?) how much data we have access to or not, and if how much data we have access to and the ‘optimal size of a model’ are related, *how *are the two related, and why? The questions included in the previous sentence relate to some fundamental differences between AIC (and similar measures) and BIC – but let’s not get ahead of ourselves. I may or may not go into details like these in my coverage of the book, but I certainly won’t cover stuff like that in this post. Some of the content is really technical: “Chapters 5 and 6 present more difficult material [than chapters 1-4] and some new research results. Few readers will be able to absorb the concepts presented here after just one reading of the material […] Underlying theory is presented in Chapter 7, and this material is much deeper and more mathematical.” – from the preface. The sample size considerations mentioned above relate to stuff covered in chapter 6. As you might already have realized, this book has a lot of stuff.

When dealing with models, one way to think about these things is to consider two in some sense separate issues: On the one hand we might think about which model is most appropriate (model selection), and on the other hand we might think about how best to estimate parameter values and variance-covariance matrices *given* a specific model. As the book points out early on, “if one assumes or somehow chooses a particular model, methods exist that are objective and asymptotically optimal for estimating model parameters and the sampling covariance structure, conditional on that model. […] The sampling distributions of ML [maximum likelihood] estimators are often skewed with small samples, but profile likelihood intervals or log-based intervals or bootstrap procedures can be used to achieve asymmetric confidence intervals with good coverage properties. **In general, the maximum likelihood method provides an objective, omnibus theory for estimation of model parameters and the sampling covariance matrix, given an appropriate model**.” The problem is that it’s not ‘a given’ that the model we’re working on

*is*actually appropriate. That’s where model selection mechanisms enters the picture. Such methods can help us realize which of the models we’re considering might be the most appropriate one(s) to apply in the specific context (there are other things they can’t tell us, however – see below).

Below I have added some quotes from the book and some further comments:

“Generally, alternative models will involve differing numbers of parameters; the number of parameters will often differ by at least an order of magnitude across the set of candidate models. […] The more parameters used, the better the fit of the model to the data that is achieved. Large and extensive data sets are likely to support more complexity, and this should be considered in the development of the set of candidate models. If a particular model (parametrization) does not make biological [/’scientific’] sense, this is reason to exclude it from the set of candidate models, particularly in the case where causation is of interest. In developing the set of candidate models, one must recognize a certain balance between keeping the set small and focused on plausible hypotheses, while making it big enough to guard against omitting a very good a priori model. While this balance should be considered, we advise the inclusion of all models that seem to have a reasonable justification, prior to data analysis. While one must worry about errors due to both underfitting and overfitting, it seems that modest overfitting is less damaging than underfitting (Shibata 1989).” (The key word here is ‘modest’ – and please don’t take these authors to be in favour of obviously overfitted models and data dredging strategies; they spend quite a few pages criticizing such models/approaches!).

“It is not uncommon to see biologists collect data on 50–130 “ecological” variables in the blind hope that some analysis method and computer system will “find the variables that are significant” and sort out the “interesting” results […]. This shotgun strategy will likely uncover mainly spurious correlations […], and it is prevalent in the naive use of many of the traditional multivariate analysis methods (e.g., principal components, stepwise discriminant function analysis, canonical correlation methods, and factor analysis) found in the biological literature [*and elsewhere, US*]. We believe that mostly spurious results will be found using this unthinking approach […], and we encourage investigators to give very serious consideration to a well-founded set of candidate models and predictor variables (as a reduced set of possible prediction) as a means of minimizing the inclusion of spurious variables and relationships. […] Using AIC and other similar methods one can only hope to select the best model from this set; if good models are not in the set of candidates, they cannot be discovered by model selection (i.e., data analysis) algorithms. […] statistically we can infer only that a best model (by some criterion) has been selected, never that it is the true model. […] **Truth and true models are not statistically identifiable from data**.”

“It is generally a mistake to believe that there is a simple “true model” in the biological sciences and that during data analysis this model can be uncovered and its parameters estimated. Instead, biological systems [*and other systems! – US*] are complex, with many small effects, interactions, individual heterogeneity, and individual and environmental covariates (most being unknown to us); we can only hope to identify a model that provides a good approximation to the data available. The words “true model” represent an oxymoron, except in the case of Monte Carlo studies, whereby a model is used to generate “data” using pseudorandom numbers […] A model is a simplification or approximation of reality and hence will not reflect all of reality. […] While a model can never be “truth,” a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless. Model selection methods try to rank models in the candidate set relative to each other; whether any of the models is actually “good” depends primarily on the quality of the data and the science and a priori thinking that went into the modeling. […] Proper modeling and data analysis tell what inferences the data support, not what full reality might be […] Even if a “true model” did exist and if it could be found using some method, it would not be good as a fitted model for general inference (i.e., understanding or prediction) about some biological system, because its numerous parameters would have to be estimated from the finite data, and the precision of these estimated parameters would be quite low.”

A key concept in the context of model selection is the tradeoff between bias and variance in a model framework:

“If the fit is improved by a model with more parameters, then where should one stop? Box and Jenkins […] suggested that the* principle of parsimony* should lead to a model with “. . . the smallest possible number of parameters for adequate representation of the data.” Statisticians view the principle of parsimony as a bias versus variance tradeoff. In general, bias decreases and variance increases as the dimension of the model (K) increases […] The fit of any model can be improved by increasing the number of parameters […]; however, a tradeoff with the increasing variance must be considered in selecting a model for inference. Parsimonious models achieve a proper tradeoff between bias and variance. All model selection methods are based to some extent on the principle of parsimony […] The concept of parsimony and a bias versus variance tradeoff is very important.”

“we reserve the terms underfitted and overfitted for use in relation to a “best approximating model” […] Here, an underfitted model would ignore some important replicable (i.e., conceptually replicable in most other samples) structure in the data and thus fail to identify effects that were actually supported by the data. In this case, bias in the parameter estimators is often substantial, and the sampling variance is underestimated, both factors resulting in poor confidence interval coverage. Underfitted models tend to miss important treatment effects in experimental settings. Overfitted models, as judged against a best approximating model, are often free of bias in the parameter estimators, but have estimated (and actual) sampling variances that are needlessly large (the precision of the estimators is poor, relative to what could have been accomplished with a more parsimonious model). Spurious treatment effects tend to be identified, and spurious variables are included with overfitted models. […] The goal of data collection and analysis is to make inferences from the sample that properly apply to the population […] A paramount consideration is the repeatability, with good precision, of any inference reached. When we imagine many replicate samples, there will be some recognizable features common to almost all of the samples. Such features are the sort of inference about which we seek to make strong inferences (from our single sample). Other features might appear in, say, 60% of the samples yet still reflect something real about the population or process under study, and we would hope to make weaker inferences concerning these. Yet additional features appear in only a few samples, and these might be best included in the error term (σ^{2}) in modeling. If one were to make an inference about these features quite unique to just the single data set at hand, as if they applied to all (or most all) samples (hence to the population), then we would say that the sample is overfitted by the model (we have overfitted the *data*). Conversely, failure to identify the features present that are strongly replicable over samples is underfitting. […] A best approximating model is achieved by properly balancing the errors of underfitting and overfitting.”

Model selection bias is a key concept in the model selection context, and I think this problem is quite similar/closely related to problems encountered in a meta-analytical context which I believe I’ve discussed before here on the blog (see links above to the posts on meta-analysis) – if I’ve understood these authors correctly, one might choose to think of publication bias issues as partly the result of model selection bias issues. Let’s for a moment pretend you have a ‘true model’ which includes three variables (in the book example there are four, but I don’t think you need four…); one is very important, one is a sort of ‘60% of the samples variable’ mentioned above, and the last one would be a variable we might prefer to just include in the error term. Now the problem is this: When people look at samples where the last one of these variables is ‘seen to matter’, the effect size of this variable will be biased away from zero (they don’t explain where this bias comes from in the book, but I’m reasonably sure this is a result of the probability of identification/inclusion of the variable in the model depending on the (‘local’/’sample’) effect size; the bigger the effect size of a specific variable in a specific sample, the more likely the variable is to be identified as important enough to be included in the model – *Bohrenstein and Hedges* talked about similar dynamics, for obvious reasons, and I think their reasoning ‘transfers’ to this situation and is applicable here as well). When models include variables such as the last one, you’ll have model selection bias: “When predictor variables [like these] are included in models, the associated estimator for a σ^{2} is negatively biased and precision is exaggerated. These two types of bias are called model selection bias”. Much later in the book they incidentally conclude that: “**The best way to minimize model selection bias is to reduce the number of models fit to the data by thoughtful a priori model formulation**.”

“Model selection has most often been viewed, and hence taught, in a context of null hypothesis testing. Sequential testing has most often been employed, either stepup (forward) or stepdown (backward) methods. Stepwise procedures allow for variables to be added or deleted at each step. These testing-based methods remain popular in many computer software packages in spite of their poor operating characteristics. […] Generally, hypothesis testing is a very poor basis for model selection […] There is no statistical theory that supports the notion that hypothesis testing with a fixed α level is a basis for model selection. […] Tests of hypotheses within a data set are not independent, making inferences difficult. The order of testing is arbitrary, and differing test order will often lead to different final models. [This is incidentally one, of several, key differences between hypothesis testing approaches and information theoretic approaches: “The order in which the information criterion is computed over the set of models is not relevant.”] […] Model selection is dependent on the arbitrary choice of α, but α should depend on both n and K to be useful in model selection”.

## Statistical Models for Proportions and Probabilities

“Most elementary statistics books discuss inference for proportions and probabilities, and the primary readership for this monograph is the student of statistics, either at an advanced undergraduate or graduate level. As some of the recommended so-called ‘‘large-sample’’ rules in textbooks have been found to be inappropriate, this monograph endeavors to provide more up-to-date information on these topics. I have also included a number of related topics not generally found in textbooks. The emphasis is on model building and the estimation of parameters from the models.

It is assumed that the reader has a background in statistical theory and inference and is familiar with standard univariate and multivariate distributions, including conditional distributions.”

…

The above quote is from the the book‘s preface. The book is highly technical – here’s a screencap of a page roughly in the middle:

I think the above picture provides some background as to why I do not think it’s a good idea to provide detailed coverage of the book here. Not all pages are that bad, but this *is* a book on mathematical statistics. The technical nature of the book made it difficult for me to know how to rate it – I like to ask myself when reading books like this one if I would be able to spot an error in the coverage. In some contexts here I clearly would not be able to do that (given the time I was willing to spend on the book), and when that’s the case I always feel hesitant about rating(/’judging’) books of this nature. I should note that there are pretty much no spelling/formatting errors, and the language is easy to understand (‘if you know enough about statistics…’). I did have one major problem with part of the coverage towards the end of the book, but it didn’t much alter my general impression of the book. The problem was that the author seems to apply (/recommend?) a hypothesis-testing framework for model selection, a practice which although widely used is frankly considered bad statistics by Burnham and Anderson in their book on model selection. In the relevant section of the book Seber discusses an approach to modelling which starts out with a ‘full model’ including both primary effects and various (potentially multi-level) interaction terms (he deals specifically with data derived from multiple (independent?) multinomial distributions, but where the data comes from is not really important here), and then he proceeds to use hypothesis tests of whether interaction terms are zero to determine whether or not interactions should be included in the model or not. For people who don’t know, this model selection method is both very commonly used and a very wrong way to do things; using hypothesis testing as a model selection mechanism is a methodologically invalid approach to model selection, something Burnham and Anderson talks a lot about in their book. I assume I’ll be covering Burnham and Anderson’s book in more detail later on here on the blog, so for now I’ll just make this key point here and then return to that stuff later – if you did not understand the comments above you shouldn’t worry too much about it, I’ll go into much more detail when talking about that stuff later. This problem was the only real problem I had with Seber’s book.

Although I’ll not talk a lot about what the book was about (not only because it might be hard for some readers to follow, I should point out, but also because detailed coverage would take a lot more time than I’d be willing to spend on this stuff), I decided to add a few links to relevant stuff he talks about in the book. Quite a few pages in the book are spent on talking about the properties of various distributions, how to estimate key parameters of interest, and how to construct confidence intervals to be used for hypothesis testing in those specific contexts.

Some of the links below deal with stuff covered in the book, a few others however just deal with stuff I had to look up in order to understand what was going on in the coverage:

Inverse sampling.

Binomial distribution.

Hypergeometric distribution.

Multinomial distribution.

Binomial proportion confidence interval. (Coverage of the Wilson score interval, Jeffreys interval, and the Clopper-Pearson interval included in the book).

Fisher’s exact test.

Marginal distribution.

Fischer information.

Moment-generating function.

Factorial moment-generating function.

Delta method.

Multidimensional central limit theorem (the book applies this, but doesn’t really talk about it).

Matrix function.

McNemar’s test.

## Wikipedia articles of interest

i. Pendle witches.

“The trials of the **Pendle witches** in 1612 are among the most famous witch trials in English history, and some of the best recorded of the 17th century. The twelve accused lived in the area around Pendle Hill in Lancashire, and were charged with the murders of ten people by the use of witchcraft. All but two were tried at Lancaster Assizes on 18–19 August 1612, along with the Samlesbury witches and others, in a series of trials that have become known as the Lancashire witch trials. One was tried at York Assizes on 27 July 1612, and another died in prison. Of the eleven who went to trial – nine women and two men – ten were found guilty and executed by hanging; one was found not guilty.

The official publication of the proceedings by the clerk to the court, Thomas Potts, in his *The Wonderfull Discoverie of Witches in the Countie of Lancaster*, and the number of witches hanged together – nine at Lancaster and one at York – make the trials unusual for England at that time. It has been estimated that all the English witch trials between the early 15th and early 18th centuries resulted in fewer than 500 executions; this series of trials accounts for more than two per cent of that total.”

“One of the accused, Demdike, had been regarded in the area as a witch for fifty years, and some of the deaths the witches were accused of had happened many years before Roger Nowell started to take an interest in 1612.^{[13]} The event that seems to have triggered Nowell’s investigation, culminating in the Pendle witch trials, occurred on 21 March 1612.^{[14]}

On her way to Trawden Forest, Demdike’s granddaughter, Alizon Device, encountered John Law, a pedlar from Halifax, and asked him for some pins.^{[15]} Seventeenth-century metal pins were handmade and relatively expensive, but they were frequently needed for magical purposes, such as in healing – particularly for treating warts – divination, and for love magic, which may have been why Alizon was so keen to get hold of them and why Law was so reluctant to sell them to her.^{[16]} Whether she meant to buy them, as she claimed, and Law refused to undo his pack for such a small transaction, or whether she had no money and was begging for them, as Law’s son Abraham claimed, is unclear.^{[17]} A few minutes after their encounter Alizon saw Law stumble and fall, perhaps because he suffered a stroke; he managed to regain his feet and reach a nearby inn.^{[18]} Initially Law made no accusations against Alizon,^{[19]} but she appears to have been convinced of her own powers; when Abraham Law took her to visit his father a few days after the incident, she reportedly confessed and asked for his forgiveness.^{[20]}

Alizon Device, her mother Elizabeth, and her brother James were summoned to appear before Nowell on 30 March 1612. Alizon confessed that she had sold her soul to the Devil, and that she had told him to lame John Law after he had called her a thief. Her brother, James, stated that his sister had also confessed to bewitching a local child. Elizabeth was more reticent, admitting only that her mother, Demdike, had a mark on her body, something that many, including Nowell, would have regarded as having been left by the Devil after he had sucked her blood.”^{}

“The Pendle witches were tried in a group that also included the Samlesbury witches, Jane Southworth, Jennet Brierley, and Ellen Brierley, the charges against whom included child murder and cannibalism; Margaret Pearson, the so-called Padiham witch, who was facing her third trial for witchcraft, this time for killing a horse; and Isobel Robey from Windle, accused of using witchcraft to cause sickness.^{[33]}

Some of the accused Pendle witches, such as Alizon Device, seem to have genuinely believed in their guilt, but others protested their innocence to the end.”

“Nine-year-old Jennet Device was a key witness for the prosecution, something that would not have been permitted in many other 17th-century criminal trials. However, King James had made a case for suspending the normal rules of evidence for witchcraft trials in his *Daemonologie*.^{[42]} As well as identifying those who had attended the Malkin Tower meeting, Jennet also gave evidence against her mother, brother, and sister. […] When Jennet was asked to stand up and give evidence against her mother, Elizabeth began to scream and curse her daughter, forcing the judges to have her removed from the courtroom before the evidence could be heard.^{[48]} Jennet was placed on a table and stated that she believed her mother had been a witch for three or four years. She also said her mother had a familiar called Ball, who appeared in the shape of a brown dog. Jennet claimed to have witnessed conversations between Ball and her mother, in which Ball had been asked to help with various murders. James Device also gave evidence against his mother, saying he had seen her making a clay figure of one of her victims, John Robinson.^{[49]} Elizabeth Device was found guilty.^{[47]}

James Device pleaded not guilty to the murders by witchcraft of Anne Townley and John Duckworth. However he, like Chattox, had earlier made a confession to Nowell, which was read out in court. That, and the evidence presented against him by his sister Jennet, who said that she had seen her brother asking a black dog he had conjured up to help him kill Townley, was sufficient to persuade the jury to find him guilty.^{[50]}^{[51]}”

“Many of the allegations made in the Pendle witch trials resulted from members of the Demdike and Chattox families making accusations against each other. Historian John Swain has said that the outbreaks of witchcraft in and around Pendle demonstrate the extent to which people could make a living either by posing as a witch, or by accusing or threatening to accuse others of being a witch.^{[17]} Although it is implicit in much of the literature on witchcraft that the accused were victims, often mentally or physically abnormal, for some at least, it may have been a trade like any other, albeit one with significant risks.^{[74]} There may have been bad blood between the Demdike and Chattox families because they were in competition with each other, trying to make a living from healing, begging, and extortion.”

…

ii. Kullback–Leibler divergence.

This article is the only one of the five ‘main articles’ in this post which is not a featured article. I looked this one up because the Burnham & Anderson book I’m currently reading talks about this stuff quite a bit. The book will probably be one of the most technical books I’ll read this year, and I’m not sure how much of it I’ll end up covering here. Basically most of the book deals with the stuff ‘covered’ in the (very short) ‘Relationship between models and reality’ section of the wiki article. There are a lot of details the article left out… The same could be said about the related wiki article about AIC (both articles incidentally include the book in their references).

…

The first thing that would spring to mind if someone asked me what I knew about it would probably be something along the lines of: “…well, it’s *huge*…”

…and it is. But we know a lot more than that – some observations from the article:

“The atmosphere of Jupiter is the largest planetary atmosphere in the Solar System. It is mostly made of molecular hydrogen and helium in roughly solar proportions; other chemical compounds are present only in small amounts […] The atmosphere of Jupiter lacks a clear lower boundary and gradually transitions into the liquid interior of the planet. […] The Jovian atmosphere shows a wide range of active phenomena, including band instabilities, vortices (cyclones and anticyclones), storms and lightning. […] Jupiter has powerful storms, always accompanied by lightning strikes. The storms are a result of moist convection in the atmosphere connected to the evaporation and condensation of water. They are sites of strong upward motion of the air, which leads to the formation of bright and dense clouds. The storms form mainly in belt regions. The lightning strikes on Jupiter are hundreds of times more powerful than those seen on Earth.” [However do note that later on in the article it is stated that: “On Jupiter lighting strikes are on average a few times more powerful than those on Earth.”]

“The composition of Jupiter’s atmosphere is similar to that of the planet as a whole.^{[1]} Jupiter’s atmosphere is the most comprehensively understood of those of all the gas giants because it was observed directly by the *Galileo* atmospheric probe when it entered the Jovian atmosphere on December 7, 1995.^{[26]} Other sources of information about Jupiter’s atmospheric composition include the *Infrared Space Observatory* (ISO),^{[27]} the *Galileo* and *Cassini* orbiters,^{[28]} and Earth-based observations.”

“The visible surface of Jupiter is divided into several bands parallel to the equator. There are two types of bands: lightly colored *zones* and relatively dark *belts.* […] The alternating pattern of belts and zones continues until the polar regions at approximately 50 degrees latitude, where their visible appearance becomes somewhat muted.^{[30]} The basic belt-zone structure probably extends well towards the poles, reaching at least to 80° North or South.^{[5]}

The difference in the appearance between zones and belts is caused by differences in the opacity of the clouds. Ammonia concentration is higher in zones, which leads to the appearance of denser clouds of ammonia ice at higher altitudes, which in turn leads to their lighter color.^{[15]} On the other hand, in belts clouds are thinner and are located at lower altitudes.^{[15]} The upper troposphere is colder in zones and warmer in belts.^{[5]} […] The Jovian bands are bounded by zonal atmospheric flows (winds), called *jets*. […] The location and width of bands, speed and location of jets on Jupiter are remarkably stable, having changed only slightly between 1980 and 2000. […] However bands vary in coloration and intensity over time […] These variations were first observed in the early seventeenth century.”

“Jupiter radiates much more heat than it receives from the Sun. It is estimated that the ratio between the power emitted by the planet and that absorbed from the Sun is 1.67 ± 0.09.”

…

iv. Wife selling (English custom).

“**Wife selling** in England was a way of ending an unsatisfactory marriage by mutual agreement that probably began in the late 17th century, when divorce was a practical impossibility for all but the very wealthiest. After parading his wife with a halter around her neck, arm, or waist, a husband would publicly auction her to the highest bidder. […] Although the custom had no basis in law and frequently resulted in prosecution, particularly from the mid-19th century onwards, the attitude of the authorities was equivocal. At least one early 19th-century magistrate is on record as stating that he did not believe he had the right to prevent wife sales, and there were cases of local Poor Law Commissioners forcing husbands to sell their wives, rather than having to maintain the family in workhouses.”

“Until the passing of the Marriage Act of 1753, a formal ceremony of marriage before a clergyman was not a legal requirement in England, and marriages were unregistered. All that was required was for both parties to agree to the union, so long as each had reached the legal age of consent,^{[8]} which was 12 for girls and 14 for boys.^{[9]} Women were completely subordinated to their husbands after marriage, the husband and wife becoming one legal entity, a legal status known as coverture. […] Married women could not own property in their own right, and were indeed themselves the property of their husbands. […] Five distinct methods of breaking up a marriage existed in the early modern period of English history. One was to sue in the ecclesiastical courts for separation from bed and board (*a mensa et thoro*), on the grounds of adultery or life-threatening cruelty, but it did not allow a remarriage.^{[11]} From the 1550s, until the Matrimonial Causes Act became law in 1857, divorce in England was only possible, if at all, by the complex and costly procedure of a private Act of Parliament.^{[12]} Although the divorce courts set up in the wake of the 1857 Act made the procedure considerably cheaper, divorce remained prohibitively expensive for the poorer members of society.^{[13]}^{[nb 1]} An alternative was to obtain a “private separation”, an agreement negotiated between both spouses, embodied in a deed of separation drawn up by a conveyancer. Desertion or elopement was also possible, whereby the wife was forced out of the family home, or the husband simply set up a new home with his mistress.^{[11]} Finally, the less popular notion of wife selling was an alternative but illegitimate method of ending a marriage.”

“Although some 19th-century wives objected, records of 18th-century women resisting their sales are non-existent. With no financial resources, and no skills on which to trade, for many women a sale was the only way out of an unhappy marriage.^{[17]} Indeed the wife is sometimes reported as having insisted on the sale. […] Although the initiative was usually the husband’s, the wife had to agree to the sale. An 1824 report from Manchester says that “after several biddings she [the wife] was knocked down for 5s; but not liking the purchaser, she was put up again for 3s and a quart of ale”.^{[27]} Frequently the wife was already living with her new partner.^{[28]} In one case in 1804 a London shopkeeper found his wife in bed with a stranger to him, who, following an altercation, offered to purchase the wife. The shopkeeper agreed, and in this instance the sale may have been an acceptable method of resolving the situation. However, the sale was sometimes spontaneous, and the wife could find herself the subject of bids from total strangers.^{[29]} In March 1766, a carpenter from Southwark sold his wife “in a fit of conjugal indifference at the alehouse”. Once sober, the man asked his wife to return, and after she refused he hanged himself. A domestic fight might sometimes precede the sale of a wife, but in most recorded cases the intent was to end a marriage in a way that gave it the legitimacy of a divorce.”^{}

“Prices paid for wives varied considerably, from a high of £100 plus £25 each for her two children in a sale of 1865 (equivalent to about £12,500 in 2015)^{[34]} to a low of a glass of ale, or even free. […] According to authors Wade Mansell and Belinda Meteyard, money seems usually to have been a secondary consideration;^{[4]} the more important factor was that the sale was seen by many as legally binding, despite it having no basis in law. […] In Sussex, inns and public houses were a regular venue for wife-selling, and alcohol often formed part of the payment. […] in Ninfield in 1790, a man who swapped his wife at the village inn for half a pint of gin changed his mind and bought her back later.^{[42]} […] Estimates of the frequency of the ritual usually number about 300 between 1780 and 1850, relatively insignificant compared to the instances of desertion, which in the Victorian era numbered in the tens of thousands.^{[43]}”

“In 1825 a man named Johnson was charged with “having sung a song in the streets describing the merits of his wife, for the purpose of selling her to the highest bidder at Smithfield.” Such songs were not unique; in about 1842 John Ashton wrote “Sale of a Wife”.^{[nb 6]}^{[58]} The arresting officer claimed that the man had gathered a “crowd of all sorts of vagabonds together, who appeared to listen to his ditty, but were in fact, collected to pick pockets.” The defendant, however, replied that he had “not the most distant idea of selling his wife, who was, poor creature, at home with her hungry children, while he was endeavouring to earn a bit of bread for them by the strength of his lungs.” He had also printed copies of the song, and the story of a wife sale, to earn money. Before releasing him, the Lord Mayor, judging the case, cautioned Johnson that the practice could not be allowed, and must not be repeated.^{[59]} In 1833 the sale of a woman was reported at Epping. She was sold for 2s. 6d., with a duty of 6d. Once sober, and placed before the Justices of the Peace, the husband claimed that he had been forced into marriage by the parish authorities, and had “never since lived with her, and that she had lived in open adultery with the man Bradley, by whom she had been purchased”. He was imprisoned for “having deserted his wife”.^{[60]}”

…

v. Bog turtle.

“The **bog turtle** (*Glyptemys muhlenbergii*) is a semiaquatic turtle endemic to the eastern United States. […] It is the smallest North American turtle, measuring about 10 centimeters (4 in) long when fully grown. […] The bog turtle can be found from Vermont in the north, south to Georgia, and west to Ohio. Diurnal and secretive, it spends most of its time buried in mud and – during the winter months – in hibernation. The bog turtle is omnivorous, feeding mainly on small invertebrates.”

“The bog turtle is native only to the eastern United States,^{[nb 1]} congregating in colonies that often consist of fewer than 20 individuals.^{[23]} […] densities can range from 5 to 125 individuals per 0.81 hectares (2.0 acres). […] The bog turtle spends its life almost exclusively in the wetland where it hatched. In its natural environment, it has a maximum lifespan of perhaps 50 years or more,^{[47]} and the average lifespan is 20–30 years.”

“The bog turtle is primarily diurnal, active during the day and sleeping at night. It wakes in the early morning, basks until fully warm, then begins its search for food.^{[31]} It is a seclusive species, making it challenging to observe in its natural habitat.^{[11]} During colder days, the bog turtle will spend much of its time in dense underbrush, underwater, or buried in mud. […] Day-to-day, the bog turtle moves very little, typically basking in the sun and waiting for prey. […] Various studies have found different rates of daily movement in bog turtles, varying from 2.1 to 23 meters (6.9 to 75.5 ft) in males and 1.1 to 18 meters (3.6 to 59.1 ft) in females.”

“Changes to the bog turtle’s habitat have resulted in the disappearance of 80 percent of the colonies that existed 30 years ago.^{[7]} Because of the turtle’s rarity, it is also in danger of illegal collection, often for the worldwide pet trade. […] The bog turtle was listed as *critically endangered* in the 2011 IUCN Red List.^{[53]}“

## Evidence-Based Diagnosis

“Evidence-Based Diagnosis is a textbook about diagnostic, screening, and prognostic tests in clinical medicine. The authors’ approach is based on many years of experience teaching physicians in a clinical research training program. Although requiring only a minimum of mathematics knowledge, the quantitative discussions in this book are deeper and more rigorous than those in most introductory texts. […] It is aimed primarily at clinicians, particularly those who are academically minded, but it should be helpful and accessible to anyone involved with selection, development, or marketing of diagnostic, screening, or prognostic tests. […] Our perspective is that of skeptical consumers of tests. We want to make proper diagnoses and not miss treatable diseases. Yet, we are aware that vast resources are spent on tests that too frequently provide wrong answers or right answers of little value, and that new tests are being developed, marketed, and sold all the time, sometimes with little or no demonstrable or projected benefit to patients. This book is intended to provide readers with the tools they need to evaluate these tests, to decide if and when they are worth doing, and to interpret the results.”

…

I simply could not possibly justify not giving this book a shot considering the amazon ratings – it has an insane average rating of five stars, based on nine ratings. I agree with the reviewers: This is a really nice book. It covers a lot of stuff I’ve seen before, e.g. in Fletcher and Fletcher, Petrie and Sabin, Juth and Munthe, Borenstein, Hedges et al., Adam, Baltussen et al. (listing all of these suddenly made me realize how much stuff I’ve actually read about these sorts of topics in the past…), as well as in stats courses I’ve taken, but as the book is focusing specifically on medical testing aspects there is also a lot of new stuff as well. It should be noted that some people will benefit a lot more from reading the book than I did; I’ve spent weeks dealing with related aspects of subtopics they cover in just a few pages, and there were a lot of familiar concepts, distinctions, etc. in the book. Even so, this book is remarkably well-written and these guys really know their stuff. If you want to read a book about the basics of how to make sense of the results of medical tests and stuff like that, this is the book you’ll want to read.

Let’s say you have a test measuring some variable which might be useful in a diagnostic context. How would we know it might be useful? Well, one might come up with some criteria such a test should meet; like that the results of the test doesn’t depend on who’s doing the testing, perhaps it also shouldn’t matter when the test is done. You might also want the test to be somewhat accurate. But what do we even mean by that? There are various approaches to thinking about accuracy, and some may be better than others. So the book covers familiar topics like sensitivity and specificity, likelihood ratios, and receiver operating characteristic (ROC) curves. A test might be accurate, but if the results of a test does not change clinical decision-making it might not be worth it to do the test; so the question of whether a test is accurate or not is different from whether it’s also useful. In terms of usefulness concepts like positive- and negative predictive value and distinctions such as that between absolute and relative risk become important. It might not even be a good idea to use a test even if it distinguishes reasonably well between people who are sick and people who are not, because a very accurate test might be too expensive to be justified undertaking; the book also has a bit of stuff on cost-effectiveness. Of course costs associated with getting tested for a health condition are not limited to monetary costs; a test might be uncomfortable, and it may also for example be the case that a false positive or a false negative result might sometimes have quite severe consequences (e.g. in the context of cancer screening). In such contexts concepts like the number needed to treat might be useful. It might also on the other hand be that a test gives answers which are wrong so often that even if it’s very cheap to do, it still might not be worth doing. There’s stuff in the book about how to think about, and come up with decision-rules about, how to identify things like treatment-thresholds; variables which will be determined by probability of disease and costs associated with testing (/and treatment). A variable like the cost of a treatment might in an analytical framework involve both the costs of treating people with the health condition as well as the costs of treating people who tested positive without being sick and the costs of not treating sick people who tested negative. One might think in one context that it would be twice as bad to miss a diagnosis than it would be to treat someone who does not have the disease, which would lead to one set of decision-rules in terms of when to test and when to treat, whereas in another context it might be a lot worse to miss a diagnosis, so we’d be less worried about treating someone without the disease. There may be more than one relevant threshold in the diagnostic setting; usually there’ll be some range of prior probabilities of disease for which the test will add enough information to change decision-making, but at either end of the range the test might not add enough information to be justified. To be more specific, if you’re almost certain the patient has some specific disease, you’ll want to treat him because the test result will not change anything; and if on the other hand you’re almost certain that he does not have the disease, based e.g. on the prevalence rate and the clinical presentation, then you’ll want to refrain from testing if the test has costs (including time costs, inconvenience, etc.). The book includes formal and reasonably detailed analysis of such topics.

In terms of how to interpret the results of a test it matters who you’re testing, and as already indicated the authors apply a Bayesian approach to these matters and repeatedly emphasize the importance of priors when evaluating test results (or for that matter findings from the literature). In that context some important notions are included about what you can and can’t use e.g. variables like prevalence and incidence for, how best to use such variables to inform decision-making, and things like how the study design might impact which variables are available to you for analysis (don’t try to estimate prevalence if you’re dealing with a case-control setup, where this variable is determined by the study design).

Of course medical most tests don’t just give two results. Dichotomization adds simplicity compared to more complex scenarios, so that’s where the book starts out, but it doesn’t stop there. If you have a test involving a continuous variable then dichotomizing the results will reduce the value of the test; this is equivalent to using pair-wise comparisons to make sense of continuous data in other contexts. However it’s sometimes useful to do it anyway because you may be in a situation where you need to quickly/easily separate ‘normal’ from ‘abnormal’. Likelihood ratios are really useful in the context of multi-level tests. In the simple dichotomous test, the LR for a test result is the probability of the result in a patient with disease divided by the probability of the result in a patient without disease. If you have lots of possible test results however, you’ll not be limited to two likelihood ratios; you’ll have as many likelihood ratios as there are results of the test. Those likelihood ratios are useful because the LR in the context of a multi-level test is equal to the slope of the ROC curve over the relevant interval. The ROC curve in some sense displays the tradeoff between sensitivity (‘true positive’) and specificity (‘true negative’); each point on the curve represents a different cut-off for calling a test positive. Such curves are quite useful in terms of figuring out if a test adds information or not, how well it distinguishes between patients. If you want to compare different tests and how they perform, Bland-Altman plots also seem to be useful tools.

Sometimes the results of more than one test will be relevant to decision-making, and a key question to ask here is the extent to which tests, and test results, are independent. If tests are not independent, one should be careful about how to update the probability of disease based on a new laboratory finding, and about which conclusions can be drawn regarding the extent to which an additional test might or might not be necessary/useful/required. The book does not go into too much detail, but enough is said on this topic to make it clear that test dependence is a potential issue one should keep in mind when evaluating multiple test results. They do talk a bit about how to come up with decision-rules about which tests to prefer in situations where multiple interdependent tests are available for analysis.

…

Sometimes blinding is difficult. The book tells us that it’s particularly important to blind when outcomes are subjective (like pain), and when prognostic factors may affect treatment in the study setting.

Medical tests can be used for different things, and not all tests are equal. One important distinction they talk about in the book is the distinction between diagnostic tests, which are done on sick people to figure out why they’re sick, and screening tests, which are mostly done on healthy people with a low prior probability of disease. There are different types of screening tests. One type of test is screening for symptomatic disease, which is sometimes done because people may be sick and have symptoms without being aware of the fact that they’re sick; screening for depression might be an example of this (that *may* even sometimes be cost-effective). These tests are reasonably similar to traditional diagnostic tests, and so can be evaluated in a similar manner. However most screening tests are of a different kind; they’re aimed at identifying *risk factors*, rather than ‘actual disease’ (a third kind is screening for presymptomatic disease). This generally tends to make them harder to justify undertaking, for reasons covered in much greater detail in *Juth and Munthe* (see the link over the word ‘may’ above). There are other differences as well; concepts such as sensitivity and specificity are for example difficult to relate to screening tests aimed at identifying risk factors, as such screening tests have as a goal to estimate incidence, rather than prevalence, which will often make it hard to compare such tests with the established ‘gold standard’ (as is usually the case). I decided to include a few quotes from this part of the coverage:

“the general public tends to be supportive of screening programs. Part of this is wishful thinking. We would like to believe that bad things happen for a reason, and that there are things we can do to prevent them […] .We also tend to be much more swayed by stories of individual patients (either those whose disease was detected early or those in whom it was found “too late”) than by boring statistics about risks, costs, and benefits […]. Because, at least in the U.S., there is no clear connection between money spent on screening tests and money not being available to spend on other things, the public tends not to be swayed by arguments about cost efficacy […]. In fact, in the general public’s view of screening, even wrong answers are not necessarily a bad thing. Schwartz et al. (2004) did a national telephone survey of attitudes about cancer screening in the U.S. They found that 38% of respondents had experienced at least one false-positive screening test. Although more than 40% of these subjects referred to that experience as “very scary” or the “scariest time of my life,” 98% were glad they had the screening test! […] Another disturbing result of the survey by Schwartz et al. was that, even though (as of 2002) the U.S. Preventive Health Services Task Force felt that evidence was insufficient to recommend prostate cancer screening, more than 60% of respondents said that a 55-year-old man who did not have a routine PSA test was “irresponsible,” and more than a third said this for an 80-year old […] Thus, regardless of the efficacy of screening tests, they have become an obligation if one does not wish to be blamed for getting some illnesses.”

There are many reasons why there may be problems with using observational studies to evaluate screening tests, and they talk a bit about those. One is what they call ‘volunteer bias’, which is just basic selection bias. Then there are the familiar problems of lead-time bias and length time bias. It should perhaps be noted here that both of the two latter problems can be handled in the context of a randomized controlled trial; neither lead-time bias nor length time bias are issues if the study is an RCT which compares the entire screened group with the entire unscreened group. Yet another problem is stage-migration bias, which for example can be a problem when more sensitive tests allow for earlier detection which changes how people are staged; this may lead to changes in stage-specific mortality rates, without actually improving overall mortality at all. A final problem they talk about is overdiagnoses related to the problem of pseudodisease, which is disease that would never have affected the patient if it had not been diagnosed by the screening procedure. Again a quote might be in order:

“It is difficult to identify pseudodisease in an individual patient, because it requires completely ignoring the diagnosis. (If you treat pseudodisease, the treatment will always appear to be curative, and you won’t realize the patient had pseudodisease rather than real disease!) In some ways, pseudodisease is an extreme type of stage migration bias. Patients who were not previously diagnosed as having the disease are now counted as having it. Although the incidence of the disease goes up, the prognosis of those who have it improves. […] Lack of understanding of pseudodisease, including the lack of people who know they have had it, is a real problem, because most of us understand the world through stories […]. Patients whose pseudodisease has been “cured” become strong proponents of screening and treatment and can tell a powerful and easily understood story about their experience. On the other hand, there aren’t people who can tell a compelling story of pseudodisease – men who can say, “I had a completely unnecessary prostatectomy,” or women who say, “I had a completely unnecessary mastectomy,” even though we know statistically that many such people exist.

The existence of pseudo–lung cancer was strongly suggested by the results of the Mayo Lung Study, a randomized trial of chest x-rays and sputum cytology to screen for lung cancer among 9,211 male cigarette smokers (Marcus et al. 2000).”

I included the last part also to indicate that this is actually a real problem also in situations where you’d be very likely to imagine it couldn’t possibly be a problem; even a disease as severe as lung cancer is subject to this kind of issue. There are also problems that may make screening tests look worse than they really are; like power issues, unskilled medical personnel doing the testing, and lack of follow-up (if a positive test result does not lead to any change in health care provision, there’s no good reason to assume earlier diagnosis as a result of screening will impact e.g. disease-specific mortality. On a related note there’s some debate about which mortality metric (general vs disease-specific) is to be preferred in the screening context, and they talk a bit about that as well).

I expected to write more about the book in this post than I have so far and perhaps include a few more quotes, but my computer broke down while I was writing this post yesterday so this is what you get. However as already mentioned this is a great book, and if you think you might like it based on the observations included in this post you should definitely read it.

## Introduction to Meta Analysis (III)

(xkcd).

…

This will be my last post about the book. Below I have included some observations from the last 100 pages.

…

“A central theme in this volume is the fact that we usually prefer to work with effect sizes, rather than p-values. […] While we would argue that researchers should shift their focus to effect sizes even when working entirely with primary studies, the shift is *absolutely critical* when our goal is to synthesize data from multiple studies. A narrative reviewer who works with p-values (or with reports that were based on p-values) and uses these as the basis for a synthesis, is facing an impossible task. Where people tend to misinterpret a single p-value, the problem is much worse when they need to compare a series of p-values. […] the p-value is often misinterpreted. Because researchers *care about* the effect size, they tend to take whatever information they have and press it into service as an indicator of effect size. A statistically significant p-value is assumed to reflect a clinically important effect, and a nonsignificant p-value is assumed to reflect a trivial (or zero) effect. However, these interpretations are not necessarily correct. […] The narrative review typically works with p-values (or with conclusions that are based on p-values), and therefore lends itself to […] mistakes. p-values that differ are assumed to reflect different effect sizes but may not […], p-values that are the same are assumed to reflect similar effect sizes but may not […], and a more significant p-value is assumed to reflect a larger effect size when it may actually be based on a smaller effect size […]. By contrast, the meta-analysis works with effect sizes. As such it not only focuses on the question of interest (what is the size of the effect) but allows us to compare the effect size from study to study.”

“To compute the summary effect in a meta-analysis we compute an effect size for each study and then combine these effect sizes, rather than pooling the data directly. […] This approach allows us to study the dispersion of effects before proceeding to the summary effect. For a random-effects model this approach also allows us to incorporate the between-studies dispersion into the weights. There is one additional reason for using this approach […]. The reason is to ensure that each effect size is based on the comparison of a group with its own control group, and thus avoid a problem known as Simpson’s paradox. In some cases, particularly when we are working with observational studies, this is a critically important feature. […] The term paradox refers to the fact that one group can do better in every one of the included studies, but still do worse when the raw data are pooled. The problem is not limited to studies that use proportions, but can exist also in studies that use means or other indices. The problem exists only when the base rate (or mean) varies from study to study and the proportion of participants from each group varies as well. For this reason, the problem is generally limited to observational studies, although it can exist in randomized trials when allocation ratios vary from study to study.” [*See the wiki article for more*]

“When studies are addressing the same outcome, measured in the same way, using the same approach to analysis, but presenting results in different ways, then the only obstacles to meta-analysis are practical. If sufficient information is available to estimate the effect size of interest, then a meta-analysis is possible. […]

When studies are addressing the same outcome, measured in the same way, but using different approaches to analysis, then the possibility of a meta-analysis depends on both statistical and practical considerations. One important point is that all studies in a meta-analysis must use essentially the same index of treatment effect. For example, we cannot combine a risk difference with a risk ratio. Rather, we would need to use the summary data to compute the same index for all studies.

There are some indices that are similar, if not exactly the same, and judgments are required as to whether it is acceptable to combine them. One example is odds ratios and risk ratios. When the event is rare, then these are approximately equal and can readily be combined. As the event gets more common the two diverge and should not be combined. Other indices that are similar to risk ratios are hazard ratios and rate ratios. Some people decide these are similar enough to combine; others do not. The judgment of the meta-analyst in the context of the aims of the meta-analysis will be required to make such decisions on a case by case basis.

When studies are addressing the same outcome measured in different ways, or different outcomes altogether, then the suitability of a meta-analysis depends mainly on substantive considerations. The researcher will have to decide whether a combined analysis would have a meaningful interpretation. […] There is a useful class of indices that are, perhaps surprisingly, combinable under some simple transformations. In particular, formulas are available to convert standardized mean differences, odds ratios and correlations to a common metric [*I should note that the book covers these data transformations, but I decided early on not to talk about that kind of stuff in my posts because it’s highly technical and difficult to blog*] […] These kinds of conversions require some assumptions about the underlying nature of the data, and violations of these assumptions can have an impact on the validity of the process. […] A report should state the computational model used in the analysis and explain why this model was selected. A common mistake is to use the fixed-effect model on the basis that there is no evidence of heterogeneity. As [already] explained […], the decision to use one model or the other should depend on the nature of the studies, and not on the significance of this test [because the test will often have low power anyway]. […] The report of a meta-analysis should generally include a forest plot.”

“The issues addressed by a sensitivity analysis for a systematic review are similar to those that might be addressed by a sensitivity analysis for a primary study. That is, the focus is on the extent to which the results are (or are not) robust to assumptions and decisions that were made when carrying out the synthesis. The kinds of issues that need to be included in a sensitivity analysis will vary from one synthesis to the next. […] One kind of sensitivity analysis is concerned with the impact of decisions that lead to different data being used in the analysis. A common example of sensitivity analysis is to ask how results might have changed if different study inclusion rules had been used. […] Another kind of sensitivity analysis is concerned with the impact of the statistical methods used […] For example one might ask whether the conclusions would have been different if a different effect size measure had been used […] Alternatively, one might ask whether the conclusions would be the same if fixed-effect versus random-effects methods had been used. […] Yet another kind of sensitivity analysis is concerned with how we addressed missing data […] A very important form of missing data is the missing data on effect sizes that may result from incomplete reporting or selective reporting of statistical results within studies. When data are selectively reported in a way that is related to the magnitude of the effect size (e.g., when results are only reported when they are statistically significant), such missing data can have biasing effects similar to publication bias on entire studies. In either case, we need to ask how the results would have changed if we had dealt with missing data in another way.”

“A cumulative meta-analysis is a meta-analysis that is performed first with one study, then with two studies, and so on, until all relevant studies have been included in the analysis. As such, a cumulative analysis *is not a different analytic method* than a standard analysis, but simply *a mechanism for displaying a series of separate analyses* in one table or plot. When the series are sorted into a sequence based on some factor, the display shows how our estimate of the effect size (and its precision) shifts as a function of this factor. When the studies are sorted chronologically, the display shows how the evidence accumulated, and how the conclusions may have shifted, over a period of time.”

“While cumulative analyses are most often used to display the pattern of the evidence over time, the same technique can be used for other purposes as well. Rather than sort the data chronologically, we can sort it by any variable, and then display the pattern of effect sizes. For example, assume that we have 100 studies that looked at the impact of homeopathic medicines, and we think that the effect is related to the quality of the blinding process. We anticipate that studies with complete blinding will show no effect, those with lower quality blinding will show a minor effect, those that blind only some people will show a larger effect, and so on. We could sort the studies based on the quality of the blinding (from high to low), and then perform a cumulative analysis. […] Similarly, we could use cumulative analyses to display the possible impact of publication bias. […] large studies are assumed to be unbiased, but the smaller studies may tend to over-estimate the effect size. We could perform a cumulative analysis, entering the larger studies at the top and adding the smaller studies at the bottom. If the effect was initially small when the large (nonbiased) studies were included, and then increased as the smaller studies were added, we would indeed be concerned that the effect size was related to sample size. A benefit of the cumulative analysis is that it displays not only *if* there is a shift in effect size, but also *the* *magnitude* of the shift. […] It is important to recognize that cumulative meta-analysis is a mechanism for display, rather than analysis. […] These kinds of displays are compelling and can serve an important function. However, if our goal is actually to examine the relationship between a factor and effect size, then the appropriate analysis is a meta-regression”

“John C. Bailar, in an editorial for the New England Journal of Medicine (Bailar, 1997), [wrote] that mistakes […] are common in meta-analysis. He argues that a meta-analysis is inherently so complicated that mistakes by the persons performing the analysis are all but inevitable. He also argues that journal editors are unlikely to uncover all of these mistakes. […] The specific points made by Bailar about problems with meta-analysis are entirely reasonable. He is correct that many meta-analyses contain errors, some of them important ones. His list of potential (and common) problems can serve as a bullet list of mistakes to avoid when performing a meta-analysis. However, the mistakes cited by Bailar are flaws in the application of the method, rather than problems with the method itself. Many primary studies suffer from flaws in the design, analyses, and conclusions. In fact, some serious kinds of problems are endemic in the literature. The response of the research community is to locate these flaws, consider their impact for the study in question, and (hopefully) take steps to avoid similar mistakes in the future. In the case of meta-analysis, as in the case of primary studies, we cannot condemn a method because some people have used that method improperly. […] In his editorial Bailar concludes that, until such time as the quality of meta-analyses is improved, he would prefer to work with the traditional narrative reviews […] We disagree with the conclusion that narrative reviews are preferable to systematic reviews, and that meta-analyses should be avoided. The narrative review suffers from every one of the problems cited for the systematic review. The only difference is that, in the narrative review, these problems are less obvious. […] the key advantage of the systematic approach of a meta-analysis is that all steps are clearly described so that the process is transparent.”

## Open Thread

It’s been a long time since I had one of these. Questions? Comments? Random observations?

I hate posting posts devoid of content, so here’s some random stuff:

i.

If you think the stuff above is all fun and games I should note that the topic of chiralty, which is one of the things talked about in the lecture above, was actually covered in some detail in Gale’s book, which hardly is a book which spends a great deal of time talking about esoteric mathematical concepts. On a related note, the main reason why I have not blogged that book is incidentally that I lost all notes and highlights I’d made in the first 200 pages of the book when my computer broke down, and I just can’t face reading that book again simply in order to blog it. It’s a good book, with interesting stuff, and I may decide to blog it later, but I don’t feel like doing it at the moment; without highlights and notes it’s a real pain to blog a book, and right now it’s just not worth it to reread the book. Rereading books can be fun – I’ve incidentally been rereading *Darwin* lately and I may decide to blog this book soon; I imagine I might also choose to reread some of Asimov’s books before long – but it’s not much fun if you’re finding yourself having to do it simply because the computer deleted your work.

…

ii. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.

Here’s the abstract:

“Statistical power analysis provides the conventional approach to assess error rates when designing a research study. However, power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations in which (a) the probability of an estimate being in the wrong direction (*Type S [sign] error*) and (b) the factor by which the magnitude of an effect might be overestimated (*Type M [magnitude] error or exaggeration ratio*) are estimated. We illustrate with examples from recent published research and discuss the largest challenge in a design calculation: coming up with reasonable estimates of plausible effect sizes based on external information.”

If a study has low power, you can get into a lot of trouble. Some problems are well known, others probably aren’t. A bit more from the paper:

“design calculations can reveal three problems:

1. Most obvious, a study with low power is unlikely to “succeed” in the sense of yielding a statistically significant result.

2. It is quite possible for a result to be significant at the 5% level — with a 95% confidence interval that entirely excludes zero — and for there to be a high chance, sometimes 40% or more, that this interval is on the wrong side of zero. Even sophisticated users of statistics can be unaware of this point — that the probability of a Type S error is not the same as the p value or significance level.[3]

3. Using statistical significance as a screener can lead researchers to drastically overestimate the magnitude of an effect (Button et al., 2013).

Design analysis can provide a clue about the importance of these problems in any particular case.”

“Statistics textbooks commonly give the advice that statistical significance is not the same as practical significance, often with examples in which an effect is clearly demonstrated but is very small […]. In many studies in psychology and medicine, however, the problem is the opposite: an estimate that is statistically significant but with such a large uncertainty that it provides essentially no information about the phenomenon of interest. […] There is a range of evidence to demonstrate that it remains the case that too many small studies are done and preferentially published when “significant.” We suggest that one reason for the continuing lack of real movement on this problem is the historic focus on power as a lever for ensuring statistical significance, with inadequate attention being paid to the difficulties of interpreting statistical significance in underpowered studies. Because insufficient attention has been paid to these issues, we believe that too many small studies are done and preferentially published when “significant.” There is a common misconception that if you happen to obtain statistical significance with low power, then you have achieved a particularly impressive feat, obtaining scientific success under difficult conditions.

However, that is incorrect if the goal is scientific understanding rather than (say) publication in a top journal. In fact, statistically significant results in a noisy setting are highly likely to be in the wrong direction and invariably overestimate the absolute values of any actual effect sizes, often by a substantial factor.”

…

iii. I’m sure most people who might be interested in following the match are already well aware that Anand and Carlsen are currently competing for the world chess championship, and I’m not going to talk about that match here. However I do want to mention to people interested in improving their chess that I recently came across this site, and that I quite like it. It only deals with endgames, but endgames are really important. If you don’t know much about endgames you may find the videos available here, here and here to be helpful.

…

iv. A link: Crosss Validated: “Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.”

A friend recently told me about this resource. I knew about the existence of StackExchange, but I haven’t really spent much time there. These days I mostly stick to books and a few sites I already know about; I rarely look for new interesting stuff online. This also means you should not automatically assume I surely already know about X when you’re considering whether to tell me about X in an Open Thread.

## Introduction to Meta Analysis (II)

You can read my first post about the book here. Some parts of the book are fairly technical, so I decided in the post below to skip some chapters in my coverage, simply because I could see no good way to cover the stuff on a wordpress blog (which as already mentioned many times is not ideal for math coverage) without spending a lot more time on that stuff than I wanted to. If you’re a new reader and/or you don’t know what a meta-analysis is, I highly recommend you read my first post about the book before moving on to the coverage below (and/or you can watch this brief video on the topic).

Below I have added some more quotes and observations from the book.

…

“In primary studies we use regression, or multiple regression, to assess the relationship between one or more covariates (moderators) and a dependent variable. Essentially the same approach can be used with meta-analysis, except that the covariates are at the level of the study rather than the level of the subject, and the dependent variable is the effect size in the studies rather than subject scores. We use the term *meta-regression* to refer to these procedures when they are used in a meta-analysis.

The differences that we need to address as we move from primary studies to meta-analysis for regression are similar to those we needed to address as we moved from primary studies to meta-analysis for subgroup analyses. These include the need to assign a weight to each study and the need to select the appropriate model (fixed versus random effects). Also, as was true for subgroup analyses, the *R ^{2} *index, which is used to quantify the proportion of variance explained by the covariates, must be modified for use in meta-analysis.

With these modifications, however, the full arsenal of procedures that fall under the heading of multiple regression becomes available to the meta-analyst. […] As is true in primary studies, where we need an appropriately large ratio of

*subjects*to covariates in order for the analysis be to meaningful, in meta-analysis we need an appropriately large ratio of

*studies*to covariates. Therefore, the use of meta-regression, especially with multiple covariates, is not a recommended option when the number of studies is small.”

“Power depends on the size of the effect and the precision with which we measure the effect. For subgroup analysis this means that power will increase as the difference between (or among) subgroup means increases, and/or the standard error within subgroups decreases. For meta-regression this means that power will increase as the magnitude of the relationship between the covariate and effect size increases, and/or the precision of the estimate increases. In both cases, a key factor driving the precision of the estimate will be the total number of individual subjects across all studies and (for random effects) the total number of studies. […] While there is a general perception that power for testing the main effect is consistently high in meta-analysis, this perception is not correct […] and certainly does not extend to tests of subgroup differences or to meta-regression. […] Statistical power for detecting a difference among subgroups, or for detecting the relationship between a covariate and effect size, is often low [and] failure to obtain a statistically significant difference among subgroups should never be interpreted as evidence that the effect is the same across subgroups. Similarly, failure to obtain a statistically significant effect for a covariate should never be interpreted as evidence that there is no relationship between the covariate and the effect size.”

“When we have effect sizes for more than one outcome (or time-point) within a study, based on the same participants, the information for the different effects is not independent and we need to take account of this in the analysis. […] When we are working with different outcomes at a single point in time, the plausible range of correlations [between outcomes] will depend on the similarity of the outcomes. When we are working with the same outcome at multiple time-points, the plausible range of correlations will depend on such factors as the time elapsed between assessments and the stability of the relative scores over this time period. […] Researchers who do not know the correlation between outcomes sometimes fall back on either of two ‘default’ positions. Some will include both [outcome variables] in the analysis and treat them as independent. Others would use the average of the [variances of the two outcomes]. It is instructive, therefore, to consider the practical impact of these choices. […] In effect, […] researchers who adopt either of these positions as a way of bypassing the need to specify a correlation, are actually adopting a correlation, albeit implicitly. And, the correlation that they adopt falls at either extreme of the possible range (either zero or 1.0). The first approach is almost certain to underestimate the variance and overestimate the precision. The second approach is almost certain to overestimate the variance and underestimate the precision.” [*A good example of a more general point in the context of statistical/mathematical modelling: Sometimes it’s really hard not to make assumptions, and trying to get around such problems by ‘ignoring them’ may sometimes lead to the implicit adoption of assumptions which are highly questionable as well.*]

“Vote counting is the name used to describe the idea of seeing how many studies yielded a significant result, and how many did not. […] narrative reviewers often resort to [vote counting] […] In some cases this process has been formalized, such that one actually counts the number of significant and non-significant p-values and picks the winner. In some variants, the reviewer would look for a clear majority rather than a simple majority. […] One might think that summarizing *p*-values through a vote-counting procedure would yield more accurate decision than any one of the single significance tests being summarized. This is not generally the case, however. In fact, Hedges and Olkin (1980) showed that the power of vote-counting considered as a statistical decision procedure can not only be lower than that of the studies on which it is based, the power of vote counting can tend toward zero as the number of studies increases. […] the idea of vote counting is fundamentally flawed and the variants on this process are equally flawed (and perhaps even more dangerous, since the basic flaw is less obvious when hidden behind a more complicated algorithm or is one step removed from the *p*-value). […] The logic of vote counting says that a significant finding is evidence that an effect exists, while a non-significant finding is evidence that an effect is absent. While the first statement is true, the second is not. While a nonsignificant finding *could* be due to the fact that the true effect is nil, it can also be due simply to low statistical power. Put simply, the *p*-value reported for any study is a function of the observed effect size and the sample size. Even if the observed effect is substantial, the *p*-value will not be significant unless the sample size is adequate. In other words, as most of us learned in our first statistics course, *the absence of a statistically significant effect is not evidence that an effect is absent*.”

“While the term vote counting is associated with narrative reviews it can also be applied to the single study, where a significant p-value is taken as evidence that an effect exists, and a nonsignificant p-value is taken as evidence that an effect does not exist. Numerous surveys in a wide variety of substantive fields have repeatedly documented the ubiquitous nature of this mistake. […] When we are working with a single study and we have a nonsignificant result we don’t have any way of knowing whether or not the effect is real. The nonsignificant *p*-value could reflect either the fact that the true effect is nil *or* the fact that our study had low power. While we caution against accepting the former (that the true effect is nil) we cannot rule it out. By contrast, when we use meta-analysis to synthesize the data from a series of studies we can often identify the true effect. And in many cases (for example if the true effect is substantial and is consistent across studies) we can assert that the nonsignificant *p*-value in the separate studies was due to low power rather than the absence of an effect. […] vote

counting is never a valid approach.”

“The fact that a meta-analysis will often [but not always] have high power is important because […] primary studies often suffer from low power. While researchers are encouraged to design studies with power of at least 80%, this goal is often elusive. Many studies in medicine, psychology, education and an array of other fields have power substantially lower than 80% to detect large effects, and substantially lower than 50% to detect smaller effects that are still important enough to be of theoretical or practical importance. By contrast, a meta-analysis based on multiple studies will have a higher total sample size than any of the separate studies and the increase in power can be substantial. The problem of low power in the primary studies is especially acute when looking for adverse events. The problem here is that studies to test new drugs are *powered* to find a treatment effect for the drug, and do not have adequate power to detect side effects (which have a much lower event rate, and therefore lower power).”

“Assuming a nontrivial effect size, power is primarily a function of the precision […] When we are working with a fixed-effect analysis, precision for the summary effect is always higher than it is for any of the included studies. Under the fixed-effect analysis precision is largely determined by the total sample size […], and it follows the total sample size will be higher across studies than within studies. […] in a random-effects meta-analysis, power depends on within-study error and between-studies variation [*…if you don’t recall the difference between fixed-effects models and random effects models, see the previous post*]. If the effect sizes are reasonably consistent from study to study, and/or if the analysis includes a substantial number of studies, then the second of these will tend to be small, and power will be driven by the cumulative sample size. In this case the meta-analysis will tend to have higher power than any of the included studies. […] However, if the effect size varies substantially from study to study, and the analysis includes only a few studies, then this second aspect will limit the potential power of the meta-analysis. In this case, power could be limited to some low value even if the analysis includes tens of thousands of persons. […] The Cochrane Database of Systematic Reviews is a database of systematic reviews, primarily of randomized trials, for medical interventions in all areas of healthcare, and currently includes over 3000 reviews. In this database, the median number of trials included in a review is six. When a review includes only six studies, power to detect even a moderately large effect, let alone a small one, can be well under 80%. While the median number of studies in a review differs by the field of research, in almost any field we do find some reviews based on a small number of studies, and so we cannot simply assume that power is high. […] Even when power to test the main effect is high, many meta-analyses are not concerned with the main effect at all, but are performed solely to assess the impact of covariates (or moderator variables). […] The question to be addressed is not whether the treatment works, but whether one variant of the treatment is more effective than another variant. The test of a moderator variable in a meta-analysis is akin to the test of an interaction in a primary study, and both suffer from the same factors that tend to decrease power. First, the effect size is actually the difference between the two effect sizes and so is almost invariably smaller than the main effect size. Second, the sample size within groups is (by definition) smaller than the total sample size. Therefore, power for testing the moderator will often be very low (Hedges and Pigott, 2004).”

“It is important to understand that the fixed-effect model and random-effects model address different hypotheses, and that they use different estimates of the variance because they make different assumptions about the nature of the distribution of effects across studies […]. Researchers sometimes remark that power is lower under the random-effects model than for the fixed-effect model. While this statement may be true, it misses the larger point: it is not meaningful to compare power for fixed- and random-effects analyses since the two values of power are not addressing the same question. […] Many meta-analyses include a test of homogeneity, which asks whether or not the between-studies dispersion is more than would be expected by chance. The test of significance is […] based on *Q*, the sum of the squared deviations of each study’s effect size estimate (*Yi*) from the summary effect (*M*), with each deviation weighted by the inverse of that study’s variance. […] Power for this test depends on three factors. The larger the ratio of between-studies to within-studies variance, the larger the number of studies, and the more liberal the criterion for significance, the higher the power.”

“While a meta-analysis will yield a mathematically accurate synthesis of the studies included in the analysis, if these studies are a biased sample of all relevant studies, then the mean effect computed by the meta-analysis will reflect this bias. Several lines of evidence show that studies that report relatively high effect sizes are more likely to be published than studies that report lower effect sizes. Since published studies are more likely to find their way into a meta-analysis, any bias in the literature is likely to be reflected in the meta-analysis as well. This issue is generally known as publication bias. The problem of publication bias is not unique to systematic reviews. It affects the researcher who writes a narrative review and even the clinician who is searching a database for primary papers. […] Other factors that can lead to an upward bias in effect size and are included under the umbrella of publication bias are the following. Language bias (English-language databases and journals are more likely to be searched, which leads to an oversampling of statistically significant studies) […]; availability bias (selective inclusion of studies that are easily accessible to the researcher); cost bias (selective inclusion of studies that are available free or at low cost); familiarity bias (selective inclusion of studies only from one’s own discipline); duplication bias (studies with statistically significant results are more likely to be published more than once […]) and citation bias (whereby studies with statistically significant results are more likely to be cited by others and therefore easier to identify […]). […] If persons performing a systematic review were able to locate studies that had been published in the grey literature (any literature produced in electronic or print format that is not controlled by commercial publishers, such as technical reports and similar sources), then the fact that the studies with higher effects are more likely to be published in the more mainstream publications would not be a problem for meta-analysis. In fact, though, this is not usually the case.

While a systematic review *should* include a thorough search for all relevant studies, the actual amount of grey/unpublished literature included, and the types, varies considerably across meta-analyses.”

“In sum, it is possible that the studies in a meta-analysis may overestimate the true effect size because they are based on a biased sample of the target population of studies. But how do we deal with this concern? The only true test for publication bias is to compare effects in the published studies formally with effects in the unpublished studies. This requires access to the unpublished studies, and if we had that we would no longer be concerned. Nevertheless, the best approach would be for the reviewer to perform a truly comprehensive search of the literature, in hopes of minimizing the bias. In fact, there is evidence that this approach is somewhat effective. Cochrane reviews tend to include more studies and to report a smaller effect size than similar reviews published in medical journals. Serious efforts to find unpublished, and difficult to find studies, typical of Cochrane reviews, may therefore reduce some of the effects of publication bias. Despite the increased resources that are needed to locate and retrieve data from sources such as dissertations, theses, conference papers, government and technical reports and the like, it is generally indefensible to conduct a synthesis that categorically excludes these types of research reports. Potential benefits and costs of grey literature searches must be balanced against each other.”

“Since we cannot be certain that we have avoided bias, researchers have developed methods intended to assess its potential impact on any given meta-analysis. These methods address the following questions:

*Is there evidence of any bias?

*Is it possible that the entire effect is an artifact of bias?

*How much of an impact might the bias have? […]

Methods developed to address publication bias require us to make many assumptions, including the assumption that the pattern of results is due to bias, and that this bias follows a certain model. […] In order to gauge the impact of publication bias we need a model that tells us which studies are likely to be missing. The model that is generally used […] makes the following assumptions: (a) Large studies are likely to be published regardless of statistical significance because these involve large commitments of time and resources. (b) Moderately sized studies are at risk for being lost, but with a moderate sample size even modest effects will be significant, and so only some studies are lost here. (c) Small studies are at greatest risk for being lost. Because of the small sample size, only the largest effects are likely to be significant, with the small and moderate effects likely to be unpublished.

The combined result of these three items is that we expect the bias to increase as the sample size goes down, and the methods described […] are all based on this model. […] [One problem is however that] when there is clear evidence of asymmetry, we cannot assume that this reflects publication bias. The effect size may be larger in small studies because we retrieved a biased sample of the smaller studies, but it is also possible that the effect size really is larger in smaller studies for entirely unrelated reasons. For example, the small studies may have been performed using patients who were quite ill, and therefore more likely to benefit from the drug (as is sometimes the case in early trials of a new compound). Or, the small studies may have been performed with better (or worse) quality control than the larger ones. Sterne et al. (2001) use the term *small-study effect* to describe a pattern where the effect is larger in small studies, and to highlight the fact that the mechanism for this effect is not known.”

“It is almost always important to include an assessment of publication bias in relation to a meta-analysis. It will either assure the reviewer that the results are robust, or alert them that the results are suspect.”

## Introduction to Meta Analysis (I)

“Since meta-analysis is a relatively new field, many people, including those who actually use meta-analysis in their work, have not had the opportunity to learn about it systematically. We hope that this volume will provide a framework that allows them to understand the logic of meta-analysis, as well as how to apply and interpret meta-analytic procedures properly.

This book is aimed at researchers, clinicians, and statisticians. Our approach is primarily conceptual. The reader will be able to skip the formulas and still understand, for example, the differences between fixed-effect and random-effects analysis, and the mechanisms used to assess the dispersion in effects from study to study. However, for those with a statistical orientation, we include all the relevant formulas, along with worked examples. […] This volume is intended for readers from various substantive fields, including medicine, epidemiology, social science, business, ecology, and others. While we have included examples from many of these disciplines, the more important message is that meta-analytic methods that may have developed in any one of these fields have application to all of them.”

…

I’ve been reading this book and I like it so far – I’ve read about the topic before but I’ve been missing a textbook on this topic, and this one is quite good so far (I’ve read roughly half of it so far). Below I have added some observations from the first thirteen chapters of the book:

…

“Meta-analysis refers to the statistical synthesis of results from a series of studies. While the statistical procedures used in a meta-analysis can be applied to any set of data, the synthesis will be meaningful only if the studies have been collected systematically. This could be in the context of a systematic review, the process of systematically locating, appraising, and then synthesizing data from a large number of sources. Or, it could be in the context of synthesizing data from a select group of studies, such as those conducted by a pharmaceutical company to assess the efficacy of a new drug. If a treatment effect (or effect size) is consistent across the series of studies, these procedures enable us to report that the effect is robust across the kinds of populations sampled, and also to estimate the magnitude of the effect more precisely than we could with any of the studies alone. If the treatment effect varies across the series of studies, these procedures enable us to report on the range of effects, and may enable us to identify factors associated with the magnitude of the effect size.”

“For systematic reviews, a clear set of rules is used to search for studies, and then to determine which studies will be included in or excluded from the analysis. Since there is an element of subjectivity in setting these criteria, as well as in the conclusions drawn from the meta-analysis, we cannot say that the systematic review is entirely objective. However, because all of the decisions are specified clearly, the mechanisms are transparent. A key element in most systematic reviews is the statistical synthesis of the data, or the meta-analysis. Unlike the narrative review, where reviewers implicitly assign some level of importance to each study, in meta-analysis the weights assigned to each study are based on mathematical criteria that are specified in advance. While the reviewers and readers may still differ on the substantive meaning of the results (as they might for a primary study), the statistical analysis provides a transparent, objective, and replicable framework for this discussion. […] If the entire review is performed properly, so that the search strategy matches the research question, and yields a reasonably complete and unbiased collection of the relevant studies, then (providing that the included studies are themselves valid) the meta-analysis will also be addressing the intended question. On the other hand, if the search strategy is flawed in concept or execution, or if the studies are providing biased results, then problems exist in the review that the meta-analysis cannot correct.”

“Meta-analyses are conducted for a variety of reasons […] The purpose of the meta-analysis, or more generally, the purpose of any research synthesis has implications for when it should be performed, what model should be used to analyze the data, what sensitivity analyses should be undertaken, and how the results should be interpreted. Losing sight of the fact that meta-analysis is a tool with multiple applications causes confusion and leads to pointless discussions about what is the right way to perform a research synthesis, when there is no single right way. It all depends on the purpose of the synthesis, and the data that are available.”

“The effect size, a value which reflects the magnitude of the treatment effect or (more generally) the strength of a relationship between two variables, is the unit of currency in a meta-analysis. We compute the effect size for each study, and then work with the effect sizes to assess the consistency of the effect across studies and to compute a summary effect. […] The summary effect is nothing more than the weighted mean of the individual effects. However, the mechanism used to assign the weights (and therefore the meaning of the summary effect) depends on our assumptions about the distribution of effect sizes from which the studies were sampled. Under the fixed-effect model, we assume that all studies in the analysis share the same true effect size, and the summary effect is our estimate of this common effect size. Under the random-effects model, we assume that the true effect size varies from study to study, and the summary effect is our estimate of the mean of the distribution of effect sizes. […] A key theme in this volume is the importance of assessing the dispersion of effect sizes from study to study, and then taking this into account when interpreting the data. If the effect size is consistent, then we will usually focus on the summary effect, and note that this effect is robust across the domain of studies included in the analysis. If the effect size varies modestly, then we might still report the summary effect but note that the true effect in any given study could be somewhat lower or higher than this value. If the effect varies substantially from one study to the next, our attention will shift from the summary effect to the dispersion itself.”

“During the time period beginning in1959 and ending in 1988 (a span of nearly 30 years) there were a total of 33 randomized trials performed to assess the ability of streptokinase to prevent death following a heart attack. […] The trials varied substantially in size. […] Of the 33 studies, six were statistically significant while the other 27 were not, leading to the perception that the studies yielded conflicting results. […] In 1992 Lau et al. published a meta-analysis that synthesized the results from the 33 studies. […] [They found that] the treatment reduces the risk of death by some 21%. And, this effect was reasonably consistent across all studies in the analysis. […] The narrative review has no mechanism for synthesizing the p-values from the different studies, and must deal with them as discrete pieces of data. In this example six of the studies were statistically significant while the other 27 were not, which led some to conclude that there was evidence against an effect, or that the results were inconsistent […] By contrast, the meta-analysis allows us to combine the effects and evaluate the statistical significance of the summary effect. The p-value for the summary effect [was] p=0.0000008. […] While one might assume that 27 studies failed to reach statistical significance because they reported small effects, it is clear […] that this is not the case. In fact, the treatment effect in many of these studies was actually larger than the treatment effect in the six studies that were statistically significant. Rather, the reason that 82% of the studies were not statistically significant is that these studies had small sample sizes and low statistical power.”

“the [narrative] review will often focus on the question of whether or not the body of evidence allows us to reject the null hypothesis. There is no good mechanism for discussing the magnitude of the effect. By contrast, the meta-analytic approaches discussed in this volume allow us to compute an estimate of the effect size for each study, and these effect sizes fall at the core of the analysis. This is important because the effect size is what we care about. If a clinician or patient needs to make a decision about whether or not to employ a treatment, they want to know if the treatment reduces the risk of death by 5% or 10% or 20%, and this is the information carried by the effect size. […] The p-value can tell us only that the effect is not zero, and to report simply that the effect is not zero is to miss the point. […] The narrative review has no good mechanism for assessing the consistency of effects. The narrative review starts with p-values, and because the p-value is driven by the size of a study as well as the effect in that study, the fact that one study reported a p-value of 0.001 and another reported a p-value of 0.50 does not mean that the effect was larger in the former. The p-value of 0.001 *could* reflect a large effect size but it could also reflect a moderate or small effect in a large study […] The p-value of 0.50 *could* reflect a small (or nil) effect size but could also reflect a large effect in a small study […] This point is often missed in narrative reviews. Often, researchers interpret a nonsignificant result to mean that there is no effect. If some studies are statistically significant while others are not, the reviewers see the results as conflicting. This problem runs through many fields of research. […] By contrast, meta-analysis completely changes the landscape. First, we work with effect sizes (not p-values) to determine whether or not the effect size is consistent across studies. Additionally, we apply methods based on statistical theory to allow that some (or all) of the observed dispersion is due to random sampling variation rather than differences in the true effect sizes. Then, we apply formulas to partition the variance into random error versus real variance, to quantify the true differences among studies, and to consider the implications of this variance.”

“Consider […] the case where some studies report a difference in means, which is used to compute a standardized mean difference. Others report a difference in proportions which is used to compute an odds ratio. And others report a correlation. All the studies address the same broad question, and we want to include them in one meta-analysis. […] we are now dealing with different indices, and we need to convert them to a common index before we can proceed. The question of whether or not it is appropriate to combine effect sizes from studies that used different metrics must be considered on a case by case basis. The key issue is that it only makes sense to compute a summary effect from studies that we judge to be comparable in relevant ways. If we would be comfortable combining these studies if they had used the same metric, then the fact that they used different metrics should not be an impediment. […] When some studies use means, others use binary data, and others use correlational data, we can apply formulas to convert among effect sizes. […] When we convert between different measures we make certain assumptions about the nature of the underlying traits or effects. Even if these assumptions do not hold exactly, the decision to use these conversions is often better than the alternative, which is to simply omit the studies that happened to use an alternate metric. This would involve loss of information, and possibly the systematic loss of information, resulting in a biased sample of studies. A sensitivity analysis to compare the meta-analysis results with and without the converted studies would be important. […] Studies that used different measures may [however] differ from each other in substantive ways, and we need to consider this possibility when deciding if it makes sense to include the various studies in the same analysis.”

“The precision with which we estimate an effect size can be expressed as a standard error or confidence interval […] or as a variance […] The precision is driven primarily by the sample size, with larger studies yielding more precise estimates of the effect size. […] Other factors affecting precision include the study design, with matched groups yielding more precise estimates (as compared with independent groups) and clustered groups yielding less precise estimates. In addition to these general factors, there are unique factors that affect the precision for each effect size index. […] Studies that yield more precise estimates of the effect size carry more information and are assigned more weight in the meta-analysis.”

“Under the fixed-effect model we assume that all studies in the meta-analysis share a common (true) effect size. […] However, in many systematic reviews this assumption is implausible. When we decide to incorporate a group of studies in a meta-analysis, we assume that the studies have enough in common that it makes sense to synthesize the information, but there is generally no reason to assume that they are *identical* in the sense that the true effect size is *exactly the same* in all the studies. […] Because studies will differ in the mixes of participants and in the implementations of interventions, among other reasons, there may be *different effect sizes* underlying different studies. […] One way to address this variation across studies is to perform a *random-effects* meta-analysis. In a random-effects meta-analysis we usually assume that the true effects are normally distributed. […] Since our goal is to estimate the mean of the distribution, we need to take account of two sources of variance. First, there is within-study error in estimating the effect in each study. Second (even if we knew the true mean for each of our studies), there is variation in the true effects across studies. Study weights are assigned with the goal of minimizing both sources of variance.”

“Under the fixed-effect model we assume that the true effect size for all studies is identical, and the only reason the effect size varies between studies is sampling error (error in estimating the effect size). Therefore, when assigning weights to the different studies we can largely ignore the information in the smaller studies since we have better information about the same effect size in the larger studies. By contrast, under the random-effects model the goal is not to estimate one true effect, but to estimate the mean of a distribution of effects. Since each study provides information about a different effect size, we want to be sure that all these effect sizes are represented in the summary estimate. This means that we cannot discount a small study by giving it a very small weight (the way we would in a fixed-effect analysis). The estimate provided by that study may be imprecise, but it is information about an effect that no other study has estimated. By the same logic we cannot give too much weight to a very large study (the way we might in a fixed-effect analysis). […] Under the fixed-effect model there is a wide range of weights […] whereas under the random-effects model the weights fall in a relatively narrow range. […] the relative weights assigned under random effects will be *more balanced* than those assigned under fixed effects. As we move from fixed effect to random effects, extreme studies will lose influence if they are large, and will gain influence if they are small. […] Under the fixed-effect model the only source of uncertainty is the within-study (sampling or estimation) error. Under the random-effects model there is this same source of uncertainty plus an additional source (between-studies variance). It follows that the variance, standard error, and confidence interval for the summary effect will always be larger (or wider) under the random-effects model than under the fixed-effect model […] Under the fixed-effect model the null hypothesis being tested is that there is zero effect in *every study*. Under the random-effects model the null hypothesis being tested is that the *mean effect* is zero. Although some may treat these hypotheses as interchangeable, they are in fact different”

“It makes sense to use the fixed-effect model if two conditions are met. First, we believe that all the studies included in the analysis are functionally identical. Second, our goal is to compute the common effect size for the identified population, and not to generalize to other populations. […] this situation is relatively rare. […] By contrast, when the researcher is accumulating data from a series of studies that had been performed by researchers operating independently, it would be unlikely that all the studies were functionally equivalent. Typically, the subjects or interventions in these studies would have differed in ways that would have impacted on the results, and therefore we should not assume a common effect size. Therefore, in these cases the random-effects model is more easily justified than the fixed-effect model. […] There is one caveat to the above. If the number of studies is very small, then the estimate of the between-studies variance […] will have poor precision. While the random-effects model is still the appropriate model, we lack the information needed to apply it correctly. In this case the reviewer may choose among several options, each of them problematic [and one of which is to apply a fixed effects framework].”

## Medical Statistics at a Glance

I wasn’t sure if I should blog this book or not, but in the end I decided to add a few observations here – you can read my goodreads review here.

Before I started reading the book I was considering whether it’d be worth it, as a book like this might have little to offer for someone with my background – I’ve had a few stats courses at this point, and it’s not like the specific topic of medical statistics is completely unknown to me; for example I read an epidemiology textbook just last year, and *Hill* and *Glied and Smith* covered related topics as well. It wasn’t that I thought there’s not a lot of medical statistics I don’t already know – there is – it was more of a concern that this specific (type of) book might not be the book to read if I wanted to learn a lot of new stuff in this area.

Disregarding the specific medical context of the book I knew a lot of stuff about many of the topics covered. To take an example, Bartholomew’s book devoted *a lot of pages* to the question of how to handle missing data in a sample, a question this book devotes 5 sentences to. There are a lot of details missing here and the coverage is not very deep. As I hint at in the goodreads review, I think the approach applied in the book is to some extent simply mistaken; I don’t think this (many chapters on different topics, each chapter 2-3 pages long) is a good way to write a statistics textbook. The many different chapters on a wide variety of topics give you the impression that the authors have tried to maximize the amount of people who might get something out of this book, which may have ended up meaning that few people will actually get much out of it. On the plus side there are illustrated examples of many of the statistical methods used in the book, and you also get (some of) the relevant formulas for calculating e.g. specific statistics – but you get little understanding of the details of why this works, when it doesn’t, and what happens when it doesn’t. I already mentioned Bartholomew’s book – many other textbooks written about topics which they manage to cover in their two- or three-page chapters could be mentioned as well – examples include publications such as this, this and this.

Given the way the book starts out (which different types of data exist? How do you calculate an average and what is a standard deviation?) I think the people most likely to be reading a book like this are people who have a very limited knowledge of statistics and data analysis – and when people like that read stats books, you need to be very careful with your wording and assumptions. Maybe I’m just a grumpy old man, but I’m not sure I think the authors are careful enough. A couple of examples:

“Statistical modelling includes the use of simple and multiple linear regression, polynomial regression, logistic regression and methods that deal with survival data. All these methods rely on generating the mathematical model that describes the relationship between two or more variables. In general, any model can be expressed in the form:

*g(Y) = a + b _{1}x_{1} + b_{2}x_{2} + … + b_{k}x_{k}*

where Y is the fitted value of the dependent variable, g(.) is some optional transformation of it (for example, the logit transformation), xl, . . . , xk are the predictor or explanatory variables”

(In case you were wondering, it took me 20 minutes to find out how to lower those 1’s and 2’s because it’s not a standard wordpress function and you need to really want to find out how to do this in order to do it. The k’s still look like crap, but I’m not going to spend more time trying to figure out how to make this look neat. I of course could not copy the book formula into the post, or I would have done that. As I’ve pointed out many times, it’s a nightmare to cover mathematical topics on a blog like this. Yeah, I know Terry Tao also blogs on wordpress, but presumably he writes his posts in a different program – I’m very much against the idea of doing this, even if I am sometimes – in situations like these – seriously reconsidering whether I should do that.)

Let’s look closer at this part again: “In general, any model can be expressed…”

This choice of words and the specific example is the sort of thing I have in mind. If you don’t know a lot about data analysis and you read a statement like this literally, which is the sort of thing I for one am wont to do, you’ll conclude that there’s no such thing as a model which is non-linear in its parameters. But there are a lot of models like that. Imprecise language like this can be incredibly frustrating because it will lead either to confusion later on, or, if people don’t read another book on any of these topics again, severe overconfidence and mistaken beliefs due to hidden assumptions.

Here’s another example from chapter 28, on ‘Performing a linear regression analysis’:

“**Checking the assumptions**

For each observed value of x, the residual is the observed y minus the corresponding fitted Y. Each residual may be either positive or negative. We can use the residuals to check the following assumptions underlying linear regression.

1 There is a linear relationship between x and y: Either plot y against x (the data should approximate a straight line), or plot the residuals against x (we should observe a random scatter of points rather than any systematic pattern).

2 The observations are independent: the observations are independent if there is no more than one pair of observations on each individual.”

This is not good. Arguably the independence assumption is in some contexts best conceived of as an in practice untestable assumption, but regardless of whether it ‘really’ is or not there are a lot of ways in which this assumption may be violated, and observations not being derived from the same individual is not a sufficient requirement for establishing independence. Assuming otherwise is potentially really problematic.

Here’s another example:

“**Some words of comfort**

Do not worry if you find the theory underlying probability distributions complex. Our experience demonstrates that you want to know only when and how to use these distributions. We have therefore outlined the essentials, and omitted the equations that define the probability distributions. You will find that you only need to be familiar with the basic ideas, the terminology and, perhaps (although infrequently in this computer age), know how to refer to the tables.”

I found this part problematic. If you want to do hypothesis testing using things like the Chi-squared distribution or the F-test (both ‘covered’, sort of, in the book), you need to be really careful about details like the relevant degrees of freedom and how these may depend on what you’re doing with the data, and stuff like this is sometimes not obvious – not even to people who’ve worked with the equations (well, sometimes it is obvious, but it’s easy to forget to correct for estimated parameters and you can’t always expect the program to do this for you, especially not in more complex model frameworks). My position is that if you’ve never even seen the relevant equations, you have no business conducting anything but the most basic of analyses involving these distributions. Of course a person who’s only read this book would not be able to do more than that, but even so instead of ‘some words of comfort’ I’d much rather have seen ‘some words of caution’.

One last one:

“**Error checking**

* **Categorical data** – It is relatively easy to check categorical data, as the responses for each variable can only take one of a number of limited values. Therefore, values that are not allowable must be errors.”

Nothing else is said about error checking of categorical data in this specific context, so it would be natural to assume from reading this that if you simply check whether values are ‘allowable’ or not, this is sufficient to catch all the errors. But this is a completely uninformative statement, as a key term remains undefined – neglected is the question of how to define (observation-specific-) ‘allowability’ in the first place, which is the real issue; a proper error-finding algorithm should apply a precise and unambiguous definition of this term, and how to (/implicitly?) construct/apply such an algoritm is likely to sometimes be quite hard, especially when multiple categories are used and allowed and the category dimension in question is hard to cross-check against other variables. Reading the above sequence, it’d be easy for the reader to assume that this is all very simple and easy.

…

Oh well, all this said the book did had some good stuff as well. I’ve added some further comments and observations from the book below, with which I did not ‘disagree’ (to the extent that this is even possible). It should be noted that the book has a lot of focus on hypothesis testing and (/how to conduct) different statistical tests, and very little about statistical *modelling*. Many different tests are either mentioned and/or explicitly covered in the book, which aside from e.g. standard z-, t- and F-tests also include things like e.g. McNemar’s test, Bartlett’s test, the sign test, and the Wilcoxon rank-sum test, most of which were covered – I realized after having read the book – in the last part of the first statistics text I read, a part I was not required to study and so technically hadn’t read. So I did come across some new stuff while reading the book. Those specific parts were actually some of the parts of the book I liked best, because they contained stuff I didn’t already know, and not just stuff which I used to know but had forgot about. The few additional quotes added below do to some extent illustrate what the book is like, but it should also be kept in mind that they’re perhaps also not completely ‘fair’, in a way, in terms of providing a balanced and representative sample of the kind of stuff included in the publication; there are many (but perhaps not enough..) equations along the way (which I’m not going to blog, for reasons already mentioned), and the book includes detailed explanations and illustrations of how to conduct specific tests – it’s quite ‘hands-on’ in some respects, and a lot of tools will be added to the toolbox of someone who’s not read a similar publication before.

…

“Generally, we make comparisons between individuals in different groups. For example, most clinical trials (Topic 14) are **parallel** trials, in which each patient receives one of the two (or occasionally more) treatments that are being compared, i.e. they result in *between-individual* comparisons.

Because there is usually less variation in a measurement within an individual than between different individuals (Topic 6), in some situations it may be preferable to consider using each individual as hidher own control. These *within-individual* comparisons provide more precise comparisons than those from between-individual designs, and fewer individuals are required for the study to achieve the same level of precision. In a clinical trial setting, the **crossover design**[1] is an example of a within-individual comparison; if there are two treatments, every individual gets each treatment, one after the other in a random order to eliminate any effect of calendar time. The treatment periods are separated by a **washout period**, which allows any residual effects (**carry-over**) of the previous treatment to dissipate. We analyse the difference in the responses on the two treatments for each individual. This design can only be used when the treatment temporarily alleviates symptoms rather than provides a cure, and the response time is not prolonged.”

“A cohort study takes a group of individuals and usually follows them forward in time, the aim being to study whether exposure to a particular aetiological factor will affect the incidence of a disease outcome in the future […]

**Advantages of cohort studies**

*The time sequence of events can be assessed.

*They can provide information on a wide range of outcomes.

*It is possible to measure the incidence/risk of disease directly.

*It is possible to collect very detailed information on exposure to a wide range of factors.

*It is possible to study exposure to factors that are rare.

*Exposure can be measured at a number of time points, so that changes in exposure over time can be studied. There is reduced **recall** and **selection bias** compared with case-control studies (Topic 16).

**Disadvantages of cohort studies**

*In general, cohort studies follow individuals for long periods of time, and are therefore costly to perform.

*Where the outcome of interest is rare, a very large sample size is needed.

*As follow-up increases, there is often increased loss of patients as they migrate or leave the study, leading to biased results. *As a consequence of the long time-scale, it is often difficult to maintain consistency of measurements and outcomes over time. […]

*It is possible that disease outcomes and their probabilities, or the aetiology of disease itself, may change over time.”

“A case-control study compares the characteristics of a group of patients with a particular disease outcome (the **cases**) to a group of individuals without a disease outcome (the **controls**), to see whether any factors occurred more or less frequently in the cases than the controls […] Many case-control studies are **matched** in order to select cases and controls who are as similar as possible. In general, it is useful to sex-match individuals (i.e. if the case is male, the control should also be male), and, sometimes, patients will be age-matched. However, it is important not to match on the basis of the risk factor of interest, or on any factor that falls within the causal pathway of the disease, as this will remove the ability of the study to assess any relationship between the risk factor and the disease. Unfortunately, matching [means] that the effect on disease of the variables that have been used for matching cannot be studied.”

“**Advantages of case-control studies**

“quick, cheap and easy […] particularly suitable for rare diseases. […] A wide range of risk factors can be investigated. […] no loss to follow-up.

**Disadvantages of case-control studies**

Recall bias, when cases have a differential ability to remember certain details about their histories, is a potential problem. For example, a lung cancer patient may well remember the occasional period when he/she smoked, whereas a control may not remember a similar period. […] If the onset of disease preceded exposure to the risk factor, causation cannot be inferred. […] Case-control studies are not suitable when exposures to the risk factor are rare.”

“**The P-value is the probability of obtaining our results, or something more extreme, if the null hypothesis is true.** The null hypothesis relates to the population of interest, rather than the sample. Therefore, the null hypothesis is either true or false and we *cannot* interpret the P-value as the probability that the null hypothesis is true.”

“Hypothesis tests which are based on knowledge of the probability distributions that the data follow are known as **parametric tests**. Often data do not conform to the assumptions that underly these methods (Topic 32). In these instances we can use **non-parametric tests** (sometimes referred to as **distribution-free** tests, or **rank methods**). […] Non-parametric tests are particularly useful when the sample size is small […], and when the data are measured on a categorical scale. However, non-parametric tests are generally wasteful of information; consequently they have less power […] A number of factors have a direct bearing on power for a given test.

*The **sample size**: power increases with increasing sample size. […]

*The **variability of the observations**: power increases as the variability of the observations decreases […]

*The **effect of interest**: the power of the test is greater for larger effects. A hypothesis test thus has a greater chance of detecting a large real effect than a small one.

*The **significance level**: the power is greater if the significance level is larger”

“The statistical use of the word ‘regression’ derives from a phenomenon known as **regression to the mean**, attributed to Sir Francis Galton in 1889. He demonstrated that although tall fathers tend to have tall sons, the average height of the sons is less than that of their tall fathers. The average height of the sons has ‘regressed’ or ‘gone back’ towards the mean height of all the fathers in the population. So, on average, tall fathers have shorter (but still tall) sons and short fathers have taller (but still short) sons.

We observe regression to the mean in **screening** and in **clinical trials**, when a subgroup of patients may be selected for treatment because their levels of a certain variable, say cholesterol, are extremely high (or low). If the measurement is repeated some time later, the average value for the second reading for the subgroup is usually less than that of the first reading, tending towards (i.e. regressing to) the average of the age- and sex-matched population, irrespective of any treatment they may have received. Patients recruited into a clinical trial on the basis of a high cholesterol level on their first examination are thus likely to show a drop in cholesterol levels on average at their second examination, even if they remain untreated during this period.”

“A systematic review[1] is a formalized and stringent process of combining the information from all relevant studies (both published and unpublished) of the same health condition; these studies are usually clinical trials […] of the same or similar treatments but may be observational studies […] a meta-analysis, because of its inflated sample size, is able to detect treatment effects with **greater power** and estimate these effects with **greater precision** than any single study. Its advantages, together with the introduction of meta-analysis software, have led meta-analyses to proliferate. However, improper use can lead to erroneous conclusions regarding treatment efficacy. The following principal **problems** should be thoroughly investigated and resolved before a meta-analysis is performed.

***Publication bias **– the tendency to include in the analysis only the results from published papers; these favour statistically significant findings.

***Clinical heterogeneity **– in which differences in the patient population, outcome measures, definition of variables, and/or duration of follow-up of the studies included in the analysis create problems of non-compatibility.

***Quality differences **– the design and conduct of the studies may vary in their quality. Although giving more weight to the better studies is one solution to this dilemma, any weighting system can be criticized on the grounds that it is arbitrary.

***Dependence **– the results from studies included in the analysis may not be independent, e.g. when results from a study are published on more than one occasion.”

## Unobserved Variables – Models and Misunderstandings

This is a neat little book in the *Springer Briefs in Statistics* series. The author is David J Bartholomew, a former statistics professor at the LSE. I wrote a brief goodreads review, but I thought that I might as well also add a post about the book here. The book covers topics such as the EM algorithm, Gibbs sampling, the Metropolis–Hastings algorithm and the Rasch model, and it assumes you’re familiar with stuff like how to do ML estimation, among many other things. I had some passing familiarity with many of the topics he talks about in the book, but I’m sure I’d have benefited from knowing more about some of the specific topics covered. Because large parts of the book is basically unreadable by people without a stats background I wasn’t sure how much of it it made sense to cover here, but I decided to talk a bit about a few of the things which I believe don’t require you to know a whole lot about this area.

…

“Modern statistics is built on the idea of models—probability models in particular. [*While I was rereading this part, I was reminded of this quote which I came across while finishing my most recent quotes post: “No scientist is as model minded as is the statistician; in no other branch of science is the word model as often and consciously used as in statistics.” Hans Freudenthal*.] The standard approach to any new problem is to identify the sources of variation, to describe those sources by probability distributions and then to use the model thus created to estimate, predict or test hypotheses about the undetermined parts of that model. […] A statistical model involves the identification of those elements of our problem which are subject to uncontrolled variation and a specification of that variation in terms of probability distributions. Therein lies the strength of the statistical approach and the source of many misunderstandings. Paradoxically, misunderstandings arise both from the lack of an adequate model and from over reliance on a model. […] At one level is the failure to recognise that there are many aspects of a model which cannot be tested empirically. At a higher level is the failure is to recognise that any model is, necessarily, an assumption in itself. The model is not the real world itself but a representation of that world as perceived by ourselves. This point is emphasised when, as may easily happen, two or more models make exactly the same predictions about the data. Even worse, two models may make predictions which are so close that no data we are ever likely to have can ever distinguish between them. […] * All model-dependant inference is necessarily conditional on the model.* This stricture needs, especially, to be borne in mind when using Bayesian methods. Such methods are totally model-dependent and thus all are vulnerable to this criticism. The problem can apparently be circumvented, of course, by embedding the model in a larger model in which any uncertainties are, themselves, expressed in probability distributions. However, in doing this we are embarking on a potentially infinite regress which quickly gets lost in a fog of uncertainty.”

“Mixtures of distributions play a fundamental role in the study of unobserved variables […] The two important questions which arise in the analysis of mixtures concern how to identify whether or not a given distribution could be a mixture and, if so, to estimate the components. […] Mixtures arise in practice because of failure to recognise that samples are drawn from several populations. If, for example, we measure the heights of men and women without distinction the overall distribution will be a mixture. It is relevant to know this because women tend to be shorter than men. […] It is often not at all obvious whether a given distribution could be a mixture […] even a two-component mixture of normals, has 5 unknown parameters. As further components are added the estimation problems become formidable. If there are many components, separation may be difficult or impossible […] [To add to the problem,] the form of the distribution is unaffected by the mixing [in the case of the mixing of normals]. Thus there is no way that we can recognise that mixing has taken place by inspecting the form of the resulting distribution alone. Any given normal distribution could have arisen naturally or be the result of normal mixing […] if *f(x)* is normal, there is no way of knowing whether it is the result of mixing and hence, if it is, what the mixing distribution might be.”

“Even if there is close agreement between a model and the data it does not follow that the model provides a true account of how the data arose. It may be that several models explain the data equally well. When this happens there is said to be a lack of identifiability. Failure to take full account of this fact, especially in the social sciences, has led to many over-confident claims about the nature of social reality. Lack of identifiability within a class of models may arise because different values of their parameters provide equally good fits. Or, more seriously, models with quite different characteristics may make identical predictions. […] If we start with a model we can predict, albeit uncertainly, what data it should generate. But if we are given a set of data we cannot necessarily infer that it was generated by a particular model. In some cases it may, of course, be possible to achieve identifiability by increasing the sample size but there are cases in which, no matter how large the sample size, no separation is possible. […] Identifiability matters can be considered under three headings. First there is lack of *parameter identifiability* which is the most common use of the term. This refers to the situation where there is more than one value of a parameter in a given model each of which gives an equally good account of the data. […] Secondly there is what we shall call lack of *model identifiability* which occurs when two or more models make exactly the same data predictions. […] The third type of identifiability is actually the combination of the foregoing types.

Mathematical statistics is not well-equipped to cope with situations where models are practically, but not precisely, indistinguishable because it typically deals with things which can only be expressed in unambiguously stated theorems. Of necessity, these make clear-cut distinctions which do not always correspond with practical realities. For example, there are theorems concerning such things as sufficiency and admissibility. According to such theorems, for example, a proposed statistic is either sufficient or not sufficient for some parameter. If it is sufficient it contains all the information, in a precisely defined sense, about that parameter. But in practice we may be much more interested in what we might call ‘near sufficiency’ in some more vaguely defined sense. Because we cannot give a precise mathematical definition to what we mean by this, the practical importance of the notion is easily overlooked. The same kind of fuzziness arises with what are called structural eqation models (or structural relations models) which have played a very important role in the social sciences. […] we shall argue that structural equation models are almost always unidentifiable in the broader sense of which we are speaking here. […] [our results] constitute a formidable argument against the careless use of structural relations models. […] In brief, the valid use of a structural equations model requires us to lean very heavily upon assumptions about which we may not be very sure. It is undoubtedly true that if such a model provides a good fit to the data, then it provides a *possible* account of how the data might have arisen. It says nothing about what other models might provide an equally good, or even better fit. As a tool of inductive inference designed to tell us something about the social world, linear structural relations modelling has very little to offer.”

“It is very common for data to be missing and this introduces a risk of bias if inferences are drawn from incomplete samples. However, we are not usually interested in the missing data themselves but in the population characteristics to whose estimation those values were intended to contribute. […] A very longstanding way of dealing with missing data is to fill in the gaps by some means or other and then carry out the standard analysis on the completed data set. This procedure is known as *imputation*. […] In its simplest form, each missing data point is replaced by a single value. Because there is, inevitably, uncertainty about what the imputed values should be, one can do better by substituting a range of plausible values and comparing the results in each case. This is known as *multiple imputation*. […] missing values may occur anywhere and in any number. They may occur haphazardly or in some pattern. In the latter case, the pattern may provide a clue to the mechanism underlying the loss of data and so suggest a method for dealing with it. The conditional distribution which we have supposed might be the basis of imputation depends, of course, on the mechanism behind the loss of data. From a practical point of view the detailed information necessary to determine this may not be readily obtainable or, even, necessary. Nevertheless, it is useful to clarify some of the issues by introducing the idea of a probability mechanism governing the loss of data. This will enable us to classify the problems which would have to be faced in a more comprehensive treatment. The simplest, if least realistic approach, is to assume that the chance of being missing is the same for all elements of the data matrix. In that case, we can, in effect, ignore the missing values […] Such situations are designated as MCAR which is an acronym for Missing Completely at Random. […] In the smoking example we have supposed that men are more likely to refuse [to answer] than women. If we go further and assume that there are no other biasing factors we are, in effect, assuming that ‘missingness’ is completely at random for men and women, separately. This would be an example of what is known as Missing at Random(MAR) […] which means that the missing mechanism depends on the observed variables but not on those that are missing. The final category is Missing Not at Random (MNAR) which is a residual category covering all other possibilities. This is difficult to deal with in practice unless one has an unusually complete knowledge of the missing mechanism.

Another term used in the theory of missing data is that of ignorability. The conditional distribution of y given x will, in general, depend on any parameters of the distribution of M [*the variable we use to describe the mechanism governing the loss of observations*] yet these are unlikely to be of any practical interest. It would be convenient if this distribution could be ignored for the purposes of inference about the parameters of the distribution of x. If this is the case the mechanism of loss is said to be ignorable. In practice it is acceptable to assume that the concept of ignorability is equivalent to that of MAR.”

## A couple of abstracts

**Abstract:**

“There are both costs and benefits associated with conducting scientific- and technological research. Whereas the *benefits* derived from scientific research and new technologies have often been addressed in the literature (for a good example, see *Evenson et al., 1979*), few of the major non-monetary societal *costs* associated with major expenditures on scientific research and technology have however so far received much attention.

In this paper we investigate one of the major non-monetary societal cost variables associated with the conduct of scientific and technological research in the United States, namely the suicides resulting from research activities. In particular, in this paper we analyze the association between scientific- and technological research expenditure patterns and the number of suicides committed using one of the most common suicide methods, namely that of hanging, strangulation and suffocation (-HSS). We conclude from our analysis that there’s a very strong association between scientific research expenditures in the US and the frequency of suicides committed using the HSS method, and that this relationship has been stable for at least a decade. An important aspect in the context of the association is the precise mechanisms through which the increase in HHSs takes place. Although the mechanisms are still not well-elucidated, we suggest that one of the important components in this relationship may be judicial research, as initial analyses of related data have suggested that this variable may be important. We argue in the paper that our initial findings in this context provide impetus for considering this pathway a particularly important area of future research in this field.”

Graph 1:

…

**Abstract**:

“Murders by bodily force (-Mbf) make up a substantial number of all homicides in the US. Previous research on the topic has shown that this criminal activity causes the compromise of some common key biological functions in victims, such as respiration and cardiac function, and that many people with close social relationships with the victims are psychosocially affected as well, which means that this societal problem is clearly of some importance.

Researchers have known for a long time that the marital state of the inhabitants of the state of Mississippi and the dynamics of this variable have important nation-wide effects. Previous research has e.g. analyzed how the marriage rate in Mississippi determines the US per capita consumption of whole milk. In this paper we investigate how the dynamics of Mississippian marital patterns relate to the national Mbf numbers. We conclude from our analysis that it is very clear that there’s a strong association between the divorce rate in Mississippi and the national level of Mbf. We suggest that the effect may go through previously established channels such as e.g. milk consumption, but we also note that the precise relationship has yet to be elucidated and that further research on this important topic is clearly needed.”

…

This abstract is awesome as well, but I didn’t write it…

…

The ‘funny’ part is that I could actually easily imagine papers not too dissimilar to the ones just outlined getting published in scientific journals. Indeed, in terms of the structure I’d claim that many published papers *are* *exactly* *like this*. They do significance testing as well, sure, but hunting down p-values is not much different from hunting down correlations and it’s quite easy to do both. If that’s all you have, you haven’t shown much.

## Clinical Epidemiology: The Essentials

Here’s the amazon link, here’s goodreads (3.84 average rating).

The book is an introductory textbook about clinical epidemiology. I don’t think I’ve ever actually read an epidemiology textbook (..a big part of why I wanted to read this), but I’ve read a lot of other stuff on related matters and I think it would be fair to say that this is not a new field to me. Most of the stuff in the book was well known to me and there actually wasn’t a lot of new stuff in there. I happen to know *a lot* more about many of the topics covered than what’s written in the book; data analysis is data analysis, and whether you’re performing survival analysis using Cox proportional hazard models on medical data or on unemployment data doesn’t really change all that much, except perhaps how to interpret the results.. But I didn’t know that I knew beforehand, and the book did contain enough interesting observations along the way to keep me going; even though many of the things covered were things I’d read about before, I did read the book from cover to cover. Though I should also point out that I emphatically did *not* spend an equal amount of time on all of it, and that I went over some of it quite fast (I didn’t really need to be reminded what a p-value is…).

The book defines a lot of concepts and shows how they are connected. I think I’d categorize it as ‘methodological, but informal’, in the sense that there’s very little explicit math here aside from the math required to introduce/explain variables of interest. Given the informal structure and the fact that it’s an introductionary text it’s natural that a few imprecise statements creep in here and there; there was in particular one paragraph in the middle of the book where I felt unsure what they were actually trying to say (they were basically saying that you shouldn’t ever use a specific research method, described in a very vague manner, but I knew that if they were criticizing what I believed them to be criticizing then they were obviously overlooking/disregarding benefits of the method which should be traded off against the costs they emphasized). A couple of places they make semi-questionable assumptions about e.g. attrition rate dynamics and selection mechanisms related to stuff like compliance patterns – but given the nature of the book this is perfectly understandable and, I think, forgivable, as a person who’s completely new to the field probably shouldn’t worry too much about those kinds of details yet (you can easily end up confusing people more than you help them by adding too much complexity to an intro textbook). Most of the time the language is precise and to the point.

If you don’t know a lot about this stuff, I think this book is a good starting point. It provides you with a good basis. It’s well written. But it’s not detailed enough (/too informal) and it didn’t teach me enough new stuff for me to really start getting excited about the book – I ended up giving it 3 stars on goodreads. To give you a sense of what the book is like, below I’ve added a few quotes from the book, with my own comments italized and bracketed:

…

“Although clinical distributions often resemble a normal distribution the resemblance is superficial. As one statistician (4) put it, “The experimental fact is that for most physiological variables the distribution is smooth, unimodal, and skewed, and that mean +/- 2 standard deviations does not cut off the desired 95%. We have no mathematical, statistical, or other theorems that enable us to predict the shape of the distributions of physiologic measurements.”

The shapes of clinical distributions differ from one another because many differences among people, other than random variation, contribute to distributions of clinical measurements. Therefore, if distributions of clinical measurements resemble normal curves, it is largely by accident. Even so, it is often assumed, as a matter of convenience […] that clinical measurements are “normally distributed”.”

“most distributions of clinical variables are not easily divided into “normal” and “abnormal.” They are not inherently dichotomous and do not display sharp breaks or two peaks that characterize normal and abnormal results. This is because disease is usually acquired by degrees, so there is a smooth transition from low to high values with increasing degrees of dysfunction. Laboratory tests reflecting organ failure, such as serum creatinine for kidney failure or ejection fraction for heart failure, behave in this way. Another reason why normals and abnormals are not seen as separate distributions is that even when people with and without a disease have substantially different frequency distributions, the distributions almost always overlap.”

“patients in clinical trials are usually a highly selected, biased sample of all patients with the condition of interest. As heterogeneity is restricted, the internal validity of the study is improved; in other words, there is less opportunity for differences in outcome that are not related to treatment itself. But exclusions come at the price of diminished generalizability.” [*As RCT’s (potentially) limited external validity is closely related to questions about compliance/adherence, this aspect was naturally extensively covered in Davies and Kermani – so it’s not exactly news to me. Nevertheless it’s an important aspect often overlooked when evaluating studies of this type, so I figured I should include a remark/quote on the topic here*].

“**Lead time** is the period of time between the detection of a medical condition by screening and when it ordinarily would be diagnosed because a patient experiences symptoms and seeks medical care […] The amount of lead time for a given disease depends on the biological rate of progression of the disease and how early the screening test can detect the disease. When lead time is very short […] treatment of medical conditions picked up on screening is likely to be no more effective than treatment after symptoms appear. On the other hand, when lead time is long […] treatment of the medical condition found on screening can be very effective. […] How can lead time cause biased results in a study of the efficacy of early treatment? […] because of screening, a disease is found earlier than it would have been after the patient developed symptoms. As a result, people who are diagnosed by screening for a deadly disease will, on average, survive longer from the time of diagnosis than people who are diagnosed after they get symptoms, even if early treatment is no more effective than treatment at the time of clinical presentation. In such a situation, screening would appear to help people live longer, spuriously improving survival rates when, in reality, they have been given not more “survival time” but more “disease time.”” [*I knew about this phenomenon, but I didn’t know that it had a name*. *I do now*.]

“**Length-time bias** […] occurs because the proportion of slow-growing lesions diagnosed during screening is greater than the proportion of those diagnosed during usual medical care. [*This is because the speed of tumor growth is related to the likelihood that a screening test will ‘be necessary’ to find the cancer – a fast-moving cancer will produce symptoms before the screening test/in between screening tests, whereas a slow-moving cancer will not*]. As a result, length-time bias makes it seem that screening and early treatment are more effective than usual care.” [*For those who don’t know, I should note that some similar problems may pop up in models applied in labour economics, e.g. when dealing with unemployment data. So the math/intuition behind problems such as these are not unknown to me – as mentioned in the beginning data analysis is to some extent just data analysis, no matter which field of inquiry it’s being applied to.*]

“To many people in the Western world, the suggestion that sticking needles into the body and twirling them can decrease pain and control vomiting seems biologically implausible. However, randomized controlled trials of acupuncture have found acupuncture effective.” [*I include this quote here only because it cost the book one star in my evaluation. I was probably at around ~3.5 or so before reading this. After reading this, I was at ~2.5. I considered it a serious hit to the credibility of the authors, because it to me displayed a lack of critical thinking and a signal of poor judgment. Here’s a cochrane review on the matter (“We concluded that there was insufficient evidence to judge whether acupuncture is effective in relieving cancer-related pain in adults.”).*]