## Medical Statistics at a Glance

I wasn’t sure if I should blog this book or not, but in the end I decided to add a few observations here – you can read my goodreads review here.

Before I started reading the book I was considering whether it’d be worth it, as a book like this might have little to offer for someone with my background – I’ve had a few stats courses at this point, and it’s not like the specific topic of medical statistics is completely unknown to me; for example I read an epidemiology textbook just last year, and *Hill* and *Glied and Smith* covered related topics as well. It wasn’t that I thought there’s not a lot of medical statistics I don’t already know – there is – it was more of a concern that this specific (type of) book might not be the book to read if I wanted to learn a lot of new stuff in this area.

Disregarding the specific medical context of the book I knew a lot of stuff about many of the topics covered. To take an example, Bartholomew’s book devoted *a lot of pages* to the question of how to handle missing data in a sample, a question this book devotes 5 sentences to. There are a lot of details missing here and the coverage is not very deep. As I hint at in the goodreads review, I think the approach applied in the book is to some extent simply mistaken; I don’t think this (many chapters on different topics, each chapter 2-3 pages long) is a good way to write a statistics textbook. The many different chapters on a wide variety of topics give you the impression that the authors have tried to maximize the amount of people who might get something out of this book, which may have ended up meaning that few people will actually get much out of it. On the plus side there are illustrated examples of many of the statistical methods used in the book, and you also get (some of) the relevant formulas for calculating e.g. specific statistics – but you get little understanding of the details of why this works, when it doesn’t, and what happens when it doesn’t. I already mentioned Bartholomew’s book – many other textbooks written about topics which they manage to cover in their two- or three-page chapters could be mentioned as well – examples include publications such as this, this and this.

Given the way the book starts out (which different types of data exist? How do you calculate an average and what is a standard deviation?) I think the people most likely to be reading a book like this are people who have a very limited knowledge of statistics and data analysis – and when people like that read stats books, you need to be very careful with your wording and assumptions. Maybe I’m just a grumpy old man, but I’m not sure I think the authors are careful enough. A couple of examples:

“Statistical modelling includes the use of simple and multiple linear regression, polynomial regression, logistic regression and methods that deal with survival data. All these methods rely on generating the mathematical model that describes the relationship between two or more variables. In general, any model can be expressed in the form:

*g(Y) = a + b _{1}x_{1} + b_{2}x_{2} + … + b_{k}x_{k}*

where Y is the fitted value of the dependent variable, g(.) is some optional transformation of it (for example, the logit transformation), xl, . . . , xk are the predictor or explanatory variables”

(In case you were wondering, it took me 20 minutes to find out how to lower those 1’s and 2’s because it’s not a standard wordpress function and you need to really want to find out how to do this in order to do it. The k’s still look like crap, but I’m not going to spend more time trying to figure out how to make this look neat. I of course could not copy the book formula into the post, or I would have done that. As I’ve pointed out many times, it’s a nightmare to cover mathematical topics on a blog like this. Yeah, I know Terry Tao also blogs on wordpress, but presumably he writes his posts in a different program – I’m very much against the idea of doing this, even if I am sometimes – in situations like these – seriously reconsidering whether I should do that.)

Let’s look closer at this part again: “In general, any model can be expressed…”

This choice of words and the specific example is the sort of thing I have in mind. If you don’t know a lot about data analysis and you read a statement like this literally, which is the sort of thing I for one am wont to do, you’ll conclude that there’s no such thing as a model which is non-linear in its parameters. But there are a lot of models like that. Imprecise language like this can be incredibly frustrating because it will lead either to confusion later on, or, if people don’t read another book on any of these topics again, severe overconfidence and mistaken beliefs due to hidden assumptions.

Here’s another example from chapter 28, on ‘Performing a linear regression analysis’:

“**Checking the assumptions**

For each observed value of x, the residual is the observed y minus the corresponding fitted Y. Each residual may be either positive or negative. We can use the residuals to check the following assumptions underlying linear regression.

1 There is a linear relationship between x and y: Either plot y against x (the data should approximate a straight line), or plot the residuals against x (we should observe a random scatter of points rather than any systematic pattern).

2 The observations are independent: the observations are independent if there is no more than one pair of observations on each individual.”

This is not good. Arguably the independence assumption is in some contexts best conceived of as an in practice untestable assumption, but regardless of whether it ‘really’ is or not there are a lot of ways in which this assumption may be violated, and observations not being derived from the same individual is not a sufficient requirement for establishing independence. Assuming otherwise is potentially really problematic.

Here’s another example:

“**Some words of comfort**

Do not worry if you find the theory underlying probability distributions complex. Our experience demonstrates that you want to know only when and how to use these distributions. We have therefore outlined the essentials, and omitted the equations that define the probability distributions. You will find that you only need to be familiar with the basic ideas, the terminology and, perhaps (although infrequently in this computer age), know how to refer to the tables.”

I found this part problematic. If you want to do hypothesis testing using things like the Chi-squared distribution or the F-test (both ‘covered’, sort of, in the book), you need to be really careful about details like the relevant degrees of freedom and how these may depend on what you’re doing with the data, and stuff like this is sometimes not obvious – not even to people who’ve worked with the equations (well, sometimes it is obvious, but it’s easy to forget to correct for estimated parameters and you can’t always expect the program to do this for you, especially not in more complex model frameworks). My position is that if you’ve never even seen the relevant equations, you have no business conducting anything but the most basic of analyses involving these distributions. Of course a person who’s only read this book would not be able to do more than that, but even so instead of ‘some words of comfort’ I’d much rather have seen ‘some words of caution’.

One last one:

“**Error checking**

* **Categorical data** – It is relatively easy to check categorical data, as the responses for each variable can only take one of a number of limited values. Therefore, values that are not allowable must be errors.”

Nothing else is said about error checking of categorical data in this specific context, so it would be natural to assume from reading this that if you simply check whether values are ‘allowable’ or not, this is sufficient to catch all the errors. But this is a completely uninformative statement, as a key term remains undefined – neglected is the question of how to define (observation-specific-) ‘allowability’ in the first place, which is the real issue; a proper error-finding algorithm should apply a precise and unambiguous definition of this term, and how to (/implicitly?) construct/apply such an algoritm is likely to sometimes be quite hard, especially when multiple categories are used and allowed and the category dimension in question is hard to cross-check against other variables. Reading the above sequence, it’d be easy for the reader to assume that this is all very simple and easy.

…

Oh well, all this said the book did had some good stuff as well. I’ve added some further comments and observations from the book below, with which I did not ‘disagree’ (to the extent that this is even possible). It should be noted that the book has a lot of focus on hypothesis testing and (/how to conduct) different statistical tests, and very little about statistical *modelling*. Many different tests are either mentioned and/or explicitly covered in the book, which aside from e.g. standard z-, t- and F-tests also include things like e.g. McNemar’s test, Bartlett’s test, the sign test, and the Wilcoxon rank-sum test, most of which were covered – I realized after having read the book – in the last part of the first statistics text I read, a part I was not required to study and so technically hadn’t read. So I did come across some new stuff while reading the book. Those specific parts were actually some of the parts of the book I liked best, because they contained stuff I didn’t already know, and not just stuff which I used to know but had forgot about. The few additional quotes added below do to some extent illustrate what the book is like, but it should also be kept in mind that they’re perhaps also not completely ‘fair’, in a way, in terms of providing a balanced and representative sample of the kind of stuff included in the publication; there are many (but perhaps not enough..) equations along the way (which I’m not going to blog, for reasons already mentioned), and the book includes detailed explanations and illustrations of how to conduct specific tests – it’s quite ‘hands-on’ in some respects, and a lot of tools will be added to the toolbox of someone who’s not read a similar publication before.

…

“Generally, we make comparisons between individuals in different groups. For example, most clinical trials (Topic 14) are **parallel** trials, in which each patient receives one of the two (or occasionally more) treatments that are being compared, i.e. they result in *between-individual* comparisons.

Because there is usually less variation in a measurement within an individual than between different individuals (Topic 6), in some situations it may be preferable to consider using each individual as hidher own control. These *within-individual* comparisons provide more precise comparisons than those from between-individual designs, and fewer individuals are required for the study to achieve the same level of precision. In a clinical trial setting, the **crossover design**[1] is an example of a within-individual comparison; if there are two treatments, every individual gets each treatment, one after the other in a random order to eliminate any effect of calendar time. The treatment periods are separated by a **washout period**, which allows any residual effects (**carry-over**) of the previous treatment to dissipate. We analyse the difference in the responses on the two treatments for each individual. This design can only be used when the treatment temporarily alleviates symptoms rather than provides a cure, and the response time is not prolonged.”

“A cohort study takes a group of individuals and usually follows them forward in time, the aim being to study whether exposure to a particular aetiological factor will affect the incidence of a disease outcome in the future […]

**Advantages of cohort studies**

*The time sequence of events can be assessed.

*They can provide information on a wide range of outcomes.

*It is possible to measure the incidence/risk of disease directly.

*It is possible to collect very detailed information on exposure to a wide range of factors.

*It is possible to study exposure to factors that are rare.

*Exposure can be measured at a number of time points, so that changes in exposure over time can be studied. There is reduced **recall** and **selection bias** compared with case-control studies (Topic 16).

**Disadvantages of cohort studies**

*In general, cohort studies follow individuals for long periods of time, and are therefore costly to perform.

*Where the outcome of interest is rare, a very large sample size is needed.

*As follow-up increases, there is often increased loss of patients as they migrate or leave the study, leading to biased results. *As a consequence of the long time-scale, it is often difficult to maintain consistency of measurements and outcomes over time. […]

*It is possible that disease outcomes and their probabilities, or the aetiology of disease itself, may change over time.”

“A case-control study compares the characteristics of a group of patients with a particular disease outcome (the **cases**) to a group of individuals without a disease outcome (the **controls**), to see whether any factors occurred more or less frequently in the cases than the controls […] Many case-control studies are **matched** in order to select cases and controls who are as similar as possible. In general, it is useful to sex-match individuals (i.e. if the case is male, the control should also be male), and, sometimes, patients will be age-matched. However, it is important not to match on the basis of the risk factor of interest, or on any factor that falls within the causal pathway of the disease, as this will remove the ability of the study to assess any relationship between the risk factor and the disease. Unfortunately, matching [means] that the effect on disease of the variables that have been used for matching cannot be studied.”

“**Advantages of case-control studies**

“quick, cheap and easy […] particularly suitable for rare diseases. […] A wide range of risk factors can be investigated. […] no loss to follow-up.

**Disadvantages of case-control studies**

Recall bias, when cases have a differential ability to remember certain details about their histories, is a potential problem. For example, a lung cancer patient may well remember the occasional period when he/she smoked, whereas a control may not remember a similar period. […] If the onset of disease preceded exposure to the risk factor, causation cannot be inferred. […] Case-control studies are not suitable when exposures to the risk factor are rare.”

“**The P-value is the probability of obtaining our results, or something more extreme, if the null hypothesis is true.** The null hypothesis relates to the population of interest, rather than the sample. Therefore, the null hypothesis is either true or false and we *cannot* interpret the P-value as the probability that the null hypothesis is true.”

“Hypothesis tests which are based on knowledge of the probability distributions that the data follow are known as **parametric tests**. Often data do not conform to the assumptions that underly these methods (Topic 32). In these instances we can use **non-parametric tests** (sometimes referred to as **distribution-free** tests, or **rank methods**). […] Non-parametric tests are particularly useful when the sample size is small […], and when the data are measured on a categorical scale. However, non-parametric tests are generally wasteful of information; consequently they have less power […] A number of factors have a direct bearing on power for a given test.

*The **sample size**: power increases with increasing sample size. […]

*The **variability of the observations**: power increases as the variability of the observations decreases […]

*The **effect of interest**: the power of the test is greater for larger effects. A hypothesis test thus has a greater chance of detecting a large real effect than a small one.

*The **significance level**: the power is greater if the significance level is larger”

“The statistical use of the word ‘regression’ derives from a phenomenon known as **regression to the mean**, attributed to Sir Francis Galton in 1889. He demonstrated that although tall fathers tend to have tall sons, the average height of the sons is less than that of their tall fathers. The average height of the sons has ‘regressed’ or ‘gone back’ towards the mean height of all the fathers in the population. So, on average, tall fathers have shorter (but still tall) sons and short fathers have taller (but still short) sons.

We observe regression to the mean in **screening** and in **clinical trials**, when a subgroup of patients may be selected for treatment because their levels of a certain variable, say cholesterol, are extremely high (or low). If the measurement is repeated some time later, the average value for the second reading for the subgroup is usually less than that of the first reading, tending towards (i.e. regressing to) the average of the age- and sex-matched population, irrespective of any treatment they may have received. Patients recruited into a clinical trial on the basis of a high cholesterol level on their first examination are thus likely to show a drop in cholesterol levels on average at their second examination, even if they remain untreated during this period.”

“A systematic review[1] is a formalized and stringent process of combining the information from all relevant studies (both published and unpublished) of the same health condition; these studies are usually clinical trials […] of the same or similar treatments but may be observational studies […] a meta-analysis, because of its inflated sample size, is able to detect treatment effects with **greater power** and estimate these effects with **greater precision** than any single study. Its advantages, together with the introduction of meta-analysis software, have led meta-analyses to proliferate. However, improper use can lead to erroneous conclusions regarding treatment efficacy. The following principal **problems** should be thoroughly investigated and resolved before a meta-analysis is performed.

***Publication bias **– the tendency to include in the analysis only the results from published papers; these favour statistically significant findings.

***Clinical heterogeneity **– in which differences in the patient population, outcome measures, definition of variables, and/or duration of follow-up of the studies included in the analysis create problems of non-compatibility.

***Quality differences **– the design and conduct of the studies may vary in their quality. Although giving more weight to the better studies is one solution to this dilemma, any weighting system can be criticized on the grounds that it is arbitrary.

***Dependence **– the results from studies included in the analysis may not be independent, e.g. when results from a study are published on more than one occasion.”

No comments yet.

## Leave a Reply