Econstudentlog

Evidence-Based Diagnosis

“Evidence-Based Diagnosis is a textbook about diagnostic, screening, and prognostic tests in clinical medicine. The authors’ approach is based on many years of experience teaching physicians in a clinical research training program. Although requiring only a minimum of mathematics knowledge, the quantitative discussions in this book are deeper and more rigorous than those in most introductory texts. […] It is aimed primarily at clinicians, particularly those who are academically minded, but it should be helpful and accessible to anyone involved with selection, development, or marketing of diagnostic, screening, or prognostic tests. […] Our perspective is that of skeptical consumers of tests. We want to make proper diagnoses and not miss treatable diseases. Yet, we are aware that vast resources are spent on tests that too frequently provide wrong answers or right answers of little value, and that new tests are being developed, marketed, and sold all the time, sometimes with little or no demonstrable or projected benefit to patients. This book is intended to provide readers with the tools they need to evaluate these tests, to decide if and when they are worth doing, and to interpret the results.”

I simply could not possibly justify not giving this book a shot considering the amazon ratings – it has an insane average rating of five stars, based on nine ratings. I agree with the reviewers: This is a really nice book. It covers a lot of stuff I’ve seen before, e.g. in Fletcher and Fletcher, Petrie and Sabin, Juth and Munthe, Borenstein, Hedges et al., Adam, Baltussen et al. (listing all of these suddenly made me realize how much stuff I’ve actually read about these sorts of topics in the past…), as well as in stats courses I’ve taken, but as the book is focusing specifically on medical testing aspects there is also a lot of new stuff as well. It should be noted that some people will benefit a lot more from reading the book than I did; I’ve spent weeks dealing with related aspects of subtopics they cover in just a few pages, and there were a lot of familiar concepts, distinctions, etc. in the book. Even so, this book is remarkably well-written and these guys really know their stuff. If you want to read a book about the basics of how to make sense of the results of medical tests and stuff like that, this is the book you’ll want to read.

Let’s say you have a test measuring some variable which might be useful in a diagnostic context. How would we know it might be useful? Well, one might come up with some criteria such a test should meet; like that the results of the test doesn’t depend on who’s doing the testing, perhaps it also shouldn’t matter when the test is done. You might also want the test to be somewhat accurate. But what do we even mean by that? There are various approaches to thinking about accuracy, and some may be better than others. So the book covers familiar topics like sensitivity and specificity, likelihood ratios, and receiver operating characteristic (ROC) curves. A test might be accurate, but if the results of a test does not change clinical decision-making it might not be worth it to do the test; so the question of whether a test is accurate or not is different from whether it’s also useful. In terms of usefulness concepts like positive- and negative predictive value and distinctions such as that between absolute and relative risk become important. It might not even be a good idea to use a test even if it distinguishes reasonably well between people who are sick and people who are not, because a very accurate test might be too expensive to be justified undertaking; the book also has a bit of stuff on cost-effectiveness. Of course costs associated with getting tested for a health condition are not limited to monetary costs; a test might be uncomfortable, and it may also for example be the case that a false positive or a false negative result might sometimes have quite severe consequences (e.g. in the context of cancer screening). In such contexts concepts like the number needed to treat might be useful. It might also on the other hand be that a test gives answers which are wrong so often that even if it’s very cheap to do, it still might not be worth doing. There’s stuff in the book about how to think about, and come up with decision-rules about, how to identify things like treatment-thresholds; variables which will be determined by probability of disease and costs associated with testing (/and treatment). A variable like the cost of a treatment might in an analytical framework involve both the costs of treating people with the health condition as well as the costs of treating people who tested positive without being sick and the costs of not treating sick people who tested negative. One might think in one context that it would be twice as bad to miss a diagnosis than it would be to treat someone who does not have the disease, which would lead to one set of decision-rules in terms of when to test and when to treat, whereas in another context it might be a lot worse to miss a diagnosis, so we’d be less worried about treating someone without the disease. There may be more than one relevant threshold in the diagnostic setting; usually there’ll be some range of prior probabilities of disease for which the test will add enough information to change decision-making, but at either end of the range the test might not add enough information to be justified. To be more specific, if you’re almost certain the patient has some specific disease, you’ll want to treat him because the test result will not change anything; and if on the other hand you’re almost certain that he does not have the disease, based e.g. on the prevalence rate and the clinical presentation, then you’ll want to refrain from testing if the test has costs (including time costs, inconvenience, etc.). The book includes formal and reasonably detailed analysis of such topics.

In terms of how to interpret the results of a test it matters who you’re testing, and as already indicated the authors apply a Bayesian approach to these matters and repeatedly emphasize the importance of priors when evaluating test results (or for that matter findings from the literature). In that context some important notions are included about what you can and can’t use e.g. variables like prevalence and incidence for, how best to use such variables to inform decision-making, and things like how the study design might impact which variables are available to you for analysis (don’t try to estimate prevalence if you’re dealing with a case-control setup, where this variable is determined by the study design).

Of course medical most tests don’t just give two results. Dichotomization adds simplicity compared to more complex scenarios, so that’s where the book starts out, but it doesn’t stop there. If you have a test involving a continuous variable then dichotomizing the results will reduce the value of the test; this is equivalent to using pair-wise comparisons to make sense of continuous data in other contexts. However it’s sometimes useful to do it anyway because you may be in a situation where you need to quickly/easily separate ‘normal’ from ‘abnormal’. Likelihood ratios are really useful in the context of multi-level tests. In the simple dichotomous test, the LR for a test result is the probability of the result in a patient with disease divided by the probability of the result in a patient without disease. If you have lots of possible test results however, you’ll not be limited to two likelihood ratios; you’ll have as many likelihood ratios as there are results of the test. Those likelihood ratios are useful because the LR in the context of a multi-level test is equal to the slope of the ROC curve over the relevant interval. The ROC curve in some sense displays the tradeoff between sensitivity (‘true positive’) and specificity (‘true negative’); each point on the curve represents a different cut-off for calling a test positive. Such curves are quite useful in terms of figuring out if a test adds information or not, how well it distinguishes between patients. If you want to compare different tests and how they perform, Bland-Altman plots also seem to be useful tools.

Sometimes the results of more than one test will be relevant to decision-making, and a key question to ask here is the extent to which tests, and test results, are independent. If tests are not independent, one should be careful about how to update the probability of disease based on a new laboratory finding, and about which conclusions can be drawn regarding the extent to which an additional test might or might not be necessary/useful/required. The book does not go into too much detail, but enough is said on this topic to make it clear that test dependence is a potential issue one should keep in mind when evaluating multiple test results. They do talk a bit about how to come up with decision-rules about which tests to prefer in situations where multiple interdependent tests are available for analysis.

blind_trials
Sometimes blinding is difficult. The book tells us that it’s particularly important to blind when outcomes are subjective (like pain), and when prognostic factors may affect treatment in the study setting.

Medical tests can be used for different things, and not all tests are equal. One important distinction they talk about in the book is the distinction between diagnostic tests, which are done on sick people to figure out why they’re sick, and screening tests, which are mostly done on healthy people with a low prior probability of disease. There are different types of screening tests. One type of test is screening for symptomatic disease, which is sometimes done because people may be sick and have symptoms without being aware of the fact that they’re sick; screening for depression might be an example of this (that may even sometimes be cost-effective). These tests are reasonably similar to traditional diagnostic tests, and so can be evaluated in a similar manner. However most screening tests are of a different kind; they’re aimed at identifying risk factors, rather than ‘actual disease’ (a third kind is screening for presymptomatic disease). This generally tends to make them harder to justify undertaking, for reasons covered in much greater detail in Juth and Munthe (see the link over the word ‘may’ above). There are other differences as well; concepts such as sensitivity and specificity are for example difficult to relate to screening tests aimed at identifying risk factors, as such screening tests have as a goal to estimate incidence, rather than prevalence, which will often make it hard to compare such tests with the established ‘gold standard’ (as is usually the case). I decided to include a few quotes from this part of the coverage:

“the general public tends to be supportive of screening programs. Part of this is wishful thinking. We would like to believe that bad things happen for a reason, and that there are things we can do to prevent them […] .We also tend to be much more swayed by stories of individual patients (either those whose disease was detected early or those in whom it was found “too late”) than by boring statistics about risks, costs, and benefits […]. Because, at least in the U.S., there is no clear connection between money spent on screening tests and money not being available to spend on other things, the public tends not to be swayed by arguments about cost efficacy […]. In fact, in the general public’s view of screening, even wrong answers are not necessarily a bad thing. Schwartz et al. (2004) did a national telephone survey of attitudes about cancer screening in the U.S. They found that 38% of respondents had experienced at least one false-positive screening test. Although more than 40% of these subjects referred to that experience as “very scary” or the “scariest time of my life,” 98% were glad they had the screening test! […] Another disturbing result of the survey by Schwartz et al. was that, even though (as of 2002) the U.S. Preventive Health Services Task Force felt that evidence was insufficient to recommend prostate cancer screening, more than 60% of respondents said that a 55-year-old man who did not have a routine PSA test was “irresponsible,” and more than a third said this for an 80-year old […] Thus, regardless of the efficacy of screening tests, they have become an obligation if one does not wish to be blamed for getting some illnesses.”

There are many reasons why there may be problems with using observational studies to evaluate screening tests, and they talk a bit about those. One is what they call ‘volunteer bias’, which is just basic selection bias. Then there are the familiar problems of lead-time bias and length time bias. It should perhaps be noted here that both of the two latter problems can be handled in the context of a randomized controlled trial; neither lead-time bias nor length time bias are issues if the study is an RCT which compares the entire screened group with the entire unscreened group. Yet another problem is stage-migration bias, which for example can be a problem when more sensitive tests allow for earlier detection which changes how people are staged; this may lead to changes in stage-specific mortality rates, without actually improving overall mortality at all. A final problem they talk about is overdiagnoses related to the problem of pseudodisease, which is disease that would never have affected the patient if it had not been diagnosed by the screening procedure. Again a quote might be in order:

“It is difficult to identify pseudodisease in an individual patient, because it requires completely ignoring the diagnosis. (If you treat pseudodisease, the treatment will always appear to be curative, and you won’t realize the patient had pseudodisease rather than real disease!) In some ways, pseudodisease is an extreme type of stage migration bias. Patients who were not previously diagnosed as having the disease are now counted as having it. Although the incidence of the disease goes up, the prognosis of those who have it improves. […] Lack of understanding of pseudodisease, including the lack of people who know they have had it, is a real problem, because most of us understand the world through stories […]. Patients whose pseudodisease has been “cured” become strong proponents of screening and treatment and can tell a powerful and easily understood story about their experience. On the other hand, there aren’t people who can tell a compelling story of pseudodisease – men who can say, “I had a completely unnecessary prostatectomy,” or women who say, “I had a completely unnecessary mastectomy,” even though we know statistically that many such people exist.
The existence of pseudo–lung cancer was strongly suggested by the results of the Mayo Lung Study, a randomized trial of chest x-rays and sputum cytology to screen for lung cancer among 9,211 male cigarette smokers (Marcus et al. 2000).”

I included the last part also to indicate that this is actually a real problem also in situations where you’d be very likely to imagine it couldn’t possibly be a problem; even a disease as severe as lung cancer is subject to this kind of issue. There are also problems that may make screening tests look worse than they really are; like power issues, unskilled medical personnel doing the testing, and lack of follow-up (if a positive test result does not lead to any change in health care provision, there’s no good reason to assume earlier diagnosis as a result of screening will impact e.g. disease-specific mortality. On a related note there’s some debate about which mortality metric (general vs disease-specific) is to be preferred in the screening context, and they talk a bit about that as well).

I expected to write more about the book in this post than I have so far and perhaps include a few more quotes, but my computer broke down while I was writing this post yesterday so this is what you get. However as already mentioned this is a great book, and if you think you might like it based on the observations included in this post you should definitely read it.

December 22, 2014 - Posted by | Books, Cancer/oncology, Epidemiology, Health Economics, Medicine, Statistics

No comments yet.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.