Model Selection and Multi-Model Inference (I)
“We wrote this book to introduce graduate students and research workers in various scientific disciplines to the use of information-theoretic approaches in the analysis of empirical data. These methods allow the data-based selection of a “best” model and a ranking and weighting of the remaining models in a pre-defined set. Traditional statistical inference can then be based on this selected best model. However, we now emphasize that information-theoretic approaches allow formal inference to be based on more than one model (multimodel inference). Such procedures lead to more robust inferences in many cases, and we advocate these approaches throughout the book. […] Information theory includes the celebrated Kullback–Leibler “distance” between two models (actually, probability distributions), and this represents a fundamental quantity in science. In 1973, Hirotugu Akaike derived an estimator of the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood. His measure, now called Akaike’s information criterion (AIC), provided a new paradigm for model selection in the analysis of empirical data. His approach, with a fundamental link to information theory, is relatively simple and easy to use in practice, but little taught in statistics classes and far less understood in the applied sciences than should be the case. […] We do not claim that the information-theoretic methods are always the very best for a particular situation. They do represent a unified and rigorous theory, an extension of likelihood theory, an important application of information theory, and they are objective and practical to employ across a very wide class of empirical problems. Inference from multiple models, or the selection of a single “best” model, by methods based on the Kullback–Leibler distance are almost certainly better than other methods commonly in use now (e.g., null hypothesis testing of various sorts, the use of R2, or merely the use of just one available model).
This is an applied book written primarily for biologists and statisticians using models for making inferences from empirical data. […] This book might be useful as a text for a course for students with substantial experience and education in statistics and applied data analysis. A second primary audience includes honors or graduate students in the biological, medical, or statistical sciences […] Readers should ideally have some maturity in the quantitative sciences and experience in data analysis. Several courses in contemporary statistical theory and methods as well as some philosophy of science would be particularly useful in understanding the material. Some exposure to likelihood theory is nearly essential”.
The above quotes are from the preface of the book, which I have so far only briefly talked about here; this post will provide a lot more details. Aside from writing the post in order to mentally process the material and obtain a greater appreciation of the points made in the book, I have also as a secondary goal tried to write the post in a manner so that people who are not necessarily experienced model-builders might also derive some benefit from the coverage. Whether or not I was successful in that respect I do not know – given the outline above, it should be obvious that there are limits as to how ‘readable’ you can make stuff like this to people without a background in a semi-relevant field. I don’t think I have written specifically about the application of information criteria in the model selection context before here on the blog, at least not in any amount of detail, but I have written about ‘model-stuff’ before, also in ‘meta-contexts’ not necessarily related to the application of models in economics; so if you’re interested in ‘this kind of stuff’ but you don’t feel like having a go at a post dealing with a book which includes word combinations like ‘the (relative) expectation of Kullback–Leibler distance based on Fisher’s maximized log-likelihood’ in the preface, you can for example have a look at posts like this, this, this and this. I have also discussed here on the blog some stuff somewhat related to the multi-model inference part, how you can combine the results of various models to get a bigger picture of what’s going on, in these posts – they approach ‘the topic’ (these are in fact separate topics…) in a very different manner than does this book, but some key ideas should presumably transfer. Having said all this, I should also point out that many of the basic points made in the coverage below should be relatively easy to understand, and I should perhaps repeat that I’ve tried to make this post readable to people who’re not too familiar with this kind of stuff. I have deliberately chosen to include no mathematical formulas in my coverage in this post. Please do not assume this is because the book does not contain mathematical formulas.
Before moving on to the main coverage I thought I’d add a note about the remark above that stuff like AIC is “little taught in statistics classes and far less understood in the applied sciences than should be the case”. The book was written a while back, and some things may have changed a bit since then. I have done coursework on the application of information criteria in model selection as it was a topic (briefly) covered in regression analysis(? …or an earlier course), so at least this kind of stuff is now being taught to students of economics where I study and has been for a while as far as I’m aware – meaning that coverage of such topics is probably reasonably widespread at least in this field. However I can hardly claim that I obtained a ‘great’ or ‘full’ understanding of the issues at hand from the work on these topics I did back then – and so I have only gradually, while reading this book, come to appreciate some of the deeper issues and tradeoffs involved in model selection. This could probably be taken as an argument that these topics are still ‘far less understood … than should be the case’ – and another, perhaps stronger, argument would be Seber’s comments in the last part of his book; if a statistician today may still ‘overlook’ information criteria when discussing model selection in a Springer text, it’s not hard to argue that the methods are perhaps not as well known as should ‘ideally’ be the case. It’s obvious from the coverage that a lot of people were not using the methods when the book was written, and I’m not sure things have changed as much as would be preferable since then.
What is the book about? A starting point for understanding the sort of questions the book deals with might be to consider the simple question: When we set out to model stuff empirically and we have different candidate models to choose from, how do we decide which of the models is ‘best’? There are a lot of other questions dealt with in the coverage as well. What does the word ‘best’ mean? We might worry over both the functional form of the model and which variables should be included in ‘the best’ model – do we need separate mechanisms for dealing with concerns about the functional form and concerns about variable selection, or can we deal with such things at the same time? How do we best measure the effect of a variable which we have access to and consider including in our model(s) – is it preferable to interpret the effect of a variable on an outcome based on the results you obtain from a ‘best model’ in the set of candidate models, or is it perhaps sometimes better to combine the results of multiple models (and for example take an average of the effects of the variable across multiple proposed models to be the best possible estimate) in the choice set (as should by now be obvious for people who’ve read along here, there are some sometimes quite close parallels between stuff covered in this book and stuff covered in Borenstein & Hedges)? If we’re not sure which model is ‘right’, how might we quantify our uncertainty about these matters – and what happens if we don’t try to quantify our uncertainty about which model is correct? What is bootstrapping, and how can we use Monte Carlo methods to help us with model selection? If we apply information criteria to choose among models, what do these criteria tell us, and which sort of issues are they silent about? Are some methods for deciding between models better than others in specific contexts – might it for example be a good idea to make criteria adjustments when faced with small sample sizes which makes it harder for us to rely on asymptotic properties of the criteria we apply? How might the sample size more generally relate to our decision criterion deciding which model might be considered ‘best’ – do we think that what might be considered to be ‘the best model’ might depend upon (‘should depend upon’?) how much data we have access to or not, and if how much data we have access to and the ‘optimal size of a model’ are related, how are the two related, and why? The questions included in the previous sentence relate to some fundamental differences between AIC (and similar measures) and BIC – but let’s not get ahead of ourselves. I may or may not go into details like these in my coverage of the book, but I certainly won’t cover stuff like that in this post. Some of the content is really technical: “Chapters 5 and 6 present more difficult material [than chapters 1-4] and some new research results. Few readers will be able to absorb the concepts presented here after just one reading of the material […] Underlying theory is presented in Chapter 7, and this material is much deeper and more mathematical.” – from the preface. The sample size considerations mentioned above relate to stuff covered in chapter 6. As you might already have realized, this book has a lot of stuff.
When dealing with models, one way to think about these things is to consider two in some sense separate issues: On the one hand we might think about which model is most appropriate (model selection), and on the other hand we might think about how best to estimate parameter values and variance-covariance matrices given a specific model. As the book points out early on, “if one assumes or somehow chooses a particular model, methods exist that are objective and asymptotically optimal for estimating model parameters and the sampling covariance structure, conditional on that model. […] The sampling distributions of ML [maximum likelihood] estimators are often skewed with small samples, but profile likelihood intervals or log-based intervals or bootstrap procedures can be used to achieve asymmetric confidence intervals with good coverage properties. In general, the maximum likelihood method provides an objective, omnibus theory for estimation of model parameters and the sampling covariance matrix, given an appropriate model.” The problem is that it’s not ‘a given’ that the model we’re working on is actually appropriate. That’s where model selection mechanisms enters the picture. Such methods can help us realize which of the models we’re considering might be the most appropriate one(s) to apply in the specific context (there are other things they can’t tell us, however – see below).
Below I have added some quotes from the book and some further comments:
“Generally, alternative models will involve differing numbers of parameters; the number of parameters will often differ by at least an order of magnitude across the set of candidate models. […] The more parameters used, the better the fit of the model to the data that is achieved. Large and extensive data sets are likely to support more complexity, and this should be considered in the development of the set of candidate models. If a particular model (parametrization) does not make biological [/’scientific’] sense, this is reason to exclude it from the set of candidate models, particularly in the case where causation is of interest. In developing the set of candidate models, one must recognize a certain balance between keeping the set small and focused on plausible hypotheses, while making it big enough to guard against omitting a very good a priori model. While this balance should be considered, we advise the inclusion of all models that seem to have a reasonable justification, prior to data analysis. While one must worry about errors due to both underfitting and overfitting, it seems that modest overfitting is less damaging than underfitting (Shibata 1989).” (The key word here is ‘modest’ – and please don’t take these authors to be in favour of obviously overfitted models and data dredging strategies; they spend quite a few pages criticizing such models/approaches!).
“It is not uncommon to see biologists collect data on 50–130 “ecological” variables in the blind hope that some analysis method and computer system will “find the variables that are significant” and sort out the “interesting” results […]. This shotgun strategy will likely uncover mainly spurious correlations […], and it is prevalent in the naive use of many of the traditional multivariate analysis methods (e.g., principal components, stepwise discriminant function analysis, canonical correlation methods, and factor analysis) found in the biological literature [and elsewhere, US]. We believe that mostly spurious results will be found using this unthinking approach […], and we encourage investigators to give very serious consideration to a well-founded set of candidate models and predictor variables (as a reduced set of possible prediction) as a means of minimizing the inclusion of spurious variables and relationships. […] Using AIC and other similar methods one can only hope to select the best model from this set; if good models are not in the set of candidates, they cannot be discovered by model selection (i.e., data analysis) algorithms. […] statistically we can infer only that a best model (by some criterion) has been selected, never that it is the true model. […] Truth and true models are not statistically identifiable from data.”
“It is generally a mistake to believe that there is a simple “true model” in the biological sciences and that during data analysis this model can be uncovered and its parameters estimated. Instead, biological systems [and other systems! – US] are complex, with many small effects, interactions, individual heterogeneity, and individual and environmental covariates (most being unknown to us); we can only hope to identify a model that provides a good approximation to the data available. The words “true model” represent an oxymoron, except in the case of Monte Carlo studies, whereby a model is used to generate “data” using pseudorandom numbers […] A model is a simplification or approximation of reality and hence will not reflect all of reality. […] While a model can never be “truth,” a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless. Model selection methods try to rank models in the candidate set relative to each other; whether any of the models is actually “good” depends primarily on the quality of the data and the science and a priori thinking that went into the modeling. […] Proper modeling and data analysis tell what inferences the data support, not what full reality might be […] Even if a “true model” did exist and if it could be found using some method, it would not be good as a fitted model for general inference (i.e., understanding or prediction) about some biological system, because its numerous parameters would have to be estimated from the finite data, and the precision of these estimated parameters would be quite low.”
A key concept in the context of model selection is the tradeoff between bias and variance in a model framework:
“If the fit is improved by a model with more parameters, then where should one stop? Box and Jenkins […] suggested that the principle of parsimony should lead to a model with “. . . the smallest possible number of parameters for adequate representation of the data.” Statisticians view the principle of parsimony as a bias versus variance tradeoff. In general, bias decreases and variance increases as the dimension of the model (K) increases […] The fit of any model can be improved by increasing the number of parameters […]; however, a tradeoff with the increasing variance must be considered in selecting a model for inference. Parsimonious models achieve a proper tradeoff between bias and variance. All model selection methods are based to some extent on the principle of parsimony […] The concept of parsimony and a bias versus variance tradeoff is very important.”
“we reserve the terms underfitted and overfitted for use in relation to a “best approximating model” […] Here, an underfitted model would ignore some important replicable (i.e., conceptually replicable in most other samples) structure in the data and thus fail to identify effects that were actually supported by the data. In this case, bias in the parameter estimators is often substantial, and the sampling variance is underestimated, both factors resulting in poor confidence interval coverage. Underfitted models tend to miss important treatment effects in experimental settings. Overfitted models, as judged against a best approximating model, are often free of bias in the parameter estimators, but have estimated (and actual) sampling variances that are needlessly large (the precision of the estimators is poor, relative to what could have been accomplished with a more parsimonious model). Spurious treatment effects tend to be identified, and spurious variables are included with overfitted models. […] The goal of data collection and analysis is to make inferences from the sample that properly apply to the population […] A paramount consideration is the repeatability, with good precision, of any inference reached. When we imagine many replicate samples, there will be some recognizable features common to almost all of the samples. Such features are the sort of inference about which we seek to make strong inferences (from our single sample). Other features might appear in, say, 60% of the samples yet still reflect something real about the population or process under study, and we would hope to make weaker inferences concerning these. Yet additional features appear in only a few samples, and these might be best included in the error term (σ2) in modeling. If one were to make an inference about these features quite unique to just the single data set at hand, as if they applied to all (or most all) samples (hence to the population), then we would say that the sample is overfitted by the model (we have overfitted the data). Conversely, failure to identify the features present that are strongly replicable over samples is underfitting. […] A best approximating model is achieved by properly balancing the errors of underfitting and overfitting.”
Model selection bias is a key concept in the model selection context, and I think this problem is quite similar/closely related to problems encountered in a meta-analytical context which I believe I’ve discussed before here on the blog (see links above to the posts on meta-analysis) – if I’ve understood these authors correctly, one might choose to think of publication bias issues as partly the result of model selection bias issues. Let’s for a moment pretend you have a ‘true model’ which includes three variables (in the book example there are four, but I don’t think you need four…); one is very important, one is a sort of ‘60% of the samples variable’ mentioned above, and the last one would be a variable we might prefer to just include in the error term. Now the problem is this: When people look at samples where the last one of these variables is ‘seen to matter’, the effect size of this variable will be biased away from zero (they don’t explain where this bias comes from in the book, but I’m reasonably sure this is a result of the probability of identification/inclusion of the variable in the model depending on the (‘local’/’sample’) effect size; the bigger the effect size of a specific variable in a specific sample, the more likely the variable is to be identified as important enough to be included in the model – Bohrenstein and Hedges talked about similar dynamics, for obvious reasons, and I think their reasoning ‘transfers’ to this situation and is applicable here as well). When models include variables such as the last one, you’ll have model selection bias: “When predictor variables [like these] are included in models, the associated estimator for a σ2 is negatively biased and precision is exaggerated. These two types of bias are called model selection bias”. Much later in the book they incidentally conclude that: “The best way to minimize model selection bias is to reduce the number of models fit to the data by thoughtful a priori model formulation.”
“Model selection has most often been viewed, and hence taught, in a context of null hypothesis testing. Sequential testing has most often been employed, either stepup (forward) or stepdown (backward) methods. Stepwise procedures allow for variables to be added or deleted at each step. These testing-based methods remain popular in many computer software packages in spite of their poor operating characteristics. […] Generally, hypothesis testing is a very poor basis for model selection […] There is no statistical theory that supports the notion that hypothesis testing with a fixed α level is a basis for model selection. […] Tests of hypotheses within a data set are not independent, making inferences difficult. The order of testing is arbitrary, and differing test order will often lead to different final models. [This is incidentally one, of several, key differences between hypothesis testing approaches and information theoretic approaches: “The order in which the information criterion is computed over the set of models is not relevant.”] […] Model selection is dependent on the arbitrary choice of α, but α should depend on both n and K to be useful in model selection”.
No comments yet.