Model Selection and Multi-Model Inference (II)
I haven’t really blogged this book in anywhere near the amount of detail it deserves even though my first post about the book actually had a few quotes illustrating how much different stuff is covered in the book.
This book is technical, and even if I’m trying to make it less technical by omitting the math in this post it may be a good idea to reread the first post about the book before reading this post to refresh your knowledge of these things.
Quotes and comments below – most of the coverage here focuses on stuff covered in chapters 3 and 4 in the book.
“Tests of null hypotheses and information-theoretic approaches should not be used together; they are very different analysis paradigms. A very common mistake seen in the applied literature is to use AIC to rank the candidate models and then “test” to see whether the best model (the alternative hypothesis) is “significantly better” than the second-best model (the null hypothesis). This procedure is flawed, and we strongly recommend against it […] the primary emphasis should be on the size of the treatment effects and their precision; too often we find a statement regarding “significance,” while the treatment and control means are not even presented. Nearly all statisticians are calling for estimates of effect size and associated precision, rather than test statistics, P-values, and “significance.” [Borenstein & Hedges certainly did as well in their book (written much later), and this was not an issue I omitted to talk about in my coverage of their book…] […] Information-theoretic criteria such as AIC, AICc, and QAICc are not a “test” in any sense, and there are no associated concepts such as test power or P-values or α-levels. Statistical hypothesis testing represents a very different, and generally inferior, paradigm for the analysis of data in complex settings. It seems best to avoid use of the word “significant” in reporting research results under an information-theoretic paradigm. […] AIC allows a ranking of models and the identification of models that are nearly equally useful versus those that are clearly poor explanations for the data at hand […]. Hypothesis testing provides no general way to rank models, even for models that are nested. […] In general, we recommend strongly against the use of null hypothesis testing in model selection.”
“The bootstrap is a type of Monte Carlo method used frequently in applied statistics. This computer-intensive approach is based on resampling of the observed data […] The fundamental idea of the model-based sampling theory approach to statistical inference is that the data arise as a sample from some conceptual probability distribution f. Uncertainties of our inferences can be measured if we can estimate f. The bootstrap method allows the computation of measures of our inference uncertainty by having a simple empirical estimate of f and sampling from this estimated distribution. In practical application, the empirical bootstrap means using some form of resampling with replacement from the actual data x to generate B (e.g., B = 1,000 or 10,000) bootstrap samples […] The set of B bootstrap samples is a proxy for a set of B independent real samples from f (in reality we have only one actual sample of data). Properties expected from replicate real samples are inferred from the bootstrap samples by analyzing each bootstrap sample exactly as we first analyzed the real data sample. From the set of results of sample size B we measure our inference uncertainties from sample to (conceptual) population […] For many applications it has been theoretically shown […] that the bootstrap can work well for large sample sizes (n), but it is not generally reliable for small n […], regardless of how many bootstrap samples B are used. […] Just as the analysis of a single data set can have many objectives, the bootstrap can be used to provide insight into a host of questions. For example, for each bootstrap sample one could compute and store the conditional variance–covariance matrix, goodness-of-fit values, the estimated variance inflation factor, the model selected, confidence interval width, and other quantities. Inference can be made concerning these quantities, based on summaries over the B bootstrap samples.”
“Information criteria attempt only to select the best model from the candidate models available; if a better model exists, but is not offered as a candidate, then the information-theoretic approach cannot be expected to identify this new model. Adjusted R2 […] are useful as a measure of the proportion of the variation “explained,” [but] are not useful in model selection […] adjusted R2 is poor in model selection; its usefulness should be restricted to description.”
“As we have struggled to understand the larger issues, it has become clear to us that inference based on only a single best model is often relatively poor for a wide variety of substantive reasons. Instead, we increasingly favor multimodel inference: procedures to allow formal statistical inference from all the models in the set. […] Such multimodel inference includes model averaging, incorporating model selection uncertainty into estimates of precision, confidence sets on models, and simple ways to assess the relative importance of variables.”
“If sample size is small, one must realize that relatively little information is probably contained in the data (unless the effect size if very substantial), and the data may provide few insights of much interest or use. Researchers routinely err by building models that are far too complex for the (often meager) data at hand. They do not realize how little structure can be reliably supported by small amounts of data that are typically “noisy.””
“Sometimes, the selected model [when applying an information criterion] contains a parameter that is constant over time, or areas, or age classes […]. This result should not imply that there is no variation in this parameter, rather that parsimony and its bias/variance tradeoff finds the actual variation in the parameter to be relatively small in relation to the information contained in the sample data. It “costs” too much in lost precision to add estimates of all of the individual θi. As the sample size increases, then at some point a model with estimates of the individual parameters would likely be favored. Just because a parsimonious model contains a parameter that is constant across strata does not mean that there is no variation in that process across the strata.”
“[In a significance testing context,] a significant test result does not relate directly to the issue of what approximating model is best to use for inference. One model selection strategy that has often been used in the past is to do likelihood ratio tests of each structural factor […] and then use a model with all the factors that were “significant” at, say, α = 0.05. However, there is no theory that would suggest that this strategy would lead to a model with good inferential properties (i.e., small bias, good precision, and achieved confidence interval coverage at the nominal level). […] The purpose of the analysis of empirical data is not to find the “true model”— not at all. Instead, we wish to find a best approximating model, based on the data, and then develop statistical inferences from this model. […] We search […] not for a “true model,” but rather for a parsimonious model giving an accurate approximation to the interpretable information in the data at hand. Data analysis involves the question, “What level of model complexity will the data support?” and both under- and overfitting are to be avoided. Larger data sets tend to support more complex models, and the selection of the size of the model represents a tradeoff between bias and variance.”
“The easy part of the information-theoretic approaches includes both the computational aspects and the clear understanding of these results […]. The hard part, and the one where training has been so poor, is the a priori thinking about the science of the matter before data analysis — even before data collection. It has been too easy to collect data on a large number of variables in the hope that a fast computer and sophisticated software will sort out the important things — the “significant” ones […]. Instead, a major effort should be mounted to understand the nature of the problem by critical examination of the literature, talking with others working on the general problem, and thinking deeply about alternative hypotheses. Rather than “test” dozens of trivial matters (is the correlation zero? is the effect of the lead treatment zero? are ravens pink?, Anderson et al. 2000), there must be a more concerted effort to provide evidence on meaningful questions that are important to a discipline. This is the critical point: the common failure to address important science questions in a fully competent fashion. […] “Let the computer find out” is a poor strategy for researchers who do not bother to think clearly about the problem of interest and its scientific setting. The sterile analysis of “just the numbers” will continue to be a poor strategy for progress in the sciences.
Researchers often resort to using a computer program that will examine all possible models and variables automatically. Here, the hope is that the computer will discover the important variables and relationships […] The primary mistake here is a common one: the failure to posit a small set of a priori models, each representing a plausible research hypothesis.”
“Model selection is most often thought of as a way to select just the best model, then inference is conditional on that model. However, information-theoretic approaches are more general than this simplistic concept of model selection. Given a set of models, specified independently of the sample data, we can make formal inferences based on the entire set of models. […] Part of multimodel inference includes ranking the fitted models from best to worst […] and then scaling to obtain the relative plausibility of each fitted model (gi) by a weight of evidence (wi) relative to the selected best model. Using the conditional sampling variance […] from each model and the Akaike weights […], unconditional inferences about precision can be made over the entire set of models. Model-averaged parameter estimates and estimates of unconditional sampling variances can be easily computed. Model selection uncertainty is a substantial subject in its own right, well beyond just the issue of determining the best model.”
“There are three general approaches to assessing model selection uncertainty: (1) theoretical studies, mostly using Monte Carlo simulation methods; (2) the bootstrap applied to a given set of data; and (3) utilizing the set of AIC differences (i.e., ∆i) and model weights wi from the set of models fit to data.”
“Statistical science should emphasize estimation of parameters and associated measures of estimator uncertainty. Given a correct model […], an MLE is reliable, and we can compute a reliable estimate of its sampling variance and a reliable confidence interval […]. If the model is selected entirely independently of the data at hand, and is a good approximating model, and if n is large, then the estimated sampling variance is essentially unbiased, and any appropriate confidence interval will essentially achieve its nominal coverage. This would be the case if we used only one model, decided on a priori, and it was a good model, g, of the data generated under truth, f. However, even when we do objective, data-based model selection (which we are advocating here), the [model] selection process is expected to introduce an added component of sampling uncertainty into any estimated parameter; hence classical theoretical sampling variances are too small: They are conditional on the model and do not reflect model selection uncertainty. One result is that conditional confidence intervals can be expected to have less than nominal coverage.”
“Data analysis is sometimes focused on the variables to include versus exclude in the selected model (e.g., important vs. unimportant). Variable selection is often the focus of model selection for linear or logistic regression models. Often, an investigator uses stepwise analysis to arrive at a final model, and from this a conclusion is drawn that the variables in this model are important, whereas the other variables are not important. While common, this is poor practice and, among other issues, fails to fully consider model selection uncertainty. […] Estimates of the relative importance of predictor variables xj can best be made by summing the Akaike weights across all the models in the set where variable j occurs. Thus, the relative importance of variable j is reflected in the sum w+ (j). The larger the w+ (j) the more important variable j is, relative to the other variables. Using the w+ (j), all the variables can be ranked in their importance. […] This idea extends to subsets of variables. For example, we can judge the importance of a pair of variables, as a pair, by the sum of the Akaike weights of all models that include the pair of variables. […] To summarize, in many contexts the AIC selected best model will include some variables and exclude others. Yet this inclusion or exclusion by itself does not distinguish differential evidence for the importance of a variable in the model. The model weights […] summed over all models that include a given variable provide a better weight of evidence for the importance of that variable in the context of the set of models considered.” [The reason why I’m not telling you how to calculate Akaike weights is that I don’t want to bother with math formulas in wordpress – but I guess all you need to know is that these are not hard to calculate. It should perhaps be added that one can also use bootstrapping methods to obtain relevant model weights to apply in a multimodel inference context.]
“If data analysis relies on model selection, then inferences should acknowledge model selection uncertainty. If the goal is to get the best estimates of a set of parameters in common to all models (this includes prediction), model averaging is recommended. If the models have definite, and differing, interpretations as regards understanding relationships among variables, and it is such understanding that is sought, then one wants to identify the best model and make inferences based on that model. […] The bootstrap provides direct, robust estimates of model selection probabilities πi , but we have no reason now to think that use of bootstrap estimates of model selection probabilities rather than use of the Akaike weights will lead to superior unconditional sampling variances or model-averaged parameter estimators. […] Be mindful of possible model redundancy. A carefully thought-out set of a priori models should eliminate model redundancy problems and is a central part of a sound strategy for obtaining reliable inferences. […] Results are sensitive to having demonstrably poor models in the set of models considered; thus it is very important to exclude models that are a priori poor. […] The importance of a small number (R) of candidate models, defined prior to detailed analysis of the data, cannot be overstated. […] One should have R much smaller than n. MMI [Multi-Model Inference] approaches become increasingly important in cases where there are many models to consider.”
“In general there is a substantial amount of model selection uncertainty in many practical problems […]. Such uncertainty about what model structure (and associated parameter values) is the K-L [Kullback–Leibler] best approximating model applies whether one uses hypothesis testing, information-theoretic criteria, dimension-consistent criteria, cross-validation, or various Bayesian methods. Often, there is a nonnegligible variance component for estimated parameters (this includes prediction) due to uncertainty about what model to use, and this component should be included in estimates of precision. […] we recommend assessing model selection uncertainty rather than ignoring the matter. […] It is […] not a sound idea to pick a single model and unquestioningly base extrapolated predictions on it when there is model uncertainty.”
No comments yet.