A little stuff about modelling
(No, not that type of modelling! – I was rather thinking about the type below…)
Anyway, I assume not all readers are equally familiar with this stuff, which I’ve incidentally written about before e.g. here. Some of you will know all this stuff already and you do not need to read on (well, maybe you do – in order to realize that you do not..). Some of it is recap, some of it I don’t think I’ve written about before. Anyway.
i. So, a model is a representation of the world. It’s a simplified version of it, which helps us think about the matters at hand.
ii. Models always have a lot of assumptions. A perhaps surprising observation is that, from a certain point of view, models which might be categorized as more ‘simple’ (few explicit assumptions) can be said to make as many assumptions as do more ‘complex’ models (many explicit assumptions); it’s just that the underlying assumptions are different. To illustate this, let’s have a look at two different models, model 1 and model 2. Model 1 is a model which states that ‘Y = aX’. Model 2 is a model which states that ‘Y = aX + bZ’.
Model 1 assumes b is equal to 0 so that Z is not a relevant variable to include, whereas model 2 assumes b is not zero – but both models make assumptions about this variable ‘Z’ (and the parameter ‘b’). Models will often differ along such lines, making different assumptions about variables and how they interact (incidentally here we’re implicitly assuming in both models that X and Z are independent). A ‘simple’ model does make fewer (explicit) assumptions about the world than does a ‘complex’ model – but that question is different from the question of which restrictions the two models impose on the data. And thinking in binary terms when we ask ourselves the question, ‘Are we making an assumption about this variable or this relationship?’, then the answer will always be ‘yes’ either way. Does the variable Z contribute information relevant to Y? Does it interact with other variables in the model? Both the simple model and the complex model include assumptions about this stuff. At every branching point where the complex model departs from the simple one, you have one assumption in one model (‘the distinction between f and g matters’, ‘alpha is non-zero’) and another assumption in the other (‘the distinction between f and g doesn’t matter’, ‘alpha is zero’). You always make assumptions, it’s just that the assumptions are different. In simple models assumptions are often not spelled out, which is presumably part of why some of the assumptions made in such models are easy to overlook; it makes sense that they’re not, incidentally, because there’s an infinite number of ways to make adjustments to a model. It’s true that branching out does take place in some complex models in ways that do not occur in simple models, and once you’re more than one branching point away from the departure point where the two models first differ then the behaviour of the complex model may start to be determined by additional new assumptions where on the other hand the behaviour of the simple model might still rely on the same assumption that determined the behaviour at the first departure point – so the number of explicit assumptions will be different, but an assumption is made in either case at every junction.
As might be inferred from the comments above usually ‘the simple model’ will be the one with the more restrictive assumptions, in terms of what the data is ‘allowed’ to do. Fewer assumptions usually means stronger assumptions. It’s a much stronger assumption to assume that e.g. males and females are identical than is the alternative that they are not; there are many ways they could be not identical but only one way in which they can be. The restrictiveness of a model does not equal the number of assumptions (explicitly) made. No, on a general note it is rather the case that more assumptions mean that your model becomes less restrictive, because additional assumptions allow for more stuff to vary – this is indeed a big part of why model-builders generally don’t just stick to very simple models; if you do that, you don’t get the details right. Adding more assumptions may allow you to make a more correct model that better explains the data. It is my experience (not that I have much of it, but..) that people who’re unfamiliar with modelling think of additional assumptions as somehow ‘problematic’ – ‘more stuff can go wrong if you add more assumptions, the more assumptions you have the more likely it is that one of them is violated’. The problem is that not making assumptions is not really an option; you’ll basically assume something no matter what you do. ‘That variable/distinction/connection is irrelevant’, which is often the default assumption, is also just that – an assumption. If you do modelling you don’t ever get to not make assumptions, they’re always there lurking in the background whether you like it or not.
iii. A big problem is that we don’t know a priori which assumptions are correct before we’ve actually tested the models – indeed, we often make models mainly in order to figure out which assumptions are correct. (Sometimes we can’t even test the assumptions we’re making in a model, but let’s ignore this problem here…). A more complex model may not always be more correct, perform better. Sometimes it’ll actually do a worse job at explaining the variation in the data than a simple one would have done. When you add more variables to a model, you also add more uncertainty because of things like measurement error. Sometimes it’s worth it, because the new variable explain a lot of the variation in the data. Sometimes it’s not – sometimes the noise you add is far more relevant than is the additional information contribution about how the data behaves.
There are various ways to try to figure out if the amount of noise added from an additional variable is too high for it to be a good idea to include the variable in a model, but they’re not perfect and you always have tradeoffs. There are many different methods to estimate which model performs better, and the different methods apply different criteria – so you can easily get into a situation where the choice of which variable to include in your ‘best model’ depends on e.g. which information criterium you choose to apply.
Anyway the key point is this: You can’t just add everything (all possible variables you could imagine play a role) and assume you’ll be able to explain everything that way – adding another variable may indeed sometimes be a very bad idea.
iv. If you test a lot of hypotheses simultaneously, which all have some positive probability of being evaluated as correct, then as you add more variables to your model it becomes more and more likely that at least one of those hypotheses will be evaluated as being correct (relevant link) unless you somehow adjust the probability of a given hypothesis being evaluated as correct as you add more hypotheses along the way. This is another reason adding more variables to a model can sometimes be problematic. There are ways around this particular problem, but if they are not used, which they often are not, then you need to be careful.
v. Adding more variables is not always preferable, but then what about throwing more data at the problem by adding to the sample size? Surely if you add more data to the sample that should increase your confidence in the model results, right? Well… No – bigger is actually not always better. This is related to the concept of consistency in statistics. “A consistent estimator is one for which, when the estimate is considered as a random variable indexed by the number n of items in the data set, as n increases the estimates converge to the value that the estimator is designed to estimate,” as the wiki article puts it. You can imagine that consistency is one of the key assumptions underlying statistical models – it really is, we care a lot about consistency, and all else equal you should always prefer a consistent estimator to an inconsistent one (however it should be noted that all else is not always equal; a consistent estimator may have larger variance than an inconsistent estimator in a finite sample, which means that we may actually sometimes prefer the latter to the former in specific situations). But the thing is, not all estimators are consistent. There are always some critical assumptions which need to be satisfied in order for the consistency requirement to be met, and in a bad model these requirements will not be met. If you have a bad model, for example if you’ve incorrectly specified the relationships between the variables or included the wrong variables in your model, then increasing the sample size will do nothing to help you – additional data will not somehow magically make the estimates more reliable ‘because of asymptotics’. In fact if your model’s performance is very sensitive to the sample size to which you apply it, it may well indicate that there’s a problem with the model, i.e. that the model is misspecified (see e.g. this).
vi. Not all model assumptions are equal – some assumptions will usually be much more critical than others. As already mentioned consistency of regressors is very important, and here it is important to note that not all model assumption violations will lead to inconsistent estimators. An example of where this is not the case is the homoskedasticity assumption (see also this) in regression analysis. Here you can actually find yourself in a situation where you deliberately apply a model where you know that one of your assumptions about how the data behaves is violated, yet this is not a problem at all because you can deal with the problem separately so that that violation is of no practical importance as you can correct for it. As already mentioned in the beginning most models will be simplified versions of the stuff that goes on in the real world, so you’ll expect to see some ‘violations’ here and there – the key question to ask here is then, is the violation important and which consequences does it have for the estimates we’ve obtained? If you do not ask yourself such questions when evaluating a model, you may easily end up quibbling about details which are of no importance anyway because they don’t really matter. And remember that all the assumptions made in the model are not always spelled out, and that some of the important ones may have been overlooked.
vii. Which causal inferences to make from the model? Correlation != causation. To some extent the question to which extent the statistical link is causal relates to questions pertaining to whether we’ve picked the right variables and the right way to relate them to each other. But as I’ve remarked upon before some model types are better suited for establishing causal links than are others – there are good ways and bad ways to get at the heart of the matter (one application here, I believe I’ve linked to this before). Different fields will often have developed different approaches, see e.g. this, this and this. Correlation on its own will probably tell you next to nothing about anything you might be interested in; as I believe my stats prof put it last semester, ‘we don’t care about correlation, correlation means nothing’. Randomization schemes with treatment groups and control groups are great. If we can’t do those, we can still try to make models to get around the problems. Those models make assumptions, but so do the other models you’re comparing them with and in order to properly evaluate them you need to be explicit about the assumptions made by the competing models as well.