## A little stuff about modelling

(No, not *that type* of modelling! – I was rather thinking about the type below…)

…

Anyway, I assume not all readers are equally familiar with this stuff, which I’ve incidentally written about before e.g. here. Some of you will know all this stuff already and you do not need to read on (well, maybe you do – in order to realize that you do not..). Some of it is recap, some of it I don’t think I’ve written about before. Anyway.

i. So, a model is a representation of the world. It’s a simplified version of it, which helps us think about the matters at hand.

ii. Models always have a lot of assumptions. A perhaps surprising observation is that, from a certain point of view, models which might be categorized as more ‘simple’ (few explicit assumptions) can be said to make as many assumptions as do more ‘complex’ models (many explicit assumptions); it’s just that the underlying assumptions are different. To illustate this, let’s have a look at two different models, model 1 and model 2. Model 1 is a model which states that ‘Y = aX’. Model 2 is a model which states that ‘Y = aX + bZ’.

Model 1 assumes b is equal to 0 so that Z is not a relevant variable to include, whereas model 2 assumes b is not zero – but both models make assumptions about this variable ‘Z’ (and the parameter ‘b’). Models will often differ along such lines, making different assumptions about variables and how they interact (incidentally here we’re implicitly assuming in both models that X and Z are independent). A ‘simple’ model does make fewer (explicit) assumptions about the world than does a ‘complex’ model – but that question is different from the question of which restrictions the two models impose on the data. And thinking in binary terms when we ask ourselves the question, ‘Are we making an assumption about this variable or this relationship?’, then the answer will always be ‘yes’ either way. Does the variable Z contribute information relevant to Y? Does it interact with other variables in the model? Both the simple model and the complex model include assumptions about this stuff. At every branching point where the complex model departs from the simple one, you have one assumption in one model (‘the distinction between f and g matters’, ‘alpha is non-zero’) and another assumption in the other (‘the distinction between f and g doesn’t matter’, ‘alpha is zero’). You always make assumptions, it’s just that the assumptions are different. In simple models assumptions are often not spelled out, which is presumably part of why some of the assumptions made in such models are easy to overlook; it makes sense that they’re not, incidentally, because there’s an infinite number of ways to make adjustments to a model. It’s true that branching out does take place in some complex models in ways that do not occur in simple models, and once you’re more than one branching point away from the departure point where the two models first differ then the behaviour of the complex model may start to be determined by additional new assumptions where on the other hand the behaviour of the simple model might still rely on the same assumption that determined the behaviour at the first departure point – so the number of explicit assumptions will be different, but *an assumption is made in either case at every junction*.

As might be inferred from the comments above usually ‘the simple model’ will be the one with the more restrictive assumptions, in terms of what the data is ‘allowed’ to do. Fewer assumptions usually means stronger assumptions. It’s a much stronger assumption to assume that e.g. males and females are identical than is the alternative that they are not; there are many ways they could be not identical but only one way in which they can be. *The restrictiveness of a model does not equal the number of assumptions (explicitly) made*. No, on a general note it is rather the case that more assumptions mean that your model becomes *less* restrictive, because additional assumptions allow for more stuff to vary – this is indeed a big part of why model-builders generally don’t just stick to very simple models; if you do that, you don’t get the details right. Adding more assumptions may allow you to make a more correct model that better explains the data. It is my experience (not that I have much of it, but..) that people who’re unfamiliar with modelling think of additional assumptions as somehow ‘problematic’ – ‘more stuff can go wrong if you add more assumptions, the more assumptions you have the more likely it is that one of them is violated’. The problem is that not making assumptions is not really an option; you’ll basically assume* something *no matter what you do. ‘That variable/distinction/connection is irrelevant’, which is often the default assumption, is also just that – an assumption. If you do modelling you don’t ever get to not make assumptions, they’re always there lurking in the background whether you like it or not.

iii. A big problem is that we don’t know *a priori* which assumptions are correct before we’ve actually tested the models – indeed, we often make models mainly in order to figure out which assumptions are correct. (Sometimes we can’t even test the assumptions we’re making in a model, but let’s ignore this problem here…). *A more complex model may not always be more correct, perform better*. Sometimes it’ll actually do a worse job at explaining the variation in the data than a simple one would have done. When you add more variables to a model, you also add more uncertainty because of things like measurement error. Sometimes it’s worth it, because the new variable explain a lot of the variation in the data. Sometimes it’s not – sometimes the noise you add is far more relevant than is the additional information contribution about how the data behaves.

There are various ways to try to figure out if the amount of noise added from an additional variable is too high for it to be a good idea to include the variable in a model, but they’re not perfect and you always have tradeoffs. There are many different methods to estimate which model performs better, and the different methods apply different criteria – so you can easily get into a situation where the choice of which variable to include in your ‘best model’ depends on e.g. which information criterium you choose to apply.

Anyway the key point is this: You can’t just add everything (all possible variables you could imagine play a role) and assume you’ll be able to explain everything that way – adding another variable may indeed sometimes be a very bad idea.

iv. If you test a lot of hypotheses simultaneously, which all have some positive probability of being evaluated as correct, then as you add more variables to your model it becomes more and more likely that at least one of those hypotheses will be evaluated as being correct (relevant link) unless you somehow adjust the probability of a given hypothesis being evaluated as correct as you add more hypotheses along the way. This is another reason adding more variables to a model can sometimes be problematic. There are ways around this particular problem, but if they are not used, which they often are not, then you need to be careful.

v. Adding more variables is not always preferable, but then what about throwing more data at the problem by adding to the sample size? Surely if you add more data to the sample that should increase your confidence in the model results, right? Well… No – bigger is actually not always better. This is related to the concept of consistency in statistics. “A consistent estimator is one for which, when the estimate is considered as a random variable indexed by the number *n* of items in the data set, as *n* increases the estimates converge to the value that the estimator is designed to estimate,” as the wiki article puts it. You can imagine that consistency is one of the key assumptions underlying statistical models – it really is, we care a lot about consistency, and all else equal you should always prefer a consistent estimator to an inconsistent one (however it should be noted that all else is not always equal; a consistent estimator may have larger variance than an inconsistent estimator in a finite sample, which means that we may actually sometimes prefer the latter to the former in specific situations). But the thing is, not all estimators are consistent. There are always some critical assumptions which need to be satisfied in order for the consistency requirement to be met, and in a bad model these requirements will not be met. If you have a bad model, for example if you’ve incorrectly specified the relationships between the variables or included the wrong variables in your model, then increasing the sample size will do nothing to help you – additional data will not somehow magically make the estimates more reliable ‘because of asymptotics’. In fact if your model’s performance is very sensitive to the sample size to which you apply it, it may well indicate that there’s a problem with the model, i.e. that the model is misspecified (see e.g. this).

vi. Not all model assumptions are equal – some assumptions will usually be much more critical than others. As already mentioned consistency of regressors is very important, and here it is important to note that not all model assumption violations will lead to inconsistent estimators. An example of where this is not the case is the homoskedasticity assumption (see also this) in regression analysis. Here you can actually find yourself in a situation where you deliberately apply a model where you *know* that one of your assumptions about how the data behaves is violated, yet this is not a problem at all because you can deal with the problem separately so that that violation is of no practical importance as you can correct for it. As already mentioned in the beginning most models will be simplified versions of the stuff that goes on in the real world, so you’ll expect to see some ‘violations’ here and there – the key question to ask here is then, is the violation important and which consequences does it have for the estimates we’ve obtained? If you do not ask yourself such questions when evaluating a model, you may easily end up quibbling about details which are of no importance anyway because they don’t really matter. And remember that all the assumptions made in the model are not always spelled out, and that some of the important ones may have been overlooked.

vii. Which causal inferences to make from the model? Correlation != causation. To some extent the question to which extent the statistical link is causal relates to questions pertaining to whether we’ve picked the right variables and the right way to relate them to each other. But as I’ve remarked upon before some model types are better suited for establishing causal links than are others – there are good ways and bad ways to get at the heart of the matter (one application here, I believe I’ve linked to this before). Different fields will often have developed different approaches, see e.g. this, this and this. Correlation on its own will probably tell you next to nothing about anything you might be interested in; as I believe my stats prof put it last semester, ‘we don’t care about correlation, correlation means nothing’. Randomization schemes with treatment groups and control groups are great. If we can’t do those, we can still try to make models to get around the problems. Those models make assumptions, but so do the other models you’re comparing them with and in order to properly evaluate them you need to be explicit about the assumptions made by the competing models as well.

There is nothing here that is new to me, but I finished the post anyway, because your writing is clear and engaging. Becoming a pop economics writer might be an option for you — but it is a risky bet, since you do require quite a bit of luck in the publishing world.

Comment by Miao | July 24, 2013 |

Thanks for the kind words. I’d prefer a less risky career. If I can’t do any better with the credentials I’ll have obtained by the time I graduate (…if I graduate…) than that, I should probably consider something like that as a fall-back option. As I’m sure I’ve mentioned before I was approached by a publisher at one point because of stuff I’d written here on the blog, so the idea of writing a book at some point is not

thatforeign to me – though that’s still very different from becoming a professional writer. I’ll probably not want to go down that route unless no-one wants to hire me when I’ve finished my education, and I really don’t hope I’ll find myself in that situation.Comment by US | July 24, 2013 |

“I was approached by a publisher at one point because of stuff I’d written here on the blog” — This might just be the motivation I need to start writing/maintaining a blog myself, though I am not sure if I will be able to rival your level of consistency, both in terms of frequency and quality 🙂

Comment by Miao | July 24, 2013

If it is (the motivation you need) then I shouldn’t really say anything because I’d love to see you start blogging…

But I’ll say this anyway; if you want to become an author, waiting around for people to find you isn’t going to work. I highly doubt many people are discovered that way and the odds are surely against you. The direct approach is far more likely to work – write something, then find someone whom you can trick into publishing it. Also blogging may make you a better writer, but people don’t like to pay for stuff they can get for free so there are some tradeoffs here regarding what to share online, and you should certainly keep those tradeoffs in mind if you plan on using your blog in manners such as these (as well as other tradeoffs that might apply).

Incidentally if you can’t convince anyone to publish your stuff, you always have the option of self-publishing the material as an e-book.

[Do recall here at the end my meta-advice on how to react when I’ve advised someone…]

Comment by US | July 24, 2013

There is no ‘reply’ button to your comment so I’ll write a new one here.

What you said has already occurred to me (considering the fact that there are many millions of blogs out there, it is exceedingly unlikely that I can just sit on my ass and expect my blog to go viral without any promotion on my part), but nevertheless it is a good reminder. 🙂

Comment by Miao | July 24, 2013 |

“Anyway the key point is this: You can’t just add everything (all possible variables you could imagine play a role) and assume you’ll be able to explain everything that way – adding another variable may indeed sometimes be a very bad idea.” – That never stopped some from trying; Leontieff got a Nobel Prize for it. I have not done this particular kind of analysis, but I have heard that back in the day they tried to solve linear systems of 5,000 or so equations. Unsurprisingly, if computational intractability does not kill such a model, GIGO will.

CSB (a problem I had to wrestle with not that long ago – hopefully useful as folklore “from the trenches”): My company gathers open-high-low-close-volume data for every stock traded in the US every trading day, plus quite a few ETFs, plus quite a few indices. Out of these, we calculate a boatload of statistics – some widely known, some proprietary. Some of them have some merit (momentum, relative strength), some are stupid (up and down volume) and some are so ridiculous it stops being funny at some point – things like “take the ratio of the Nasdaq100 and S&P500 closing prices, calculate X% and Y% exponential moving averages (X and Y are derived from Fibonacci numbers), build a cumulative sum of the differences of the two EMAs, and calculate its 10-day rolling standard deviation”. My task was to find which of our variables best predict whether a specific combination of five tradable market indices will be up or down the next day. I knew this will be a wildly overfit model with a gazillion of “spurious successes”, and said as much, but I get paid to do what I am told 🙂 My results (there were dozens of them) were of the kind “for days where variable X>12.75 and variable Y<-3.5, the next day was up in 64 out of 67 cases in my training sample, and this held well to 18 out of 21 on average in the out-of-sample validation runs". My bosses were very impressed and happy, I got pats on the back galore, and (the reward for hard work is more work!) a task to check how these same variables predict the returns over the 2 following days and the 5 following days. As you can imagine, over the next year, these "magic" combinations/strategies mostly sucked – reversion to the mean is a bitch. At more than one occasion, I felt very much like the main character from the movie Pi (highly recommended). (CSB over)

“Different fields will often have developed different approaches (to causality)…” – A word of advice: I am sure you are familiar with Granger causality, as it was linked in one of the wikis you linked to. But on the off chance you are not – acquaint yourself with it, and if you do know it – master it, and the generalized vector autoregression version of it. This is powerful stuff in the hands of someone who is familiar with its limitations. I mentioned it, in passing, to a very bright fellow who does bio-informatics for Bristol-Myers, and he is now using it to locate molecules that shrink tumors, apparently with great success. The “randomization schemes with treatment groups and control groups” you mention are indeed the best way, but have this annoying limitation of the real world/industry – they are expensive, so you need to justify the budget to run them.

Best!

Comment by Plamus | July 30, 2013 |

“That never stopped some from trying” – part of why I mentioned this. Plus it’s in my impression a wildly common idea among statistically illiterate people.

““for days where variable X>12.75 and variable Y<-3.5, the next day was up in 64 out of 67 cases in my training sample" – how large was the average gain? What if you looked two days ahead instead of one? On the one hand more competent management would lead to more work, on the other hand less – they probably wouldn’t ask you to derive variables from Fibonacci numbers… I always thought simple illustrative examples of

what you’re actually doingwhen you’re overfitting a model were a good way to conceptualize the kinds of problems overfitting might lead to: 'This indicator performs very well … when we apply it to lefthanded men at the age of 34 who bought a green Mercedes last Thursday'. It’s probably true, but which conclusion would any sane man draw from that observation? 🙂We spent a lot of time on study designs and how they relate to the problem of causal evaluation, and stuff like how to identify and isolate ATEs in the courses I had last semester, so of course I'm aware of some of the tradeoffs and reasons why RCTs aren't more widespread than they are. This kind of stuff plays a much bigger role in some parts of economics/econometrics than you'd think if you’d only listen to ‘critics of the social sciences (/…from the outside)’.

Comment by US | July 30, 2013 |

As soon as I read your reply, I knew I had bollocksed the explanation. I used X and Y twice, for different things – my apologies. The first X and Y – the EMA factors – are just constants: ((Phi – 1)/2)^2 and ((phi -1)/2)^3 in percentage term, so X~38.2 and Y~23.6. I was illustrating one of our ludicrous indicators. The second X and Y were indicators I was testing. The goal was to find “successful” X’s and Y’s with the optimal thresholds, which I randomly put here as 12.75 and -3.5. So, in summary, the conceptual problem was: you have a ton of indicators (pre-calculated, so you have values), and next-day returns on an index (coded 1 for UP, and 0 for down); find signals (e.g indicator X>12.75 AND indicator Y<3.5), which, when they occur, predict an up or a down day with high probability; do not care about the days when the signal is off (say, from above, when indicator X3.5). If I have managed to explain the setup adequately, I’d be curious how you’d attack this problem. I’ll tell you what I did, if you are interested.

Best!

Comment by Plamus | July 31, 2013

“I’d be curious how you’d attack this problem” – I’d spend some (too much…) time thinking about various approaches, how this problem relates to problems I’ve seen before, etc., before doing anything. And I’m not sure my real-life time is well spent thinking about this particular problem.

You’re welcome to tell me how you approached the problem if you like – you have a lot more experience doing these kinds of things than I do and your strategy was successful, so I might learn something.

Comment by US | July 31, 2013

I did not mean to ask you for a complete solution – just thought you may know of some method that smarter people than either you or me have developed, as while I may have a bit more hands-on experience, your theoretical background is much deeper and more recent than mine. I do not know of such a method. Since we are not looking to best predict all of our next-next returns, but just a subset, regressions and PCA do not work, although I tried both in order to extract a few indicators to try to massage further – did not give me much (mind, quite a few of the indicators are highly correlated, e.g. the returns of most equity indices for the day).

What I did is by no means efficient (I have no way of knowing I got most or the best signals) or elegant (as you’ll see), but it gave me what was asked of me. I used decision trees. I loaded all of my indicators into a data-mining software, and had it build for me a decision tree that explained almost all of the next-day returns. The output was something like: “If signal A is true (indicator X>a AND indicator Y

c AND indicator W<d), then next day is down (say) 43 out of 48); if not signal A and not signal B, then signal C…" and so on all the way down. I took only signal A, if it was "good enough" – sometimes it was useless, as it would only predict, say, 11 out of 12, and a signal that only lights up 12 times in 5 years is not very useful. Then I removed indicator X from my data set, and reran it, getting a totally new decision tree. Then I added X back in, but removed Y. Basically, I tried to make the software give me numerous decision trees, by tweaking the indicators I fed in as well as some of the parameters of the decision tree search (type of validation procedure, number of iterations for the validation, number of allowable indicators that make up a signal, etc.) All in all, I generated about 200 decision trees, and by skimming the top branches, got about 40 "promising" signals that looked great on paper (on screen?). I fully expected most of them to suck as actual predictors, and of course they did, but that was not my assignment, so… 🙂 I could easily, given enough time, have generated not 200, but 2,000 trees, and yielded not 40, but 200-300 signals. This is what overfitting does for you, and what I meant to illustrate initially anyway – you get lots of beautifully looking results, but for the most part that's all they do – look pretty, while being useless in terms of making you wiser.Comment by Plamus | August 1, 2013 |

Ah, crap… It seems the blogging software interprets all “more than” and “less than” signs as HTML tags – so it ate a part of my post, and made the rest bold. “Yc” should be “Y is less than b, then the next day is up; if not sifnal A, then signal B (indicator Z is higher than”. I apologize – a preview option would have been helpful.

Comment by Plamus | August 1, 2013 |

Ha, yep – doesn’t sound neither efficient nor very ‘convincing’ (“I’ve found the fundamental law behind the numbers!”). 🙂 But if something like that’s what the guy in charge is asking for (and it was, even though he might not know it…), well…

It’s my impression that time series work and forecasting in particular is quite vulnerable to stuff like overfitting and borderline nonsensical models – stuff like AR(/I)MA models for example in my mind are just curve-fitting exercises – but I haven’t actually done much work in these areas as that kind of stuff doesn’t really interest me (I’d rather work on stuff like microeconometrics). Incidentally my lack of interest in time series econometrics is actually a bit of a shame, all things considered, given the place I’m studying; we have some people who’re really good at that stuff.

I agree that a preview comment feature would be nice, I may look into that (but I promise nothing).

Comment by US | August 2, 2013 |

I meant to add this link as well in my last comment. “Institut for Økonomi, Aarhus Universitet” is where I study. It’s literally at the top of the list. I figured that if they evaluated CREATES separately it would be higher placed than IfØ but apparently that’s not the case; that’s probably related to how the rankings are constructed (haven’t looked at this).

Comment by US | August 3, 2013