Fixed Effects and Hierarchical Models

It is often the case that the data you will work with as a data scientist will have a nested structure, meaning our individual observations can also be divided into groups. In a dataset on student test scores, for example, individual students can be grouped by classroom, or by school. In data on store revenues, it may be possible to group stores by the city in which they are located.

When this happens, we often want to take this group structure into account. For students, for example, we might want to take into account that all the students in a good school may do better than average (because the school is good), or that all the stores in a poor city may be likely to under-perform (not because of anything the store has done, but because the city is poor).

There are several strategies for making these adjustments. The one the is most popular among statisticians (and which is most familiar to MIDS students) are hierarchical models. Hierarchical models accomplish two things: (a) they allow for different intercepts (or different slopes) for all observations within a group, and (b) they adjust our standard errors to account for the possibility of group-level shocks (if the 300 kids in a single school are all over-performing, hierarchical models don’t think of that as 300 observations when calculating standard errors for group effects (which would generally result in small standard errors since we have a sample size of 300 kids); it sees it as a single observation of a good school (with larger standard errors due to the fact we only have a sample size of one school).

Hierarchical models are popular in many settings, and for good reason – they are statistically efficient (meaning they generate estimates that tend to standard errors that are smaller than what you get from other models designed to accomplish the same things). But they also have limitations.

The chief limitation of hierarchical models is that they assume that the magnitudes of group difference (i.e. the group intercepts or differences in slope) are uncorrelated with the other explanatory variables in the model. Indeed, it is this assumption that makes them statistically efficient.

But there are many settings where this assumption breaks down. For example, suppose we’re estimating the test scores of children, and we’re controlling for whether children are low-income. If we estimate a hierarchical model to allow for different schools to have different average test scores, then those estimates of school-level differences will be biased if low-income students tend to go to under-performing school (because our explanatory variable – a child’s income – is correlated with the school’s average performance).

With that in mind, social scientists (who work with data where almost everything is extremely correlated) often prefer a different strategy: fixed effects with clustered standard errors.

Fixed Effects & Clustered Standard Errors

Fixed effects are just like hierarchical group estimates, except they don’t require that the level (or slope) differences of groups be uncorrelated with other explanatory variables. This does make them less statistically efficient (you tend to get larger standard errors than in hierarchical models), but it also makes them more robust (they give you unbiased estimates whether a correlation with explanatory variables exists or not).

But fixed effects don’t address the second purpose of hierarchical models – correcting your standard errors. For that, we use clustered standard errors. The idea of clustered standard errors is to account for the fact that if everyone in a given group has a high (or low) error, we want our standard errors to reflect the fact we shouldn’t treat that like lots of individual high (or low) errors (which would generate small standard errors because of the implied large sample size), but rather a single high (or low) error for the group (with an accompanyingly large standard error since really, there’s only one observation of a group error common to all observations). As a result, we generally get larger standard errors when we cluster.

(Fixed effects are also often used to allow for different group intercepts, not it is less common to use them to allow for differential group slopes. This actually can be done, but less frequently).

I recognize this all feels hand-wavy. If you’re interested in the math, you can find a more detailed mathematical discussion, but I’ll confess that this is an area I’ve found found that math to offer much intuition. Technically, what we’re doing is allowing for off-the-diagonal terms in our variance-covariance matrix for members of each group to allow for correlated shocks. But if that explanation helps you understand in principle what we’re doing, you’re smarter than me. :)

So if you’re using fixed effects for groups that you think might be subject to common “shocks” (things you aren’t directly measuring that affect the outcomes of the observations – like how all students in a school may see test scores change because of an especially good principal, or how all stores in a city may have low sales because the city is unusually poor) – you should cluster all the observations for that group (we’ll discuss how to cluster in Python in the next execises).

Last note: hierarchical models are equivalent to what social scientists often call “random effects models with clustered standard errors.” So just think “hierarchical models” when you hear “random effects.”

Things To Remember:

  • Everything you learned about when and how you can use hierarchical models applies to fixed effects + clustered standard errors, except that fixed effects + clustered standard errors are even more robust.

  • If you’re using fixed effects for groups that might be subject to common shocks, you should probably also be using clustered standard errors.

  • Random effects + clustered standard errors is just a different term for hierarchical models, and often you’ll just hear them called “random effects models” or “mixed effects models”.

[ ]: