Fixed Effects: Indicator Variables for Groups

One common use of indicator variables are as fixed effects. Fixed effects are used when our data as a “nested” structure (we think individual observations belong to groups), and we suspect different things may be happening in each group.

For example, suppose we have a dataset of student test scores, and students are all grouped into different schools; or perhaps we have data on earnings and gender across US cities. In these examples, individual observations can be thought of as being grouped into schools or cities.

One option with this kind of data is to just ignore the groups. For example, if we want to know about differences in the academic performance of minority children across the school system, then we might not want to add controls for students’ schools because we think that part of way race impacts performance is though sorting of minority students into worse schools. If we added school fixed effects, we’d lose that variation.

But suppose we were interested in understanding whether school administrators treat minority children differently, and whether this affects academic performance. Principles, for example, may be more likely to suspect Black children than White children. If that were our interest, then what we really want to know about is how race impacts academic performance among students in the same school. And that’s where fixed effects are useful – they let us control for group-level effects (like the fact all children in one school might tend to get lower grades) so we can focus on explaining intra-group variation (differences among children at the same school).

In this regard, fixed effects are analogous in purpose to hierarchical models, though they are slightly different in implementation (differences between fixed effects and hierarchical models are discussed here).

Implementing Fixed Effects

To illustrate, let’s try and estimate how gender impacts earnings in the US using data from the US Current Population Survey (CPS) on US wages in 2019. We’ll begin with a simple model of earnings:

[1]:
import pandas as pd

# Load survey
cps = pd.read_stata('https://github.com/nickeubank/MIDS_Data/blob/master/Current_Population_Survey/morg18.dta?raw=true')

# Limit to people currently employed and working full time.
cps = cps[cps.lfsr94 == 'Employed-At Work']
cps = cps[cps.uhourse >= 35]

# And we can adjust earnings per hour (in cents) into dollars,
cps['earnhre_dollars'] = cps['earnhre'] / 100
cps['annual_earnings'] = cps['earnhre_dollars'] * cps['uhourse'] * 52

# And create gender and college educ variable
cps['female'] = (cps.sex == 2).astype('int')
cps['has_college_educ'] = (cps.grade92 > 43).astype('int')
[2]:
import statsmodels.formula.api as smf
smf.ols('annual_earnings ~ female + age + has_college_educ', cps).fit().summary()
[2]:
OLS Regression Results
Dep. Variable: annual_earnings R-squared: 0.102
Model: OLS Adj. R-squared: 0.102
Method: Least Squares F-statistic: 2490.
Date: Mon, 17 Feb 2020 Prob (F-statistic): 0.00
Time: 09:27:22 Log-Likelihood: -7.5063e+05
No. Observations: 65755 AIC: 1.501e+06
Df Residuals: 65751 BIC: 1.501e+06
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.239e+04 281.264 115.142 0.000 3.18e+04 3.29e+04
female -8052.8166 172.035 -46.809 0.000 -8390.004 -7715.629
age 290.0363 6.255 46.370 0.000 277.777 302.296
has_college_educ 2.367e+04 410.579 57.646 0.000 2.29e+04 2.45e+04
Omnibus: 31647.447 Durbin-Watson: 1.882
Prob(Omnibus): 0.000 Jarque-Bera (JB): 264260.027
Skew: 2.152 Prob(JB): 0.00
Kurtosis: 11.828 Cond. No. 209.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In this model, we’re getting estimates of how education and gender explain variation across all Americans.

But in this dataset, we also have a variable that tells us the industry in which each respondent is employed. If we want to understand the relationship between gender and income through both workplace bias and sectoral sorting, we can use the model above. But suppose we want to estimate wage discrimination in the workplace after controlling for the industry into which someone chooses to work. In other words, we want to know about the impact of gender on wages within industries.

To do so, we can add an indicator for each respondent’s industry (in the ind02 variable):

Then we can run the following regression:

[3]:
# Adding semi-colon to suppress output because too long...
smf.ols('annual_earnings ~ female + age + has_college_educ + C(ind02)', cps).fit().summary();

which will generate output that will look approximately like this (note your output will be VERY long – I’m omitting all the county coeffients for space. We’ll talk later about how to suppress those in your output):

cps_fe_unclustered_p1

.
.
.

cps_fe_unclustered_p2

Voila! What you’ve just estimated is no longer the relationship between gender and income across all Americans, but rather the relationship between gender and income within each industry.

To be clear, fixed effects aren’t mathematically different from adding a normal control variable. One could say that adding has_college_educ means that we’re now estimating the relationship between gender and income among college educated and among non-college educated. Mechnically, fixed effects are just additional indicator variables. But because we often use them for groups, thinking about the fact that, when added, one is effectively estimating variation within the groups specified by the fixed effects is a powerful idea.

Perhaps no place is this more clear than in full panel data, where you have data on the same entities over time. In a panel regression, the addition of entity fixed effects allow you to difference out any constant differences between entities, and focus only on changes within each entity over time. This even works for people! In a panel with individuals observed over time, adding individual fixed effects means you’re effectively controlling for anything constant about each individual (things that don’t change over time), and now you’re just studying changes over time for each individual.

Clustering

When working with fixed effects, however, it’s also often a good idea to cluster your standard errors by your fixed effect variable. Clustering is a method for taking into account some of the variation in your data isn’t coming from the individual level (where you have lots of observations), but rather from the group level. Since you have fewer groups than observations, clustering corrects your standard errors to reflect the smaller effective sample size being used to estimate those fixed effects (clustering only affects standard errors – it has no impact on coefficients themselves. This is just about adjustments to our confidence in our inferences).

Clustering is thankfully easy to do – just use the get_robustcov_results method from statsmodels, and use the groups keyword to pass the group assignments for each observation.

(R users: as we’ll discuss below, I think the easiest way to do this is to use the plm package.)

Note that if you’re using formulas in statsmodels, the regression is automatically dropping observations that can’t be estimated because of missing data, so you have to do the same before passing your group assignments to get_robustcov_results – otherwise you’ll get the error:

ValueError: The weights and list don't have the same length.

because the number of observations in the model doesn’t match the number of observations in the group assignment vector you pass!

[4]:
model = smf.ols('annual_earnings ~ female + age + has_college_educ + C(ind02)', cps).fit()

# Drop any entries with missing data from the model
fe_groups = cps.copy()
for i in ['annual_earnings', 'female', 'age', 'ind02', 'has_college_educ']:
    fe_groups = fe_groups[pd.notnull(fe_groups[i])]
[5]:
# Adjust SEs
# Again, suppressing actual output for space with semicolon
model.get_robustcov_results(cov_type='cluster',
                            groups=fe_groups.ind02).summary();
/Users/Nick/miniconda3/lib/python3.7/site-packages/statsmodels/base/model.py:1752: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 253, but rank is 3
  'rank is %d' % (J, J_), ValueWarning)

cps_fe_clustered_p1

.
.
.

cps_fe_clustered_p2

As you can see, while our point estimates haven’t changed at all (the coefficient on female, for example, is still -6,892), we have increased the size of our standard errors. The SE on female, for example, has gone from 190 without clustering to 420 with clustering.

Computationally Efficient Fixed Effects

OK, so everything we’ve describe up till here is a reasonable approach to fixed effects, but it has two limitations: our regression output looks terrible, and computing all those intercepts was slow.

This brings us to some of the specialized methods for calculating fixed effects. It turns out that if you aren’t interested in the coefficient on each fixed effect, there are much more computationally efficient methods of calculating fixed effects. But to use them, we’ll have to use a different library: linearmodels (installable using pip install linearmodels).

(R users: see note at bottom on doing this in R)

In particular, we’ll be using the PanelOLS function from linearmodels. As the name implies, PanelOLS is designed for linear regression (social scientists call linear regression Ordinary Least Squares, or OLS) with panel data, which is really any form of data organized along two dimensions. Normally a panel has data on many entities observed several times, so the first dimension is the entity dimension, and the second is the time dimension.

In this case, we don’t really have a panel – just nested data – but because fixed effects are commonly used in panels, we’ll use this tool.

The only catch is: you have to use multiindexes in pandas. I know, I hate them too. But the multi-index is required by the library for it to understand what variable constitutes the “group” for which you want to add fixed effects. Basically PanelOLS calls the first level of the multi-index the entity and the second level time. In this case, though, we’ll just make the first level our counties, and the second level individual identifiers, then use entity fixed effects (and clustering).

[6]:
cps.head()
[6]:
county smsastat age sex grade92 race ethnic marital uhourse earnhre ... occ2012 lfsr94 class94 unioncov ind02 stfips earnhre_dollars annual_earnings female has_college_educ
2 0 1.0 52 2 39 2 NaN 5 40.0 2084.0 ... 5700.0 Employed-At Work Government - State Residential care facilities, without nursing (... AL 20.84 43347.2 1 0
3 0 1.0 19 2 39 2 NaN 7 40.0 1000.0 ... 5240.0 Employed-At Work Private, For Profit No Business support services (5614) AL 10.00 20800.0 1 0
4 0 1.0 56 2 43 2 NaN 5 40.0 2500.0 ... 3255.0 Employed-At Work Government - Federal Hospitals (622) AL 25.00 52000.0 1 0
6 97 1.0 48 1 39 1 NaN 7 40.0 1700.0 ... 9130.0 Employed-At Work Private, For Profit No Truck transportation (484) AL 17.00 35360.0 0 0
17 97 1.0 59 1 39 2 NaN 7 40.0 2000.0 ... 9620.0 Employed-At Work Private, For Profit No ****Department stores and discount stores (s45... AL 20.00 41600.0 0 0

5 rows × 29 columns

[7]:
# Move county groups into highest level of multi-index,
# with old index in second level.
# PanelOLS will then see the first level as the `entity`
# identifier.
cps_w_multiindex = cps.set_index(['ind02', cps.index])
cps_w_multiindex.head()
[7]:
county smsastat age sex grade92 race ethnic marital uhourse earnhre ... ms123 occ2012 lfsr94 class94 unioncov stfips earnhre_dollars annual_earnings female has_college_educ
ind02
Residential care facilities, without nursing (6232, 6233, 6239) 2 0 1.0 52 2 39 2 NaN 5 40.0 2084.0 ... NaN 5700.0 Employed-At Work Government - State AL 20.84 43347.2 1 0
Business support services (5614) 3 0 1.0 19 2 39 2 NaN 7 40.0 1000.0 ... NaN 5240.0 Employed-At Work Private, For Profit No AL 10.00 20800.0 1 0
Hospitals (622) 4 0 1.0 56 2 43 2 NaN 5 40.0 2500.0 ... NaN 3255.0 Employed-At Work Government - Federal AL 25.00 52000.0 1 0
Truck transportation (484) 6 97 1.0 48 1 39 1 NaN 7 40.0 1700.0 ... NaN 9130.0 Employed-At Work Private, For Profit No AL 17.00 35360.0 0 0
****Department stores and discount stores (s45211) 17 97 1.0 59 1 39 2 NaN 7 40.0 2000.0 ... NaN 9620.0 Employed-At Work Private, For Profit No AL 20.00 41600.0 0 0

5 rows × 28 columns

[8]:
from linearmodels import PanelOLS
mod = PanelOLS.from_formula('annual_earnings ~ female + age + has_college_educ + EntityEffects',
                            data=cps_w_multiindex)
mod.fit(cov_type='clustered', cluster_entity=True)
/Users/Nick/miniconda3/lib/python3.7/site-packages/linearmodels/utility.py:478: MissingValueWarning:
Inputs contain missing values. Dropping rows with missing observations.
  warnings.warn(missing_value_warning_msg, MissingValueWarning)
[8]:
PanelOLS Estimation Summary
Dep. Variable: annual_earnings R-squared: 0.0783
Estimator: PanelOLS R-squared (Between): 0.3536
No. Observations: 65755 R-squared (Within): 0.0783
Date: Mon, Feb 17 2020 R-squared (Overall): 0.2878
Time: 09:27:28 Log-likelihood -7.465e+05
Cov. Estimator: Clustered
F-statistic: 1853.9
Entities: 251 P-value 0.0000
Avg Obs: 261.97 Distribution: F(3,65501)
Min Obs: 2.0000
Max Obs: 5603.0 F-statistic (robust): 242.78
P-value 0.0000
Time periods: 65755 Distribution: F(3,65501)
Avg Obs: 1.0000
Min Obs: 1.0000
Max Obs: 1.0000
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
female -6891.8 418.12 -16.483 0.0000 -7711.3 -6072.3
age 247.57 14.917 16.597 0.0000 218.34 276.81
has_college_educ 1.997e+04 1555.4 12.839 0.0000 1.692e+04 2.302e+04


F-test for Poolability: 34.810
P-value: 0.0000
Distribution: F(250,65501)

Included effects: Entity
id: 0x7f86332e4fd0

As you can see, the coefficients of the PanelOLS model are exactly the same as those we calculated above, and the standard errors are nearly identical (there are a few ways to calculate clustered standard errors, so they aren’t numerically equivalent). But the way PanelOLS added fixed effects was much more computationally efficient efficient, and in these situations, we don’t usually want to see the coefficients, so they’re suppressed.

Panel Analysis in R

If you’re an R user, the best library I am aware of for these types of analyses is lfe, which you can read about here.

The analogue of the PanelOLS regression above with lfe would be:

# estimate the fixed effects regression with felm()
library(lfe) #fixed effect model package

## Loading required package: Matrix
library(haven) # load stata package

# load data
cps = read_dta('https://github.com/nickeubank/MIDS_Data/blob/master/Current_Population_Survey/morg18.dta?raw=true')

# create dv
cps$earnhre_dollars = cps$earnhre / 100
cps['annual_earnings'] = cps['earnhre_dollars'] * cps['uhourse'] * 52

# model specification
model <- felm(annual_earnings ~ age  | county | 0 |county, data = cps)

# model summary
summary(model)

##
## Call:
##    felm(formula = annual_earnings ~ age | county | 0 | county, data = cps)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -63187 -14735  -3622   9682 330934
##
## Coefficients:
##     Estimate Cluster s.e. t value Pr(>|t|)
## age  296.703        5.347   55.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24510 on 93599 degrees of freedom
##   (208632 observations deleted due to missingness)
## Multiple R-squared(full model): 0.03591   Adjusted R-squared: 0.03488
## Multiple R-squared(proj model): 0.03234   Adjusted R-squared: 0.0313
## F-statistic(full model, *iid*):34.87 on 100 and 93599 DF, p-value: < 2.2e-16
## F-statistic(proj model):  3079 on 1 and 99 DF, p-value: < 2.2e-16

Where felm is the panel regression.

The syntax for felm is:

felm(dependent variable ~ independent variable | fixed effects variables (if you have 2+, add them like 'fe1 + fe2')  | instrumental variables (0 if no instrument) | level of clustered se, data )

We haven’t covered (and probably won’t cover) instrumental variables, so you’ll usually leave the middle entry as 0:

felm(dependent variable ~ independent variables | fixed effects variables | 0 | level of clustered se, data )