{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using and Interpreting Indicator (Dummy) Variables\n", "\n", "We often gloss over indicator variables in our statistics courses, but not only are they (in my view) one of the most powerful tools in a data scientist's tool box, but I cannot tell you how much I see people struggle with *interpreting* indicator variables in their regressions. So in this tutorial, I'll try to give them the treatment they deserve, and hopefully by the end, you've have a firm understanding not only of how to use *and interpret* Indicator Variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What are indicator variables?\n", "\n", "Indicator variables -- sometimes also referred to as dummy variables, though I don't know why -- are variables that take on only the value of 0 and 1, and are used to *indicate* whether a given observation belongs to a discrete category in a way that can be used in statistical models. \n", "\n", "For example, indicator variables can be used to indicate if an survey respondent is a woman (if the variable is 1 for women, 0 otherwise) or a Democrat (if the variable is 1 for democrats, 0 otherwise). In addition, as discussed in more detail below, collections of indicator variables can also be used to code categorical variables that take on more than 2 variables using a method called \"one-hot encoding). This allows use to work with variables that have many levels, like an individual's political party registration (which could be Democrat, Independent, or Republican). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The ONE thing that you must understand when using indicator variables:\n", "\n", "When you put an indicator variable in a regression model, there are two things you must always keep in mind about interpreting the coefficients associated with the indicator variable: \n", "\n", "1) The coefficient on an indicator variable is an estimate of the average **DIFFERENCE** in the dependent variable for the group identified by the indicator variable (after taking into account other variables in the regression) and\n", "\n", "2) the **REFERENCE GROUP**, which is the set of observations for which the indicator variable is always zero. \n", "\n", "If you always remember that the coefficient on an indicator variable is an estimate of a **DIFFERENCE** with respect to a **REFERENCE GROUP** (also sometimes referred to as the \"omitted category\"), you're 90% of the way to understanding indicator variables. \n", "\n", "I recognize this may feel obvious, but trust me: I've literally reviewed papers from tenured faculty at major Universities that get this wrong. This is something people get confused about constantly, so I promise it's worth this treatment. \n", "\n", "OK, let's get concrete. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indicator Variables with Two Category Variable\n", "\n", "Let's start with a simple model in which we wish to predict voter turnout using data from North Carolina. Suppose we're interested in looking at how turnout varies by gender, which is dichotomous in the North Carolina voter file (obviously this is somewhat problematic given what we've come to know about gender, but in most datasets you'll find a dichotomous coding). " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegendervotedpartyrace
071FEMALE1UNAFFILIATEDWHITE
147MALE1UNAFFILIATEDWHITE
229MALE0DEMOCRATICWHITE
360MALE1REPUBLICANWHITE
484MALE0DEMOCRATICWHITE
\n", "
" ], "text/plain": [ " age gender voted party race\n", "0 71 FEMALE 1 UNAFFILIATED WHITE\n", "1 47 MALE 1 UNAFFILIATED WHITE\n", "2 29 MALE 0 DEMOCRATIC WHITE\n", "3 60 MALE 1 REPUBLICAN WHITE\n", "4 84 MALE 0 DEMOCRATIC WHITE" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load data we'll use. Should work for anyone.\n", "import pandas as pd\n", "\n", "voters = pd.read_csv(\n", " \"https://raw.githubusercontent.com/nickeubank/\"\n", " \"css_tutorials/master/exercise_data/voter_turnout.csv\"\n", ")\n", "voters.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegendervotedpartyracefemale
071FEMALE1UNAFFILIATEDWHITETrue
147MALE1UNAFFILIATEDWHITEFalse
229MALE0DEMOCRATICWHITEFalse
360MALE1REPUBLICANWHITEFalse
484MALE0DEMOCRATICWHITEFalse
\n", "
" ], "text/plain": [ " age gender voted party race female\n", "0 71 FEMALE 1 UNAFFILIATED WHITE True\n", "1 47 MALE 1 UNAFFILIATED WHITE False\n", "2 29 MALE 0 DEMOCRATIC WHITE False\n", "3 60 MALE 1 REPUBLICAN WHITE False\n", "4 84 MALE 0 DEMOCRATIC WHITE False" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a 0/1 variable for female.\n", "voters[\"female\"] = voters.gender == \"FEMALE\"\n", "voters.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: voted R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.7416
Date: Sun, 16 Feb 2020 Prob (F-statistic): 0.389
Time: 11:07:14 Log-Likelihood: -5768.3
No. Observations: 9919 AIC: 1.154e+04
Df Residuals: 9917 BIC: 1.156e+04
Df Model: 1
Covariance Type: HC3
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.7461 0.007 113.927 0.000 0.733 0.759
female[T.True] 0.0075 0.009 0.861 0.389 -0.010 0.025
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1890.028 Durbin-Watson: 1.974
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2391.714
Skew: -1.156 Prob(JB): 0.00
Kurtosis: 2.337 Cond. No. 2.77


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: voted R-squared: 0.000\n", "Model: OLS Adj. R-squared: -0.000\n", "Method: Least Squares F-statistic: 0.7416\n", "Date: Sun, 16 Feb 2020 Prob (F-statistic): 0.389\n", "Time: 11:07:14 Log-Likelihood: -5768.3\n", "No. Observations: 9919 AIC: 1.154e+04\n", "Df Residuals: 9917 BIC: 1.156e+04\n", "Df Model: 1 \n", "Covariance Type: HC3 \n", "==================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------\n", "Intercept 0.7461 0.007 113.927 0.000 0.733 0.759\n", "female[T.True] 0.0075 0.009 0.861 0.389 -0.010 0.025\n", "==============================================================================\n", "Omnibus: 1890.028 Durbin-Watson: 1.974\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2391.714\n", "Skew: -1.156 Prob(JB): 0.00\n", "Kurtosis: 2.337 Cond. No. 2.77\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors are heteroscedasticity robust (HC3)\n", "\"\"\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import statsmodels.formula.api as smf\n", "\n", "model = smf.ols(\"voted ~ female\", voters).fit()\n", "model.get_robustcov_results(\"HC3\").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so how do we interpret this coefficient of 0.0075 on female? As we said before, it is the average **DIFFERENCE** in the dependent variable (Whether the person votes) with respect to a **REFERENCE GROUP**. The reference group is *the group for whom the indicator is always equal to zero*, which in this case is the set of male voters. \n", "\n", "So this says that women are 0.7% *more likely to vote (in North Carolina) then men.*\n", "\n", "Now let's try a more interesting example: Democrats. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DEMOCRATIC 4426\n", "REPUBLICAN 3365\n", "UNAFFILIATED 2128\n", "Name: party, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "voters.party.value_counts() # OK, note here that there are THREE party registrations in this data." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegendervotedpartyracefemaledemocrat
071FEMALE1UNAFFILIATEDWHITETrueFalse
147MALE1UNAFFILIATEDWHITEFalseFalse
229MALE0DEMOCRATICWHITEFalseTrue
360MALE1REPUBLICANWHITEFalseFalse
484MALE0DEMOCRATICWHITEFalseTrue
\n", "
" ], "text/plain": [ " age gender voted party race female democrat\n", "0 71 FEMALE 1 UNAFFILIATED WHITE True False\n", "1 47 MALE 1 UNAFFILIATED WHITE False False\n", "2 29 MALE 0 DEMOCRATIC WHITE False True\n", "3 60 MALE 1 REPUBLICAN WHITE False False\n", "4 84 MALE 0 DEMOCRATIC WHITE False True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# So let's do the same thing as before for Democrats\n", "voters[\"democrat\"] = voters.party == \"DEMOCRATIC\"\n", "voters.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: voted R-squared: 0.004
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 40.81
Date: Sun, 16 Feb 2020 Prob (F-statistic): 1.76e-10
Time: 11:07:15 Log-Likelihood: -5748.0
No. Observations: 9919 AIC: 1.150e+04
Df Residuals: 9917 BIC: 1.151e+04
Df Model: 1
Covariance Type: HC3
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.7754 0.006 137.665 0.000 0.764 0.786
democrat[T.True] -0.0562 0.009 -6.388 0.000 -0.073 -0.039
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1865.360 Durbin-Watson: 1.973
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2361.772
Skew: -1.149 Prob(JB): 0.00
Kurtosis: 2.343 Cond. No. 2.51


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: voted R-squared: 0.004\n", "Model: OLS Adj. R-squared: 0.004\n", "Method: Least Squares F-statistic: 40.81\n", "Date: Sun, 16 Feb 2020 Prob (F-statistic): 1.76e-10\n", "Time: 11:07:15 Log-Likelihood: -5748.0\n", "No. Observations: 9919 AIC: 1.150e+04\n", "Df Residuals: 9917 BIC: 1.151e+04\n", "Df Model: 1 \n", "Covariance Type: HC3 \n", "====================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------------\n", "Intercept 0.7754 0.006 137.665 0.000 0.764 0.786\n", "democrat[T.True] -0.0562 0.009 -6.388 0.000 -0.073 -0.039\n", "==============================================================================\n", "Omnibus: 1865.360 Durbin-Watson: 1.973\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2361.772\n", "Skew: -1.149 Prob(JB): 0.00\n", "Kurtosis: 2.343 Cond. No. 2.51\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors are heteroscedasticity robust (HC3)\n", "\"\"\"" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that because we're estimating a linear probability\n", "# model we need to use heteroskedastic robust\n", "# standard errors.\n", "\n", "model = smf.ols(\"voted ~ democrat\", voters).fit()\n", "model.get_robustcov_results(\"HC3\").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now how do we interpret this coefficient on Democrats (-0.056)? As before it's the average **DIFFERENCE** in the dependent variable between the indicated group (Democrats) and the reference group. But what's the reference group? Republicans?\n", "\n", "No -- the **reference group** or **omitted category** is anyone for whom the indicator variable is always zero -- in this case, all non-Democrats, whether they're Republicans or Unaffiliated. \n", "\n", "So this result says that Democrats are less likely to vote than non-Democrats, but NOT that they're less likely to vote than Republicans per se. \n", "\n", "So how do we deal with multiple categories? With multiple indicator variables!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indicator Variables for variables with more than 2 categories\n", "\n", "To deal with categorical variables with more than 2 categories, we create indicator variables for all values of the variable *except one*. The one group for which we do not create an indicator variable will become the **reference group** for the regression. The choice of which value to make the reference category won't substantively change the results of the regression -- for example, if you also have a control for age, the coefficient on age will always be the same regardless of the reference group used -- but it does influence how easily you can interpret the results of the regression. \n", "\n", "This practice of creating a *collection* of indicator variables to encode a single categorical variable is what's called \"one-hot encoding\" by computer scientists / machine learning people. \n", "\n", "Since we're interested in the difference in turnout between Democrats and Republicans, let's make Republicans the reference category, and make indicators for Democrats and Unaffiliated voters. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegendervotedpartyracefemaledemocratunaffiliated
071FEMALE1UNAFFILIATEDWHITETrueFalseTrue
147MALE1UNAFFILIATEDWHITEFalseFalseTrue
229MALE0DEMOCRATICWHITEFalseTrueFalse
360MALE1REPUBLICANWHITEFalseFalseFalse
484MALE0DEMOCRATICWHITEFalseTrueFalse
\n", "
" ], "text/plain": [ " age gender voted party race female democrat unaffiliated\n", "0 71 FEMALE 1 UNAFFILIATED WHITE True False True\n", "1 47 MALE 1 UNAFFILIATED WHITE False False True\n", "2 29 MALE 0 DEMOCRATIC WHITE False True False\n", "3 60 MALE 1 REPUBLICAN WHITE False False False\n", "4 84 MALE 0 DEMOCRATIC WHITE False True False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "voters[\"unaffiliated\"] = voters.party == \"UNAFFILIATED\"\n", "voters.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: voted R-squared: 0.007
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 35.67
Date: Sun, 16 Feb 2020 Prob (F-statistic): 3.67e-16
Time: 11:07:15 Log-Likelihood: -5735.2
No. Observations: 9919 AIC: 1.148e+04
Df Residuals: 9916 BIC: 1.150e+04
Df Model: 2
Covariance Type: HC3
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.7988 0.007 115.554 0.000 0.785 0.812
democrat[T.True] -0.0797 0.010 -8.240 0.000 -0.099 -0.061
unaffiliated[T.True] -0.0606 0.012 -5.143 0.000 -0.084 -0.037
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1854.446 Durbin-Watson: 1.973
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2341.346
Skew: -1.144 Prob(JB): 0.00
Kurtosis: 2.344 Cond. No. 3.85


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: voted R-squared: 0.007\n", "Model: OLS Adj. R-squared: 0.007\n", "Method: Least Squares F-statistic: 35.67\n", "Date: Sun, 16 Feb 2020 Prob (F-statistic): 3.67e-16\n", "Time: 11:07:15 Log-Likelihood: -5735.2\n", "No. Observations: 9919 AIC: 1.148e+04\n", "Df Residuals: 9916 BIC: 1.150e+04\n", "Df Model: 2 \n", "Covariance Type: HC3 \n", "========================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------\n", "Intercept 0.7988 0.007 115.554 0.000 0.785 0.812\n", "democrat[T.True] -0.0797 0.010 -8.240 0.000 -0.099 -0.061\n", "unaffiliated[T.True] -0.0606 0.012 -5.143 0.000 -0.084 -0.037\n", "==============================================================================\n", "Omnibus: 1854.446 Durbin-Watson: 1.973\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2341.346\n", "Skew: -1.144 Prob(JB): 0.00\n", "Kurtosis: 2.344 Cond. No. 3.85\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors are heteroscedasticity robust (HC3)\n", "\"\"\"" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = smf.ols(\"voted ~ democrat + unaffiliated\", voters).fit()\n", "model.get_robustcov_results(\"HC3\").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Note that we also could have used the syntax `smf.ols('voted ~ C(party)', voters)`, which will automatically convert your data into one-hot encodings, but then you don't get to pick the omitted category, and sometimes it's nice to be explicit.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we interpret these results? \n", "\n", "First, we see that the coefficient on `democrat` is -0.08. That means that the **DIFFERENCE** in turnout between Democrats and the reference group (here, Republicans) is 8%. So Democrats have 8 percentage point lower turnout on average in this data than Republicans. \n", "\n", "Second, we see that the coefficient on `unaffiliated` is -0.06. That means that the **DIFFERENCE** in turnout between Unaffiliated voters and the reference group (here, Republicans) is 6%. So Unaffiliated voters have 6 percent point lower turnout on average in this data than Republicans. \n", "\n", "Moreover, the p-value on these indicator variables tells us if these differences are significant. And indeed, they show clearly that the difference between Democrats and Republicans, and the difference between Unaffiliated voters and Republicans are both significant. \n", "\n", "But what about the difference between Democrats and Unaffiliated voters? Well, turns out the regression doesn't give us that directly. To get that, we have to do some additional math. \n", "\n", "First, it's easy to estimate the difference in coefficients:\n", "\n", "```\n", "dem - unaffiliated = (dem - republican) - (unaffiliated - republican)\n", " = -0.08 - -0.06\n", " = -0.02\n", "``` \n", "So in other words, Democrats have 2 percentage point lower turnout than Unaffiliated voters. \n", "\n", "But is this difference statistically significant? For that we have to run a post-regression test. (In R, you can do these with the `car` library using the `LinearHypothesis` function)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " Test for Constraints \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "c0 -0.0191 0.012 -1.634 0.102 -0.042 0.004\n", "==============================================================================" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = smf.ols(\"voted ~ democrat + unaffiliated\", voters).fit()\n", "model = model.get_robustcov_results(\"HC3\")\n", "\n", "hypothesis = \"democrat[T.True] = unaffiliated[T.True]\"\n", "model.t_test(hypothesis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Voila -- p-value of 0.1.\n", "\n", "Wanna confirm it? let's change our reference group to `unaffiliated`. Then when we look at the coefficient on `democrat`, that will be the difference between Democrats and the new reference group (Unaffiliated voters)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: voted R-squared: 0.007
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 35.67
Date: Sun, 16 Feb 2020 Prob (F-statistic): 3.67e-16
Time: 11:07:16 Log-Likelihood: -5735.2
No. Observations: 9919 AIC: 1.148e+04
Df Residuals: 9916 BIC: 1.150e+04
Df Model: 2
Covariance Type: HC3
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.7383 0.010 77.436 0.000 0.720 0.757
democrat[T.True] -0.0191 0.012 -1.634 0.102 -0.042 0.004
republican[T.True] 0.0606 0.012 5.143 0.000 0.037 0.084
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1854.446 Durbin-Watson: 1.973
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2341.346
Skew: -1.144 Prob(JB): 0.00
Kurtosis: 2.344 Cond. No. 4.60


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: voted R-squared: 0.007\n", "Model: OLS Adj. R-squared: 0.007\n", "Method: Least Squares F-statistic: 35.67\n", "Date: Sun, 16 Feb 2020 Prob (F-statistic): 3.67e-16\n", "Time: 11:07:16 Log-Likelihood: -5735.2\n", "No. Observations: 9919 AIC: 1.148e+04\n", "Df Residuals: 9916 BIC: 1.150e+04\n", "Df Model: 2 \n", "Covariance Type: HC3 \n", "======================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "--------------------------------------------------------------------------------------\n", "Intercept 0.7383 0.010 77.436 0.000 0.720 0.757\n", "democrat[T.True] -0.0191 0.012 -1.634 0.102 -0.042 0.004\n", "republican[T.True] 0.0606 0.012 5.143 0.000 0.037 0.084\n", "==============================================================================\n", "Omnibus: 1854.446 Durbin-Watson: 1.973\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2341.346\n", "Skew: -1.144 Prob(JB): 0.00\n", "Kurtosis: 2.344 Cond. No. 4.60\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors are heteroscedasticity robust (HC3)\n", "\"\"\"" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "voters[\"republican\"] = voters.party == \"REPUBLICAN\"\n", "model = smf.ols(\"voted ~ democrat + republican\", voters).fit()\n", "model.get_robustcov_results(\"HC3\").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can now see, the coefficient on `democrat` (now the difference between Democrats and Unaffiliated voters) is exactly what we'd calculated above (-0.019) and has the same p-value we calculated previously (0.1). \n", "\n", "This just goes to show that the choice of reference group doesn't change what's actually being estimated, *it just changes the interpretation of coefficients* and what statistics pop right out of the regression output, and which values require a little extra work to get." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interactions with Constant Variables\n", "\n", "**Like regular Indicators, but for differences in SLOPE rather than differences in LEVELS!**\n", "\n", "Congratulations! You're a pro at indicator variables. Now we can turn to INTERACTIONS!\n", "\n", "Interactions (at least when you interact an indicator variable with a continuous variable), just like a regular indicator variables, report **differences** between a group and the reference group. The difference is that instead of reporting the difference in **average value** of the dependent variable between the indicated group and the reference group, the coefficient on an interaction term is the average **DIFFERENCE** in the **SLOPE** associated with the continuous variable between the indicated group and the reference group. \n", "\n", "Let's be concrete: let's suppose we think that turnout among men increases as they get older by a larger amount than for women. In other words, we think that turnout increases with age for both groups, but that there's a **DIFFERENCE** in the amount it increases with age. \n", "\n", "To test this, we need to create some interaction terms. But first, a quick note: when doing interactions, it's critical to not only include all the interaction terms that interest you, **but also all the variables in the interaction as stand-alone variables.** So for this we want `age` interacted with `female`. But while the coefficient on that estimate is what we're interested in, to get the right results we also need to include just `age` and just `female`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: voted R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 1.917
Date: Sun, 16 Feb 2020 Prob (F-statistic): 0.124
Time: 11:07:16 Log-Likelihood: -5764.7
No. Observations: 9919 AIC: 1.154e+04
Df Residuals: 9915 BIC: 1.157e+04
Df Model: 3
Covariance Type: HC3
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.6961 0.029 23.789 0.000 0.639 0.753
female[T.True] 0.0933 0.039 2.370 0.018 0.016 0.171
age 0.0009 0.000 1.759 0.079 -9.81e-05 0.002
age_x_female -0.0015 0.001 -2.237 0.025 -0.003 -0.000
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1883.068 Durbin-Watson: 1.976
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2388.508
Skew: -1.156 Prob(JB): 0.00
Kurtosis: 2.340 Cond. No. 643.


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: voted R-squared: 0.001\n", "Model: OLS Adj. R-squared: 0.001\n", "Method: Least Squares F-statistic: 1.917\n", "Date: Sun, 16 Feb 2020 Prob (F-statistic): 0.124\n", "Time: 11:07:16 Log-Likelihood: -5764.7\n", "No. Observations: 9919 AIC: 1.154e+04\n", "Df Residuals: 9915 BIC: 1.157e+04\n", "Df Model: 3 \n", "Covariance Type: HC3 \n", "==================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------\n", "Intercept 0.6961 0.029 23.789 0.000 0.639 0.753\n", "female[T.True] 0.0933 0.039 2.370 0.018 0.016 0.171\n", "age 0.0009 0.000 1.759 0.079 -9.81e-05 0.002\n", "age_x_female -0.0015 0.001 -2.237 0.025 -0.003 -0.000\n", "==============================================================================\n", "Omnibus: 1883.068 Durbin-Watson: 1.976\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2388.508\n", "Skew: -1.156 Prob(JB): 0.00\n", "Kurtosis: 2.340 Cond. No. 643.\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors are heteroscedasticity robust (HC3)\n", "\"\"\"" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "voters[\"age_x_female\"] = voters.age * voters.female\n", "model = smf.ols(\"voted ~ age + female + age_x_female\", voters).fit()\n", "model.get_robustcov_results(\"HC3\").summary()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The coefficient on our interaction is -0.0015. Does that mean that as women get older, their turnout rate declines by -0.15 percentage points per year? **NO!**\n", "\n", "It says that however turnout varies with age for men, turnout will vary with age by 0.0015 *less* for women. It is the **DIFFERENCE** in slopes between the two groups.\n", "\n", "The coefficient on `age` tells us how turnout varies with age **for the reference group**. So it says that for men, turnout increases by 0.09 percentage points per year. \n", "\n", "But if you want to know how women's turnout varies with age, you have to **ADD** the coefficient on `age` (the rate of change for me) plus the coefficient on `age_x_female` (the difference between the rate of men and women). \n", "\n", "So going through all these coefficients, we have: \n", "\n", "- `female` (0.09): Controlling for age, women are 9 percentage points more likely to vote than men. \n", "- `age` (0.0009): As men get one year older, they become 0.09 percentage points more likely to vote. \n", "- `age_x_female` (-0.0015): As women get one year older, the likelihood they vote increases by -0.1 percentage point *less* than men. \n", "\n", "So how much does women's turnout increase if they age one year? 0.0009 + -0.0015 = -0.0006. So female turnout actually *declines* by 0.06 percentage points a year.\n", "\n", "Now let's talk statistical significance. The p-value on age shows us that there's a statistically significant relationship between age and turnout **for men**. The p-value for `age_x_female` tells us that there's a statistically significant **difference** between men and women in how turnout varies with age. But is there a statistically significant relationship between age and turnout for women?\n", "\n", "Again, we don't actually get an answer from our regression. To see, we have to run the following: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " Test for Constraints \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "c0 -0.0006 0.000 -1.389 0.165 -0.001 0.000\n", "==============================================================================" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = smf.ols(\"voted ~ age + female + age_x_female\", voters).fit()\n", "model = model.get_robustcov_results(\"HC3\")\n", "hypothesis = \"age + age_x_female = 0\"\n", "model.t_test(hypothesis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the p-value is 0.17 for the relationship between female and age (and the coefficient of -0.0006, just like we calculated above!)\n", "\n", "### Keeping Things Straight\n", "\n", "When you want to know a quantity, how can you figure out what coefficients to look at if you don't remember these rules?\n", "\n", "The simplest way to make sure you're interpreting indicators correctly is to think about what our model looks like for different kinds of people. So suppose we wanted to figure out how men's turnout varies with age. Let's look at our model:\n", "\n", "$$voted = \\beta_0 + \\beta_1 * age + \\beta_2 * female + \\beta_3 * (age * female) + \\epsilon$$\n", "\n", "Well, for men `female` and `age_x_female` will always be zero, so the model for men is actually just:\n", "\n", "$$voted_{men} = \\beta_0 + \\beta_1 * age + \\epsilon$$\n", "\n", "And how does this vary with age? Linearly by $\\beta_1$ per year.\n", "\n", "What about for women? For women all those indicators will be 1s, so the equation will effectively be:\n", "\n", "$$voted_{women} = \\beta_0 + \\beta_1 * age + \\beta_2 + \\beta_3 * age + \\epsilon$$\n", "$$voted_{women} = \\beta_0 + (\\beta_1 + \\beta_3)* age + \\beta_2 + \\epsilon$$\n", "\n", "And how does that vary with age? Linearly by $\\beta_1 + \\beta_3$ per year. \n", "\n", "Finally, if we want the *difference* between how men and women respond to age, we can write this out:\n", "\n", "$$voted_{women} - voted_{men} = $$\n", "$$= (\\beta_0 + \\beta_1 * age + \\beta_2 + \\beta_3 * age + \\epsilon) - (\\beta_0 + \\beta_1 * age + \\epsilon)$$\n", "$$=\\beta_3 * age$$\n", "\n", "So the difference in who men and women respond to age is $\\beta_3$. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The * operator\n", "\n", "Finally, as with using `C()` to convert categoricals to indicators, you can also use the `*` notation in statsmodels for interactions -- it not only creates the interaction term, but also adds all the level effects:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: voted R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 1.917
Date: Sun, 16 Feb 2020 Prob (F-statistic): 0.124
Time: 11:07:17 Log-Likelihood: -5764.7
No. Observations: 9919 AIC: 1.154e+04
Df Residuals: 9915 BIC: 1.157e+04
Df Model: 3
Covariance Type: HC3
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.6961 0.029 23.789 0.000 0.639 0.753
female[T.True] 0.0933 0.039 2.370 0.018 0.016 0.171
age 0.0009 0.000 1.759 0.079 -9.81e-05 0.002
age:female[T.True] -0.0015 0.001 -2.237 0.025 -0.003 -0.000
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1883.068 Durbin-Watson: 1.976
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2388.508
Skew: -1.156 Prob(JB): 0.00
Kurtosis: 2.340 Cond. No. 643.


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: voted R-squared: 0.001\n", "Model: OLS Adj. R-squared: 0.001\n", "Method: Least Squares F-statistic: 1.917\n", "Date: Sun, 16 Feb 2020 Prob (F-statistic): 0.124\n", "Time: 11:07:17 Log-Likelihood: -5764.7\n", "No. Observations: 9919 AIC: 1.154e+04\n", "Df Residuals: 9915 BIC: 1.157e+04\n", "Df Model: 3 \n", "Covariance Type: HC3 \n", "======================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "--------------------------------------------------------------------------------------\n", "Intercept 0.6961 0.029 23.789 0.000 0.639 0.753\n", "female[T.True] 0.0933 0.039 2.370 0.018 0.016 0.171\n", "age 0.0009 0.000 1.759 0.079 -9.81e-05 0.002\n", "age:female[T.True] -0.0015 0.001 -2.237 0.025 -0.003 -0.000\n", "==============================================================================\n", "Omnibus: 1883.068 Durbin-Watson: 1.976\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2388.508\n", "Skew: -1.156 Prob(JB): 0.00\n", "Kurtosis: 2.340 Cond. No. 643.\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors are heteroscedasticity robust (HC3)\n", "\"\"\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = smf.ols(\"voted ~ age * female\", voters).fit()\n", "model.get_robustcov_results(\"HC3\").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interactions Between Multiple Indicator Variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also do interactions between Indicators, which have similar interpretations to interactions with constant variables. \n", "\n", "For example, suppose instead of `age`, we just had a binary variable `old`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: voted R-squared: 0.010
Model: OLS Adj. R-squared: 0.010
Method: Least Squares F-statistic: 31.53
Date: Sun, 16 Feb 2020 Prob (F-statistic): 2.82e-20
Time: 11:07:18 Log-Likelihood: -5717.1
No. Observations: 9919 AIC: 1.144e+04
Df Residuals: 9915 BIC: 1.147e+04
Df Model: 3
Covariance Type: HC3
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 0.6761 0.013 53.519 0.000 0.651 0.701
old[T.True] 0.1015 0.015 6.902 0.000 0.073 0.130
female[T.True] 0.0138 0.017 0.806 0.420 -0.020 0.047
old_x_female -0.0111 0.020 -0.562 0.574 -0.050 0.028
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1820.991 Durbin-Watson: 1.973
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2319.326
Skew: -1.140 Prob(JB): 0.00
Kurtosis: 2.357 Cond. No. 9.37


Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: voted R-squared: 0.010\n", "Model: OLS Adj. R-squared: 0.010\n", "Method: Least Squares F-statistic: 31.53\n", "Date: Sun, 16 Feb 2020 Prob (F-statistic): 2.82e-20\n", "Time: 11:07:18 Log-Likelihood: -5717.1\n", "No. Observations: 9919 AIC: 1.144e+04\n", "Df Residuals: 9915 BIC: 1.147e+04\n", "Df Model: 3 \n", "Covariance Type: HC3 \n", "==================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------\n", "Intercept 0.6761 0.013 53.519 0.000 0.651 0.701\n", "old[T.True] 0.1015 0.015 6.902 0.000 0.073 0.130\n", "female[T.True] 0.0138 0.017 0.806 0.420 -0.020 0.047\n", "old_x_female -0.0111 0.020 -0.562 0.574 -0.050 0.028\n", "==============================================================================\n", "Omnibus: 1820.991 Durbin-Watson: 1.973\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2319.326\n", "Skew: -1.140 Prob(JB): 0.00\n", "Kurtosis: 2.357 Cond. No. 9.37\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors are heteroscedasticity robust (HC3)\n", "\"\"\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "voters[\"old\"] = voters.age > 50\n", "\n", "# Note the interaction has to be converted to integers first -- not booleans\n", "voters[\"old_x_female\"] = voters.old.astype(\"int\") * voters.female.astype(\"int\")\n", "model = smf.ols(\"voted ~ old + female + old_x_female\", voters).fit()\n", "model.get_robustcov_results(\"HC3\").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we interpret `old_x_female` as the *difference* in how age affects men and woman. It is negative because, as we saw before, as women age, their turnout rate does not increase as much as men's. Here we find that men's turnout increases by 10 percentage points moving from under 50 to over 50. By contrast, when moving from under 50 to over 50, a women's turnout increases by -0.01 less (in total, it changes by 0.10 + -0.01 = 0.09).\n", "\n", "(This may seem inconsistent with the results above, but that's because above we were modeling age linearly; in reality, the relationship between age and turnout is quadratic -- it increases initially, peaks in middle-age, then declines, so neither of these models fit the data perfectly)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/Nick/miniconda3/lib/python3.7/site-packages/plotnine/stats/smoothers.py:168: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.\n", " \"for lowess smoothings.\", PlotnineWarning)\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from plotnine import *\n", "\n", "ggplot(voters, aes(x=\"age\", y=\"voted\")) + geom_smooth(method=\"lowess\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fixed Effects\n", "\n", "The final use of indicator variables are for estimating Fixed Effects, which [we discuss next](fixed_effects.ipynb). " ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d" } } }, "nbformat": 4, "nbformat_minor": 4 }