Interpreting Indicator Variables¶
In this exercise we’ll work with a (somewhat canonical) dataset on the prices, mileages, weights, and other characteristics of 74 automobiles. These data originally came from the April 1979 issue of Consumer Reports and from the United States Government EPA statistics on fuel consumption; they were compiled and published by Chambers et al. (1983).
To get the data, follow this link or go to http://www.github.com/nickeubank/MIDS_Data and download the automobile_dataset.dta
file. This is a canonical example dataset used in coding examples all over the internet, and the codebook is roughly:
make Make and Model
price Price
mpg Mileage (mpg)
rep78 Repair Record 1978
headroom Headroom (in.)
trunk Trunk space (cu. ft.)
weight Weight (lbs.)
length Length (in.)
turn Turn Circle (ft.)
displacement Displacement (cu. in.)
gear_ratio Gear Ratio
foreign Car type
Indicator Variables and Omitted Variable Bias¶
Exercise 1¶
Create a new variable named guzzler
that takes the value of 1 if the car’s miles per gallon (mpg
) is less than 18 and takes value 0 otherwise (“guzzler” is a term for a car that consumes gas very quickly, or “guzzles gas”). Regress price
on guzzler
and interpret the coefficients. Do gas guzzlers cost more than the other cars? How much more?
[1]:
import pandas as pd
cars = pd.read_stata(
"https://github.com/nickeubank/MIDS_Data/"
"blob/master/automobile_dataset.dta?raw=true"
)
cars.sample().T
[1]:
49 | |
---|---|
make | Pont. Le Mans |
price | 4723 |
mpg | 19 |
rep78 | 3.0 |
headroom | 3.5 |
trunk | 17 |
weight | 3200 |
length | 199 |
turn | 40 |
displacement | 231 |
gear_ratio | 2.93 |
foreign | Domestic |
[2]:
cars["guzzler"] = (cars.mpg < 18).astype("int")
import statsmodels.formula.api as smf
smf.ols("price ~ guzzler", cars).fit().summary()
[2]:
Dep. Variable: | price | R-squared: | 0.379 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.370 |
Method: | Least Squares | F-statistic: | 43.90 |
Date: | Tue, 28 Feb 2023 | Prob (F-statistic): | 5.38e-09 |
Time: | 13:49:05 | Log-Likelihood: | -678.10 |
No. Observations: | 74 | AIC: | 1360. |
Df Residuals: | 72 | BIC: | 1365. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5143.0893 | 312.807 | 16.442 | 0.000 | 4519.521 | 5766.658 |
guzzler | 4202.2440 | 634.243 | 6.626 | 0.000 | 2937.904 | 5466.584 |
Omnibus: | 37.244 | Durbin-Watson: | 1.348 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 111.225 |
Skew: | 1.565 | Prob(JB): | 7.04e-25 |
Kurtosis: | 8.126 | Cond. No. | 2.50 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3]:
# 4,202 dollars more on average.
Exercise 2¶
Create a scatter plot of price
against weight
and color code your markers by the value of guzzler
(red for guzzler
= 1 and green for guzzler
= 0).
Based on the graph you just created, do you think not controlling for weight
might lead to omitted variable bias in the regression in Exercise 1? What is the direction of the bias?
[4]:
import altair as alt
alt.Chart(cars).encode(
x=alt.X("weight", scale=alt.Scale(zero=False)),
y=alt.Y("price", scale=alt.Scale(zero=False)),
color="guzzler:N",
).mark_point()
[4]:
Exercise 3¶
Regress price
on guzzler
, weight
, foreign
, headroom
, and displacement
. Interpret the coefficients. Do the regression results confirm your guess in Q3?
[5]:
smf.ols(
"price ~ guzzler + weight + foreign + headroom + displacement", cars
).fit().summary()
[5]:
Dep. Variable: | price | R-squared: | 0.596 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.566 |
Method: | Least Squares | F-statistic: | 20.04 |
Date: | Tue, 28 Feb 2023 | Prob (F-statistic): | 3.14e-12 |
Time: | 13:49:05 | Log-Likelihood: | -662.20 |
No. Observations: | 74 | AIC: | 1336. |
Df Residuals: | 68 | BIC: | 1350. |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -782.5353 | 1612.628 | -0.485 | 0.629 | -4000.484 | 2435.414 |
foreign[T.Foreign] | 3278.9827 | 671.826 | 4.881 | 0.000 | 1938.375 | 4619.591 |
guzzler | 1977.1796 | 711.055 | 2.781 | 0.007 | 558.291 | 3396.068 |
weight | 1.9634 | 0.702 | 2.797 | 0.007 | 0.563 | 3.364 |
headroom | -736.7997 | 309.009 | -2.384 | 0.020 | -1353.418 | -120.182 |
displacement | 8.9667 | 5.819 | 1.541 | 0.128 | -2.646 | 20.579 |
Omnibus: | 22.179 | Durbin-Watson: | 1.409 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 37.284 |
Skew: | 1.118 | Prob(JB): | 8.01e-09 |
Kurtosis: | 5.663 | Cond. No. | 2.36e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.36e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[6]:
# After taking into account weight and other factors,
# guzzlers are no only about 2,000 more expensive because,
# as we see, as weight increases, so does prices (price increases by 2 dollars
# for ).
# This does show that weight and being a guzzler were positively correlated,
# and the "guzzler" price difference in our first regression was capturing
# the actual "guzzler" difference, but also the fact that bigger cars
# are more expensive AND tend to be guzzlers.
Exercise 4¶
Variable rep78
indicates the car’s repair record. The variable is poorly documented (we don’t know that the value means) but take our word for it that the values from 1-5 indicate “very poor”, “poor”, “acceptable”, “good”, and “very good” record, respectively.
Regress price
on indicators for the different categories of rep78
. Also control for headroom
, weight
, foreign
, and displacement
. Interpret the coefficients on the indicator for rep78 == 3
.
(Note: You can use the C()
method for creating indicator variables, but your answers will only be right if the omitted category is rep78 == 1
).
(Note: If you create indicators manually, beware how Python deals with equality tests with missing values (e.g., np.nan == 1
), or you may inadvertently “create” data out of thin air!)
[7]:
smf.ols(
"price ~ C(rep78) + guzzler + weight + foreign + headroom + displacement", cars
).fit().summary()
[7]:
Dep. Variable: | price | R-squared: | 0.593 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.531 |
Method: | Least Squares | F-statistic: | 9.557 |
Date: | Tue, 28 Feb 2023 | Prob (F-statistic): | 8.11e-09 |
Time: | 13:49:05 | Log-Likelihood: | -616.77 |
No. Observations: | 69 | AIC: | 1254. |
Df Residuals: | 59 | BIC: | 1276. |
Df Model: | 9 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1852.4340 | 2284.628 | -0.811 | 0.421 | -6423.964 | 2719.096 |
C(rep78)[T.2.0] | 951.3308 | 1676.704 | 0.567 | 0.573 | -2403.746 | 4306.408 |
C(rep78)[T.3.0] | 1352.9162 | 1539.781 | 0.879 | 0.383 | -1728.179 | 4434.011 |
C(rep78)[T.4.0] | 994.5742 | 1609.415 | 0.618 | 0.539 | -2225.857 | 4215.006 |
C(rep78)[T.5.0] | 1325.1545 | 1705.943 | 0.777 | 0.440 | -2088.429 | 4738.738 |
foreign[T.Foreign] | 3240.0049 | 807.012 | 4.015 | 0.000 | 1625.177 | 4854.833 |
guzzler | 1644.4650 | 769.925 | 2.136 | 0.037 | 103.848 | 3185.082 |
weight | 1.7544 | 0.883 | 1.988 | 0.051 | -0.012 | 3.520 |
headroom | -755.1823 | 341.695 | -2.210 | 0.031 | -1438.912 | -71.452 |
displacement | 12.0412 | 7.472 | 1.612 | 0.112 | -2.910 | 26.992 |
Omnibus: | 21.610 | Durbin-Watson: | 1.430 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 33.697 |
Skew: | 1.181 | Prob(JB): | 4.82e-08 |
Kurtosis: | 5.478 | Cond. No. | 4.54e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[8]:
# The coefficient on rep78 == 3 (C(rep78)[T.3.0])
# says that an "acceptable" car is worth about 1,353 dollars
# more than a "very poor" quality car, after controlling for
# other attributes.
Interaction Effects¶
Exercise 5¶
You suspect that the effect of guzzler
on price
may be conditioned by whether or not the car is manufactured abroad. Regress price
on guzzler
, foreign
and their interaction, controlling for headroom
, weight
and displacement
. Without using mathematical language, explain to your grandma what the coefficient on the interaction term means.
[9]:
model = smf.ols(
"price ~ guzzler * foreign", cars
).fit()
model.summary()
[9]:
Dep. Variable: | price | R-squared: | 0.415 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.390 |
Method: | Least Squares | F-statistic: | 16.53 |
Date: | Tue, 28 Feb 2023 | Prob (F-statistic): | 3.20e-08 |
Time: | 13:49:12 | Log-Likelihood: | -675.89 |
No. Observations: | 74 | AIC: | 1360. |
Df Residuals: | 70 | BIC: | 1369. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 4925.0000 | 378.835 | 13.000 | 0.000 | 4169.438 | 5680.562 |
foreign[T.Foreign] | 642.7895 | 650.380 | 0.988 | 0.326 | -654.352 | 1939.931 |
guzzler | 3977.7333 | 705.352 | 5.639 | 0.000 | 2570.953 | 5384.513 |
guzzler:foreign[T.Foreign] | 2012.8105 | 1595.941 | 1.261 | 0.211 | -1170.193 | 5195.814 |
Omnibus: | 47.269 | Durbin-Watson: | 1.353 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 182.153 |
Skew: | 1.940 | Prob(JB): | 2.79e-40 |
Kurtosis: | 9.635 | Cond. No. | 6.73 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[9]:
model = smf.ols(
"price ~ guzzler * foreign + weight + headroom + displacement", cars
).fit()
model.summary()
[9]:
Dep. Variable: | price | R-squared: | 0.619 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.585 |
Method: | Least Squares | F-statistic: | 18.15 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 2.21e-12 |
Time: | 10:18:55 | Log-Likelihood: | -660.00 |
No. Observations: | 74 | AIC: | 1334. |
Df Residuals: | 67 | BIC: | 1350. |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -391.7038 | 1588.834 | -0.247 | 0.806 | -3563.030 | 2779.622 |
foreign[T.Foreign] | 2929.3402 | 679.319 | 4.312 | 0.000 | 1573.413 | 4285.267 |
guzzler | 1354.9011 | 760.244 | 1.782 | 0.079 | -162.552 | 2872.354 |
guzzler:foreign[T.Foreign] | 2797.6787 | 1381.501 | 2.025 | 0.047 | 40.190 | 5555.167 |
weight | 1.6417 | 0.705 | 2.330 | 0.023 | 0.235 | 3.048 |
headroom | -736.8717 | 302.195 | -2.438 | 0.017 | -1340.056 | -133.688 |
displacement | 12.6296 | 5.972 | 2.115 | 0.038 | 0.710 | 24.549 |
Omnibus: | 26.353 | Durbin-Watson: | 1.421 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 46.874 |
Skew: | 1.311 | Prob(JB): | 6.63e-11 |
Kurtosis: | 5.885 | Cond. No. | 2.39e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[10]:
# (My grandmother passed away a long time ago, so I'll have to tell my mom...)
# Hi Mom! How are you doing?
# I know this is gonna sound silly, but I wanted to let you know that according
# to an admittedly dated database on car prices,
# gas guzzlers cost quite a bit more when the car is foreign than when
# they're domestic.
#
# Or, to be more precise: guzzlers cost more than non-gas guzzlers around the world,
# but this difference between guzzlers and non-guzzlers is about 2,797 dollars
# greater for foreign cars than domestic cars.
Exercise 6¶
What is the price difference between a foreign guzzler and a foreign non-guzzler?
[11]:
print(f"${model.params['guzzler:foreign[T.Foreign]'] + model.params['guzzler']:.2f}")
$4152.58
Exercise 7¶
What is the price difference between a domestic non-guzzler and a foreign non-guzzler?
[12]:
print(f"${model.params['foreign[T.Foreign]']:.2f}")
$2929.34
Exercise 8¶
Regress price
on foreign
, mpg
and their interaction, controlling for headroom
, weight
and displacement
. Interpret the coefficients of the main independent variables. Explain in layman terms the coefficient on the interaction term.
[13]:
m2 = smf.ols("price ~ foreign * mpg + weight + headroom + displacement", cars).fit()
m2.summary()
[13]:
Dep. Variable: | price | R-squared: | 0.599 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.564 |
Method: | Least Squares | F-statistic: | 16.71 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 1.12e-11 |
Time: | 10:18:55 | Log-Likelihood: | -661.86 |
No. Observations: | 74 | AIC: | 1338. |
Df Residuals: | 67 | BIC: | 1354. |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1.232e+04 | 4465.992 | -2.758 | 0.007 | -2.12e+04 | -3404.206 |
foreign[T.Foreign] | 1.176e+04 | 2796.011 | 4.208 | 0.000 | 6184.000 | 1.73e+04 |
mpg | 259.8139 | 109.998 | 2.362 | 0.021 | 40.257 | 479.371 |
foreign[T.Foreign]:mpg | -314.4806 | 109.360 | -2.876 | 0.005 | -532.764 | -96.197 |
weight | 3.4327 | 0.856 | 4.008 | 0.000 | 1.723 | 5.142 |
headroom | -484.5821 | 319.958 | -1.515 | 0.135 | -1123.222 | 154.058 |
displacement | 14.4670 | 5.839 | 2.478 | 0.016 | 2.813 | 26.121 |
Omnibus: | 22.563 | Durbin-Watson: | 1.442 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 33.595 |
Skew: | 1.228 | Prob(JB): | 5.07e-08 |
Kurtosis: | 5.204 | Cond. No. | 6.89e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.89e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[14]:
# For domestic cars, prices go up with the car's miles per gallon
# by:
print(f"${m2.params['mpg']:.2f} per mpg")
$259.81 per mpg
[15]:
# But the "price premium" for milage is
print(f"${m2.params['foreign[T.Foreign]:mpg']:.2f} per mpg")
$-314.48 per mpg
[16]:
# lower for foreign cars than domestic cars.
# That means that for every additional mpg, the price of a
# foreign car changes by:
print(f"${m2.params['foreign[T.Foreign]:mpg'] + m2.params['mpg']:.2f} per mpg")
$-54.67 per mpg