Interpreting Indicator Variables

In this exercise we’ll work with a (somewhat canonical) dataset on the prices, mileages, weights, and other characteristics of 74 automobiles. These data originally came from the April 1979 issue of Consumer Reports and from the United States Government EPA statistics on fuel consumption; they were compiled and published by Chambers et al. (1983).

To get the data, follow this link or go to http://www.github.com/nickeubank/MIDS_Data and download the automobile_dataset.dta file. This is a canonical example dataset used in coding examples all over the internet, and the codebook is roughly:

make            Make and Model
price           Price
mpg             Mileage (mpg)
rep78           Repair Record 1978
headroom        Headroom (in.)
trunk           Trunk space (cu. ft.)
weight          Weight (lbs.)
length          Length (in.)
turn            Turn Circle (ft.)
displacement    Displacement (cu. in.)
gear_ratio      Gear Ratio
foreign         Car type

Indicator Variables and Omitted Variable Bias

Exercise 1

Create a new variable named guzzler that takes the value of 1 if the car’s miles per gallon (mpg) is less than 18 and takes value 0 otherwise (“guzzler” is a term for a car that consumes gas very quickly, or “guzzles gas”). Regress price on guzzler and interpret the coefficients. Do gas guzzlers cost more than the other cars? How much more?

[1]:
import pandas as pd

cars = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/"
    "blob/master/automobile_dataset.dta?raw=true"
)
cars.sample().T

[1]:
49
make Pont. Le Mans
price 4723
mpg 19
rep78 3.0
headroom 3.5
trunk 17
weight 3200
length 199
turn 40
displacement 231
gear_ratio 2.93
foreign Domestic
[2]:
cars["guzzler"] = (cars.mpg < 18).astype("int")
import statsmodels.formula.api as smf

smf.ols("price ~ guzzler", cars).fit().summary()

[2]:
OLS Regression Results
Dep. Variable: price R-squared: 0.379
Model: OLS Adj. R-squared: 0.370
Method: Least Squares F-statistic: 43.90
Date: Tue, 28 Feb 2023 Prob (F-statistic): 5.38e-09
Time: 13:49:05 Log-Likelihood: -678.10
No. Observations: 74 AIC: 1360.
Df Residuals: 72 BIC: 1365.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5143.0893 312.807 16.442 0.000 4519.521 5766.658
guzzler 4202.2440 634.243 6.626 0.000 2937.904 5466.584
Omnibus: 37.244 Durbin-Watson: 1.348
Prob(Omnibus): 0.000 Jarque-Bera (JB): 111.225
Skew: 1.565 Prob(JB): 7.04e-25
Kurtosis: 8.126 Cond. No. 2.50


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3]:
# 4,202 dollars more on average.

Exercise 2

Create a scatter plot of price against weight and color code your markers by the value of guzzler (red for guzzler = 1 and green for guzzler = 0).

Based on the graph you just created, do you think not controlling for weight might lead to omitted variable bias in the regression in Exercise 1? What is the direction of the bias?

[4]:
import altair as alt

alt.Chart(cars).encode(
    x=alt.X("weight", scale=alt.Scale(zero=False)),
    y=alt.Y("price", scale=alt.Scale(zero=False)),
    color="guzzler:N",
).mark_point()

[4]:

Exercise 3

Regress price on guzzler, weight, foreign, headroom, and displacement. Interpret the coefficients. Do the regression results confirm your guess in Q3?

[5]:
smf.ols(
    "price ~ guzzler + weight + foreign + headroom + displacement", cars
).fit().summary()

[5]:
OLS Regression Results
Dep. Variable: price R-squared: 0.596
Model: OLS Adj. R-squared: 0.566
Method: Least Squares F-statistic: 20.04
Date: Tue, 28 Feb 2023 Prob (F-statistic): 3.14e-12
Time: 13:49:05 Log-Likelihood: -662.20
No. Observations: 74 AIC: 1336.
Df Residuals: 68 BIC: 1350.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -782.5353 1612.628 -0.485 0.629 -4000.484 2435.414
foreign[T.Foreign] 3278.9827 671.826 4.881 0.000 1938.375 4619.591
guzzler 1977.1796 711.055 2.781 0.007 558.291 3396.068
weight 1.9634 0.702 2.797 0.007 0.563 3.364
headroom -736.7997 309.009 -2.384 0.020 -1353.418 -120.182
displacement 8.9667 5.819 1.541 0.128 -2.646 20.579
Omnibus: 22.179 Durbin-Watson: 1.409
Prob(Omnibus): 0.000 Jarque-Bera (JB): 37.284
Skew: 1.118 Prob(JB): 8.01e-09
Kurtosis: 5.663 Cond. No. 2.36e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.36e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[6]:
# After taking into account weight and other factors,
# guzzlers are no only about 2,000 more expensive because,
# as we see, as weight increases, so does prices (price increases by 2 dollars
# for ).

# This does show that weight and being a guzzler were positively correlated,
# and the "guzzler" price difference in our first regression was capturing
# the actual "guzzler" difference, but also the fact that bigger cars
# are more expensive AND tend to be guzzlers.

Exercise 4

Variable rep78 indicates the car’s repair record. The variable is poorly documented (we don’t know that the value means) but take our word for it that the values from 1-5 indicate “very poor”, “poor”, “acceptable”, “good”, and “very good” record, respectively.

Regress price on indicators for the different categories of rep78. Also control for headroom, weight, foreign, and displacement. Interpret the coefficients on the indicator for rep78 == 3.

(Note: You can use the C() method for creating indicator variables, but your answers will only be right if the omitted category is rep78 == 1).

(Note: If you create indicators manually, beware how Python deals with equality tests with missing values (e.g., np.nan == 1), or you may inadvertently “create” data out of thin air!)

[7]:
smf.ols(
    "price ~ C(rep78) + guzzler + weight + foreign + headroom + displacement", cars
).fit().summary()

[7]:
OLS Regression Results
Dep. Variable: price R-squared: 0.593
Model: OLS Adj. R-squared: 0.531
Method: Least Squares F-statistic: 9.557
Date: Tue, 28 Feb 2023 Prob (F-statistic): 8.11e-09
Time: 13:49:05 Log-Likelihood: -616.77
No. Observations: 69 AIC: 1254.
Df Residuals: 59 BIC: 1276.
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1852.4340 2284.628 -0.811 0.421 -6423.964 2719.096
C(rep78)[T.2.0] 951.3308 1676.704 0.567 0.573 -2403.746 4306.408
C(rep78)[T.3.0] 1352.9162 1539.781 0.879 0.383 -1728.179 4434.011
C(rep78)[T.4.0] 994.5742 1609.415 0.618 0.539 -2225.857 4215.006
C(rep78)[T.5.0] 1325.1545 1705.943 0.777 0.440 -2088.429 4738.738
foreign[T.Foreign] 3240.0049 807.012 4.015 0.000 1625.177 4854.833
guzzler 1644.4650 769.925 2.136 0.037 103.848 3185.082
weight 1.7544 0.883 1.988 0.051 -0.012 3.520
headroom -755.1823 341.695 -2.210 0.031 -1438.912 -71.452
displacement 12.0412 7.472 1.612 0.112 -2.910 26.992
Omnibus: 21.610 Durbin-Watson: 1.430
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.697
Skew: 1.181 Prob(JB): 4.82e-08
Kurtosis: 5.478 Cond. No. 4.54e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[8]:
# The coefficient on rep78 == 3 (C(rep78)[T.3.0])
# says that an "acceptable" car is worth about 1,353 dollars
# more than a "very poor" quality car, after controlling for
# other attributes.

Interaction Effects

Exercise 5

You suspect that the effect of guzzler on price may be conditioned by whether or not the car is manufactured abroad. Regress price on guzzler, foreign and their interaction, controlling for headroom, weight and displacement. Without using mathematical language, explain to your grandma what the coefficient on the interaction term means.

[9]:
model = smf.ols(
    "price ~ guzzler * foreign", cars
).fit()
model.summary()

[9]:
OLS Regression Results
Dep. Variable: price R-squared: 0.415
Model: OLS Adj. R-squared: 0.390
Method: Least Squares F-statistic: 16.53
Date: Tue, 28 Feb 2023 Prob (F-statistic): 3.20e-08
Time: 13:49:12 Log-Likelihood: -675.89
No. Observations: 74 AIC: 1360.
Df Residuals: 70 BIC: 1369.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4925.0000 378.835 13.000 0.000 4169.438 5680.562
foreign[T.Foreign] 642.7895 650.380 0.988 0.326 -654.352 1939.931
guzzler 3977.7333 705.352 5.639 0.000 2570.953 5384.513
guzzler:foreign[T.Foreign] 2012.8105 1595.941 1.261 0.211 -1170.193 5195.814
Omnibus: 47.269 Durbin-Watson: 1.353
Prob(Omnibus): 0.000 Jarque-Bera (JB): 182.153
Skew: 1.940 Prob(JB): 2.79e-40
Kurtosis: 9.635 Cond. No. 6.73


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[9]:
model = smf.ols(
    "price ~ guzzler * foreign + weight + headroom + displacement", cars
).fit()
model.summary()

[9]:
OLS Regression Results
Dep. Variable: price R-squared: 0.619
Model: OLS Adj. R-squared: 0.585
Method: Least Squares F-statistic: 18.15
Date: Tue, 01 Mar 2022 Prob (F-statistic): 2.21e-12
Time: 10:18:55 Log-Likelihood: -660.00
No. Observations: 74 AIC: 1334.
Df Residuals: 67 BIC: 1350.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -391.7038 1588.834 -0.247 0.806 -3563.030 2779.622
foreign[T.Foreign] 2929.3402 679.319 4.312 0.000 1573.413 4285.267
guzzler 1354.9011 760.244 1.782 0.079 -162.552 2872.354
guzzler:foreign[T.Foreign] 2797.6787 1381.501 2.025 0.047 40.190 5555.167
weight 1.6417 0.705 2.330 0.023 0.235 3.048
headroom -736.8717 302.195 -2.438 0.017 -1340.056 -133.688
displacement 12.6296 5.972 2.115 0.038 0.710 24.549
Omnibus: 26.353 Durbin-Watson: 1.421
Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.874
Skew: 1.311 Prob(JB): 6.63e-11
Kurtosis: 5.885 Cond. No. 2.39e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[10]:
# (My grandmother passed away a long time ago, so I'll have to tell my mom...)
# Hi Mom! How are you doing?
# I know this is gonna sound silly, but I wanted to let you know that according
# to an admittedly dated database on car prices,
# gas guzzlers cost quite a bit more when the car is foreign than when
# they're domestic.
#
# Or, to be more precise: guzzlers cost more than non-gas guzzlers around the world,
# but this difference between guzzlers and non-guzzlers is about 2,797 dollars
# greater for foreign cars than domestic cars.

Exercise 6

What is the price difference between a foreign guzzler and a foreign non-guzzler?

[11]:
print(f"${model.params['guzzler:foreign[T.Foreign]'] + model.params['guzzler']:.2f}")

$4152.58

Exercise 7

What is the price difference between a domestic non-guzzler and a foreign non-guzzler?

[12]:
print(f"${model.params['foreign[T.Foreign]']:.2f}")

$2929.34

Exercise 8

Regress price on foreign, mpg and their interaction, controlling for headroom, weight and displacement. Interpret the coefficients of the main independent variables. Explain in layman terms the coefficient on the interaction term.

[13]:
m2 = smf.ols("price ~ foreign * mpg + weight + headroom + displacement", cars).fit()
m2.summary()

[13]:
OLS Regression Results
Dep. Variable: price R-squared: 0.599
Model: OLS Adj. R-squared: 0.564
Method: Least Squares F-statistic: 16.71
Date: Tue, 01 Mar 2022 Prob (F-statistic): 1.12e-11
Time: 10:18:55 Log-Likelihood: -661.86
No. Observations: 74 AIC: 1338.
Df Residuals: 67 BIC: 1354.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1.232e+04 4465.992 -2.758 0.007 -2.12e+04 -3404.206
foreign[T.Foreign] 1.176e+04 2796.011 4.208 0.000 6184.000 1.73e+04
mpg 259.8139 109.998 2.362 0.021 40.257 479.371
foreign[T.Foreign]:mpg -314.4806 109.360 -2.876 0.005 -532.764 -96.197
weight 3.4327 0.856 4.008 0.000 1.723 5.142
headroom -484.5821 319.958 -1.515 0.135 -1123.222 154.058
displacement 14.4670 5.839 2.478 0.016 2.813 26.121
Omnibus: 22.563 Durbin-Watson: 1.442
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.595
Skew: 1.228 Prob(JB): 5.07e-08
Kurtosis: 5.204 Cond. No. 6.89e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.89e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
[14]:
# For domestic cars, prices go up with the car's miles per gallon
# by:

print(f"${m2.params['mpg']:.2f} per mpg")

$259.81 per mpg
[15]:
# But the "price premium" for milage is

print(f"${m2.params['foreign[T.Foreign]:mpg']:.2f} per mpg")

$-314.48 per mpg
[16]:
# lower for foreign cars than domestic cars.
# That means that for every additional mpg, the price of a
# foreign car changes by:

print(f"${m2.params['foreign[T.Foreign]:mpg'] + m2.params['mpg']:.2f} per mpg")

$-54.67 per mpg