Call/WhatsApp: +1 914 416 5343

## Efficient Markets Hypothesis

[ ]: # Initialize Otter
import otter
1 Econ 140 – Problem Set 4
Before getting started on the assignment, run the cell at the very top that imports otter and the
cell below which will import the packages we need.
Important: As mentioned in problem set 0, if you leave this notebook alone for a while and come
back, to save memory datahub will “forget” which code cells you have run, and you may need to
restart your kernel and run all of the cells from the top. That includes this code cell that imports
packages. If you get not defined errors, this is because you didn’t run an earlier
code cell that you needed to run. It might be this cell or the otter cell above.
[4]: import numpy as np
import pandas as pd
import statsmodels.api as sm
1.1 Problem 1. Efficient Markets Hypothesis
Does the stock market efficiently use information in valuing stocks? The Efficient Markets Hypothesis (“EMH”), developed by Nobel-prize winner Eugene Fama, maintains that current stock
prices fully reflect all available information. An implication of this hypothesis is that returns in
the current period should not be systematically related to information known in earlier periods.
Otherwise, we could use this information to predict stock returns, thus violating EMH. As an analyst at an investment management company, you have been tasked with examining the validity
of the EMH. You obtained a dataset of 142 randomly-selected firms listed on the New York Stock
Exchange, consisting of the following four variables:
Variable Description
return Total return from holding a firm’s stock over a one-year period, from
January 2014 to December 2014. Note that an annual return such has
31.4% is entered in the dataset as 31.4.
dkr A firm’s debt to capital ratio in 2013.
lnetincome Natural log of the net income for a firm in 2013.
lsalary Natural log of the total compensation for a firm’s CEO in 2013.
1
Using these data, you estimated the following two regressions.
Regression 1
Regression 2
Question 1.a. Based on the results for the two OLS regressions, what is the sign of the correlation
between dkr and lnetincome? Alternatively, is there not enough information to determine the sign
of the correlation?
Question 1.b. Interpret the coefficient on lnetincome in Regression 2.
Now suppose you added another variable to the regression, and obtained the following regression
results.
2
Regression 3
Question 1.c. Suppose that you use Regression 3 to examine whether EMH holds. What are the
null and alternative hypotheses?
Question 1.d. Carry out the test in part (c) at the 5% level. Do you reject or fail to reject the
null hypothesis?
Question 1.e. Interpret the result you obtained in part (d), in light of your task of examining the
validity of EMH.
Question 1.f. Provide (at least) two reasons why there might be imperfect multicollinearity
present in Regression 3.
Question 1.g. Which of the following statements is true based on a comparison of Regression
2 and Regression 3? – (i) dkr and lnetincome are highly-correlated. – (ii) dkr and lsalary are
highly-correlated. – (iii) lnetincome and lsalary are highly-correlated. – (iv) All of the above. –
(v) None of the above.
Question 1.h. The sample of 142 stocks only include companies that were traded on the NYSE
as of the end of 2013. A company that went out of business, for instance, before the end of that
year could not enter the sample. How would this sampling affect the estimated coefficient relative
to the population regression?
3
1.2 Problem 2. Airlines and Antitrust
Antitrust authorities have long been concerned that airline carriers may exercise their market power
by charging higher fares. The greatest concern arises when one airline runs the vast majority of
flights in and out of an airport. Usually this happens when an airline designates an airport as
a national or regional “hub” of their operations. The dataset airfares.csv consists of average
fares and other characteristics of popular U.S. origin-destination pairs (e.g., Boston-Chicago) for
the year 2000.
Variable Description Units
lfare logarithm of the average fare
on the route
log of fare in 2000 dollars
dist distance of the route thousands of miles
passen average number of
passengers per day
thousands of passengers
concen market share of biggest
airline carrier on the route,
measured in terms of
passengers carried
fraction (e.g., 0.55 = 55%
market share)
origin city of origin of flight
destin city of destination of flight
Question 2.a. Regress lfare on dist, passen and concen, with robust standard errors. Make
sure the cell below (and all regression questions in this assignment) shows your regression results
like you’ve done in previous assignments, otherwise we cannot give credit. This assignment will be
a little less guided. Make sure do use different variable names for each separate coding part to avoid
unexpected errors from reusing variables. Refer to previous assignments if you need a refresher on
how we performed different regressions. Don’t forget to add a constant to your regressions.
[ ]: …
Question 2.b. What is the interpretation of the coefficient on passen?
Question 2.c. Based on your OLSEs, and assuming the OLS assumptions hold, what is the partial
effect of the market share of the largest carrier on air fares? Is your answer consistent with the
hypothesis that firms use their market power to charge higher prices?
Question 2.d. How would you test whether market power is used the same way on more popular
and less popular routes? Write down the model and the hypothesis, carry out the estimation and
the test.
This question is for your code, the next is for your explanation.
4
[ ]: …
Question 2.e. Explain.
Question 2.f. We need to question whether the results of the regression in part (d) is revealing
a causal relationship between concentration and airfares. In particular, we are concerned whether
our estimation results on U.S. data are valid for other markets, such as Europe and Asia. Give one
reason why the results would not be “externally valid” if applied to the airline industry in one of
these other two regions.
Question 2.g. We are also aware of several potential threats to “internal validity” of the results.
For each one of the five main internal validity threats, describe one possibility that could plausibly
1.3 Problem 3. World Health Organization
The World Health Organization (“WHO”) collects data which assesses the health care outcomes
of the populations in 191 countries across the globe, as well as exploring potential explanations for
those outcomes. These data are published in the annual “World Health Report.” The file who.csv
contains five years (1993-1997) of these data. The variables in the panel of countries include:
Variable Description
comp composite measure of health care attainment
year 1993,1994,1995,1996,1997
hexp per capita health expenditure
hc3 educational attainment (tertiary schooling)
country number assigned to country
oecd dummy indicator for an OECD member country
gini Gini coefficient for income inequality
geff World Bank measure of government effectiveness
voice World Bank measure of democratization of the political process
tropics dummy indicator of tropical location
popden population density (people per square mile)
pubthe proportion of health expenditure paid by public authorities
gdpc normalized per-capita GDP
Question 3.a. Create a new variable for the dataset that is the square of educational attainment
(hc3). Then regress life expectancy (dale) on health expenditures (hexp), the educational attain5
ment in the country (hc3), and its square (the variable you created). For now, select rows from
1997 and use only these rows in the regression. Use robust standard errors and don’t forget to
add a constant term. Comment on whether you think the relationship between life expectancy and
education is linear or quadratic and why you came to that conclusion.
This question is for your code, the next is for your explanation.
[6]: …
Question 3.b. Explain.
Question 3.c. To the specification in part (a), add the additional control variables: gini, tropics,
popden, pubthe, gdpc, voice, and geff. Test whether these additional regressors are jointly
significant (we do the F-test for you in this part, you just have to interpret it). What effect does
inclusion of these additional controls have on the coefficients of the other included regressors?
This question is for your code, the next is for your explanation.
[7]: # This is the code for your regression.
# We give you starter code for this one so that we know what the variable name␣
,→is
# for the regression results, which we use in the code cell below.
model_3b = …
results_3b = …
results_3b.summary()
[8]: # Please don’t change this cell, just run it.
# This is how you do an F-test. Notice that we do .f_test on the results of the
# unrestricted model, and then we give the names of the variables we want to
# test inside quotation marks.
results_3b.f_test(“gini, tropics, popden, pubthe, gdpc, voice, geff”).summary()
Question 3.d. Explain.
Question 3.e. Return to the simpler regression specification in part (a). We want see if the
determinants of life expectancy are different for rich and poor countries. Use membership in the
“Organization of Economic Cooperation & Development” (oecd) as the indicator of a rich country.
The OECD had 30 member countries during this time period. Perform a test of the hypothesis
that all three of the coefficients in the population regression are equal for OECD and non-OECD
countries.
Hint: You will need to create three new variables.
This question is for your code, the next is for your explanation.
[9]: …
6
[52]: # This extra code cell may be helpful

Question 3.f. Explain.
Question 3.g. Give an example of a time-invariant variable that would result in different life
expectancy across countries.
Question 3.h. Estimate the regression having a fixed effect for each country in the sample. We
have defined the endogenous and exogenous variables for you, you just have to fill in the rest.
Notice how we converted the country variable to a set of dummy variables for each country. You
can ignore the coefficients for every country variable. What change took place in the coefficients
on the education variables? Explain why you think there was a change in these coefficients.
This question is for your code, the next is for your explanation.
[49]: # .get_dummies transforms a categorical variable into a dataframe of dummy␣
,→variables,
# one for each category. The prefix and prefix_sep part just makes sure the␣
,→variable
# names are strings and not integers.
countries = pd.get_dummies(who[‘country’], prefix=”, prefix_sep=”)
# This just joins the dummy dataframe with the original
who_country = who[[‘dale’, ‘hexp’, ‘hc3’, ‘hc3^2’]].join(countries)
y_3h = who_country[‘dale’]
# Here we drop country 191, since otherwise there would be perfect colinearity␣
,→in
# the columns. We also have to drop dale since that’s the endogenous variable we
# regress on.
model_3h = sm.OLS(…, …)
results_3h = model_3h.fit(…)
results_3h.summary()
Question 3.i. Explain.
Question 3.j. Give an example of an entity-invariant variable, which is excluded from the estimated regression model in part (a), that would result in variation in life expectancy over time.
Question 3.k. Perform regression with time fixed effects. Are the results consistent with your
reasoning about the entity-invariant variables? The procedure for this question will be similar to
3.h. Drop the dummy variable for 1993 for this question.
This question is for your code, the next is for your explanation.
7
[50]: …
Question 3.l. Explain.