Empirical Finance: Meeting Fiduciary Standards Through Skepticism, not Cynicism
Michael Edesses is out with a scathing article lambasting the field of empirical finance. He draws inspiration from Harvey, Liu and Zhu’s (HLZ) recent article, entitled “…and the Cross Section of Expected Returns”, but extends HLZ’s conclusions to an absurd limit. In this article, we discuss why we embrace the framework of healthy skepticism described by HLZ, but in the context of a more optimistic and constructive view of empirical finance.
A Primer on Empirical Finance
Academics and quantitatively minded practitioners in finance spend much of their time trying to discover new sources of excess returns in markets. For the purpose of this article, let’s call these sources of excess returns ‘factors’. The most robust studies begin with a logical premise based on a reasonable theory about sources of risk and how investors behave. Typically a theory proposes that investors misbehave in a consistent way, which helps to inform a hypothesis that the researchers can test by examining data. The hypothesis often asserts that the proposed mistakes provide an opportunity for others to earn excess profits, and the researchers believe the data will support this view.
Good science, however, starts by stating a so-called ‘null hypothesis’, which is consistent with the dominant paradigm. In finance, the dominant paradigm is the ‘Efficient Markets Hypothesis’, which generally states that investors do not make mistakes, and that there are no persistent sources of excess returns in markets. As such, researchers who present a theory for why investors do make persistent mistakes, which provide opportunities for excess profits, face a high hurdle if they expect their peers to embrace their theory.
In finance, these hurdles are represented by statistical thresholds. For example, in order for a finance researcher to reasonably argue that the effect he has discovered is ‘significant’, there must be less than a 5% probability that the effect might have been observed purely by chance. If there is greater than 5% likelihood that the observed effect is the result of luck, the researcher cannot reject the null hypothesis that markets are efficient. Only when the observed effect is substantially large, and/or has been observed across a large number of observations, such that there is less than a 5% chance that the results are due to random chance, can the researcher reject the null hypothesis and claim that his theory has merit.
By convention, most finance researchers prefer to think of statistical significance not in terms of probability values explicitly, but rather in terms of ’t-scores’. Of course, basic statistics equates any t-score with a probability value, so t-scores and probability values are interchangeable in terms of conclusion. Without going too far down the rabbit hole, suffice to say that under standard assumptions if an effect presents with a t-score greater than ~2, there is less than 5% chance that the effect is random; therefore it is considered to be statistically significant. Where a researcher wants to be extra cautious, he may want the effect to be significant at a 1% threshold, which corresponds to a t-score of about 2.6.
But there are Many Tests
In their paper, HLZ assert, quite rightly, that as more researchers run experiments to find factors that explain (i.e. forecast) sources of excess returns, there is an increasing probability that researchers will stumble on spurious factors. In other words, if many researchers are running many tests, it is inevitable that some researchers will find effects that appear to be statistically significant, but which have in fact occurred purely by chance. As such, observers of these studies should become more skeptical of statistically significant results over time.
HLZ suggest that, to counter this known issue, thresholds for statistical significance should increase in proportion to the number of tests that have been conducted to date. They propose three adjustments to account for this issue, and demonstrate that, after accounting for these adjustments, the majority of historical ‘discoveries’ in finance, including those that meet traditional tests of significance, are probably due to random chance.
Figure 1. is drawn from HLZ. The red, green and blue lines on the chart show three methods for how the threshold for statistical significance should rise in recognition of an accelerating number of factor tests through time. For example, according to the Bonferroni test (blue line), the t-score that would indicate statistical significance in 1995 was about 3.25, while tests of new factors conducted today should be rejected unless they exceed a t-score of 3.75. While this might seem like a small increase in threshold value, in fact it represents a hurdle that is almost 6x harder to overcome.
Figure 1. Published t-scores and statistical significance of select equity market factors at time of publication adjusted for data-mining bias.
Source: Harvey, Liu and Zhu, 2014
Factors in Figure 1. are highlighted according to their dates of discovery and the statistical significance (t-score) found in the original papers. Note that the market (MRT), value (HML), momentum (MOM), durable capital goods (DCG), and short-run volatility (SRV) factors exceed even the most conservative adjusted thresholds for statistical significance. Importantly, these factors would still be deemed statistically significant even if they had just been discovered recently, because their t-scores are so high. Perhaps surprisingly for some, the small-cap premium (SMB) did not survive these more rigorous tests.
Michael Edesses argues that the thresholds above substantially understate the true number of tests that have been conducted because research that does not reach a significant conclusion is rarely published. This did not go unnoticed by HLZ. In fact they calculate that the number of published factor papers understates the number of actual factor tests by about 71%. However, when they make adjustments based on these more conservative estimates, the 5 factors above remain statistically significant.
From Skepticism to Cynicism
HLZ demonstrate a healthy skepticism in acknowledging the prevalence of Type I error in financial research. In contrast, Michael extends HLZ’s conclusions from healthy skepticism to zealous cynicism. Where HLZ propose structured and well established methods to account for data-mining bias in research, Michael wonders, “Is it impossible to raise the bar high enough?”, and answers: yes!
Unfortunately, Michael offers no evidence to support this contention, and no solution in the event that it is true. As such, Michael leaves financial practitioners in the predicament of having to make decisions in markets, but where no research that might inform these decisions can be trusted. This is hardly constructive.
We contend that HLZ provide a rational framework for raising the bar on financial research. By applying their guidelines, investment practitioners have tools at their disposal to differentiate between spurious conclusions and meaningfully prospective methods, and can make appropriate informed decisions in this context.
Confidence Inspired by More Evidence
HLZ is not alone in their desire to bring statistical rigour to the financial research process. Many well respected practitioners share HLZ’s concerns and apply similar methods in their own practices.
One way for practitioners to gain greater confidence in prospective factors is through out-of-sample testing. Fortunately, there is an abundance of out-of-sample analysis validating the most robust factors. One obvious out-of-sample test involves testing the factor on a brand new universe. For example, if a method worked on U.S. stocks, it should also work on stocks in other international stock markets. In addition, if a factor was identified in 1993, then tests over the 20 year period from 1994 – 2013 are also considered out-of-sample. One might also ‘perturb’ a factor’s specification to test for robustness, say by changing the definition of ‘value’ from price-t- book value to price-to-cash-flow or price-to-earnings.
In “Finding Smart Beta in the Factor Zoo”, Jason Hsu and Vitali Kalesnik at Research Affiliates performed tests of the value, momentum, low beta, quality and size factors on stocks across U.S. and international markets. For tests on U.S. markets they used data back to 1967, while international tests were run from 1987. Recall that the size, value and momentum factors were first documented in the early 1990s, and the low beta anomaly was first catalogued by Haugen in the mid-1970s. In addition, all factors were first identified using exclusively U.S. stocks. As such, by testing on international markets over the period 1987-2013 their analysis was legitimately ‘out of sample’. That is, they tested on out-of-sample universes, and over a 26 year horizon, where 20 years were out of sample in time. Results in international markets were consistent with the results of the seminal papers.
In addition, Hsu and Kalesnik tested using different definitions of the factors. For example, they tested ‘value’ as defined by dividends-to-price, cash-flow-to-price, and earnings-to-price as well as the original book-to-price metric. They also varied the lookback horizons and skip-months for momentum, and tested both beta and volatility for the low-beta factor, again with different lookback horizons. As you can see from Figure 2., the value, momentum and low beta factors all proved robust to alternative definitions.
Figure 2. Value, low beta and momentum factors prove robust to alternative specifications
Source: Research Affiliates using CRSP/Compustat data
Clearly Jason Hsu at Research Affiliates takes seriously the concerns raised by HLZ, and has taken steps to increase empirical rigour of their solutions. But they are not alone.
The principals at AQR, principally Cliff Asness and colleagues, performed their own analysis of the value and momentum factors across both a universe of global stocks and a universe of global asset class indexes. Their tests span the period 1972-2011, so about 40% of their analysis period is out of sample in time. Of course, about half of their global stock universe, and the entire global asset class universe, is also out of sample for the entire period. Their results are summarized in Figure 3. below.
Figure 3. Statistical significance of value and momentum factors across global stocks and asset classes, 1972-2011
Source: Asness, Moskowitz and Pedersen, “Value and Momentum Everywhere”
Highlighted in green, note the statistical significance of risk-adjusted excess returns from the value and momentum factors in global stocks (top) and global asset classes (bottom). This analysis validates the persistence of the value and momentum factors across a largely out of sample data set. Even better, the t-scores exceed the higher thresholds proposed by HLZ, and tests on the asset class universe overcome HLZ’s higher hurdles with substantial margin to spare (full disclosure: ReSolve investment solutions rely largely on asset class momentum and low beta factors).
The Recipe for Success
HLZ, Hsu and Kalesnik, AQR, ourselves, and most other reputable practitioners agree that empirical evidence is necessary, but not sufficient to validate prospective investment strategies. Rather, in addition to empirical evidence, Hsu and Kalesnik maintain that, “The factor has [to have] a credible reason to offer a persistent premium”. Specifically, they would find empirical evidence of a factor compelling if:
- It is related to a macro risk exposure, or
- It is related to a deep-rooted behavioral bias that is present in a meaningful fraction of investors, or
- It is related to an institutional feature that cannot be easily changed.
Wes Gray of AlphaArchitect makes almost exactly the same point in a recent article. Basically, Wes says that even if you identify a statistically significant market opportunity, you must still ask yourself these four questions before you can have confidence that the effect is real:
- Who is the sucker – that is, from which investors do we expect to extract excess returns?
- Why are the suckers making mistakes, and why should they be expected to continue making the same mistakes in the future?
- Who are the pros – that is, who are the natural arbitrageurs of the opportunity?
- What is preventing the pros from taking advantage of the opportunity?
If you can answer these four questions with confidence, and the effect demonstrates statistically significant results, then a reasonable practitioner would be remiss in ignoring the potential opportunity.
Where We Stand
We firmly believe that the scientific process is alive and well in empirical finance, and that HLZ’s guidelines are an excellent example of the process at work. Financial practitioners should evaluate research with a healthy skepticism, and an awareness of the implications of data mining. Further, investors should be especially skeptical of papers claiming the existence of new factors with no foundation in theories of risk or investor behaviour, and where there are no clear limits to arbitrage.
Where possible, investors should seek out validation through out-of-sample testing. This includes tests on new investment universes; tests on different time frames, and; tests that perturb the definition of the phenomenon. Where a factor has good theoretical roots and proves resilient to a wide variety of empirical tests, rational and dispassionate practitioners must answer a difficult question: why would a responsible fiduciary ignore the opportunity?
Ideology of any kind is the enemy of constructive thinking. The hallmark of a cynic is sweeping disregard for evidence where it runs counter to their worldview. If a person’s worldview holds that all financial professionals are charlatans, he will view all financial research through the prism of this ideology. But this is profoundly counterproductive. After all, the practice of financial advice and management demands real-time decisions every day that affect people’s lives. How is one supposed to act? Worse, how is one supposed to uphold his fiduciary standard?
The truth is, an investor can choose to act in a way that is consistent with how they feel the world should work, or he can choose to act in a way that is consistent with a thoughtful interpretation of the evidence. Empirical finance is the manifestation of this latter worldview, and consistent with all reasonable thresholds of professionalism and fiduciary standard.
We know where we stand, and now so do you.