Working Paper: Random Forests and Selected Samples

​Paper Authors: Jonathan Cook and Saad Siddiqui

Research Focus: A common goal of economic analysis is to understand how a variable of interest, e.g., auditor effort, relates to an outcome, e.g., PCAOB inspection findings. Linear regression is a powerful tool for uncovering causal relationships, but can be misleading when applied to selected samples. This concern is important for analyses at the PCAOB, as the PCAOB collects information only for inspected audits of issuers.  While there is a large set of economic literature that develops estimators to recover causal relationships from selected samples, these estimators typically require strong assumptions regarding the selection mechanism.

"Random Forests and Selected Samples," provides causal estimates while leaving the selection mechanism largely unspecified. The paper's procedure builds on existing work on sample selection problems as well as a recent development in machine learning. Specifically, the authors lean on a variation of random forest regression developed by Susan Athey and Stefan Wager at Stanford University. 

The intuition for the paper's estimator is that using random forests to model the effects of selection facilitates subtracting these effects from the data. Once these effects are subtracted effects, one can proceed with analyses using linear regression.

To examine the performance of the procedure, the authors compare the paper's estimator with some estimators that exist in the economic literature using simulated data and data on married women's wages. Since not all married women enter the workforce, estimates of the effect of education on wage with the selected sample may differ from the causal effects. The results obtained from the paper's estimator differ from the results obtained from other methods and the estimator performs well in settings where the selection mechanism is unknown.