Working Paper: Heckroc: ROC Curves for Selected Samples

​Paper Authors: Jonathan Cook and Ashish Rajbhandari
Publication: Accepted for publication in The Stata Journal

Research Focus: Evaluating model performance is a crucial step for developing a predictive model. In some settings, the available data are drawn nonrandomly from the population. For example, Basel requires banks to estimate the probability of default for their loans. To assess the predictive performance of their probability of default models, banks could evaluate prediction with the sample of loan applicants that were granted loans. The sample of applicants that received a loan is likely different from the full population of applicants. Similarly, a regulator may be interested in the performance of a model designed to predict regulatory infractions. If the regulator can know only whether an infraction has occurred after completing an inspection, the data available to evaluate prediction may differ from the population in key ways.

Receiver Operating Characteristic (ROC) curves, which are a common tool for evaluating predictions, provide a misleading picture of a model's predictive power when used with selected samples. A recent PCAOB Office of Economic and Risk Analysis paper (published as Cook, J. "ROC Curves and Nonrandom Data," Pattern Recognition Letters, 2017, 85(1): 35-41) provides a procedure to consistently estimate the ROC curve that would be obtained with a random sample.

In "Heckroc: ROC Curves for Selected Samples," the authors describe a Stata module to implement the procedure described in "ROC curves and nonrandom data." The module, called heckroc, is available on the Boston College Statistical Software Components (SSC) website and can be installed by typing "ssc install heckroc" in Stata.

Heckroc estimates the area under the ROC curve and provides a graphical display of the curve. A variety of plot options are available, including the ability to add confidence bands to the plot. The module also comes with a data set that illustrates the effect of sample selection on ROC curves.