Health Research and Policy

Abstract

DATE: October 10, 2013
TIME: 1:15 - 3:00 pm
LOCATION: Medical School Office Building, Rm x303
TITLE: Local Case-Control Sampling: Efficient Subsampling for Imbalanced Data Sets
SPEAKER: Will Fithian
Department of Statistics, Stanford

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme generalizing standard case-control sampling. Unlike standard case-control sampling, which is inconsistent under model misspecification for the population coefficients, our method is consistent provided that the pilot estimate is. In extremely imbalanced data sets the subsample may contain a tiny fraction of the full data set. Still, under correct specification and with a consistent, independent pilot estimate, the subsampled estimate has exactly twice the asymptotic variance of the full-sample MLE, and this factor of two can be improved by increasing the number of data points accepted.

This is joint work with Trevor Hastie.

Suggested readings:
Fithian, William, and Trevor Hastie. Local Case-Control Sampling: Efficient Subsampling in Imbalanced Data Sets. arXiv preprint arXiv:1306.3706 (2013).

Stanford Medicine Resources:

Footer Links: