The Two Cultures of Stats

article philosophical 2001

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

This paper appeared August 2001 in “Statistical Science” volume 16 and was written by Prof. Leo Breiman († Juli 5, 2005), a statistician at the University of California, Berkeley.

1. Introduction

There are two cultures in the statistical modeling of data. The traditional Data Modeling Culture assumes that the data are generated by a certain stochastic mechanism, a model. The Algorithmic Modeling Culture treats the mechanism as unknown and seeks to find a formula that operates on the data.

2. The Data Modeling Culture

The Data Modeling Culture is characterized by:

  • Assumption of a stochastic mechanism that generates data
  • Maximum likelihood estimation
  • Hypothesis testing and p-values
  • Focus on parameters and their significance

2.1 Example: Linear Regression

\begin{equation} y = X\beta + \epsilon \end{equation}

Where $\epsilon \sim N(0, \sigma^2 I)$. The focus is on estimating $\beta$ and establishing statistical significance.

3. The Algorithmic Modeling Culture

The Algorithmic Modeling Culture is characterized by:

  • Focus on predictive accuracy
  • Complex models like neural networks and random forests
  • Validation on test data
  • Black-box methodology

The Algorithmic approach doesn’t make assumptions about the data-generating mechanism and instead focuses on finding algorithms that work well in practice.