What is an improper prior?

A brief introduction to an essential concept in Bayesian statistics.

Author

Ben Ewing

Published

August 23, 2022


// Imports
stdlib = require( "https://unpkg.com/@stdlib/dist-flat@0.0.96/build/bundle.min.js" )


// Settings
scrubberOptions = ({
  format: (d) => d.toFixed(2),
  autoplay: true,
  loopDelay: 500,
  alternate: true
})

theme_obj = FileAttachment("../../themes/improper-prior-observable.json").json()
theme = theme_obj.scheme

Introduction

While improperprior.com is meant to act as my personal website, it seems irresponsible not to include a page explaining the concept of improper priors.

What follows is a basic overview of improper priors, while I start with an example using a proper conjugate prior, I presume some basic knowledge of the Bayesian process. Additional resources are included at the bottom of this post.

Using a Proper Prior

To set the stage, let’s suppose I am an avid runner, and I’d like to better understand how quickly my heart rate recovers after running a hard (for me) mile. The data from my runs might look something like this.

Day of Month	Mile Time (Seconds)	Recovery Heart Rate (Beats per Minute)
Table 1: Sample Heart Rate Data
1	489	165
5	465	169
7	470	177
18	491	175
27	477	180
30	455	167
Note: heart rate is measured one minute after a hard one mile effort.

For the sake of simplicity¹, let us assume that my recovery heart rate is normally distributed, and that the variance is fixed to some known population value. Formally, let \(\mu\) represent my heart rate, the quantity we’d like to estimate, and let \(\sigma^2\) be the known variance. We’ll fix \(\sigma^2=12\).

In a Bayesian framework, we assume that our data drawn from some distribution (normal, in this case) that is specified by some set of parameters (the unknown parameter \(\mu\), and the known parameter \(\sigma^2\)). Without looking at the data, we don’t know what \(\mu\) should be, but we can specify a distribution of possible values. For this example we’ll hypothesize that \(\mu\) is also normally distributed, \(\mu \sim N(160, 20)\) where the high variance expresses our uncertainty. This is our prior distribution.

With normally distributed data and a normal prior for \(\mu\), we’ve specified a normal-normal model with known variance.

Once I’ve collected data, denoted by \(Y\), we can use Bayes’ theorem to update our estimate of \(\mu\).

\[ P(\mu|Y) = \frac{L(Y|\mu)P(\mu)}{P(Y)}. \tag{1}\]

Where:

\(P(\mu|Y)\) is the distribution of possible values of \(\mu\), now informed by data. This is what we’d like to estimate.
\(L(Y|\mu)\) is the likelihood of observing the data that was collected, given the true value of \(\mu\).
\(P(\mu)\) is the prior distribution.
\(P(Y)\) is the unconditional likelihood of observing the data. We will ignore this term, as it is constant once the data has been observed.

If we evaluate this, we end up with²:

\[ P(\mu|Y) \propto N \left( \frac{1}{\frac{1}{20} + \frac{n}{12}} \left( \frac{160}{20} + \overline Y \frac{n}{12} \right), \frac{1}{\frac{1}{20} + \frac{n}{12}} \right). \tag{2}\]

Where:

\(n\) is the number of observations.
\(\overline Y\) is the mean observed recovery heart rate.
\(12\) is the known population variance of recovery heart rate recordings.
\(160\) is the prior mean recovery heart rate.
\(20\) is the variance we’ve put on the prior.

This is our posterior distribution, the distribution of possible values for my recovery heart rate taking into account both the observed data and the prior. Because we’ve used a conjugate prior we’ve ended up with a normal distribution, which is obviously very convenient.

viewof prior_µ = Inputs.range([100, 200], {value: 160, step: 1, label: "Prior µ:"})
viewof prior_σ_sq = Inputs.range([1, 50], {value: 20, step: 1, label: "Prior σ^2:"})

viewof observed_hr_input = Inputs.text({label: "Observed Recovery Heart Rates: ", 
                                        value: "165, 169"})
observed_hr = observed_hr_input.split`,`
  .map(x => +x)
  .filter( value => !Number.isNaN(value) && value != 0)

observed_µ = observed_hr.reduce((a, b) => a + b) / observed_hr.length
observed_n = observed_hr.length

posterior_µ = (1/(1/prior_σ_sq + observed_n/12)) * (prior_µ/prior_σ_sq + observed_µ*(observed_n/12))
posterior_σ_sq = 1/(1/prior_σ_sq + observed_n/12)

x = stdlib.linspace( 80, 240, 1000 )
y_prior = x.map((y => stdlib.base.dists.normal.pdf( y, prior_µ, Math.sqrt(prior_σ_sq) ) ))
y_posterior = x.map((y => stdlib.base.dists.normal.pdf( y, posterior_µ, Math.sqrt(posterior_σ_sq) ) ))

length = x.length

prior_data = Array.from({length}, (_, i) => ({
  x: x[i],
  y: y_prior[i],
  type: "prior"
}))

posterior_data = Array.from({length}, (_, i) => ({
  x: x[i],
  y: y_posterior[i],
  type: "posterior"
}))

plot_data = prior_data.concat(posterior_data)


Plot.plot({
  x: {
    label: "x",
    domain: [100, 200]
  },
  y: {
    label: "p(x)"
  },
  color: {
    legend: true,
    range: theme
  },
  marks: [
    Plot.line(plot_data, {x: "x", y: "y", stroke: "type"}),
    Plot.tickX(observed_hr, {strokeOpacity: 0.2})
  ],
  height: 300
})

Using an Improper Prior

While this seems like a good estimate, a skeptic may reasonably argue that the choice of a \(\mu \sim N(160, 20)\) prior is biasing the estimate. It’s a best guess, when the reality is that we don’t really know what \(\mu\) should be before observing any data. As a naive alternative, what would happen if we just set the prior probability of each possible value of \(\mu\) to 1?

Well it turns out that we can do exactly this - we can use any prior, even an improper prior as long as the posterior comes out to be a proper distribution. In other words, the prior need not be a proper distribution so long as the posterior is. In this particular case we end up with up with a posterior that is informed entirely by observed data.

\[ \begin{align*} P(\mu|Y) &\propto L(Y|\mu)P(\mu) \\ &\propto L(Y|1)\times1 \\ &\propto L(Y) \end{align*} \tag{3}\]

Choosing an improper prior that generates a valid posterior can be tricky, but Jeffreys prior is a generally good place to start.

Further Resources

Craig Gidney has a nice blog post walking through a slightly more technical example of improper priors. Likewise, Andy Jones has a great podcast with a few additional examples. For a more general treatment A First Course in Bayesian Statistical Methods by Hoff and Bayesian Data Analysis by Gelman et al are the standard introductory Bayesian statistics textbooks.

As ever, Wikipedia has very detailed articles on priors, more suitable for reference than learning:

Footnotes

The normal distribution is not perfect for this problem as it includes all real valued numbers - I can’t possibly have a negative heart rate. The gamma distribution may be a more natural fit, however the normal distribution has easily interpretable parameters, and the known variance means that we only need to focus on a single parameter.↩︎
See this Statlect page for the actual derivation of the posterior.↩︎