Advanced Sampling

A Slide Presentation on advanced sampling methods.

(Christian Steinruecken, Tomoharu Iwata)

Use arrow keys, PgUp, PgDn
to move between slides, or press 'dot' to toggle the
toolbar.

Typing 'HELP' + return jumps to a help screen.

A histogram sampled from a

- Introduction
- Basic sampling techniques
- Markov-Chain Monte Carlo
- Advanced MCMC

"That's what the brain does."

-- Máté

Looking at samples from a model can help to understand the model's properties.

\[ \PP{x \| y}\ =\ \frac{\PP{y \| x} \mul \PP{x}}{\PP{y}} \] where: \[ \PP{y}\ =\ \int \PP{y \| x} \PP{x}\ dx\ =\ Z \]

- Posterior expectations: \(\displaystyle \int f(x)\ \PP{x \| y}\ dx \)
- Evidence and model selection: \(\displaystyle \int \PP{y \| x}\ \PP{x}\ dx \)
- Prediction: \(\displaystyle \int \PP{y_* \| x}\ \PP{x \| y}\ dx \)

\(\displaystyle
\approx \sum_i f(x^{(i)},y)
\)

- Replace hard integrals with
summations . - Sampling methods
- Central problem: how to sample \(x^{(i)}\)
- Monte Carlo, MCMC, Gibbs, etc.

\(\displaystyle
\approx \int g(x,y) \ dx
\)

- Replace hard integrals with
easier integrals . - Message passing on factor graph
- Central problem: how to find \(g(\cdot) \in \set{G}\)
- VB, EP, etc.

\(\displaystyle
\approx \int h(x,y; x^*) \ dx
\)

- Replace hard integrals with
estimators . - "Non-Bayesian" methods
- Central problem: how to find \(x^*\)
- MAP, ML, Laplace, etc.

- A source of randomness (generator of random events)
- Method of transforming the randomness to samples from our chosen distribution.

Sources of natural randomness:

- chaotic or unstable dynamical systems
- thermal noise from transistors
- clock drift
- shot noise from photo diodes
- particle emissions from radioactive materials
- semi-transparent mirror + photon detectors
- measurement of a quantum system
- keystroke latencies of typed text

There are many kinds of hardware random number generators. E.g. the UK government's ERNIE. Thomson (1959)

A radioactive Caesium-137 probe next to a Geiger-Müller detector.

\[ ^{137}\mathrm{Cs}\ \stackrel{\scriptscriptstyle 30.17\mathrm{y}}{\longrightarrow}\ ^{137\mathrm{m}}\mathrm{Ba}+\beta^-+\overline{\nu_e}\ \stackrel{\scriptscriptstyle 156\mathrm{s}}{\longrightarrow}\ ^{137}\mathrm{Ba}+\gamma \] The number of β-particles arriving in fixed time intervals is Poisson-distributed.

Instructions: Walker (1996).

Natural phenomena usually produce weakly random or biased random events.

A

In general, this can be surprisingly difficult. Gifford (1988)

Suppose you're given a series of coin flips of
unknown bias \(\theta\).

Can you use these to createfair random bits ?

Can you use these to create

Given a supply of fair random bits ,
can you create fair dice rolls?

\(r \in\) { ⚀ , ⚁ , ⚂ , ⚃ , ⚄ , ⚅ }

\(r \in\) { ⚀ , ⚁ , ⚂ , ⚃ , ⚄ , ⚅ }

Suppose a sequence of people leaving a train.
How to choose a person uniformly at random ,
if the total number of people is not known
in advance?

- Keep a
reservoir of \(K\) randomly selected elements. - Initialise the reservoir to the first \(K\) elements \(x_1 \dots x_K\).
- For each subsequent element \(x_n\), replace one of the elements in the reservoir with \(x_n\) with probability \(\frac{k}{n}\).

Algorithm R, published in chapter 3.4.2 of Knuth (1969).

More detailed treatment is given by Vitter (1985) and also Devroye (1986).

/** Reservoir-samples an element, uniformly at random. * Reservoir sampling has linear time complexity, but avoids * the need to compute the collection's size in advance. * If the specified collection is empty, null is returned. * @param rnd random source * @param col iterable collection of elements * @return a uniformly random element */ public static <X> X rsvsample(Random rnd, Iterable<X> col) { X e = null; int k = 0; /* elements seen so far */ for (X x : col) { k++; e = (e == null) ? x : (rnd.nextInt(k) == 0 ? x : e); } return e; }

- Deterministic random bit generators.

- Middle square method (obsolete), John von Neumann (1951)
- Blum Blum Shub, Blum & al (1986)
- Mersenne Twister, Matsumoto & Nishimura (1997)

- The decimal (or binary) expansion of π, etc. Dodge (1996)

Tradeoffs:

**Least time / memory.**E.g. direct sampling.**Ease of implementation.**E.g. Metropolis MCMC.**Fewest random bits.**E.g. arithmetic coding.**Independence guarantees.**E.g. coupling from the past.

- Generating samples from a target distribution, using random numbers.
- Computing expectations of some function \(f(x)\) under a distribution \(\txtPP\).

Sample points

Discard \(y\) coordinates, and return \(x\) coordinates as a fair samples.

Suppose we have some discrete distribution:

and we want *millions* of samples from it.

- Draw a uniform random number \(r \sim \Uniform{0,1}\).
- Initialize \(g \assign 0\)
- Initialize \(x \assign \text{smallest } x \in \set{X}\).
- While \(g \lt r\) do:
- \(x \assign \next{x}\)
- \(g \assign g + \PP{x}\)

- Return \(x\)

This method implicitly computes the

Time complexity: O(N).

**Preparation:**Build a weighted decision tree using \(\p{x}\), with elements at the leaf nodes.**Sampling:**Start from the root, branch with probability according to weight until a leaf \(x\) is reached.- Return \(x\).

Time complexity: O(log N).

Suppose the following distribution over \(\bset{\text{A,B,C}}\):

To sample from this distribution:
0.413 | |

0.248 | |

0.339 |

- Allocate an array
m of size 1000. - Fill with 413 ×
A , 248 ×B and 339 ×C . - Draw a uniform integer
k in \(\bset{1 \mdots 1000}\). - Return
m[k] .

Proof of concept that it's possible to sample faster than O(log N).

We can do better... :-)

Observations:

We can build a fast sampler by

- Allocate an array \(s_1 \mdots s_N\) of
switch probabilities - Allocate an array \(y_1 \mdots y_N\) of
aliases \(\in \set{X}\)

- Draw \(n \sim \Uniform{1,N}\)
- Flip a coin with bias \(s_n\)
- if outcome is
false , return \(x_n\) - if outcome is
true , return \(y_n\)

- if outcome is

It's the fastest known method for sampling from discrete distributions.

Further reading:

- Walker (1977) for the original idea and FORTRAN code
- Ahrens (1989) for alias sampling from the Gaussian distribution.
- Marsaglia & al (2004) for a modern review.
- Fishman & Yarberry (1993) for an alias-like sampling method for dynamically changing distributions.

This is used e.g. in

- uniformly distributed random bits
- a target distribution

Arithmetic decoding of uniform random bits generates

We want independent samples \(x_1 \dots x_N\) from a distribution \(\txtPP\).

\(\PP{x}\) | difficult distribution, |

\(\QQ{x}\) | simple distribution, |

We can use \(\txtQQ\) to help us sample from \(\txtPP\).

- Sample \(x \sim \txtQQ\)
- Sample \(u\) uniformly between \([0 , c \mul \QQ{x} ]\)
- If \(u \lt \PP{x}\),
accept . Otherwise,reject and go back to step 1.

- Draw samples \(x_1 \dots x_N \sim \txtQQ\)
- Compute:

\(\displaystyle \hat{\Phi} = \frac1{\class{blue}{W}} \mul \sum_{n=1}^N f(x_n) \mul \frac{\PP{x_n}}{\QQ{x_n}} \) where \(\displaystyle \class{blue}{W = \sum_{n=1}^N \frac{\PP{x_n}}{\QQ{x_n}}} \)

The Metropolis Method uses a conditional proposal distribution
\(\QQ{ x \|x'}\) which depends on a

- Draw \(\left. x \| \xold \right. \sim \QQ{ \xold } \).
- Compute
acceptance factor \(\displaystyle a = \frac{\PP{x}}{\PP{\xold}}\) - If \(a \ge 1\), then
accept \( \xnew \assign x \).

If \(a \lt 1\), thenaccept with probability \(a\),

otherwisereject by setting \( \xnew \assign \xold \).

For the Metropolis method to work, the proposal distribution must be symmetric:

\[ \QQ{x \| x'} \ =\ \QQ{x' \| x} \]However, for

\[
a \ =\ \frac{\PP{\xprop}}{\PP{\xold}} \mul \class{hl}{
\frac{\QQ{\xold \| \xprop}}{\QQ{\xprop \| \xold}}
}
\]

Simply compute: \[ \hat{\Phi} \ =\ \frac{1}{N} \sum_{n=1}^N f(x_n) \] over

(Still, the chain must run long enough.)

- Gibbs Sampling is a Metropolis-Hastings algorithm whose proposals are always accepted.
- For each step, replace the value of a variable using the distribution conditioned on the remaining variables.

- \(\gg{x_{1}^{(\tau+1)}} \sim \PP{ x_{1} \| \rr{ x_{2}^{(\tau)}}, \rr{ x_{3}^{(\tau)} }, \rr{\cdots}, \rr{ x_{M}^{(\tau)}} } \)
- \(\gg{x_{2}^{(\tau+1)}} \sim \PP{ x_{2} \| \gg{ x_{1}^{(\tau+1)}}, \rr{x_{3}^{(\tau)}}, \rr{\cdots}, \rr{x_{M}^{(\tau)}} } \) \(\vdots\)
- \(\gg{x_{M}^{(\tau+1)}} \sim \PP{ x_{M} \| \gg{x_{1}^{(\tau+1)}}, \gg{x_{2}^{(\tau+1)}}, \gg{\cdots}, \gg{x_{M-1}^{(\tau+1)}} } \)

- Acceptance rate:
\[
a = \frac{\PP{x'} \QQ{x \| x'}}
{\PP{x} \QQ{x' \| x}}
= \frac{P(x_{m}'|x_{\setminus m}')P(x_{\setminus m}')P(x_{m}|x_{\setminus m}')}{P(x_{m}|x_{\setminus m})P(x_{\setminus m})P(x_{m}'|x_{\setminus m})}
=1
\]
since \(x_{\setminus m}=x_{\setminus m}'\).

\(x_{\setminus m}\) represents variables except for \(x_{m}\).

- exhibits random walk behaviour
- can be inefficient (small steps or a high rejection rate)

- uses gradient information to make large steps with low rejection rate
- is applicable to distributions over continuous variables,
for which we can evaluate the gradient of the log probability

- The distribution can be written in this form: \[ P(x) = \frac{1}{Z}\exp(-E(x)) \]
- Introduce
momentum variables \(r\), and define the kinetic energy: \[ K(r) = \tfrac{1}{2}r^{\top}r \] - Hamiltonian or total energy: \[ H(x,r) = E(x)+K(r) \]
- Obtain samples \(\{x_{n}\}\) by discarding \(r\) from samples of the joint density \[ P(x,r) = \frac{1}{Z_H}\exp(-E(x))\exp(-K(r)) \]

- Draw a momentum from Gaussian: \(r\sim\exp(-K(r))/Z_K\)
- Simulate Hamiltonian dynamics by \(L\) leapfrog steps \[ r(t+\epsilon/2)=r(t)-\frac{\epsilon}{2}\frac{\partial E(x(t))}{\partial x} \] \[ x(t+\epsilon)=x(t)+\epsilon r(t+\epsilon/2) \] \[ r(t+\epsilon)=r(t+\epsilon/2)-\frac{\epsilon}{2}\frac{\partial E(x(t+\epsilon))}{\partial x} \]
- Decide whether to accept the state \((x',r')\) after the leapfrog steps \[ \min(1,\exp[H(x,r)-H(x',r')]) \]

- The optimal acceptance rate is 65% Neal (2011)

- Stuck at local minimum: slow mixing
E(x)

[Hukushima, 2003]

- \(\beta\): inverse temperature, \(0 \leq \beta \leq 1\) \[ P(x|\beta)=\frac{1}{Z(\beta)}\exp(-\beta E(x)) \]
- \(\beta\rightarrow 1\): target distribution, \(\beta\rightarrow 0\): flat distribution
- Switching to smooth landscapes with high temperature: fast mixing

- multiple simulations with different temperatures
- can use any Monte Carlo methods at a single temperature
- exchange states with probability
\[
\min(1,\exp(\beta-\beta')(E-E'))
\]
- use samples from \(\beta=1\)
- samples from \(\beta \neq 1 \) can be used for estimating the normalizing constant
- choose a set of temperatures according to a geometric series

- also called sequential Monte Carlo
- sequential (online) version of importance sampling
- often used for nonlinear and non-Gaussian dynamical system

- Draw \(S\) particles from the proposal \[ x_k^{(s)} \sim \pi(x_k|x_{1:k-1}^{(s)},y_{1:k-1}) \]
- Update the importance weights, and normalize them \[ w_{k}^{(s)}=w_{k-1}^{(s)}\frac{p(y_k|x_{k}^{(s)})p(x_{k}^{(s)}|x_{k-1}^{(s)})}{\pi(x_k|x_{1:k-1}^{(s)},y_{1:k-1})} \]
- Resample particles from the current particle set
proportional to their weights

An MCMC instance of the

Given a point

/* $Id: sampling.html,v 1.59 2013/08/13 21:16:23 chris Exp $ */ /* Author: Christian Steinruecken */ import java.util.Random; import java.util.Vector; import java.util.Collection; /** Implements a slice sampler. * <p>Slice sampling is a Markov chain Monte Carlo method which * produces samples from a probability density <var>f</var>. * This is done by constructing a uniform random walk over the * <i>area</i> under <var>f</var>. * (For details, please see the techreport cited below.)</p> * <p>This implementation works for distributions on the 1-dimensional * real line. Bounding intervals for the slices are found by starting * from an initial interval of length <var>w</var>, which is repeatedly * doubled until both ends of the interval lie outside the area under * <var>f</var>.</p> * <b>References:</b><ul> * <li><a name="neal2000b">Radford M. Neal. <i>Slice Sampling,</i> * 2000-08-29. Technical Report, No. 2005. Departments of Statistics and * Computer Science, University of Toronto. * [<a href="http://www.cs.utoronto.ca/~radford/ftp/slc-samp.pdf">

/** Constructs a new interval doubling slice sampler for * distribution <i>dist</i> and starting point <var>x</var>. * @param dist distribution to be sampled from * @param x starting point */ public SliceSampler(Distribution<Double> dist, Double x) { this.dist = dist; this.x = x; } /** Checks acceptance of a proposed point <var>newx</var> * when interval (<var>a</var>,<var>z</var>) was found by * doubling procedure. */ protected boolean acceptable(double newx) { // current point is still x boolean det = false; while (z-a > 1.1*w) { double m = (a+z)/2; if ((x < m && newx >= m) || (x >= m && newx < m)) { det = true; } if (newx < m) { a = m; } else { z = m; } if (det && y >= dist.p(z) && y >= dist.p(a)) { return false; } } return true; } /** This method finds a bounding interval of the current * slice by Neal's "doubling" procedure. */ protected void doubleBounder(Random rnd) { // find slice bounds (by "doubling" procedure) // position a window [a,z] of size w randomly around x a = x - Samplers.uniform(rnd,w); z = a+w; // probability density at the endpoints double ap = dist.p(a); double zp = dist.p(z); // expand until slice [a,z] is acceptable int k = maxk; while (k > 0 && (y < ap || y < zp)) { if (Samplers.flip(rnd)) { a -= z-a; ap = dist.p(a); } else { z += z-a; zp = dist.p(z); } k--; } } /** This method finds a bounding interval of the current * slice by Neal's "stepping" procedure. */ protected void stepBounder(Random rnd) { // position a window [a,z] of size w randomly around x a = x-Samplers.uniform(rnd,w); z = a+w; // probability density at the endpoints int m = 1000; int j = rnd.nextInt(m); int k = m-1-j;

while (j>0 && y < dist.p(a)) { a -= w; j--; } while (k>0 && y < dist.p(z)) { z += w; k--; } } /** Get a sample using slice sampling. */ public Double sample(Random rnd) { // current point: x // sample y uniformly between 0 and p(x) y = Samplers.uniform(rnd)*dist.p(x); // find slice bounds //stepBounder(rnd); doubleBounder(rnd); //a = Double.NEGATIVE_INFINITY; //z = Double.POSITIVE_INFINITY; // now sample a point on slice [a,z] Double newx = Samplers.uniform(rnd,a,z); while (dist.p(newx) <= y || !acceptable(newx)) { //while (dist.p(newx) < y) { //System.out.println("["+a+","+z+"] x="+x+" newx="+newx); // shrink interval if (newx < x) { a = newx; } else { z = newx; } // resample newx newx = Samplers.uniform(rnd,a,z); } //System.out.println("["+a+","+z+"] newx="+newx // +", shrinks: "+shrinks+", grows: "+grows); x = newx; return newx; } /** Get samples using <var>n</var> steps of slice sampling. * This method runs for <var>n</var> iterations and adds * all obtained samples to the specified collection. */ public void sample(Random rnd, int n, Collection<Double> samples) { for (int j=0; j<n; j++) { samples.add(sample(rnd)); } } /** Runs a simple slice sampling demo. */ public static void main(String[] args) { Random r = new Random(); Gaussian g1 = new Gaussian(-2.0, 1); Gaussian g2 = new Gaussian(2.0, 1); Mixture<Double> d = new Mixture<Double>(0.2,g1,g2); //Beta dist = new Beta(4.5,1.75); //LogitBeta dist = new LogitBeta(0.5,0.5); double range_a = -5.0; // the interval we're histogramming double range_z = +5.0; int n = 100000; // number of samples System.out.println("SLICE SAMPLING DEMO"); System.out.println("Distribution: "+d); System.out.println(); Histogram hd = Histogram.fromDoubles(d, range_a, range_z, 70, n); System.out.println("Sampled directly ("+n+" samples):"); hd.print(System.out,8); System.out.println(); SliceSampler ss = new SliceSampler(d, -2.0, 10); System.out.println("Sampled with slice sampling ("+n+" steps):"); Histogram hr = Histogram.fromDoubles(ss, range_a, range_z, 70, n); hr.print(System.out,8); } }

Given a source of random bits, a

- Arithmetic decoding of random bits
- Coupling from the past

A method for getting an

- Fix a sequence of random numbers
- Run the chain from the
infinite past to the present. - The state at time \(t=0\) is an exact sample.

- To begin, start \(T\) steps in the past.
- Create \(N\) chains, 1 for every possible input state.
- Simulate those chains from time \(t = -T\) forwards
to \(t=0\).
Always using the same random numbers. - Sometimes, chains will move to the same state.
For all subsequent \(t\),
these chains follow an identical path.
This is called
coalescence . **Iff**at time \(t=0\) all chains have converged, we're done.

**Otherwise**, extend the simulation further back in time, e.g. at \(t=-2T\).

*Must*simulate from past to present.

(Sampling from present to future, stopping at coalescence, produces biased samples.)- Algorithm must always run to completion. Stopping early biases the sampler.

Combines

See also MacKay (2006) for a a seminar presentation on Nested Sampling.

- "From imperfect randomness to perfect samples."
- Overview over sampling algorithms and related methods.
- Check out the linked resources in this presentation.
- There are plenty of interesting papers, please explore! :)

Clicking a link may show additional information.

To hide the information again, click the bubble (or the link).

- Information Theory, Inference and Learning Algorithms

(David MacKay, 2003) - Non-Uniform Random Variate Generation

(Luc Devroye, 1986).

- Advanced MCMC methods (Iain Murray, 2006)
- Extended ensemble Monte Carlo (Cristina Savin, 2012)

See also: Propp & Wilson (1997)

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P |

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P |

D | K | C | C | C | B | C | L | L | C | C | P | D | K | C | C |