A Slide Presentation on advanced sampling methods.
(Christian Steinruecken, Tomoharu Iwata)

Use arrow keys, PgUp, PgDn to move between slides, or press 'dot' to toggle the toolbar.
Typing 'HELP' + return jumps to a help screen.

click me

A histogram sampled from a negative binomial distribution.
$G \sim \text{DP}(\alpha=50, G_0 = \text{Uniform})$

## Contents

• Introduction
• Basic sampling techniques
• Markov-Chain Monte Carlo

## Why sampling?

"That's what the brain does."
-- Máté

Looking at samples from a model can help to understand the model's properties.

## Probabilistic Inference

$\PP{x \| y}\ =\ \frac{\PP{y \| x} \mul \PP{x}}{\PP{y}}$ where: $\PP{y}\ =\ \int \PP{y \| x} \PP{x}\ dx\ =\ Z$
Cunningham (2011)

## Approximate Inference Taxonomy

Many problems of interest in inference can be written as an integral of type:

$$\displaystyle \int f(x,y)\ dx$$
Examples:
• Posterior expectations: $$\displaystyle \int f(x)\ \PP{x \| y}\ dx$$
• Evidence and model selection: $$\displaystyle \int \PP{y \| x}\ \PP{x}\ dx$$
• Prediction: $$\displaystyle \int \PP{y_* \| x}\ \PP{x \| y}\ dx$$
In practice, these integrals can rarely be evaluated exactly.
Cunningham (2011)

## Approximate Inference Taxonomy

$$\displaystyle \int f(x,y)\ dx$$

$$\displaystyle \approx \sum_i f(x^{(i)},y)$$
1. Replace hard integrals with summations.
2. Sampling methods
3. Central problem: how to sample $$x^{(i)}$$
4. Monte Carlo, MCMC, Gibbs, etc.
$$\displaystyle \approx \int g(x,y) \ dx$$
1. Replace hard integrals with easier integrals.
2. Message passing on factor graph
3. Central problem: how to find $$g(\cdot) \in \set{G}$$
4. VB, EP, etc.
$$\displaystyle \approx \int h(x,y; x^*) \ dx$$
1. Replace hard integrals with estimators.
2. "Non-Bayesian" methods
3. Central problem: how to find $$x^*$$
4. MAP, ML, Laplace, etc.

## Sampling

Samplers are mechanisms for obtaining random samples of a chosen probability distribution.

Requirements:
• A source of randomness (generator of random events)
• Method of transforming the randomness to samples from our chosen distribution.

## Where does randomness come from?

Sources of natural randomness:
• chaotic or unstable dynamical systems
• thermal noise from transistors
• clock drift
• shot noise from photo diodes
• particle emissions from radioactive materials
• semi-transparent mirror + photon detectors
• measurement of a quantum system
• keystroke latencies of typed text

There are many kinds of hardware random number generators. E.g. the UK government's ERNIE. Thomson (1959)

## Random sampling: hardware solution

Example:
A radioactive Caesium-137 probe next to a Geiger-Müller detector.
$^{137}\mathrm{Cs}\ \stackrel{\scriptscriptstyle 30.17\mathrm{y}}{\longrightarrow}\ ^{137\mathrm{m}}\mathrm{Ba}+\beta^-+\overline{\nu_e}\ \stackrel{\scriptscriptstyle 156\mathrm{s}}{\longrightarrow}\ ^{137}\mathrm{Ba}+\gamma$ The number of β-particles arriving in fixed time intervals is Poisson-distributed.
Interface this to a serial port and extract the randomness with a computer.
Instructions: Walker (1996).

## Randomness extraction

Natural phenomena usually produce weakly random or biased random events.

A randomness extractor attempts to attempts to convert weak randomness into strong unbiased random numbers.

In general, this can be surprisingly difficult. Gifford (1988)

## Puzzles

Suppose you're given a series of coin flips of unknown bias $$\theta$$.
Can you use these to create fair random bits?

Given a supply of fair random bits, can you create fair dice rolls?
$$r \in$$ { ⚀ , ⚁ , ⚂ , ⚃ , ⚄ , ⚅ }

## Puzzles

Suppose a sequence of people leaving a train. How to choose a person uniformly at random, if the total number of people is not known in advance?

## Reservoir Sampling

A reservoir algorithm samples $$K$$ elements uniformly at random from a sequence $$x_1 \dots x_N$$ of unknown length $$N$$, using at most 1 pass through the sequence.

Solution
• Keep a reservoir of $$K$$ randomly selected elements.
• Initialise the reservoir to the first $$K$$ elements $$x_1 \dots x_K$$.
• For each subsequent element $$x_n$$, replace one of the elements in the reservoir with $$x_n$$ with probability $$\frac{k}{n}$$.

References
Algorithm R, published in chapter 3.4.2 of Knuth (1969).
More detailed treatment is given by Vitter (1985) and also Devroye (1986).

Simplest case: $$K=1$$.
Vitter (1985)

## Reservoir Sampling: Code

  /** Reservoir-samples an element, uniformly at random.
* Reservoir sampling has linear time complexity, but avoids
* the need to compute the collection's size in advance.
* If the specified collection is empty, null is returned.
* @param rnd random source
* @param col iterable collection of elements
* @return a uniformly random element */
public static <X> X rsvsample(Random rnd, Iterable<X> col) {
X e = null;
int k = 0;   /* elements seen so far */
for (X x : col) {
k++;
e = (e == null) ? x : (rnd.nextInt(k) == 0 ? x : e);
}
return e;
}


## Sources of deterministic randomness

Pseudo-random. Basically "non-random", but good enough in practice.

• Deterministic random bit generators.
• Middle square method (obsolete), John von Neumann (1951)
• Blum Blum Shub, Blum & al (1986)
• Mersenne Twister, Matsumoto & Nishimura (1997)
• The decimal (or binary) expansion of π, etc. Dodge (1996)

## Sampling

Sampling algorithms
transform random bits into samples from the distribution of our choice. are representations of the distributions they produce samples for.

• Least time / memory. E.g. direct sampling.
• Ease of implementation. E.g. Metropolis MCMC.
• Fewest random bits. E.g. arithmetic coding.
• Independence guarantees. E.g. coupling from the past.
Metropolis & Ulam (1949)

## Monte-Carlo Methods

Computational techniques that make use of random numbers.
For example:
• Generating samples from a target distribution, using random numbers.
• Computing expectations of some function $$f(x)$$ under a distribution $$\txtPP$$.

## Area under the curve method

Sample points uniformly at random from the area under the curve.
Discard $$y$$ coordinates, and return $$x$$ coordinates as a fair samples.

## Sampling from discrete distributions

Suppose we have some discrete distribution:

$$\p{x}$$ over elements $$x \in \set{X}$$

and we want millions of samples from it.

## Linear sampling

• Draw a uniform random number $$r \sim \Uniform{0,1}$$.
• Initialize $$g \assign 0$$
• Initialize $$x \assign \text{smallest } x \in \set{X}$$.
• While $$g \lt r$$ do:
• $$x \assign \next{x}$$
• $$g \assign g + \PP{x}$$
• Return $$x$$

This method implicitly computes the cumulative distribution $$\cuma{P}(x)$$.

Time complexity: O(N).

## Decision tree

• Preparation: Build a weighted decision tree using $$\p{x}$$, with elements at the leaf nodes.
• Sampling: Start from the root, branch with probability according to weight until a leaf $$x$$ is reached.
• Return $$x$$.

Time complexity: O(log N).

## Array method

Suppose the following distribution over $$\bset{\text{A,B,C}}$$:
 A 0.413 B 0.248 C 0.339
To sample from this distribution:
• Allocate an array m of size 1000.
• Fill with 413 × A, 248 × B and 339 × C.
• Draw a uniform integer k in $$\bset{1 \mdots 1000}$$.
• Return m[k].

## Array method

The array method draws samples in O(1) time Takes up a lot of memory

Proof of concept that it's possible to sample faster than O(log N).
We can do better... :-)
Walker (1977)

## Alias Sampling

Observations:
Discrete uniform samples are cheap. Binary decisions are cheap.

We can build a fast sampler by modulating the amplitude of a uniform distribution.
Walker (1977)

## Alias Sampling

Given a distribution $$\dist{P}$$ over $$N$$ elements $$x_1 \mdots x_N \in \set{X}$$:

Preparation:
• Allocate an array $$s_1 \mdots s_N$$ of switch probabilities
• Allocate an array $$y_1 \mdots y_N$$ of aliases $$\in \set{X}$$

Sampling:
• Draw $$n \sim \Uniform{1,N}$$
• Flip a coin with bias $$s_n$$
• if outcome is false, return $$x_n$$
• if outcome is true, return $$y_n$$
Walker (1977)

## Alias Sampling: Summary

Takes O(N²) setup time. Takes O(N) space. All samples are drawn in O(1) time.

It's the fastest known method for sampling from discrete distributions.

• Walker (1977) for the original idea and FORTRAN code
• Ahrens (1989) for alias sampling from the Gaussian distribution.
• Marsaglia & al (2004) for a modern review.
• Fishman & Yarberry (1993) for an alias-like sampling method for dynamically changing distributions.

## Arithmetic Coding

Arithmetic coding offers a way of deterministically transducing between probability distributions.

This is used e.g. in data compression to transduce between:
• uniformly distributed random bits
• a target distribution

Arithmetic decoding of uniform random bits generates independent random samples of the target distribution.

More about arithmetic coding: data compression slides (2012)

## Using a proposal distribution

We want independent samples $$x_1 \dots x_N$$ from a distribution $$\txtPP$$.

 $$\PP{x}$$ difficult distribution, cannot sample directly $$\QQ{x}$$ simple distribution, can sample directly

We can use $$\txtQQ$$ to help us sample from $$\txtPP$$.

## Rejection Sampling

1. Sample $$x \sim \txtQQ$$
2. Sample $$u$$ uniformly between $$[0 , c \mul \QQ{x} ]$$
3. If $$u \lt \PP{x}$$, accept. Otherwise, reject and go back to step 1.

## Importance Sampling

Rejecting samples seems wasteful, if we only want an expectation. Importance sampling solves this as follows: $\Phi = \int \PP{x}\ f(x)\ dx = \int \class{red}{\QQ{x}} \frac{\PP{x}} {\class{red}{\QQ{x}}} f(x)\ dx$ Method:
1. Draw samples $$x_1 \dots x_N \sim \txtQQ$$
2. Compute:
$$\displaystyle \hat{\Phi} = \frac1{\class{blue}{W}} \mul \sum_{n=1}^N f(x_n) \mul \frac{\PP{x_n}}{\QQ{x_n}}$$ where $$\displaystyle \class{blue}{W = \sum_{n=1}^N \frac{\PP{x_n}}{\QQ{x_n}}}$$

## Importance Sampling

The distributions $$\txtPP$$ or $$\txtQQ$$ need not be normalised for importance sampling to work. $\hat{\Phi}\ =\ \frac{1}{N} \sum_{n=1}^N f(x_n) \mul \frac{Z_Q}{Z_P} \mul \frac{\PP{x}^*}{\QQ{x}^*}$ But note that: $\begin{eqnarray*} \frac{Z_P}{Z_Q} &\ =\ & \int \QQ{x} \frac{\PP{x}^*}{\QQ{x}^*} \ dx \\ &\ \approx\ & \frac{1}{N} \mul \sum_{n=1}^N \frac{\PP{x_n}^*}{\QQ{x_n}^*} \end{eqnarray*}$
Marsaglia & Tsang (1984)

## The Ziggurat algorithm

Combines inverse CDF and rejection sampling Generates independent samples from continuous distributions Produces samples in $$O(1)$$ time (on average)
$$\def\xold{x_{\text{old}}} \def\xnew{x_{\text{new}}} \def\xprop{x}$$ Metropolis (1953)

## The Metropolis Method

The Metropolis Method uses a conditional proposal distribution $$\QQ{ x \|x'}$$ which depends on a previous sample $$x'$$.

Algorithm:
1. Draw $$\left. x \| \xold \right. \sim \QQ{ \xold }$$.
2. Compute acceptance factor $$\displaystyle a = \frac{\PP{x}}{\PP{\xold}}$$
3. If $$a \ge 1$$, then accept $$\xnew \assign x$$.
If $$a \lt 1$$, then accept with probability $$a$$,
otherwise reject by setting $$\xnew \assign \xold$$.

Hastings (1970) Metropolis (1953)

## Metropolis-Hastings

For the Metropolis method to work, the proposal distribution must be symmetric:

$\QQ{x \| x'} \ =\ \QQ{x' \| x}$
However, for non-symmetric $$\txtQQ$$, we can correct for the bias by adjusting the acceptance factor:
$a \ =\ \frac{\PP{\xprop}}{\PP{\xold}} \mul \class{hl}{ \frac{\QQ{\xold \| \xprop}}{\QQ{\xprop \| \xold}} }$

Hastings (1970) Metropolis (1953)

## Metropolis-Hastings

Some remarkable properties:
$$\QQ{x \| x'}$$ does not need to be similar to $$\PP{x}$$. Can be used to draw samples from (probably) any distribution.

Important caveats:
Successive samples in the chain are not indepedent. It is difficult to assess convergence.

Hastings (1970) Metropolis (1953)

## Metropolis-Hastings

Samples from a Metropolis-MCMC chain are not independent, but it is not necessary to thin them if we want an expectation.

Simply compute: $\hat{\Phi} \ =\ \frac{1}{N} \sum_{n=1}^N f(x_n)$ over all samples.

(Still, the chain must run long enough.)
Geman & Geman (1984)

## Gibbs Sampling

• Gibbs Sampling is a Metropolis-Hastings algorithm whose proposals are always accepted.
• For each step, replace the value of a variable using the distribution conditioned on the remaining variables.

Algorithm:$$\def\rr#1{\class{magenta}{#1}} \def\gg#1{\class{hl}{#1}}$$
1. $$\gg{x_{1}^{(\tau+1)}} \sim \PP{ x_{1} \| \rr{ x_{2}^{(\tau)}}, \rr{ x_{3}^{(\tau)} }, \rr{\cdots}, \rr{ x_{M}^{(\tau)}} }$$
2. $$\gg{x_{2}^{(\tau+1)}} \sim \PP{ x_{2} \| \gg{ x_{1}^{(\tau+1)}}, \rr{x_{3}^{(\tau)}}, \rr{\cdots}, \rr{x_{M}^{(\tau)}} }$$
3. $$\vdots$$
4. $$\gg{x_{M}^{(\tau+1)}} \sim \PP{ x_{M} \| \gg{x_{1}^{(\tau+1)}}, \gg{x_{2}^{(\tau+1)}}, \gg{\cdots}, \gg{x_{M-1}^{(\tau+1)}} }$$

Geman & Geman (1984)

## Gibbs Sampling

• Acceptance rate: $a = \frac{\PP{x'} \QQ{x \| x'}} {\PP{x} \QQ{x' \| x}} = \frac{P(x_{m}'|x_{\setminus m}')P(x_{\setminus m}')P(x_{m}|x_{\setminus m}')}{P(x_{m}|x_{\setminus m})P(x_{\setminus m})P(x_{m}'|x_{\setminus m})} =1$ since $$x_{\setminus m}=x_{\setminus m}'$$.
$$x_{\setminus m}$$ represents variables except for $$x_{m}$$.
No rejection No tuning parameters Useful when the conditional distributions are easy to sample (e.g. exponential family) Strong dependencies between successive samples Explores slowly in strongly correlated spaces

## Hamiltonian Monte Carlo

### Problems in Metropolis sampling

• exhibits random walk behaviour
• can be inefficient (small steps or a high rejection rate)
MacKay (2003)

## HMC (Hamiltonian or Hybrid Monte Carlo)

• uses gradient information to make large steps with low rejection rate
• is applicable to distributions over continuous variables, for which we can evaluate the gradient of the log probability

## Hamiltonian Monte Carlo

$$E(x)$$ is the potential energy
• The distribution can be written in this form: $P(x) = \frac{1}{Z}\exp(-E(x))$
• Introduce momentum variables $$r$$, and define the kinetic energy: $K(r) = \tfrac{1}{2}r^{\top}r$
• Hamiltonian or total energy: $H(x,r) = E(x)+K(r)$
• Obtain samples $$\{x_{n}\}$$ by discarding $$r$$ from samples of the joint density $P(x,r) = \frac{1}{Z_H}\exp(-E(x))\exp(-K(r))$

## Hamiltonian Monte Carlo: Procedures

1. Draw a momentum from Gaussian: $$r\sim\exp(-K(r))/Z_K$$
2. Simulate Hamiltonian dynamics by $$L$$ leapfrog steps $r(t+\epsilon/2)=r(t)-\frac{\epsilon}{2}\frac{\partial E(x(t))}{\partial x}$ $x(t+\epsilon)=x(t)+\epsilon r(t+\epsilon/2)$ $r(t+\epsilon)=r(t+\epsilon/2)-\frac{\epsilon}{2}\frac{\partial E(x(t+\epsilon))}{\partial x}$
3. Decide whether to accept the state $$(x',r')$$ after the leapfrog steps $\min(1,\exp[H(x,r)-H(x',r')])$

## Hamiltonian Monte Carlo: Summary

can take bigger steps to high probability states with low rejection rate applicable when we can evaluate the gradient need to tune the parameters (step size $$\epsilon$$ and #iterations $$L$$)
• The optimal acceptance rate is 65% Neal (2011)

## Parallel Tempering

### Sampling from a multi-modal distribution

• Stuck at local minimum: slow mixing
E(x)
[Hukushima, 2003]
Swendsen and Wang (1986)

## Parallel Tempering

• $$\beta$$: inverse temperature, $$0 \leq \beta \leq 1$$ $P(x|\beta)=\frac{1}{Z(\beta)}\exp(-\beta E(x))$
• $$\beta\rightarrow 1$$: target distribution, $$\beta\rightarrow 0$$: flat distribution
• Switching to smooth landscapes with high temperature: fast mixing

## Parallel Tempering

• multiple simulations with different temperatures
• can use any Monte Carlo methods at a single temperature
• exchange states with probability $\min(1,\exp(\beta-\beta')(E-E'))$
• use samples from $$\beta=1$$
• samples from $$\beta \neq 1$$ can be used for estimating the normalizing constant
• choose a set of temperatures according to a geometric series

## Parallel Tempering: Summary

fast mixing allow re-use existing code easy to compute in parallel

## Particle Filter

• also called sequential Monte Carlo
• sequential (online) version of importance sampling
• often used for nonlinear and non-Gaussian dynamical system

Caesarendra et al. 2010

## Particle Filter: Procedure

1. Draw $$S$$ particles from the proposal $x_k^{(s)} \sim \pi(x_k|x_{1:k-1}^{(s)},y_{1:k-1})$
2. Update the importance weights, and normalize them $w_{k}^{(s)}=w_{k-1}^{(s)}\frac{p(y_k|x_{k}^{(s)})p(x_{k}^{(s)}|x_{k-1}^{(s)})}{\pi(x_k|x_{1:k-1}^{(s)},y_{1:k-1})}$
3. Resample particles from the current particle set proportional to their weights

## Particle Filter: Summary

can model nonlinear and non-Gaussian dynamics can infer in an online fashion may need big memory space to generate enough number of particles for good approximation
Neal (2000)

## Slice Sampling

An MCMC instance of the area under the curve method:

Given a point $$x$$, samples a $$y$$-coordinate uniformly between $$0$$ and $$\PP{x}$$. Samples a new point $$x'$$ from the slice through the distribution at height $$y$$.
Neal (2000)

## Slice Sampling: Code

/* $Id: sampling.html,v 1.59 2013/08/13 21:16:23 chris Exp$ */
/* Author: Christian Steinruecken */

import java.util.Random;
import java.util.Vector;
import java.util.Collection;

/** Implements a slice sampler.
* <p>Slice sampling is a Markov chain Monte Carlo method which
* produces samples from a probability density <var>f</var>.
* This is done by constructing a uniform random walk over the
* <i>area</i> under <var>f</var>.
* (For details, please see the techreport cited below.)</p>
* <p>This implementation works for distributions on the 1-dimensional
* real line.  Bounding intervals for the slices are found by starting
* from an initial interval of length <var>w</var>, which is repeatedly
* doubled until both ends of the interval lie outside the area under
* <var>f</var>.</p>
* <b>References:</b><ul>
* <li><a name="neal2000b">Radford M. Neal.&nbsp; <i>Slice Sampling,</i>
* 2000-08-29. Technical Report, No. 2005. Departments of Statistics and
* Computer Science, University of Toronto.
* </a></li></ul>
* @see Metropolis
* @see RejectSampler */
public class SliceSampler implements Sampler<Double> {

/** Target distribution. */
Distribution<Double> dist = null;

/** Step size. */
double w = 1.0;

/** Maximal number of times a slice's bounding interval may be doubled.
* This setting limits the maximum slice width to
* <var>w</var>*2^<var>maxk</var>. */
int maxk = 64;

/** Current sample. */
Double x = null;

/** Current slice. */
double y = -1;

/** Lower end of bounding interval. */
double a = -w/2;

/** Upper end of bounding interval. */
double z = +w/2;

/** Constructs a new interval doubling slice sampler for
* distribution <i>dist</i>, starting point <var>x</var>
* and initial interval size <var>w</var>.
* @param dist distribution to be sampled from
* @param w initial interval size (typical slice width)
* @param x starting point */
public SliceSampler(Distribution<Double> dist, Double x, double w) {
assert (w > 0.0);
this.dist = dist;
this.x = x;
this.w = w;
}
/** Constructs a new interval doubling slice sampler for
* distribution <i>dist</i> and starting point <var>x</var>.
* @param dist distribution to be sampled from
* @param x starting point */
public SliceSampler(Distribution<Double> dist, Double x) {
this.dist = dist;
this.x = x;
}

/** Checks acceptance of a proposed point <var>newx</var>
* when interval (<var>a</var>,<var>z</var>) was found by
* doubling procedure. */
protected boolean acceptable(double newx) {
// current point is still x
boolean det = false;
while (z-a > 1.1*w) {
double m = (a+z)/2;
if ((x < m && newx >= m) || (x >= m && newx < m)) {
det = true;
}
if (newx < m) {
a = m;
} else {
z = m;
}
if (det && y >= dist.p(z) && y >= dist.p(a)) {
return false;
}
}
return true;
}

/** This method finds a bounding interval of the current
* slice by Neal's "doubling" procedure. */
protected void doubleBounder(Random rnd) {
// find slice bounds (by "doubling" procedure)
// position a window [a,z] of size w randomly around x
a = x - Samplers.uniform(rnd,w);
z = a+w;
// probability density at the endpoints
double ap = dist.p(a);
double zp = dist.p(z);
// expand until slice [a,z] is acceptable
int k = maxk;
while (k > 0 && (y < ap || y < zp)) {
if (Samplers.flip(rnd)) {
a  -= z-a;
ap = dist.p(a);
} else {
z  += z-a;
zp = dist.p(z);
}
k--;
}
}

/** This method finds a bounding interval of the current
* slice by Neal's "stepping" procedure. */
protected void stepBounder(Random rnd) {
// position a window [a,z] of size w randomly around x
a = x-Samplers.uniform(rnd,w);
z = a+w;
// probability density at the endpoints
int m = 1000;
int j = rnd.nextInt(m);
int k = m-1-j;
while (j>0 && y < dist.p(a)) {
a -= w;
j--;
}
while (k>0 && y < dist.p(z)) {
z += w;
k--;
}
}

/** Get a sample using slice sampling. */
public Double sample(Random rnd) {
// current point: x
// sample y uniformly between 0 and p(x)
y = Samplers.uniform(rnd)*dist.p(x);
// find slice bounds
//stepBounder(rnd);
doubleBounder(rnd);
//a = Double.NEGATIVE_INFINITY;
//z = Double.POSITIVE_INFINITY;
// now sample a point on slice [a,z]
Double newx = Samplers.uniform(rnd,a,z);
while (dist.p(newx) <= y || !acceptable(newx)) {
//while (dist.p(newx) < y) {
//System.out.println("["+a+","+z+"] x="+x+" newx="+newx);
// shrink interval
if (newx < x) {
a = newx;
} else {
z = newx;
}
// resample newx
newx = Samplers.uniform(rnd,a,z);
}
//System.out.println("["+a+","+z+"] newx="+newx
//                  +", shrinks: "+shrinks+", grows: "+grows);
x = newx;
return newx;
}

/** Get samples using <var>n</var> steps of slice sampling.
* This method runs for <var>n</var> iterations and adds
* all obtained samples to the specified collection. */
public void sample(Random rnd, int n, Collection<Double> samples) {
for (int j=0; j<n; j++) {
}
}

/** Runs a simple slice sampling demo. */
public static void main(String[] args) {
Random r = new Random();

Gaussian g1 = new Gaussian(-2.0, 1);
Gaussian g2 = new Gaussian(2.0, 1);
Mixture<Double> d = new Mixture<Double>(0.2,g1,g2);
//Beta dist = new Beta(4.5,1.75);
//LogitBeta dist = new LogitBeta(0.5,0.5);

double range_a = -5.0; // the interval we're histogramming
double range_z = +5.0;
int n = 100000; // number of samples

System.out.println("SLICE SAMPLING DEMO");
System.out.println("Distribution: "+d);
System.out.println();
Histogram hd = Histogram.fromDoubles(d, range_a, range_z, 70, n);
System.out.println("Sampled directly ("+n+" samples):");
hd.print(System.out,8);

System.out.println();

SliceSampler ss = new SliceSampler(d, -2.0, 10);
System.out.println("Sampled with slice sampling ("+n+" steps):");
Histogram hr = Histogram.fromDoubles(ss, range_a, range_z, 70, n);
hr.print(System.out,8);
}

}



## Perfect Sampling

Given a source of random bits, a perfect sampler constructs provably independent samples.

Methods for getting perfect samples:
• Arithmetic decoding of random bits
• Coupling from the past
Propp & Wilson (1996)

## Coupling from the past

A method for getting an exact sample from an ergodic Markov chain (e.g. Metropolis MCMC).

How?
• Fix a sequence of random numbers
• Run the chain from the infinite past to the present.
• The state at time $$t=0$$ is an exact sample.
Propp & Wilson (1996)

## Coupling from the past

How to simulate a chain from the infinite past:
• To begin, start $$T$$ steps in the past.
• Create $$N$$ chains, 1 for every possible input state.
• Simulate those chains from time $$t = -T$$ forwards to $$t=0$$. Always using the same random numbers.
• Sometimes, chains will move to the same state. For all subsequent $$t$$, these chains follow an identical path. This is called coalescence.
• Iff at time $$t=0$$ all chains have converged, we're done.
Otherwise, extend the simulation further back in time, e.g. at $$t=-2T$$.

Propp & Wilson (1996)

## Coupling from the past

Produces exact (provably independent) samples Super elegant algorithm Subtle things can go wrong:
• Must simulate from past to present.
(Sampling from present to future, stopping at coalescence, produces biased samples.)
• Algorithm must always run to completion. Stopping early biases the sampler.
Chain design matters: coalescence can take a long time. Running time is random.
Mira & al. (2001)

## Perfect Slice Samplers

Combines coupling from the past with slice sampling.

## Estimating Z

Skilling (2006) Skilling (2004)

## Nested Sampling

See also MacKay (2006) for a a seminar presentation on Nested Sampling.

## Summary

Tutorial Summary:
• "From imperfect randomness to perfect samples."
• Overview over sampling algorithms and related methods.
• Check out the linked resources in this presentation.
• There are plenty of interesting papers, please explore! :)

Freely available text books:
• Information Theory, Inference and Learning Algorithms
(David MacKay, 2003)
• Non-Uniform Random Variate Generation
(Luc Devroye, 1986).
Other presentations:
• Advanced MCMC methods (Iain Murray, 2006)
• Extended ensemble Monte Carlo (Cristina Savin, 2012)

## References

Luc Devroye.  Non-Uniform Random Variate Generation, 1986. Book. School of Computer Science, McGill University. Springer. [Website] [PDF] Luc Devroye.  Random Sampling, 1986. Chapter 12 in Non-Uniform Random Variate Generation. School of Computer Science, McGill University. Springer. [PDF] David J. C. MacKay.  Information Theory, Inference, and Learning Algorithms, 2003. Book. Cambridge University Press. ISBN 978-0-521-64298-9. [Website] [PDF] Nicholas Metropolis; Stanislaw Ulam.  The Monte-Carlo Method, 1949. In Journal of the American Statistical Association, Vol. 44, pp. 335-341. ASA. ISSN 0162-1459. [PDF] John von Neumann.  Various Techniques used in Connection with Random Digits, 1951. In National Bureau of Standards, Applied Math Series, Vol. 11, pp. 36-38. ISSN 1049-4685. David K. Gifford.  Natural Random Numbers, 1988-08. No. MIT-LCS-TM-371. MIT. Lecture Notes. [PDF] Lenore Blum; Manuel Blum; Michael Shub.  A Simple Unpredictable Pseudo-Random Number Generator, 1986-05. In SIAM Journal on Computing, Vol. 15, No. 2, pp. 364-383. SIAM. ISSN 0097-5397. Makoto Matsumoto; Takuji Nishimura.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator, 1998-01. In ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, pp. 3-30. ACM. ISSN 1049-3301. [PDF] Yadolah Dodge.  A Natural Random Number Generator, 1996-12. In International Statistical Review / Revue Internationale de Statistique, Vol. 64, No. 3, pp. 329-344. International Statistical Institute (ISI). [download] George Marsaglia; Wai Wan Tsang.  A fast, easily implemented method for sampling from decreasing or symmetric unimodal density functions, 1984. In SIAM Journal on Scientific and Statistical Computing, Vol. 5, No. 2, pp. 349–359.
Superseded by Marsaglia & Tsang (2000).