# Using BayeSys

Notes by David MacKay [BayeSys home | David MacKay home]

### Software by John Skilling     |     Website by David MacKay

Mon 29/3/04

Here are my notes on implementing a factor analysis model in BayeSys. I started from the BayeSys3 tar file provided in the test directory.

1. In the toy examples, the likelihood function is located in the file bayesapp.c, in the Function: UserBuild;
2. the data are defined in bayestoy.c, and the likelihood is initialized there in the lines
```// Initialise Likelihood
Common->Ndata    = Ndata;         // # data
Common->Data     = Data;          // data                         [Ndata]
Common->Acc      = Acc;           // accuracies = 1/sigma         [Ndata]
```
which link the Data array so that it's accessible from the Common structure.
3. In order to write my own likelihood, I first need to instantiate the parameters in such a way that the prior is uniform. This means that we need inverse distribution functions.

In a factor analysis model I will need Gaussian and Gamma priors.

Gaussian (mean mu, precision tau):
```     x ~ dnorm( mu , tau )
```
Can be mapped from a uniform distributed "Cube" variable c via BayeShape Shape=5 `Bell'.
Gamma

Generating a sample from a Gamma is not simple. In /home/mackay/_doc/dirichlet/sample/sample.p I get Gamma variates using

```  \$command = sprintf ( "/home/mackay/ansi/ranlib.C/test/tst %d %f %s%s | " , \$n , \$uo, \$seed.\$r );
/home/mackay/ansi/ranlib.C/src/ranlib.c
```
which uses various methods, including rejection sampling, depending on the mean of the standard gamma distbn required.

The cdf of the standard Gamma (f(x)=x^{gamma-1} e^{-x}/Gamma(gamma)) is F(x) = Gamma_x(gamma)/Gamma(gamma), where Gamma_x is the incomplete gamma function.

Hmm, let's read the manual. On page 23, the BayeShape facility is described; it doesn't mention Gamma explicitly. Perhaps the simplest course of action is to replace Gamma priors by log-normal.

[I see that Shape 2, Simplex volume, has a connection to incomplete Gamma.]

## Start simple

As a first step, I implemented the model
```mu ~ dnorm(0,1)
Data ~ dnorm(mu,1)
```
which should give (since the variance of D is 2.0)
``` Evidence = log P(D) = -D**2/4 - log(sqrt(2*pi * 2)) = -2.2655.
```
I ran the model using bayestoy.c (modified by me) and onegaussian.c (modified from bayesapp.c) (Makefile). I tried setting the Data and Accuracy to various values, checking the resulting values of <Mock> and Evidence. For extreme data values, the answers were incorrect. (Which seems reasonable!) Here are two examples, a moderately extreme case with perfect answer, and a really extreme case with wrong answer. All results came out much the same for any value of rate from 0.1 down to 0.0001.
 Data, Acc Correct and Evidence Results and Evidence 4, 10 3.96 and -8.8447 3.96 and -8.75 Good. 20, 1 10 and -101.2655 6.3 and -116.44 Bad. (Result did not improve if small rate was used)

Further investigation indicates that "Bad" results can occur when a datum requires a parameter to take on a 7-sigma outlying value. 6-sigma outliers seem to be all right.

This is exactly what's expected: real numbers are represented in 32 bits. So anything beyond mass 2^{-32} in the tail of the prior will be ill-represented. Now, at 7-sigma, the tail probability is roughly exp(-49/2) --- say 1e-12; which is 2^{-40}. So in fact we should not expect a 7-sigma event to be represented at all. 6-sigma corresponds to 30 bits.

## Next steps

I made a sequence of factor analysis models. They are available in fa.c, fa2.c, fa3.c, (mainfa.c, mainfa2.c, mainfa3.c), and have increasing numbers of parameters of an increasing number of types. To do simple Bayesian models like FA I force the number of "atoms" to be 1. I encountered a bug in my first attempt, in which BayeSys requested the likelihood for an object that had Natoms=0. I was not expecting this. I fixed the problem by modifying UserBuild so that it returns "0" instead of "1" in this situation.

fa3.c implements a 3-dimensional factor analysis model in which there are six parameters: three components V, V, V for the loading vector, which come from a unit normal prior, and three diagonal noise variances D, D, D, which have a log-normal prior. The generative model for x (a 3-component vector) is

``` P(x[1..3]) = Normal( 0, C )
```
where the covariance matrix C is
``` C = D + s2 V VT
```
where D is a diagonal matrix made from D, D, D, and s2 is a fixed variance parameter.

Rather than explicitly enumerate N data points x, I work with the sufficient statistics, {N, <x xT> } and write down the likelihood analytically. The parameter Ndata counts the number of sufficient stats, not the number of data points. The number of data points is placed in the UserCommonStr which is defined in userstr3.h, along with other essential fixed model parameters such as the variance s2 and the fixed hyperparameters of the log-normal prior.

In the output vector Mock, I keep track of what the sufficient statistics would be if perfect data were generated from the present parameters.

Results look fine. Mock matches Data nicely.

Aside: I notice that results obtained differ depending on which of my linux machines runs the program. A redhat machine and a Debian give slightly different answers.

## More parameters

In fa4.c, (mainfa4.c), I make the parameter s2 a free parameter with a log normal prior. When it has a fairly narrow prior (a s.d. of +/-2 over log(s2)), the evidence changes almost negligibly

``` from Evidence    =  -6843.40  (fa3)
to   Evidence    =  -6845.81  (fa4)
```
which is very appropriate. Increasing the prior width over log(s2) by a factor of e2 should simply decrease the evidence by 2.0 nats. Check:
```      Evidence    =  -6848.04
```
Bingo.

## Fewer parameters

In fa5.c, (mainfa5.c), I reduce the number of free parameters again (in order to carry out model comparison between two factor analysis models, one of which claims to know the loading vector V).

I set the known V to (1,1,1), which is not well-matched to the data. The evidence drops to

```      Evidence    =  -6980.84
```
(a drop of 130, with a data set of 1000 points, so a KL of 0.13 per data point, which seems reasonable).

## Test on real data

To make fa6, I linked fa5.c to mainfa6.c (known factor), which contains data from R E Turner. Similarly, fa7 links fa4.c to mainfa7.c (unknown factor vector).
```# sufficient stats for aa.dat
# N = 152
# sums:
# -0.00037766	-0.0002665	-0.0002048
#
# Covariance sums:
3.2469e+06	2.81641e+06	5.69258e+06
3.75222e+06	4.88881e+06
2.13158e+07
```

#### output (from fa6, using (1,1,1) as the expected factor):

```Evidence    =  -3004.90 (log[e])
Information =     42.62 (log[e])

Data    Mock
3246900.00 3203229.85
2816410.00 3202008.81
5692580.00 3202008.81
3752220.00 4547676.00
4888810.00 3202008.81
21315800.00 16026978.76
```

#### output (from fa7, using unknown factor):

```    Evidence    =  -3015.90 (log[e])
Information =     53.98 (log[e])

Data    Mock
3246900.00 3167927.62
2816410.00 2741839.82
5692580.00 4119870.49
3752220.00 3701758.68
4888810.00 4680292.26
21315800.00 20394359.60
```

Conclusion: while neither model reproduces the six data values perfectly, the evidence is exp(11) in favour of the (1,1,1) hypothesis.

## Moving on: get a sensible number of factors

For realistic problems, we want more factors. Rewrite the likelihood routine for the general case.