For a review paper on Bayesian methods for neural networks, please see my publications page, in particular the papers `Bayesian Interpolation' and `A Practical Bayesian Framework for Backpropagation Networks' and `Probable Networks and Plausible Predictions'. Most of these FAQs are from people who have read those papers and wanted clarification or further information.
If the answer to your question is not here in the FAQ, please proceed to the Bayesian Methods automated FAQ service; for other MacKay group FAQs, click here.
where M = E(data) + E(weights) = total "cost function"ie. in
P(w|D,alpha,beta,H) = f(M) / Zwhy are we constrained to use exp for f: couldn't we use something else and thereby derive a completely different probabilistic interpretation? Is it that we also want the addition of an arbitrary constant to have no effect? [ie. we want Posterior_with_M = Posterior_with_M+C, and exp is the only function that'll do the trick].
P(w|D,alpha,beta,H) = exp(-M)/Zand integrate over alpha and beta to represent the fact that we don't know them. Then the posterior probability P(w|D,H) has the form of a product of student distributions in which the w-dependent bits are just E_D and E_W. The reason for getting a different form from exp is that we are no longer assuming independence of the residuals--if we see how big some residuals are, it gives us an idea how bog other residuals are going to be. When you compute the derivative you get something almost equivalent to the standard derivative of standard learning. For more on this see the paper of Weigend and Buntine and my paper Hyperparameters: optimize, or integrate out? alpha.ps.gz, abstract.
t = f(x;w) + nuWe could spell this formula out longhand as a statement about a probability distribution,
P ( t | w , H ) = Normal ( mean = f(x,w) , variance = sigma^2_nu )And the assumption that all the noise terms nu are independent means that the probability of all the target values D = {t} is :
P ( {t} | w , H ) = exp - 1/2 ( sum of residuals^2 ) / Z_DThen we can ask the question `how probable are weights w?', and get the unique answer (given our assumptions, and leaving out alpha and beta for brevity) by Bayes theorem -
P( w | D , H ) = P ( D | w , H ) P ( w | H ) --------------------------- P ( D | H )Notice the first term that measures how good w is according to the data is P ( D | w , H ), which is just the probability distribution of the assumed noise, P ( {t} | w , H ).
From: udah222@bay.cc.kcl.ac.uk Dirk HusmeierI have read your introductory papers on the Bayesian framework for training multilayer perceptrons ("Bayesian Interpolation", "A Practical Bayesian Framework for Backpropagation Networks", Neural Computation 4 (1992)) with great interest, but have a question concerning the practical implementation. The scale parameters alpha and beta can be calculated via the number of free parameters, gamma= k-alpha*Trace(A^-1), according to 2*alpha*Ew(w_mp)=gamma, 2*beta*Ed(w_mp)=N-gamma (equations (4.8),(4.10) in "Bayesian Interpolation"). This assumes a quadratic approximation for M, equation (4.5), which certainly holds locally in the neighbourhood of w_mp. However, the parameter configuration w_mp is not known a priori and actually shall be found by minimizing M(w)= alpha*Ew+beta*Ed. It does not become clear how the scale parameters are chosen during the training process. Are they calculated using modified equations (4.8) and (4.10), replacing Ew(w_mp) and Ed(w_mp) by their current values Ew(w) and Ed(w) ? However, then grad(M) != 0, so the quadratic approximation, equation (4.5), does not hold any more. Can one really assume that the algorithm based on the calculation of Ew(w) and Ed(w) converges in some self-consistent way and finally leads to the minimum of M(w)?
do { partial optimization of w ; re-estimation of alpha, beta ; (using either expensive or cheap-and-cheerful formulae) } lots of timesdoes converge to some state {w,alpha}, then that state is a self-consistent solution, i.e. a pair such that w minimizes M(w|alpha) and alpha maximizes the evidence subject to the approximations and assumptions. So, if it works, it is OK. And empirically I find it works OK. I have never encountered instability or other problems with this algorithm. But there are no cast-iron guarantees in this field.
In the special case of linear models, it is possible to justify as a safe and convergent strategy the updating of alpha by the given formula before w has converged. This update procedure can be viewed as an E-M algorithm. See `Ensemble Learning and Evidence Maximization'-- nips.ps.gz, abstract, and the reference to Neal and Hinton therein, for the concepts needed to confirm this assertion.
d -1 dA 2 -- log det A = Trace (A --), and your paragraph on chi expectation being da da N or N-1 ?Thanks, Eric Dedieu
For hints on how to understand
d -1 dA -- log det A = Trace (A --), da datake a look at this section.
P( t_1...t_N | H ) = P( t_1 | H ) P( t_2 | t_1, H ) P( t_3 | t_1,t_2, H ) * ... P( t_N | t_1..t_N-1, H )-- i.e. the product of all the predictive probabilities for each of the data points, using the model `trained' on the previous data points. Now by arbitrary reordering of the points, I can expand the evidence in N! different such ways. The individual terms in the expansion will vary in value as I do this, but the overall product will stay the same. One way of representing this is to make a plot of the log predictive probability for each data point:
| - log P( t_i | t_1...t_i-1, H ) |* | * | ** | * * | * * * | * * | ** | ** ** | *** * * | * ***** **** ** | * ** * | ---+-----------------------------------------------> i NNow, as we vary the arbitrary labelling of the points, the *'s will wander up and down, but the log evidence, being (minus) the INTEGRAL under this curve, will stay the same. In the above I have drawn the typical behaviour for the terms as a function of i: the late data points are more predictable than the early ones (as the model learns), so the predictive error bars get smaller and the predictive probability gets bigger, typically. Anyway, it's definitely not flat, except in the most trivial case in which there is nothing to model. Only once the parameters have become fully determined will the average height of this curve level out to a constant value determined by the noise level. OK, how does this relate to CV? Well, CV, as I understand it, involves looking at the average value, over many data orderings, of
LEAVE-ONE-OUT-ERROR = - log P( t_N | t_1...t_N-1, H )which is precisely the value of the *last* point on this graph. And looking at the graph, I see no way that examining the average value of the last point in the graph can tell you what the integral under the whole graph is!
Pragmatically, I think that what a *user* often cares about is what the predictive performance of his model is probably going to be for the next data point N+1:
PRAGMATIC-PERFORMANCE = - log P( t_N+1 | t_1...t_N, H );and it seems reasonable that LEAVE-ONE-OUT-ERROR is going to be a good predictor of this, for N>>1. Furthermore, the evidence is not necessarily going to be so well correlated with - log P( t_N+1 | t_1...t_N, H ), since the evidence is the integral, which may include a big transient tail at the i<
P(r) = exp ( - beta r^p ) / Z(beta, p)where p is in [1,2], how should I optimise the model, and how should I do cross validation? How should I determine beta and p? What about the case of a classification model?
I have never seen these questions addressed in cross validation terms. Of course, a Bayesian has no problem addressing these questions.
Also, if the evidence and cross validation are strongly in disagreement, I would predict that cross validation would be the better method for predicting generalisation error.
{LPE} &=& \sum_m - \log P(t^{(m)}|D,\H) \\ & = & \sum_m \left[ \frac{1}{2} \left( t^{(m)} - y^{(m)} \right)^2/ {\sigma^{(m)}_y}^2 + \log (\sqrt{2 \pi} \sigma^{(m)}_y ) \right]Also, going back to scalar predictions: consider using more robust error measures than the mean-square error. After all, is it usually the case that if an error of size 2.0 costs 3 pounds more than an error of size 1.0, then an error of size 3.0 costs 5 pounds more than the size 2.0 error, and an error of size 4.0 costs 7 pounds more than the size 3.0 error? Often a more robust measure such as the mean absolute deviation, or the mean squared error of the smallest 90\% of the residuals, may be a more realistic loss function.
But how do I compute \sigma^{(m)}_y?
These are the error bars on the output of your model. Please read my PhD thesis.
P(D_2|D,alpha,H) = \int d^k w P(D_2|w,H) P(w|D,alpha,H) . (8)
Normal ( x ; mean , variance )In the simplest case, w is a scalar and there are two data points. One training point, D = t_1, and one test point, D_2 = t_2, where
P ( t | w ) = Normal ( t ; w , sigma^2 ).Let the prior on w be
P ( w | alpha ) = Normal ( w ; 0 , 1 / alpha ).The posterior probability of w given t_1 is:
P(w|t_1,alpha) \propto P(t_1|w)P(w|alpha) = Normal ( w ; w_mp , sig_w1^2 )where
w_mp = t_1 * ( 1/sigma^2 ) / ( 1/sigma^2 + alpha )and
sig_w1^2 = 1 / ( 1/sigma^2 + alpha )Let the new datum be t_2.
Equation 8 becomes:
P ( t_2 | t_1 , alpha ) = \int dw Normal ( t_2 ; w , sigma^2 ) Normal ( w ; w_mp , sig_w1^2 )in words, t_2 has a probability distribution that is given by convolving the Normal posterior distribution of w with the Normal distribution of the noise. The result is:
P( t_2 | t_1 alpha ) = Normal ( t_2 ; w_mp , ( sigma^2 + sig_w1^2 ) )
However, this is not always the case. If we take Takeuchi and MacKay's interpolation model and increase the number of hyperparameters, the predictive properties do change as the number of hyperparameters increases. My intuition is that the predictive properties of the model change only in a small and unimportant way, but all the same, the question you ask is a valid one. How should the prior on {alpha} be set?
My answer would be that it is up to the user to define a prior that corresponds to the user's prior beliefs. In the case of the Takeuchi and MacKay model, the prior on the hyperparameters defines the simplicity and complexity properties that we would expect typical functions to have. Consider the simple case of an interpolation model H_2 with alpha(x) = alpha_1, x<0; alpha(x) = alpha_2, x>0. We might wish to compare this model with the simpler model H_1, alpha(x) = alpha_1 for all x.
[I note in passing that for practical purposes, it would not be unwise to always use H_2, because the extra hyperparameters of H_2 do not significantly increase its ability to overfit the data. The Occam factors penalizing the extra hyperparameters are correspondingly weak. The posterior error bars on alpha do not become ever smaller as the number of data increases.]
How should the prior on alpha_1, alpha_2 be set? A particular value for alpha_1 defines the characteristic power of the signal y(x) for x<0, and a prior on alpha_1 therefore expresses our prior uncertainty about how much power the signal y(x) might have in x<0. I think the correct way to set the prior on alpha_1 and alpha_2 is therefore to consider how uncertain one is about the power one expects the signal to have in x<0 and in x>0. In a typical problem I expect a reasonable value for the prior would say that the power is uncertain within a range of about two orders of magnitude--- log_10 alpha_1^max / alpha_1^min ~= 2, and log_10 alpha_2^max / alpha_2^min ~= 2. In some applications, the prior uncertainty might be smaller. I think it would be unusual for the prior uncertainty to be much larger. In this way, a reasonable prior can be assigned to {alpha}, and model comparison can be performed.
The transformation beta->u has no effect on the location of the evidence maximum, i.e., beta* maximises P(D|beta) iff u* := log(beta*) maximises p(D|u).
However, since P(beta) != P(u), the maximum for the posterior is shifted, i.e., if u** maximises P(u|D) and beta** maximises P(beta|D), then u** != log(beta**).
Hence the a posteriori most probable value of the hyperparameter seems to depend on the chosen coordinate representation.
??????? Where is the catch ?
Some people (including many Bayesians!) have the impression that Bayesian inference is all about finding MAP parameters (maximum a posteriori) but this is a mistaken view. Ideal Bayesian inferences are independent of chosen representation, and so do not depend on maximization. The only reason I ever maximize a posterior probability is this: *if* there is a representation in which a probability distribution is *Gaussian* (or well approximated by a Gaussian) then maximization over that probability distribution in that representation is useful, *not* because the particular parameters at the maximum have any special status, but because predictive distributions can be simply expressed (mean, variance) in terms of the maximum of the Gaussian and its curvature. Back to top
> From: Herman BruyninckxBack to top> Subject: Non-invariance of MAP... > > In Bayesian methods for neural networks - FAQ, you make the > remark that ``Some people (including many Bayesians!) have the impression > that Bayesian inference is all about finding MAP parameters (maximum a > posteriori) but this is a mistaken view.'' This answer leaves me burning > with curiosity about what your answer is to the following question: what > then *is* Bayesian inference all about? I mean, what decision processes are > `allowed' (in the sense of being invariant)? David J.C. MacKay responds: Decision theory very rarely involves MAP. The optimal decision is the one that minimizes the expected cost. The expected cost is found by MARGINALIZING over the posterior distribution. Only in toy problems will you find that the decision from marginalizing is the same as MAP. > Subject: Re: Non-invariance of MAP... > > Thanks for your prompt reply! > > > Decision theory very rarely involves MAP. > > > > The optimal decision is the one that minimizes the > > expected cost. > And what about other decision approaches, such as minmax? I guess they > happen to be non-invariant much of the time, but they do have some > intuitive appeal, don't they? David J.C. MacKay responds: minmax can be obtained as a special case of the general statement 'The optimal decision is the one that minimizes the expected cost.' Let the quantity 'u' that is minimaxed be related to the 'effective cost' by c = exp ( beta u ) and consider what happens as you increase beta. (And do the normal minimization of the expectation of c.) In the limit of infinite beta, you end up choosing the minimax action. > Thanks *a lot* for this insight! In order to give credit to the proper > source, allow me to ask this question: Is this a `classic' in decision > theory (if so, I apologize for not knowing it!), or is it impossible to > find it in the literature because it is an original idea of yours? I am afraid I don't know whether this is written down anywhere, but I would be surprised if I am the first person to have noticed it. David
However, one could proceed in another way. Rather than integrate out w, adapt w and beta simultaneously, i.e., maximise P(w,beta|D,alpha)~ P(D|w,beta,alpha)*P(w|beta,alpha)= P(D|w,beta)*P(w|alpha) with respect to w and beta.
This is equivalent to maximising log[P(D|w,beta)]= log[{beta/(2*PI))^(T/2)}*exp(-beta*Ed)]= -beta*Ed+ (T/2)*log(beta)+ const, so that from d/d_beta(log[P(D|w,beta)]= -Ed+ T/(2*beta) = 0 we arrive at 2*beta*Ed= T (without the additive constant -gamma).
What is the reason for this deviation ?
Consider a data set {x} modelled as coming from a Gaussian distribution of mean mu and s.d. sigma.
If you maximize the joint likelihood function over mu and sigma you get (xbar, sigmaN). [Get a piece of paper and do it!]
If you integrate over mu and ask what is the most probable value of sigma, instead, you'll get a maximum at sigmaN-1.
To understand this it helps to make a contour plot of the likelihood function. The crucial concept is that it has a skew peak.
See this paper: ensemble.ps.gz. abstract. ` Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning '. And also Hyperparameters: optimize, or integrate out? alpha.ps.gz, abstract.
Your use of the term "likelihood" conflicts with what I believe to be the standard definition (although there are plenty of other people making the same "mistake" :-).
When you talk about "the likelihood of the data given the parameters w", you might just as well have said "the probability of the data given the parameters w" - you seem to view "likelihood" as a synonym for "probability", to be used when talking about the data rather than about parameters.
Amongst non-Bayesians, "likelihood" means something entirely different from "probability". If I recall correctly, the term was invented by Fischer. Since he had rejected Bayesian arguments, he wasn't allowed to talk about the "probability" of a parameter, so he invented the term "likelihood" instead. Likelihood is a function of the _parameters_ not the data, though its value is conditional on the data. Thus the "maximum likelihood" method consists of choosing the parameter set with largest likelihood - the phrase makes no sense if you view the likelihood as a function of the data, since you don't get to change the data to maximize anything.
So, the quantity P(data|params) is the likelihood of the parameters. If you *want* to say "likelihood of the data", just say "probability of the data" instead.
@INCOLLECTION{Jaynes.intervals, KEY ="Jaynes", AUTHOR ="E. T. Jaynes", TITLE ="{B}ayesian Intervals versus Confidence Intervals", BOOKTITLE ="{E.T. Jaynes}. Papers on Probability, Statistics and Statistical Physics", EDITOR ="R. D. Rosenkrantz", PUBLISHER ="Kluwer", YEAR ="1983", PAGES ="151"} % PUBLISHER ="Kluwer Academic Publishers", % reprinted in paperback 1989, % I just read utterly the best Jaynes essay ever. It is SO good; so even % handed and confrontational; rubbing the noses of the opposition in the % examples he gives, using the opponents of Galileo as analogy -- some of % his opponents refused to look through his telescope to see Jupiter's % moons, because they `already knew'. It's a very pragmatic argument he uses, % not philosophical -- just look at the results of the two approaches % and see where they give different answers, then magnify those differences % and ask your common sense which answer makes sense.
@INPROCEEDINGS{Loredo, KEY ="Loredo", AUTHOR ="T. J. Loredo", TITLE ="From {L}aplace to Supernova {SN} {1987A}: {B}ayesian Inferencein Astrophysics", BOOKTITLE ="Maximum Entropy and {B}ayesian Methods, {D}artmouth, {U.S.A.}, 1989", EDITOR ="P. Fougere", PUBLISHER ="Kluwer", YEAR ="1990", PAGES ="81--142"}
% Nice quote: [from savage originally] `Indeed to many {B}ayesians, belief % in the LP is the big difference between {B}ayesians and frequentists, % not the desire to involve prior information' @BOOK{Berger, KEY ="Berger", AUTHOR ="J. Berger", TITLE ="Statistical Decision theory and {B}ayesian Analysis", PUBLISHER ="Springer", YEAR ="1985"}--- you might also like this
@Book{Sivia:96, author = "D. S. Sivia", title = "Data Analysis. A {B}ayesian Tutorial", publisher = "Oxford University Press", year = 1996, annote="ISBN 0-19-851889-7" }good value. Sivia was a Gull student in the generation before me.
I think the above process is the way it works, and the question then is (a) can the above process be a Bayesian process? (b) if not, how does it relate to Cox?
I think that the step at which we become disatisfied with hypotheses A aand B is rarely Bayesian. It often involves an ALTERNATIVE-FREE test. Whereas Bayesian model criticism always depends on having an alternative model.
So what is wrong with cox axioms ? Why does the above science story not yield to the all-encompassing Cox? The reason, I reckon, is that Cox's second axiom says that P(A) is related to P(not-A). This is the closed-hypothesis-space assumption. And in fact, INFERENCE IS OPEN-ENDED (as Skilling says).
evaluate the Hessian A, and invert it -> A^-1 evaluate the sensitivity matrix G_o,i === dy_o / dw_i [sensitivity of output y_o to weight w_i] for all outputs o and weights i. Then the Variance-Covariance matrix for the outputs is G^T A^-1 G where ^T means transpose, to make the product work right.This is also written down on p.597 of Neural Computation -- not very useful for people wanting to understand the paper in the preceding issue! :-)
Should I ever get negative values for G^T A^{-1} G? The above approximation for the error bars rests on a gaussian approximation which assumes that the Hessian A is positive definite. (If your A is not positive definite, then I recommend forcing it to be so.) If A is pos def then G^T A^-1 G also must be positive definite. This means for example that all the diagonal elements of this matrix should be positive. To get the error bars on a single prediction, just take the square root of the corresponding diagonal element. To get error ellipsoids for two correlated predictions, plot contours of the function (delta y)(Y^{-1})(delta y), where Y = G^T A^{-1} G, and delta y is the deviation from the most probable outputs.
Another way of representing error bars is to draw typical samples from the posterior distribution P(w) propto exp( -1/2 (delta w)(A^{-1})(delta w) ). This gives entire typical functions y(x;w).
NB Error bars in classifiers are a whole different story. See my thesis or my Neural Comp papers of 1992. And see The FAQ on Classifier error bars.
There are several ways. The first two ways I used are
A \simeq \beta \sum_n g^(n) g^(n)^T + \alpha Iwhere g^(n) = d y / d w, for example n. (This is a different g from the ones above.) Here terms of order d^2 y / d w^2 are omitted, which means that the net is treated as being locally linear.
This calculation scales OK with problem size. It only costs the same amount of backpropagation as does a single pass through the training set. But accumulating the entries of the matrix A, i.e., adding up the terms
g g^T,may take a lot of time.
@Article {Pearlmutter, author = "B. A. Pearlmutter", title = "Fast Exact Multiplication by the {H}essian", journal = "Neural Computation", year = 1994, volume = 6, number = 1, pages = "147--160", annote = "Also available by ftp archive.cis.ohio-state.edu: /pub/neuroprose/pearlmutter.hessian.ps.Z" }
Since risk prediction falls under the auspices of "classification" rather than "regression", the difference between predicted and actual outcome is no longer normally distributed and I'm not sure how I would compare the error bars produced by different models.
Are you familiar with a practical method to compare the error bars on probability predictions produced by different models in a classification setting?
In a nutshell, the question is, how should we include the concept of error bars when we are dealing with a classification problem, e.g. a binary (0/1) problem.
The crucial point is to get away from the idea that what you want, when making a 0/1 prediction, is a probability "with error bars".
If I am attempting to model the probability that it's in class 1, and I know I have got a lousy model somewhere, then that means the P( ) that I spit out ought to be close to 0.5,0.5 (in the two class case). Thus the effect of "error bars" should be to "moderate" your predictions. If you have poorly determined parameters (big error bars on the parameters), your predictions will be close to 0.5, 0.5, and when the performance is assessed using the log predictive probability
sum_examples log P(actual class)you will be penalised by log(0.5) per prediction. On ther other hand if you make a poor assessment of the error bars on your parameters (e.g. if you ignore them totally) then you will make overconfident predictions; you will be automatically penalized, because when you say "I'm 0.9999 sure that that is a 1," when actually it is a 0, you will get badly burned to the tune of log 0.0001. - a big negative number.
For the case of binary classifiers this is all discussed in my paper "The evidence framework applied to classification networks". You'll note that I compare alternative models in terms of their performance on a test set in two ways: (a) the traditional way, which is to evaluate sum_examples log P(actual class| best fit parameters) and (b) the Bayesian way, which is to make an attempt at marginalization, to get P(actual class|Data ) = \int d params P(actual class| parameters) P(params | Data) and then evaluate sum_examples log P(actual class|Data) .
I call the classifier output P(class| best fit parameters) the "most probable outputs" and P(class|Data ) the "moderated outputs" -- a name which I regret introducing, since "marginalized outputs" would have done fine.
I haven't seen anyone make use of my techniques in real problems, except for my collaborator in Cambridge, Ichikawa.
An alternative way of implementing the Bayesian marginalization would be with a monte carlo approach like Radford Neal's. For multi-class problems this is the only practical approach to marginalization that I know of. (My paper only gives a gaussian approximation method for binary classification.)
Many thanks for your reply. I understand and am no longer confused 8) At the risk of asking another dumb question, does model averaging give you the same "moderation" of predictions as marginalization over network parameters? ...or does it only approximate that marginalization?
Model averaging and marginalizing over parameters both give you a "moderation" effect, yes. Both are forms of marginalization and are strongly recommended. NB, they are not the same. The ideal approach is to do both.
Is there an easy way to change the smoothing formula that you briefly discuss in your PhD for the continous outputs of the bayesian learning neural network (0/1 classification) to allow smoothing of the predicted probabilities towards 0.35 instead of standard 0.5 ?
0.35 is the natural prior probability of my positive cases in my data set and I would think that smoothing the probability estimates towards 0.35 is a better choice in this case.
Great question - This is a question that has bothered me for 9 years! The bottom line is that I think this exposes a problem with the use of neural nets that have standard sigmoid or softmax outputs. It would be better to use probabilistic models that have a nonuniform measure built into them (as does the Dirichlet distribution).
The best idea I have for patching up the sigmoid model is to treat the network as a likeihood-ratio machine, and multiply by your priors (and renormalize) after the marginalization.
When training your network to produce likelihood ratios, you need to factor out the prior probability if possible. One way to do that would be to boost all the data points by a factor inverse to the prior probability when computing the gradient.
If you have difficulties with this methodology I recommend using Radford Neal's mcmc software, which uses correct Bayesian sampling of the hyperparameters and parameters, and often works a little bit better, probably because it is more robust.
y*(m) = y(x(m); w*).If you use your definition of t1(m) then it seems that you are incorporating too much noise. The whole procedure seems to be a bootstrap-type approach, of which there are various variations for regression problems in the statistical literature, both parametric and nonparametric bootstraps, and even so-called Bayesian bootstraps. Your version is parametric in that you are adding Gaussian noise. In the nonparametric case you would create a set of residuals, defined by
r*(m) = t(m) - y*(m),sample from these at random, with replacement, to generate a bootstrap sample of residuals, r1(m), and define t1 by
t1(m) = y*(m) + r1(m) .I apologise in case this is all old-hat to you by now.
P(y) = Normal ( 0 , alpha ) [NB here I use `BUGS' convention, Normal (mean,1/variance)]And assume that we observe t_n = Normal( 0 , beta ), n = 1 ... N. Then the posterior distribution of y is
P(y|Data,alpha) = Normal ( gamma t_bar , alpha + N beta ). [t_bar = sum_n t_n / N, and gamma = N beta / ( alpha + N beta ) ]Let us now consider the strategy I describe and see if it does generate samples having this probability distribtuion. We add noise to t_n, obtaining a new data error function
E_D1 = sum_n ( t_n + n_n - y )^2 / 2where n_n ~ Normal ( 0 , beta ). We also perturb the regularizer, by changing the mean towards which y is drawn from 0 to mu, with mu ~ Normal ( 0, alpha ). The total objective function that we then minimize is
M(y) = beta E_D1 + alpha E_W1where E_W1 = (y - mu)^2 / 2.
What is the y that minimizes this? Answer,
y_min1 = gamma t_bar + ( N beta n_bar + alpha mu ) / ( N beta + alpha )OK, now what is the added term here?
1/(N beta + alpha) .Thus this procedure generates samples from precisely the posterior distribution.
l1 0 0 0 ... 0 l2 0 0 ... 0 0 l3 0 ... . . . . .where l1, l2 and l3 are its first three eigenvalues. This diagonalization involves an orthogonal transformation, i.e. a rotation, and such transformations preserve both traces and determinants; and entropies too. So whenever you have an integral of the form
\int d^k w exp ( -1/2 w A w ) ...you can (for convenience of thought) pretend that A is in fact a diagonal matrix, in which case, the integral has the separable form:
\int d^k w exp ( -1/2 \sum_i l_i w_i^2 )Such separable integrals are straightforward. All that remains is to compute the entropy of a one-dimensional gaussian, and that is left as an exercise for the reader.
How to derive w_mp=A^(-1)*B*w_ml?
Write out the expression for M(w). Differentiate with respect to w.
Set derivative equal to zero to find the minimum.
H_{odd/even}: the sequence will be the odd numbers in order, or the even numbers in order.Is this the same as ST?
D = { 2 , ... }what is your predictive distribution? I think it is reasonable to put quite a lot of predictive probability on "4", because of the existence of H_{odd/even}, and also the similar hypothesis:
H_multiple: the sequence is a list of multiples of x, starting with x.However, I don't think I would put as much as 1/2 of my predicitve distbn on "4"; from which I can conclude that less than half of my prior probability is placed on H_{odd/even} and H_multiple.
p(x) = 2^{-l(x)}---assuming that l(x) is a length in bits. If these probabilities don't add up to one, then you have a suboptimal coding system. If they do add up to one, then you have a valid code which is optimal if the true probability distribution of x (whatever we mean by that) is in fact p(x). Optimal in the sense that the expected message length is minimized. Now, you said Note also that the SC formulation does not need to make any assumptions about precision of parameters. Well, I maybe haven't read the SC formulation. But if a message coding scheme uses a parameter block followed by a data block then the parameter precision has to be taken into account. The optimal precision relates to the occam factor, i.e. to the posterior error bars, in the case of a gaussian posterior on the parameter. But there are indeed many other ways to send a message describing the data. One is not obliged to send parameters to some precision or other. Instead one can for example use on-line predictive coding, in which one chops the data x into a sequence x=(x1,x2,x3,...xN), and sends x1 using the prior predictive distribution p(x1|H) to define the message lengths, then x2 using p(x2|x1,H), etc. This can be done nicely by using the arithmetic coding algorithm invented by Rissanen. Another coding scheme in the MDL literature is Hinton's 'bits back' coding technique. See my review paper to appear in Network (ps.gz, abstract) for further discussion of this. You also said that The $(k/2)\log n$ factor comes out automatically. Yes, it does, in Rissanen's work, but NB, it is an assymptotic result. k/2 log N is not the right penalty in the general case of a well-regularised model with finite amounts of data. In such cases the evidence (and the minimal message length) is not related to k at all. The relevant quantity might be the quantity gamma, see my chapter 'Bayesian interpolation' in my thesis. I think the k/2 log N formula is often very misleading. I think the way forward is to use models with infinite numbers of parameters controlled by interesting priors. See Radford Neal's papers.
@INPROCEEDINGS{MacKay.nips4, KEY ="MacKay", AUTHOR ="D. J. C. MacKay", TITLE ="Bayesian Model Comparison and Backprop Nets", BOOKTITLE ="Advances in Neural Information Processing Systems 4", EDITOR ="J. E. Moody and S. J. Hanson and R. P. Lippmann", PUBLISHER ="Morgan Kaufmann", ADDRESS ="San Mateo, California", YEAR ="1992", PAGES ="839-846"}
My conclusion would be in general that evidence is appropriate only for large sample sizes (where for example the Gaussian assumption may be valid) and assuming the true model is on the list of models considered. For unrealizable cases, the problem is (I think) that minimizing the evidence is not necessarily the same as minimizing the generalization error EVEN asymptotically. They are just two different (albeit related) criteria.
But I would like to emphasise that there is much more to Bayesian methods than using the evidence as some sort of estimator of generalization error. This concentration on generalization error is very narrow minded. Bayesian methods can be used to optimize hyperparameters (by maximizing the evidence implicitly); and other model properties. Thus if one imagines that a noise level might vary with x, then one construct a model of sigma(x) and learn it under the guidance of the appropriate evidence function. One can use the evidence to optimize the radius of a radial basis function model. One can use multiple hyperparameters to represent a spatially varying smoothness and can infer these using the evidence. All these model improvements are likely to improve generalization error, but that's just a by-product of finding a more probable model.
The evidence is a crucial guide for one's search in model space. The evidence is a far richer guide than some mere generalization error criterion, which gives you only a single scalar, and a noisy one at that. With the evidence we can evaluate its derivatives with respect to multiple hyperparameters simultaneously.
I should also emphasise that the above uses of the evidence are very much alive and well at the small N end of things. I strongly disagree with the assertion that the evidence is only a useful creature in the infinite data limit. I now regularly use neural net models that have more parameters than there are data points. The evidence method is excellent at controlling the hyperparameters in these cases.
I think the way forward is not to concentrate on comparison of two discrete models (by the evidence or whatever) but rather to define huge single models spaces that contain alternative models as special cases.
From my understanding the initial data set was split into three parts,
a training set, used during training a test set, used during testing and a validation set.
Could you explain the purpose for the validation set.
At the end of an investigation, we want to have an idea how well our method would predict on truly unseen data. This is the role of the test set. If someone doesn't cheat, they will set aside the test set, never look at the performance on it, then they will finally test their method and publish the results.
What many people do is just divide the data into a training set and a test set, which they look at frequently; they tweak their algorithm until they get good performance on the test set, and then they report this performance. I hope it is obvious that this introduces a bias, such that the reported performance is probably optimistic.
OK - that has motivated splitting off a test set that we never look at until the very end. So what about the validation set? Well, in many learning methods there are hyperparameters and other control parameters that you can tweak, and what you would like to do is adjust them so as to get the best perfroamnce on the test set. But of course that would be cheating. So what we often do is divide the data that we *are* allowed to look at into a training set and a "pretend" test set, called a validation set.
It's not obligatory to divide the data you are given in this way. Because I use Bayesian methods, I can control my hyperparameters and parameters all using a single training set -- I don't need a validation set - - so I can make better use of available data. All the same, for peace of mind I will usually divide the data I am allowed to look at into trainign and validation data just to sanity-check the Bayesian method. Back to top
P(\alpha,|w,D) = P(\alpha,|w)so \alpha only depends on D via w's dependence on D?
alpha -> w -> Data(The w come from a prior defined by w; the data come from a noise model with mean defined by w.)
lightning storms influence probability of electricity failure
if electriicty failure then your beer (in fridge) goes warm; if no elec failure then your beer is cold.
Now you observe the data "elec failure". You might infer "hmm, maybe lightning storm happened".
If Joe then runs in and says, "hey, look, the beer is warm, it *must* have been a lightning storm!" you would slap him about the head and say, "come off it, we already know the power has gone". Observing D doesn't tell you anything more about alpha if you already know w.
First, A note about tar files. The Mosaic browser is stupid about tar files, I find. It gets the file, then doesn't know what to do with it, so it throws it away. If you use Mosaic, you probably have to find another way to transfer the tar files that follow. If you want, you can anonymous ftp to this machine (wol.ra.phy.cam.ac.uk) then cd pub/mackay/www to get to the same location.
My question surrounds the fact that I get quite varied results between the two test sets. In Neil's recent book on Bayesian neural networks he compares performance of his models on the robot data set against the results given in your 1992 Neural Computation paper: A practical bayesian framework..
Can you let me know which test set was used and give me any clues as to why I get better results on the second series?
imagine that you *know* that the true function is y=0 everywhere, and you evaluate your performance on a variety of independent test sets, each containing 200 data points.
The data has noise level 1.0.
What is the *probability distribution* of the test error Sum ( t-y )^2 ?
Answer, a Chi-squared distbn with mean 200 and variance 200, i.e. s.d. 14.
Thus it is perfectly normal and to be expected that the value of the test error will be different by more than 10% between any two given test sets.
The test set used by Radford and me was...... I can't remember, but you can tell by looking in my thesis and comparing with your own results.
Data_files: inputs targets Number_data 11000 Train from line 1 to 150, 300, 600, whatever you want up to 1000 Test_set_1 1001 6000 Test_set_2 6001 11000
Above I have provided this data in the form of the original competition.
Another format
that may be of interest is the training + validation files
that I used for my neural networks. Here, the data has been substantially
preprocessed. It is not clear how much my winning the competition owed to
the preprocessing, and how much to the use of Bayesian methods! (Of
course the Bayesian methods helped refine the preprocessing, so it is
not completely separable. Any work that makes use of this representation
of the data should acknowledge that the preprocessing and glitch
removal were both based on examination of Bayesian models.)
The tar file
contains these files:
[There is a bibliography for Donald MacCrimmon MacKay (DMM).]
I am familiar with a small fraction of my father's ideas and I think many of them were really neat, some of them were a bit simple in retrospect, and some of them (eg his views on science, religion, free will) were unfortunate.
If you want to find out what he thought about measures of information and the status of shannon information as "a" measure of information, then an ideal place to look is his book `information mechanism and meaning', published by MIT press, but probably out of print, so try a library. It's a good book, though it is dated in parts.
It is a book on information theory written when there was still confusion around, and when the Shannon information had not become *the* conventional view of `information'. Just to give you an idea of alternative measures of information (which do make perfectly good sense) --- there is the number-of-independent-degrees-of-freedom in a signal. There is the signal to noise ratio of the signal. He suggests names `metrical information content; selective information content' for the alternative measures of information.
These names are no longer in use because it turned out people stopped being confused without the help of these names. (Except maybe philosophers who are permanently confused.)
While I am writing about him, let me mention his idea that I think is really good. He had a radical view of how the brain works which I think is spot on. He viewed it not as a feedforward device from retina to V1 to V2 etc., but as being fundamentally a generative modelling machine, in which (for example) V1 contains an internal model of the world, which allows it to anticipate what is coming in from the retina, i.e. to predict it. It sends these predictions *back* from V1 to LGN; and the LGN has the role of comparator, taking the difference between what comes in and what was already anticipated, and just sending up some representation of that difference information. This is the elementary event of perception at that level. Something not anticipated by the internal model is detected and communicated up the feedforward pathway. This causes the internal model to adjust itself in such a way as to cancel out the feedforward signal, so that there is no longer a mismatch. Notice two things. This control circuit doesn't need a high quality feedforward bit. If the ff signal just gives an indication of mismatch then the internal model can hunt around under its guidance for a hypothesis that eliminates the mismatch signal. But what *is* needed is a good quality feedback bit; we need a good quality prediction circuit from V1 back to LGN. And FACT: V1 to LGN projects 10 times as many fibres as LGN projects feedforward to V1. This fact is desperately neglected by the orthodox feedforwardist school, who have a pathetic view of the feedback circuit ("it just provides global inhibition all over the LGN", I hav heard said. What hogwash! As if evolution would use ten times as many fibres to just perform global inhibition!) This idea of my father's is present in many of his papers, all of which are rather innaccessible; for example `Towards an information flow model of visual processing'.
Incidentally, the `Helmholtz machine' of Hinton et al works on the principles outlined above (roughly!).