Comments on 'Maximum Likelihood Estimation of Intrinsic Dimension' by E. Levina and P. Bickel (2004) 
David J.C. MacKay^{+*} and Zoubin Ghahramani^{+} 
+ Gatsby Computational Neuroscience Unit, University College London 
We like Levina and Bickel's paper, because it confirms our longstanding view that the way to solve most inference problems is to write down an appropriate likelihood function.
However, we feel that the authors missed the chance to apply this `likelihood principle' all the way to the final hurdle. We came to this conclusion when we saw their figure 1a, which shows the estimated dimension of a fivedimensional manifold as a function of the number of nearest neighbours k used to obtain the estimate. We were surprised that for small k, there is a strong bias in the estimated dimension, even when there are large quantities of data. We would not expect a maximum likelihood estimator to misbehave in this way. For large k, yes, we would expect biases because of the breakdown of the 'locallyuniform density' assumption. But for small k, we expected that the estimator would be unbiased, albeit with increasing variance as k decreased.
Fortunately there is a simple explanation of the bias, and the problem is easily fixed.
At equations (7) and (8), the authors write down two nice likelihoodderived estimators of dimension m based on the distances {T} to the nearest neighbours of a single point on the manifold.
The question then is how to combine the results obtained in the neighbourhood of all n datapoints to give a single inference of the dimension m.
Surely the correct way to proceed is to write down a likelihood function? Sadly, the authors decide to average the n estimators
It's not hard to come up with a bettermotivated way of combining the statistics. Treating the data surrounding each of the n points as if they are independent (which they are not, of course), we can write down a likelihood of exactly the same form we already encountered. This approach motivates an estimator just like (7),
Figures 2 and 3 below show the results of experiments involving 2000 uniformlydistributed points on a manifold of dimension 5. We fixed R to a sequence of values and used the above estimator to obtain individual estimates of m as illustrated in figure 3b (for one particular choice of R). The inverses of those estimators are shown in figure 3a. Figure 2 shows, for each value of R (controlling position on the horizontal axis), the average of the mhats (i.e. the authors' estimator) (yellow), the inverse of the average of the inverse mhats (orange), and our preferred maximum likelihood estimator (which is just equation (7) again). [2a shows a log scale on the horizontal axis, which is the average number of enclosed points.] Of course, where the data density varies, we can't expect to use a single value of R. So figure 4 shows similar results for the case where we fix k. One way of writing the estimator is




 
Fixing the poor behaviour of the estimator at small k simplifies matters arising later in the paper: for example, the finetuning of the range of k values (k_{1}, k_{2}) can probably be done away with; instead, there could be a single controlling hyperparameter describing the lengthscale at which the assumptions of the model break down. This hyperparameter could itself be inferred automatically with a Bayesian approach.
Takehome message: never average anything without consulting a likelihood function.
P.S. Octave code used to generate these figures is supplied in RUNME2.m, doplots.m, and RUNME3.m.