A Hierarchical Dirichlet Language Model.

David J C MacKay and Linda C B Peto

We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as `smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions for language modelling. The ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and DNA and protein sequences in molecular biology.

The new algorithm is compared with smoothing on a two million word corpus. The methods prove to be about equally accurate, with the hierarchical model using fewer computational resources.

