Hierachical Dirichlet Language Model

This work is based on the Hierachical Dirichlet Language Model by David J C MacKay and Linda C B Peto.

Context Redundancy

I performed a slight modification to the Dirichlet language mode, where contexts (a word in a bigram model) are clustered. A Bayesian method is used to discover the context clusters.

List of clustered contexts for a model trained on 6MB of text from the Gutenburg Library.

This method reduces the size of the model and typically decreases the perplexity on unseen text.

This work is described in more detail in my thesis.


Back to my home page