Compression algorithms in python

All files pointed to by this page are free software © David MacKay December 2005. License: GPL

This page offers a library of compression algorithms in python.

Several of these packages use the doctest package to offer automated testing. For example, running

   python IntegerCodes.py 
causes 12 tests to be executed, all of them defined in the documentation of the individual functions. The same tests are also run if you open IntegerCodes.py in emacs and type
   C-c C-c 

Contents
General utilities IntegerCodes.py
General compression algorithms Huffman algorithm
Probability-based compressors for the Bent Coin
Run-length encoder using Huffman
Block coder (using Huffman)
Arithmetic code
Compressors whose probability distributions are implicit rather than explicit
Golomb coder (Run-length encoder)
IRL (General purpose Run-length encoder, works quite well for all redundant sources)
Position code
Lempel-Ziv

General utilities

IntegerCodes.py

This module supplies four functions that implement three codes for integers.
(a) regular binary - encode: dec_to_bin(n,d) ; decode: bin_to_dec(cl,d,0)
(b) headless binary - encode: dec_to_headless(n) ; decode: bin_to_dec(cl,d,1)
(c) C_alpha(n) - encode: encoded_alpha(n) ; decode: get_alpha_integer(cl)
C_alpha(n) is a self-delimiting code for integers. Neither regular binary nor headless binary is a self-delimiting code, hence their decoders must be supplied with a parameter d that tells how many bits to read.

See Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers)

This package uses the doctest module to test that it is functioning correctly.

General compression algorithms

Huffman3.py

The Huffman3 package provides a Huffman algorithm, spitting out an optimal binary symbol code for a given set of probabilities. It also returns two objects that can be used for Encoding and Decoding with the functions encode and decode. The two objects are

  1. a list of "nodes", one for each symbol; this list is used for encoding; and
  2. a tree of "internalnodes", accessed via the root of the tree, used for decoding.
The package can be used in many ways. If you just want to quickly find the Huffman code for a set of relative frequencies, you can run Huffman3.py from a shell like this:

~/python/compression/huffman$ python Huffman3.py 
Reading in frequencies (and optional symbol names) from stdin
50    <<<|
25    <<<| These four lines are the user's input, terminated by Control-D
12    <<<|
13    <<<|
#Symbol Count   Codeword
1       (50)    1
2       (25)    01
3       (12)    000
4       (13)    001

Alternatively, a slightly fancier usage includes symbol names in each line of stdin:


~/python/compression/huffman$ echo -e " 50 a \n 25 b \n 12 c \n 13 d" > ExampleCounts
~/python/compression/huffman$ python Huffman3.py < ExampleCounts 
Reading in frequencies (and optional symbol names) from stdin
#Symbol Count   Codeword
a       (50)    1
b       (25)    01
c       (12)    000
d       (13)    001

Functions supplied by the package:

iterate(code)
runs the Huffman algorithm on a list of "nodes". It returns a pointer to the root of a new tree of "internalnodes".
encode(sourcelist,code)
encodes a list os source symbols.
decode(string,root)
decodes a binary string into a list.
huffman(counts):
Takes in a list of counts. Runs the huffman algorithm 'iterate', and optionally sorts and prints the resulting tree.
makenodes(probs):
Creates a list of nodes ready for the Huffman algorithm 'iterate'. probs should be a list of pairs('', ).

Examples illustrating use of this package:

easytest():
simple example of making a code, encoding and decoding
oldtest():
this is run when the program is run from a shell.

See also the file Example.py for a small python program that uses this package, and RLE.py for a bigger example.

Example.py

This is a compression algorithm for compressing files containing the 4 symbols {a,b,c,d}. The assumed probabilities are {0.5,0.25,0.125,0.125}.

This example uses the Huffman package to create its Huffman code and to handle encoding and decoding.

If you run this package from within emacs with C-cC-c, it runs a test called easytest().

The package can also be used directly from a shell to compress or uncompress data received via stdin or stdout.

The default behaviour is compression; to get uncompression, give an additional argument (for example --uncompress).

Usage: at shell prompt - compression first, then uncompression


    $ echo -n "aaaabbcd" > Example.txt
    $ python Example.py              < Example.txt > Example.zip
    $ python Example.py --uncompress < Example.zip > Example.unc
    $ diff Example.unc Example.txt

HuffmanL.py The alternate package HuffmanL.py returns a simpler structure for decoding. Instead of using the concept of an "internalnode" Class, the Huffman algorithm returns a binary list of binary lists of binary lists...

For example, the decoder for the code

#Symbol Count           Codeword
0       (10      )      00
1       (9       )      111
2       (8       )      101
3       (7       )      100
4       (6       )      011
5       (5       )      1101
6       (4       )      1100
7       (3       )      0101
8       (2       )      01001
9       (1       )      01000
can be represented by the simple list of lists:
A = [[0, [[[9, 8], 7], 4]], [[3, 2], [[6, 5], 1]]]
Decoding then can be done like this (pseudocode):
    l = list(string)
    p = A
    while len(l)>0:
     c = int(l.pop(0))
     p = p[c]
     if p is a string :
         output.append( p ) ; p = A

Compression algorithms for the Bent Coin File

RLE.py

This is a RunLength Encoding compression algorithm that uses the Huffman algorithm to define a code for runlengths.

The package contains the following functions:

findprobs(f=0.01,N=69):
Find probabilities of all the events:

    - No 0s followed by a 1;
    - 1 0  followed by a 1;
    - 2 0s  followed by a 1;
    - - - - - - - - 
    - (N-1) 0s followed by a 1;
    - N 0s
RLencode(string,symbols,N):
Reads in a string of 0s and 1s. Creates a list of run lengths and special symbols 'Z', which denote 'a lot of Zeroes'. ('A lot' is N.) Sends this list to the general-purpose Huffman encoder defined by the nodes in the list "symbols".
RLdecode(string,root,N):
Decode a binary string into runs, then return appropriate 0s and 1s
compress_it( inputfile, outputfile ):
Make Huffman code using runlengths, and compress
uncompress_it( inputfile, outputfile ):
Make Huffman code using runlengths, and uncompress

There are also three test functions.

If RLE.py is run from a terminal, it invokes compress_it (using stdin) or uncompress_it (using stdin), respectively if there are zero arguments or one argument.

block.py

This is a Block compression algorithm that uses the Huffman algorithm.

This simple block compressor assumes that the source file is an exact multiple of the block length. The encoding does not itself delimit the size of the file, so the decoder needs to knows where the end of the compressed file is. Thus we must either ensure the decoder knows the original file's length, or we have to use a special end-of-file character. A superior compressor would first encode the source file's length at the head of the compressed file; then the decoder would be able to stop at the right point and correct any truncation errors. I'll fix this in block2.py.

The package contains the following functions:

findprobs(f=0.01,N=6):

    Find probabilities of all the events
    000000
    000001
     ...
    111111
    <-N ->
Bencode(string,symbols,N):
Reads in a string of 0s and 1s, forms blocks, and encodes with Huffman.
Bdecode(string,root,N):
Decode a binary string into blocks, then return appropriate 0s and 1s
compress_it( inputfile, outputfile ):
Make Huffman code, and compress
uncompress_it( inputfile, outputfile ):
Make Huffman code, and uncompress

There are also three test functions. If block.py is run from a terminal, it invokes compress_it (using stdin) or uncompress_it (using stdin), respectively if there are zero arguments or one argument.

ac_encode.py Arithmetic coding -- the king of compressors
For another arithmetic coder by Philip, including file input and output functions, see Philip's page

Compression algorithms with IMPLICIT rather than EXPLICIT probability models

golomb.py

'Golomb' Compressor for a sparse file, with many 0s and few 1s.
Has one parameter m, which defines via M=2^m the typical number of consecutive 0s expected;
The Golomb code can be viewed as an approximate arithmetic code.
The Golomb code with parameter m is also identical to the Huffman run-length code for a particular value of the P(1).

IRL.py

Run-length compressor, "without explicit probabilistic model", for redundant files.

Compresses lengths of runs of 1s or 0s into a uniquely decodeable representation, C_alpha, as described in Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers) Runs of 0s and 1s are treated identically.

So this decoder is expected to perform 'quite well' for a wide range of redundant sources. It's not only quite good for files like '000000000000000100000000000000000001010000000' and '1111111111111111011111111111111111111111001111111' but also especially good for files like '11111111111111111111000000000000000000000', which none of the other 'Bent Coin' compressors handle well.

usage I: This package can be directly executed as follows:


 COMPRESS:
 python IRL.py < BentCoinFile > tmp
 UNCOMPRESS: 
 python IRL.py --uncompress < tmp > tmp2

If this package is executed by C-c C-c from within emacs, it runs the function "test" which performs a few small encodings and decodings.

usage II: see IRLc.py and IRLd.py for examples of programs that use this package.

More details

Encode one string's runs using the code C_alpha. If selfdelimiting, prepend the number of runs, so the decoder will know when to stop reading. The first run's first character is sent in the clear; all subsequent runs alternate between 0 and 1.

This compressor can be used in two ways: the 'smart' way and the 'nonsmart' way. The 'smart' way generates slightly longer compressed files that are fully self-delimiting. So you can concatenate the compressed files together end to end, and the decoder can read along, uncompressing each file separately. This smartness is achieved by prepending the Number Of Runs to the compressed string.

The package includes functions encode and decode which produce the compressed file, and optionally prepend an encoded integer to the head of the file, thus making it self-delimiting. decode calls either smartdecode (which carefully reads just the right number of bits from the compressed list in order to recover one file) or simpledecode (which reads until the end of the compressed list is reached, and which thus requires either that the receiver know the file length, or that the file system has a special end of file character.

position.py

Position codes


In position.py, the decoder must know the compressed filelength.
This position code illustrates a way to read command-line arguments.
 usage examples: 
  position.py -bits 5 < /home/mackay/compress/BentCoinFile > encoded.pos5
  position.py -bits 5  -decode 1 < encoded.pos5  > recovered5
  position.py -bits 8 < /home/mackay/compress/BentCoinFile > encoded.pos8
  position.py -bits 8  -decode 1 < encoded.pos8  > recovered8
  position.py -bits 6 < /home/mackay/compress/BentCoinFile > encoded.pos6
  position.py -bits 6  -decode 1 < encoded.pos6  > recovered6
  position.py -bits 7 < /home/mackay/compress/BentCoinFile > encoded.pos7
  position.py -bits 7  -decode 1 < encoded.pos7  > recovered7
  position.py -multipleblocks 0 -bits 14 < /home/mackay/compress/BentCoinFile > encoded.pos
  position.py -multipleblocks 0 -bits 14 -decode 1 < encoded.pos  > recovered


In position2.py, the encoded file is self-delimiting. The decoder does not need to know the compressed filelength.
LZ.py

Lempel-Ziv compression

LZ.py implements a simple Lempel-Ziv algorithm, as described in MacKay (2003) Chapter 6. page 119. Each chunk conveys a pointer to the dictionary and one bit.

LZ2.py

Lempel-Ziv compression (2)

LZ2.py implements another version of the Lempel-Ziv algorithm, in which each chunk conveys a pointer to the source file, and a length of match.