PPMZ

High Compression Markov Predictive Coder

PPMZ 2!

Download : PPMZ2 v0.81 Here's the source code and WinNT executable. The code is much cleaner than the PPMZ code; you might actually have a chance of figuring it out :^) Also, it runs in much less memory than PPMZ, so packing 'pic' and such is actually reasonable. As usual, you'll need crblib to compile.

July 22 2009 - Ian Cadieu has made a managed wrapper for PPMZ2; it includes a test harness. See : managed PPMZ2 start page ; code trunk

May 09 2004 - PPMZ2 v0.81 , with cross-correlation measurement. I've also rolled in some of the improvements from Cynbe ru Taren. Cross-correlation is an excellent way of measuring how similar two files are. Use PPMZ2 -x to build the model from one file and then use it to compress another. If the files are totally unrelated, you'll get an entropy >= 8 bits per symbol. If the files are closely related, you'll get an entropy similar to (or even less) than the self-entropy of that file. For example, if you take an exerpt from a larger text and use that larger text as the condition, the "cross-entropy" is lower than the self-entropy. For example, on some of the files of the corpus (this is PPMZ2 -x) :

paper1 self entropy : 2.210 bpc
paper1 conditioned on paper2 : 3.247 bpc
paper1 conditioned on bib : 4.297 bpc
paper1 conditioned on pic : 12.263 bpc

Another interesting metric is to do the self-compression with pre-condition (the -c) option. You can then measure similarity by taking that divided by the self-entropy (then I do a -1 and x100% to make it more clear). For example :

paper1 self entropy : 2.210 bpc
paper1 self conditioned on paper2 : 1.992 bpc
similarity = (2.210/1.992 - 1)*100% = 10.944 %
paper1 self conditioned on pic : 2.313 bpc
similarity = (2.210/2.313 - 1)*100% = -4.453 %

This is nice because it completely normalizes out the self-compressability, but it's a much more transient measure than the -x cross-correlation mode where the model isn't updated from self at all.

April 2002 : Hannu Peltola and Jorma Tarhio have produced a port of PPMZ for Linux ; their work and web page is very nice, even if you don't care for Linux. They've also generated results on the Canterbury corpus.

May 2004 - Cynbe ru Taren is porting PPMZ2 to Linux. He's cleaning up and optimizing the code, it looks nice. The PPMZ2 code base was a research project design to easily explore new ideas in PPM, and I wasn't a great coder at the time, so the improvements are welcome. PPMZ2 for Linux . Little bug fixes to PPMZ2 ; there was a bug in the "text mode" order -1 coder, and the MTF I was doing was actually a "rotate".

June 1999 - I'm now working on PPMZ version 2. Some exciting things are in the works. As usual, you can follow the developments on comp.compression.

Results for ppmz2 v0.7 on the Calgary Corpus :

bib                  :   111261 ->    23873 = 1.717 bpc
book1                :   768771 ->   210952 = 2.195 bpc
book2                :   610856 ->   140932 = 1.846 bpc
geo (de-interleaved) :   102404 ->    52446 = 4.097 bpc
news                 :   377109 ->   103951 = 2.205 bpc
obj1                 :    21504 ->     9841 = 3.661 bpc
obj2                 :   246814 ->    69137 = 2.241 bpc
paper1               :    53161 ->    14711 = 2.214 bpc
paper2               :    82199 ->    22449 = 2.185 bpc
pic (transposed)     :   513216 ->    30814 = 0.480 bpc
progc                :    39611 ->    11178 = 2.258 bpc
progl                :    71646 ->    12938 = 1.445 bpc
progp                :    49379 ->     8948 = 1.450 bpc
trans                :    93695 ->    14224 = 1.214 bpc
total :                              726400
average :                                   = 2.086 bpc

The performance on medium sized "text-like" files here is by far the best of any "pure" compressor (eg. the switching scheme of Volf not included). Medium sized text-like files are news, bib, paper2, trans, progl, book2, etc.. On book1 our compression is hurt by the LRU limitting the number of contexts (by an unknown amount; it is unlikely we would beat CTW's rate of 2.158 bpc).

On many of the "medium text-like" files (eg. progl, trans, news) we even beat the best switching schemes of Volf!! This is at first quite an astounding result, because our computational complexity is orders of magnitude lower. However, we should perhaps not be surprised. Volf's best switching/mixing schemes mix between CTW and something like PPM* or LZ77. In both cases, he is simply trying to mix the good low-order performance of CTW with a good high-order coder. We, on the other hand, have got such a hybrid automatically by using infinite-length PPM and LOE. If you like, the PPMDet and LOE scheme can be seen as a way of using the local character to switch algorithms (in my case, between PPMDet and PPMZ-finite-order), instead of weighting all possible switches by their performance.

Note that on the very small files (obj1,progc) we're worse than the old ppmz, but we don't really care about that.

Note that this means I can now send the corpus self-extracting (separate files, with a totally generic 1->1 coder) in about 759,000 bytes. Hey Leonid - you owe me fifty bucks :^)

A run-down of things we do that make us so good :

SEE (Secondary Escape Estimation) using 22 bits of data, and then weighting by entropy between various orders (9 and 15 bit orders). PPMZ 2 now uses 2 bits for the number of characters in the parent context (thanks to Malcolm Taylor) and that helps a bit. I also use the *post-exclusion* statistics to compute the escape.
LOE (Local Order Estimation) every time we step down a context. eg we start at infinity and then keeping LOE'ing down to some best lower order until we hit order (-1). I use the excluded stats and the SEE escapes to make the LOE decision, and I use the MPS probability with a penalty for non-deterministic as the LOE decision (this is the same thing done in PPMZ and was one of the primary innovations there). In v0.5 I do LOE betwen orders 16,12,8,and 5 through 0
PPMdet : infinite length deterministic context modeller. This is why our performance on trans is so much better than ever. I now have a custom SEE module coupled to ppmdet. The basic idea here is that we track the deterministic "front" of a PPM* model. However, there are some heuristic modfications that improve compression : I don't always take the longest context matched, but instead take ones which have high likelyhoods of succeeding (eg. high match counts or low minimum match lengths). In v0.5 I use PPMdet with a minimum match length of 24.
The source code is all brand new (except for some stuff in crblib) and I'm using a more elegant Trie structure to do my context lookups. The Trie is not a win if you're doing old-school PPMD, but if you need to get all the contexts to do LOE anyway, you might as well walk bottom to top (instead of top to bottom).

Plans for the future:
- Try mixing PPMZ with LZ or a PPM* variant using Volf's switching. Presumably this will produce the best results ever seen.
- Context mixing and blending. Apparently, everyone is thinking about this these days :^) hopefully I'll find the right trick sooner than someone else... Our escape mixing (via SEE) is already very good (for example, my own blending tricks and various CTW implementations cannot beat it, I've tried) so what we're really talking about here is blending the character counts...
- speed it up, cut it down, rawhide!
Malcolm Taylor is working on similar things, and we've had lots of beneficial conversations about this stuff. He and I both know that the future of coders like PPMZ and RK is in finding a wider heuristic for blending the character contexts (my SEE lets you blend escape statistics of contexts that are similar in various ways). This is quite a challenge. He's got some very good heuristics for skipping orders which help a lot on 'binary' files (like obj and geo; eg. files with machine 2-byte words and 4-byte dwords). Check out Malcolm Taylor's Web Page for the latest RK archiver; I've been wasting my time on wavelets and graphics, so his RK will always be much better implemented than PPMZ, and he's got the compression title right now!
Update 2004 - Malcolm's got a new semi-commercial company that's making a nice version of RKive : M Software

PPMZ version 9.1 released March 1998

PPMZ is a high-power, lossless, predictive context coder. (small bug-fix, March 31)
This code is covered by the Bloom Public License (you need to read it if and only if you're an industrialist).
A paper on PPMZ is now available.
Download an executable (58k) for Windows NT/95 , Pentium or better processor
Download an executable (74k) for DOS 4GW , Pentium or better processor (back by popular demand) - note: you must have the DOS4GW stub loader yourself, this is from Watcom's included 4GW-LITE which doesn't provide stub binding.
Download C source code (49k) for PPMZ
To compile PPMZ you must download crblib , and create a directory in your INCLUDE path called crbinc , so that crbinc/inc.h , etc. can be found by PPMZ.
Text files:
release notes
PPMZ v9.0 report of results
PPMZ v8.1 report of results
PPMZ v7.6 report of results
PPMZ v7.3 algorithm description
PPMZ v7.3 report of results

See release notes below for more info.

A report on the files of the Calgary Corpus

PPMZ v9.1 results on the Calgary Corpus
file	raw size	compressed	bits per byte	tuned (best)

bib	111261	24256	1.744	1.743
book1	768771	212733	2.213	2.210
book2	610856	143075	1.873	1.869
geo	102404	51635	4.033	3.874
news	377109	105725	2.242	2.242
obj1	21504	9854	3.665	3.644
obj2	246814	68804	2.230	2.215
paper1	53161	14772	2.222	2.221
paper2	82199	22749	2.214	2.210
pic	513216	50685	0.790	0.787
progc	39611	11180	2.257	2.250
progl	71646	13185	1.472	1.467
progp	49379	9122	1.477	1.471
trans	93695	14508	1.238	1.231

average			2.119	2.101

For more information...

PPM is a classic algorithm that many luminaries have done extensive work on. My PPMZ represents only a few small steps away from their body of work. The major contributors to PPM have been: Alistair Moffat, John Cleary, Ian Witten, and Bill Teahan.

Of most immediate interest is Bill Teahan's PPMD+ , which was the starting point for PPMZ.

The fact that something better was possible was suggested by Cleary and Teahan's paper on escape estimation

PPMdet was inspired by the PPM* coder by Cleary,Witten, and Teahan : PPM* paper and the LZ77 context index structures of Peter Fenwick.

Most recently, a weighting method has been inspired by the work of the CTW algorithm.

As always, a good summary of the original PPM algorithms can be found in:

T. Bell, J. Cleary, and I. Witten, Text Compression , Prentice Hall, 1990

Charles Bloom / cb at my domain

Send Me Email

Back to the Index

The free web counter says you are visitor number and visitor number to this section.