We had an issue at work. Our dev. team is managing a php server, which sends and receives binary KET files of constant size (say, 5kb) to production servers. I was rewriting some of the production servers and I notices that sometimes I’m getting TLE files in instead of KET files, and that they are 11 byte too big.
I checked it out. Apparently the dev. team was sending the KET files for compression (via php gzcompress). They were very much surprised to learn that the output files were larger than the original KETs. “Apparently”, they sighed, “gzip is not a very good compression. Let’s find another one”.
Which is the wrong conclusion. They have a meagre chance of finding a compression engine which wouldn’t expand the files. And this is a good excuse to explain this to the rest of the world.
An introduction to data compression
Pigeon hole principle → no general compression scheme possible.
Range limitation → Data redundancy → range-specific compression possible.
Prima facie, compression is an impossible task.
Let’s suppose, for simplicity, that you’re only ever interested in 100bit files. What should compression do with those files? One would expect a compression engine to take those files, and transform them into something else, which would be shorter. This is “zipping” the files. But you also expect to be able to “unzip”, that is, expand a compressed file back to its full 100bit glory. Well, suppose we have a gloriously efficient algorithm, which shrinks all files in half (to 50bit). This means that there are only 250 different outputs. Now, 250 is a huge number. But it’s negligibly small compared to the vast 2100 different files we started with. It means that our glorious compression algorithm would yield identical results on very different inputs.
Let’s put it differently. This is a mathematical principle called “the pigeonhole principle”, and it’s really simple. One cannot put 9 pigeons into 8 pigeon holes. In our case, if you have 2100 possible outputs, you need to enable 2100 inputs. So, on average (and this is a point we’ll return to later) – to compress 100 bits you need 100 bits. Tough luck.
So. How is compression at all possible? There are 3 caveats where compression can enter:
- Lossy coding
- Domain specific coding
- Probabilistic coding
But take note – no matter what method you use – it will never ever be completely general. All compression is specific.
The first issue is that some information loss is, at times, allowed. Take a 5-mega pixel image. Do you really care what the specific value at every pixel is? You don’t. If I take your photo, and change a few pixels by a little bit, you’ll never notice.
In fact, I can even change the entire image by a little amount, and you’ll never know. This means that the data has redundancy. Much of the information isn’t really important. This allows us to take a few thousands of pigeons, and stuff them into the same pigeon hole, because the difference between those specific pigeons is not important for us. For exmaple, tihs snetnence has a few wrods mxised up. Yuo can sitll raed it. Tihs maens that an Egnlish text has redanduncy. A lot of redundancy. If we are allowed to throw away the non-important information, we can compress data.
But an English text is not an image. A garbled grammar is annoying. Even so, English is very redundant. How so? Take 7-letter words. There are 26^7 possible letter combinations, which is more than 8 billion. The Oxford complete English dictionary has only 600,000 definitions – and that includes words of longer lengths. This means that we don’t really use all letter combinations. We use a meager percentage of them. The same is true of all digital media. There are less than a google particles in the entire universe. This means that there’s no use even contemplating data of size 2^100. Even if you could turn the entire universe into a computer and use every single atom as a digital bit, you wouldn’t get anywhere near to needing 2^100 bits. So, of all possible digital files, we are (and always will be) using an unimaginably negligible fraction. For all practical purposes, redundancy is limitless.
Notice what this means already: how many films have ever been made? A million? That’s nothing. A perfect hash table could have compressed them all to a mere 20 bits! If all you need is to be able to tell the difference between the different feature films, then a mere 20 bit is more than enough. Of course, the “compression” engine itself would have to be very large… This is exactly what happens when you name a movie. You use a shorthand to access it.
The third issue is probability – and it’s here that the most interesting part of compression theory sits. Not all words are as likely. Not all letters are as likely. Take an English text. Count the number of times you find the letters ‘a’ and ‘z’. ‘a’ is a hundred times more likely. So, you can allow yourself to encode the letter ‘z’ with even 50 bits, and you’d still get compression. That’s the idea behind information theory, and most coding schemes.
We can now rephrase my claim from the beginning of this page more clearly: if we would need to distinguish between all word sequences, and the sequences would all have equal probabilities, then there would be no way to compress data. Luckily for us, this is not the case.
But what this means, also, is that the best compression schemes must also be the most limited ones. When you compress English text, you utilize to the maximum the redundancy built into the English language. The more you know about the specifics of the English language, the better your compression. Compression need context. The narrower the context, the better the theoretical compression. Outside of that context, the compression won’t work. A perfect English-text compressor would not perform as well, when applied to a different language. When applied to binary machine-code, it will compress no more. This is why you can’t compress text with JPEG, and why you can’t compress binary data with GZip.