DATA COMPRESSION

7/30/2004

Data Compression is the process of reducing digital data to an equivalent representation requiring fewer bits. This is generally done to reduce storage requirements, reduce transmission time, or reduce the bandwidth needed for data transmission. Data compression can be either lossless -- the original bit pattern can be recovered intact -- or lossy -- data (hopefully not essential) is discarded. Generally, lossy encoding produces greater compression than lossless compression. Compression ratios vary with the material being compressed and may even be negative (the "compressed" data is longer than the uncompressed data) in pathological cases. Compression ratios around 2:1 are typical for lossless compression. Lossy compression can do much better although high compression ratio material may be substantially degraded.

Two early lossless compression techniques are Run Length Encoding and Huffman Encoding. Run Length encoding replaces long runs of identical bit configurations with a flag that indicates that the following data is compressed; a set of bits to be repeated; and a count of the number of repetitions. Huffman encoding was developed in the 1950s. It uses short codes for frequently used bit combinations, longer codes for less frequently used bit combinations.

In the 1980s, dictionary based lossless encoders were developed. The best known of these is probably Lemple-Zip-Welch (LZW) encoding. Dictionary encoders build a dictionary of previously seen strings of bits as they move through the data. Bit strings are replaced by their dictionary indices. The dictionary -- which may be quite lengthy -- is not sent with the data. It is reconstructed by the decoder as it moves through the data.

Lossy data compression schemes are generally applied to sensory data such as video and audio that include a great deal of redundancy. They attempt to reduce data volume by discarding data that would not be distinguished by a listener or viewer. Typically they use a priori knowledge of what users can distinguish and/or describe the data using a mathematical algorithm such as the Discrete Cosine Transform or Fractal encoding that describes the data in a series of terms and discards those terms that appear to have little potential meaning.