4 Lossless Compression Experiments

We started by experimenting with several lossless compression techniques on adir512, a reduction of the USGS Lake Champlain West DEM, shown in Figure f:512.he, obtained from the DEM with the program pnmscale. This is not one corner of the DEM, but is the whole DEM at a lower resolution. We used this conservative choice since a corner of the DEM, with the points still spaced at , would have compressed either better or worse depending on which corner we used, as shown later in Table t:part. The file contains a mix of a mountainous region in the southwest, with a plane in the center west and a lake in the center east. The 262,144 points range in elevation from 22 to 1,568 meters.

The purpose of these experiments was to determine the relative performance of the various methods, so that we could select a few methods to apply to files of various sizes. The next test case was adir1024, a corner of the Lake Champlain West DEM, first whole, then partitioned into smaller and smaller blocks. This demonstrated that altho a particular method might compress different files of the same size by amounts ranging over a factor of three, the relative performance of different methods applied to the same file did not vary much. This is why these experiments might have some general validity on other data sets.

Finally we tested a few methods on a block of 24 other randomly selected DEMs. We saw the same behavior again. Altho compression ratios varied over a factor of 10, the ratio of the size achieved by the progcode method to the size achieved by gzip, varied by only a factor of 2. That is,

In the following numbers, we will generally ignore header information in the files since the headers are a small fraction of the file size, which fraction gets smaller as the data gets larger.

The raw ASCII file, with each point stored as three to five characters, depending on the number of digits in the elevation (plus a separator), as takes 964,532 bytes. Using a binary file with two bytes per number reduces the space to 524,288 bytes. This is a smaller reduction than expected since 33.6% of the elevations are only 2 digits. Nevertheless, the binary file is useful since it is much faster to read since formatted I/O is rather compute-bound.

The Unix compress command, applied to the ASCII file, produces a 289,408 byte file, altho in only 3 CPU seconds on a Sun Sparc 10/30 workstation. A better compression program is gzip, part of the Free Software Foundation suite of free GNU software. gzip has an option for the amount of time spent to compress. At the default setting, applied to the original ASCII file, the compressed file is 263,270 bytes, produced in 21 seconds. gzip -9, the slowest smallest setting, produces a file of 258,016 bytes in 37 seconds. Since the latter took almost twice as long to compute, the default setting is probably appropriate.

Gzip, in 7 seconds, compresses the above binary file to 230,889 bytes. Gzip -9 creates a slightly smaller 230,179 byte file in 14 seconds. This shows that treating the file as numbers, rather than as simple text is somewhat better. How good might we potentially do?

Among the 262,144 points, there are only 1,436 different elevations. Thus, a table of the existing elevations plus 10 bpp are almost sufficient. What is the information-theoretic Huffman lower bound, ignoring relations between adjacent points? Let be the number of points with the i-th elevation, counting from lo to hi, and let the total number of points be . Then lower bound, also called the entropy of the data, is, in bits,

For this data, the frequencies range from 71 elevations occurring once, to one elevation occuring 28,674 times. The entropy is 280,753 bytes, or 8.6 bpp. Clearly, ignoring correlations in the elevations is expensive.

The above entropy calculation ignores autocorrelations in the data. Therefore consider a 1-D differencing, along a linear array of the points. There were 501 different differences ranging from --158 to +750, with 156 deltas occurring once, occurring 80,125 times, and the next most frequent delta occurring 15,577 times. The resulting file's entropy is 163,125 bytes, or 5.0 bpp, so using deltas gave us a factor of 1.7.

Next: 4.1 Ha Up: Compressing Elevation Data

To appear Previous: 3 Compression Review

Wm Randolph Franklin
Tue Jun 13 14:43:17 EDT 1995