From www@deja.com Sun Oct 22 13:17:19 2000 From: oleg@pobox.com Message-ID: <8svi74$h1i$1@nnrp1.deja.com> Subject: _wide-range_ optimized bit reader and its performance Date: Sun, 22 Oct 2000 20:20:56 GMT Reply-To: oleg@pobox.com Keywords: bit stream, binary parsing, Scheme, Gambit Newsgroups: comp.lang.scheme Summary: announcement of the portable, general, and optimized bit reader; timing benchmarks and their discussion X-Article-Creation-Date: Sun Oct 22 20:20:56 2000 GMT Status: OR This is to announce a better -- more generic and portable -- bit reader. It is the first part of the binary parsing framework. The bit reader code was intended to be as general and optimal as possible. The timing study at the end of that article shows the extent _both_ goals have been met. The timings also give an insight into performance of a Scheme system. The bit reader lets you read one bit from an arbitrary (possibly overlayed) stream. You can then read one more bit -- or two more bits, or 3, 7, 8, 23, 30, or 32 more bits. And then 33 bits, 65535, 81920,... That's in bits. Following a tradition of Scheme, the bit reader does not impose any artificial upper limit whatsoever on the number of bits you can attempt to read from a stream. You're naturally bounded by the size of the stream and the amount of the virtual space on your system. The validation tests included with the source code really read all the bits from a file in one swoop -- as well as one bit at a time, and many cases in-between. Variable-bit i/o is very common, and necessary, in data compression. Now we can easily implement Huffman and other compression schemes. The ability to read more than 8 bits at a time can be handy. For example, GRIB files, which store results of weather model runs, often encode temperature, pressure and other quantities in 11 or 13 bits. With the wide bit reader, we can really implement the original Langdon idea for arithmetic compression (provided the IBM patent has expired). You can load a private key in one binary read operation, and then use Scheme's arbitrary-precision arithmetics to perform a cryptographic operation. The bit reader intentionally does not read ahead: no byte is read until the very moment we really need (some of) its bits. Therefore, the reader can be used to handle a concatenation of different bit/byte streams _strictly_ sequentially, _without_ 'backing up a char', 'unreading-char' etc. tricks. For example, make-bit-reader has been used to read GRIB files of meteorological data, which made of several bitstreams with headers and tags. Thus careful attention to byte-buffering and optimization are the features of this bit reader. Despite the bit reader being so general, it is nevertheless optimized for common uses. The optimized cases stand out in the timing benchmarks discussed below. The source code is available from http://pobox.com/~oleg/ftp/Scheme/binary-parse.scm The title comments in the above file outline the entire binary parsing project. At present the project implements reading of the variable number of bits. The code includes validation tests, as well as a timing benchmark. I tested the code under Gambit-C 3.0 and MIT Scheme 5d2. The code has R5RS versions of needed logical primitives, so it should work for any R5RS Scheme system. Usage example: (define bit-reader (make-bit-reader (lambda () #b11000101))) (bit-reader 3) ==> 6 (bit-reader 4) ==> 2 The test driver in the source code shows other examples. The source code includes a timing benchmark. The following are the results of running the benchmark on a FreeBSD 4.0 box (Pentium III/Pentium III Xeon, 500 MHz, 128 MB memory), Gambit-C 3.0, in a compiled and interpreted modes. Reading a 10240-byte file by various bit chunks Timing information (total, in msec and per operation, in usec) Interpreted code Compiled code Total, msec Per op, usec || Total,ms Per op, usec char | 31 | 3.0 || 11 | 1.07 byte | 73 | 7.1 || 15 | 1.46 1 bit | 1383 | 16.9 || 471 | 5.75 2 | 896 | 21.9 || 282 | 6.88 3 | 796 | 29.2 || 254 | 9.30 4 | 527 | 25.7 || 149 | 7.28 5 | 631 | 38.5 || 194 | 11.8 6 | 560 | 41.0 || 165 | 12.1 7 | 564 | 48.2 || 165 | 14.1 8 | 153 | 14.9 || 33 | 3.22 9 | 478 | 52.5 || 144 | 15.8 10 | 411 | 50.2 || 124 | 15.1 11 | 414 | 55.6 || 128 | 17.2 15 | 346 | 63.4 || 106 | 19.4 16 | 147 | 28.7 || 31 | 6.05 17 | 337 | 69.9 || 103 | 21.4 23 | 354 | 99.4 || 119 | 33.4 24 | 140 | 41.0 || 32 | 9.38 25 | 352 | 107 || 121 | 36.9 30 | 320 | 117 || 115 | 42.1 32 | 246 | 96.1 || 83 | 32.4 65535 | 3110 | 3110000 || 2601 | 2601000 81920 | 4395 | 4395000 || 3745 | 3745000 The first row is a base line: the time to read a 10240-byte file one character at a time, using a read-char function. The second row, reading the file by bytes, is also a baseline. It shows the overhead added by char->integer and simple conditionals. The other rows give timing results for reading the file n bits at a time, where n ranges from 1 through 81920. The first column shows a cpu (user+system) time to read the whole file, in ms. The second column shows the cpu time per single application of the bit reader. For example, it takes 81920 invocations of the bit reader to read the file one bit at a time. OTH, it takes only a single call to get all 81920 bits of the file. In all the cases, the total system cpu time for the interpreted code was under 8 ms, with the exception of reading by 65535 bits (23 ms) and by 81920 bits (16 ms). When running the compiled code, the system cpu time for the whole file read expression has never been more than 2 ms -- in all the tests. The cases of reading a file by 1, 8, 16, or 24 bits stand out in the table. These are optimized cases. We also note that reading of 2*n bits is generally faster than reading of 2*n-1 or 2*n+1 bits (n>1). For compiled code, it takes only 5 microseconds to read a bit -- roughly 5 times worse than reading a character, and only 4 times worse than reading a byte. The overhead can be further reduced by using optimized Gambit primitives ##fixnum.zero?, ##fixnum.-, etc. GC cycles, interpreted vs. compiled code Interpreted Compiled char | 0 | 0 byte | 2 | 0 1 bit | 17 | 0 2 | 12 | 0 3 | 10 | 0 4 | 7 | 0 5 | 8 | 0 6 | 7 | 0 7 | 7 | 0 8 | 3 | 0 9 | 6 | 0 10 | 6 | 0 11 | 5 | 0 15 | 5 | 0 16 | 3 | 0 17 | 5 | 0 23 | 5 | 0 24 | 3 | 0 25 | 5 | 0 30 | 5 | 0 32 | 5 | 1 65535 | 136 | 119 81920 | 196 | 174 For the compiled code, reading by characters, bytes, and up to 17-bits at a time caused no memory allocations whatsoever. Reading by larger bits (but 24) require memory allocations, presumably for BIGNUMs.