From www@deja.com  Sun Oct 22 13:17:19 2000
From: oleg@pobox.com
Message-ID: <8svi74$h1i$1@nnrp1.deja.com>
Subject: _wide-range_ optimized bit reader and its performance
Date: Sun, 22 Oct 2000 20:20:56 GMT
Reply-To: oleg@pobox.com
Keywords: bit stream, binary parsing, Scheme, Gambit
Newsgroups: comp.lang.scheme
Summary: announcement of the portable, general, and optimized bit reader; timing benchmarks and their discussion
X-Article-Creation-Date: Sun Oct 22 20:20:56 2000 GMT
Status: OR

This is to announce a better -- more generic and portable -- bit
reader.  It is the first part of the binary parsing framework. The bit
reader code was intended to be as general and optimal as possible. The
timing study at the end of that article shows the extent _both_ goals
have been met. The timings also give an insight into performance of a
Scheme system.

The bit reader lets you read one bit from an arbitrary (possibly
overlayed) stream. You can then read one more bit -- or two more bits,
or 3, 7, 8, 23, 30, or 32 more bits. And then 33 bits, 65535,
81920,... That's in bits. Following a tradition of Scheme, the bit
reader does not impose any artificial upper limit whatsoever on the
number of bits you can attempt to read from a stream. You're naturally
bounded by the size of the stream and the amount of the virtual space
on your system. The validation tests included with the source code
really read all the bits from a file in one swoop -- as well as one
bit at a time, and many cases in-between.

Variable-bit i/o is very common, and necessary, in data
compression. Now we can easily implement Huffman and other compression
schemes. The ability to read more than 8 bits at a time can be
handy. For example, GRIB files, which store results of weather model
runs, often encode temperature, pressure and other quantities in 11 or
13 bits. With the wide bit reader, we can really implement the
original Langdon idea for arithmetic compression (provided the IBM
patent has expired). You can load a private key in one binary read
operation, and then use Scheme's arbitrary-precision arithmetics to
perform a cryptographic operation.

The bit reader intentionally does not read ahead: no byte is read
until the very moment we really need (some of) its bits.  Therefore,
the reader can be used to handle a concatenation of different bit/byte
streams _strictly_ sequentially, _without_ 'backing up a char',
'unreading-char' etc. tricks.  For example, make-bit-reader has been
used to read GRIB files of meteorological data, which made of several
bitstreams with headers and tags.  Thus careful attention to
byte-buffering and optimization are the features of this bit reader.

Despite the bit reader being so general, it is nevertheless optimized
for common uses. The optimized cases stand out in the timing
benchmarks discussed below.

The source code is available from
    http://pobox.com/~oleg/ftp/Scheme/binary-parse.scm

The title comments in the above file outline the entire binary parsing
project. At present the project implements reading of the variable
number of bits. The code includes validation tests, as well as a
timing benchmark. I tested the code under Gambit-C 3.0 and MIT Scheme
5d2. The code has R5RS versions of needed logical primitives, so it
should work for any R5RS Scheme system.

Usage example:
(define bit-reader (make-bit-reader (lambda () #b11000101)))
	(bit-reader 3) ==> 6
	(bit-reader 4) ==> 2
The test driver in the source code shows other examples.

The source code includes a timing benchmark. The following are the
results of running the benchmark on a FreeBSD 4.0 box (Pentium
III/Pentium III Xeon, 500 MHz, 128 MB memory), Gambit-C 3.0, in a
compiled and interpreted modes.

	   Reading a 10240-byte file by various bit chunks
      Timing information (total, in msec and per operation, in usec)

	  Interpreted code	    Compiled code
      Total, msec  Per op, usec || Total,ms    Per op, usec
char	 |   31 |       3.0	||   11	|       1.07
byte	 |   73 |	7.1	||   15	|       1.46
1 bit	 | 1383	|      16.9	||  471	|       5.75
2	 |  896	|      21.9	||  282	|       6.88
3	 |  796	|      29.2	||  254	|       9.30
4	 |  527	|      25.7	||  149	|       7.28
5	 |  631	|      38.5	||  194	|      11.8
6	 |  560	|      41.0	||  165	|      12.1
7	 |  564	|      48.2	||  165	|      14.1
8	 |  153	|      14.9	||   33	|       3.22
9	 |  478	|      52.5	||  144	|      15.8
10	 |  411	|      50.2	||  124	|      15.1
11	 |  414	|      55.6	||  128	|      17.2
15	 |  346	|      63.4	||  106	|      19.4
16	 |  147	|      28.7	||   31	|       6.05
17	 |  337	|      69.9	||  103	|      21.4
23	 |  354	|      99.4	||  119	|      33.4
24	 |  140	|      41.0	||   32	|       9.38
25	 |  352	|     107	||  121	|      36.9
30	 |  320	|     117	||  115	|      42.1
32	 |  246	|      96.1	||   83	|      32.4
65535	 | 3110	| 3110000	|| 2601	| 2601000
81920	 | 4395	| 4395000	|| 3745	| 3745000

The first row is a base line: the time to read a 10240-byte file one
character at a time, using a read-char function. The second row,
reading the file by bytes, is also a baseline. It shows the overhead
added by char->integer and simple conditionals. The other rows give
timing results for reading the file n bits at a time, where n ranges
from 1 through 81920. The first column shows a cpu (user+system) time
to read the whole file, in ms. The second column shows the cpu time
per single application of the bit reader. For example, it takes 81920
invocations of the bit reader to read the file one bit at a time. OTH,
it takes only a single call to get all 81920 bits of the file.

In all the cases, the total system cpu time for the interpreted code
was under 8 ms, with the exception of reading by 65535 bits (23 ms)
and by 81920 bits (16 ms).  When running the compiled code, the system
cpu time for the whole file read expression has never been more than 2
ms -- in all the tests.

The cases of reading a file by 1, 8, 16, or 24 bits stand out in the
table. These are optimized cases. We also note that reading of 2*n
bits is generally faster than reading of 2*n-1 or 2*n+1 bits
(n>1). For compiled code, it takes only 5 microseconds to read a bit
-- roughly 5 times worse than reading a character, and only 4 times
worse than reading a byte. The overhead can be further reduced by
using optimized Gambit primitives ##fixnum.zero?, ##fixnum.-, etc.

		GC cycles, interpreted vs. compiled code
	  Interpreted	   Compiled
char	  |   0		|   0
byte	  |   2		|   0
1 bit	  |  17		|   0
2	  |  12		|   0
3	  |  10		|   0
4	  |   7		|   0
5	  |   8		|   0
6	  |   7		|   0
7	  |   7		|   0
8	  |   3		|   0
9	  |   6		|   0
10	  |   6		|   0
11	  |   5		|   0
15	  |   5		|   0
16	  |   3		|   0
17	  |   5		|   0
23	  |   5		|   0
24	  |   3		|   0
25	  |   5		|   0
30	  |   5		|   0
32	  |   5		|   1
65535	  | 136		| 119
81920	  | 196		| 174

For the compiled code, reading by characters, bytes, and up to 17-bits
at a time caused no memory allocations whatsoever. Reading by larger
bits (but 24) require memory allocations, presumably for BIGNUMs.