From oleg-at-okmij.org Wed Sep 15 00:55:25 2004 To: haskell-cafe@haskell.org Subject: Layered I/O From: oleg-at-pobox.com Message-ID: <20040915065525.7A20FA973@Adric.metnet.navy.mil> Date: Tue, 14 Sep 2004 23:55:25 -0700 (PDT) Status: OR ... The key feature is a layered, or stackable i/o. Enclosed is a justification message that I wrote two years ago for Haskell-Cafe, but somehow did not post. An early draft of the ambitious proposal, cast in the context of Scheme, is available here: http://pobox.com/~oleg/ftp/Scheme/io.txt More polished drafts exist, and even a prototype implementation. Unfortunately, once it became clear that the ideas are working out, the motivation fizzled. The discussion of i18n i/o highlighted the need for general overlay streams. We should be able to place a processing layer onto a handle -- and to peel it off and place another one. The layers can do character encoding, subranging (limiting the stream to the specified number of basic units), base64 and other decoding, signature collecting and verification, etc. Processing of a response from a web server illustrates a genuine need for such overlayed processing. Suppose we have established a connection to a web server, sent a GET or POST request and are now reading the response. It starts as follows: HTTP/1.1 200 Have it Content-type: text/plain; charset=iso-2022-jp Content-length: 12345 Date: Tuesday, August 13, 2002 To read the response line and the content headers, our stream must be in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current locale). The body of the message is encoded in iso-2022-jp. This encoding may have nothing to do with the current locale. Furthermore, many character encodings cannot be reliably detected automatically. Therefore, after we read the headers we must forcibly set our stream to use the iso-2022-jp encoding. ISO-2022-JP is the encoding for Japanese characters [http://www.faqs.org/rfcs/rfc1554.html] It is a variable-length stateful encoding: the start of the specifically Japanese encoding is indicated by \e$B. After that, the reader should read _two octets_ from the input stream (and pass them to the application as they are). The server has indicated that it was sending 12345 _octets_ of data. We cannot tell offhand how many _characters_ of data we will read, because of the variable-length encoding. However, we must not even attempt to read the 12346-th octet: HTTP/1.1 connections are, in general, persistent, and we must not read more data than were sent. Otherwise, we deadlock. Therefore, our stream must be able to give us Japanese characters and still must count octets. The HTTP stream will not, in general, give EOF condition at the end of data. That is not the only complication. Suppose the web server replied: Content-type: text/plain; charset=iso-2022-jp Content-transfer-encoding: chunked We should expect to read a sequence of chunks of the format: CRLF CRLF where is a hexadecimal number, and is encoded as indicated in the Content-type. Therefore, after we read the header, we should keep our stream in the ASCII mode to read the field. After that, we should switch the encoding into ISO-2022-JP. After we have consumed octets, we should switch the stream back to ASCII, verify the trailing CRLF, and read the of the next chunk. The ISO-2022-JP encoding is stateful and wide (a character is formed by several octets). It may well happen that a wide character is split between two chunks: one octet of a character will be in one chunk and the other octets in following chunk. Therefore, when we switch from the ISO-2022-JP encoding to ASCII and back, we must preserve the state of the encoding. This is not the end of the story however. A web server may send us a multi-part reply: a multi-part MIME entity made of several MIME entities, each with its own encoding and transfer modes. Neither of these encoding have anything to do with the current locale. Therefore, we may need to switch encodings back and forth quite a few times. Decoding of such complex streams becomes easier if we can overlay different processing layers on a stream. We start with a TCP handle, overlay an ASCII stream and read the headers, then overlay a stream that reads a specified number of units (and returns EOF when read that many). On the top of the latter we place an ISO-2022-JP decoder. Or we choose a base64-decoder overlayed with a PCS7 signed entity decoder and with a signature verification layer. OpenSSL is one package that offers i/o overlays and stream composition. Overlaying of parsers, encoders and hash accumulators is very common in that particular domain. I have implemented such a facility in two languages, e.g., to overlay an endian stream on the top of a raw stream, then a bit stream and an arithmetic compression stream. In a functional world, Ensemble system (Liu, Kreitz, et al.) [http://citeseer.ist.psu.edu/liu99building.html] supports such a stackable i/o. When using overlayed streams, we should remember that all the layers are synchronized. If we did let iso2022_stream = makeiso2022 hFile in body then the raw hFile is still available in the body. Reading the stream with hFile and iso2022_stream indiscriminately will sure wreck havoc. Clean's unique types is an excellent feature to guard against such mistakes. Exactly the same considerations apply if we were using TIFFreader or PNG reader rather than ISO-2022-JP.