Making sense of an input stream

The following simple functions surprisingly often suffice when parsing an input stream. They either skip, or build and return tokens according to the inclusion or delimiter semantics. The list of characters to expect, include, or to break at may vary from one invocation of a function to another. This feature allows the functions to easily parse even context-sensitive languages.

EOF is generally frowned upon, and thrown up upon if encountered. Still, sometimes the EOF "symbol" can be expected: it may appear in a list of break-characters, coded as a symbol *eof*.

The input stream to parse is specified as a PORT, which is usually the last (and optional) argument. It defaults to the current input port if omitted.

Import

This package relies on a function parser-error, which must be defined by a user of the package. The function has the following signature:

     parser-error PORT MESSAGE SPECIALISING-MSG*

Many procedures of this package call the parser-error to report a parsing error. The first argument is a port, which typically points to the offending character or its neighborhood. Most of the Scheme systems let the user query a PORT for the current position. MESSAGE is the description of the error. Other arguments supply more details about the problem.

Looking for a string in a sequential input stream
Input parsing primitives
assert-curr-char, skip-until, skip-while, peek-next-char
Stream tokenizers
next-token, next-token-of, read-text-line, read-string
Sample Code
CGI:url-unquote: a CGI script utility for parsing of a QUERY_STRING [in a separate document]
A functional-style framework to parse XML/HTML documents [in a separate document]

Looking for a string in a sequential input stream

find-string-from-port? STR IN-PORT [MAX-NO-CHARS]

- looks for a string STR within the first MAX-NO-CHARS chars of the input port IN-PORT
- MAX-NO-CHARS may be omitted: in that case, the search span would be limited only by the end of the input stream
- if the string is found, the function returns the number of characters it has read from the port, and the port is set to read the first char after that (that is, after the STR)
- the function returns #f if STR was not found
Note the function reads the port strictly sequentially, and does not perform any buffering. The function thus can be used even if the port is open on a pipe or other communication channel.

Platforms

The package is tested on Gambit-C 3.0, SCM 5d2, Bigloo 2.2b and MIT Scheme 7.5.2. The function find-string-from-port? is a part of SLIB. Therefore, it is available on all platforms supported by SLIB.

Version

The current version is 1.3, Jan 18, 1996.

References

look-for-str.scm [5K]
The commented source code.

vinput-parse.scm [18K]
Verification code.

Input parsing primitives

assert-curr-char CHAR-LIST STRING [PORT]

reads a character from the PORT and looks it up in the CHAR-LIST of expected characters. If the read character was found among the expected, it is returned. Otherwise, the procedure writes a nasty message using STRING as a comment, and quits.

skip-until CHAR-LIST [PORT]

reads and skips characters from the PORT until one of the break characters is encountered. This break character is returned. The break characters are specified as the CHAR-LIST. This list may include EOF, which is to be coded as a symbol *eof*.

skip-until NUMBER [PORT]

skips the specified NUMBER of characters from the PORT and returns #f.

skip-while CHAR-LIST [PORT]

advances the PORT to the first character that is not a member of the CHAR-LIST -- or till the EOF, whichever occurs sooner. This character or the EOF object is returned. This character is left on the stream.

peek-next-char [PORT]

advances to the next character in the PORT and peeks at it. This function is useful when parsing LR(1)-type languages (one-char-read-ahead).

Platforms

The package is tested on Gambit-C 3.0, SCM 5d2, Bigloo 2.2b and MIT Scheme 7.5.2

Version

The current version is 3.1, Mar 14, 2001.

References

input-parse.scm [14K]
The commented source code.

vinput-parse.scm [18K]
A thorough verification code.

Stream tokenizers

The following tokenizers support both delimiting and inclusion semantics for a token.

next-token PREFIX-CHAR-LIST BREAK-CHAR-LIST [COMMENT-STRING] [PORT]

skips any number of the prefix characters (members of the PREFIX-CHAR-LIST), if any, and reads a sequence of characters up to (but not including) a break character, one of the BREAK-CHAR-LIST. The string of characters thus read is returned. The break character is left on the input stream.

The list of break characters may include EOF, which is to be coded as a symbol *eof*. Otherwise, EOF is fatal, generating an error message including a specified COMMENT-STRING (if any).

As we cannot tell offhand the final size of the token we are reading, we make a guess, pre-allocate a string, and grow it by quanta if necessary. The quantum is always the length of the token buffer before it was extended the last time. This amounts to a Fibonacci-type extension, which has been shown optimal.

An internal procedure input-parse:init-buffer determines the initial buffer allocation policy. By default, the policy is to reuse a statically-allocated buffer. For Scheme systems without preemptive threads or shared substrings, this policy proved optimal.

Another implementation of the same procedure is next-token-list-based, which accumulates token characters in a list rather than placing them into an extendible string buffer. In reality, the list version works just as fast as the string buffer version above, yet the list version allocates 50% more memory and has to run garbage collection 50% as many times (Gambit-C 3.0).

     ; Example: a snippet from CGI:url-unquote
     (let ((read-primitive-token
            (lambda ()
              (if (eof-object? (peek-char)) ""
                (next-token '() '(#\= #\+ #\& #\% *eof*)
     				   "URL-unquoting"))))) )
       ...)
     
     ; Parsing of a comma-delimited list
     (let loop ()
       (cons (next-token '(#\space) '(#\, *eof*) "" port)
     	(if (eof-object? (read-char port)) '() (loop))))

next-token-of INC-CHARSET [PORT]

reads characters from the PORT that belong to the list of characters INC-CHARSET. The reading stops at the first character which is not a member of the set. This character is left on the stream. All the read characters are returned in a string.

next-token-of PRED [PORT]

reads characters from the PORT for which PRED (a procedure of one argument) returns non- #f. The reading stops at the first character for which PRED returns #f. That character is left on the stream. All the results of evaluating of PRED up to #f are returned in a string.

PRED is a procedure that takes one argument (a character or the EOF object) and returns a character or #f. The returned character does not have to be the same as the input argument to the PRED. For example,

     (next-token-of
      (lambda (c)
        (cond
          ((eof-object? c) #f)
          ((char-alphabetic? c) (char-downcase c))
          (else #f))))

will try to read an alphabetic token from the current input port, and return it in lower case.

read-text-line [PORT]

is a particular application of next-token. It reads one line of text from the PORT, and returns it as a string. A line is a (possibly empty) sequence of characters terminated by CR, CRLF or LF (or even the end of file).
The terminating character (or CRLF combination) is removed from the input stream. The terminating character(s) is not a part of the return string either.
If EOF is encountered before any character is read, the return value is EOF.
This function used to be called read-line. It is renamed to avoid conflicts with many Scheme implementations, which define a similar but not identical function.

read-string N [PORT]

reads N characters from the PORT, and returns them in a string. If EOF is encountered before N characters are read, a shorter string will be returned.
If N is not positive, an empty string will be returned.

Platforms

The package is tested on Gambit-C 3.0, SCM 5d2, Bigloo 2.2b and MIT Scheme 7.5.2

Version

The current version is 3.3, Jul 11, 2001.

References

input-parse.scm [14K]
The commented source code.

vinput-parse.scm [18K]
A thorough verification code.

Sample Code

Count the number of occurrences of a pattern in a file

Count the number of occurrences of a string pattern pat in a file filename.

     (define (count-occurrences pat filename)
       (call-with-input-file filename
         (lambda (port)
           (let loop ((count 0))
            (if (find-string-from-port? pat port)
                (loop (+ 1 count))
                count)))))

or, more succinctly,

     (define (count-occurrences-do pat filename)
       (call-with-input-file filename
         (lambda (port)
           (do ((count 0 (+ 1 count)))
               ((not (find-string-from-port? pat port)) count)))))

If a file /tmp/a.t contains

     ab abra   abracadabra
     		abracadabra
     abracaDabracadabra abracadabraabracadabra

then

     (count-occurrences "abracadabra" "/tmp/a.t")
     (count-occurrences-do "abracadabra" "/tmp/a.t")

both return 5.
There is no need to read lines. In fact, there is no need to hold more than (string-length pat) characters in memory, no matter how big the input file is.

Rough-and-ready HTML-to-text converter

dehtmlize port

Given an input port that (possibly) contains an HTML or XML text, the function copies the text without the "tags" to the current-output-port. To be more precise, the function omits from copying anything between < and >, and the brackets themselves. Comments, , are disregarded as well. Note, the comments may contain nested tags.

     (define (dehtmlize port)
       (let loop ()
         (let ((token (next-token '() '(#\< *eof*) "opening tag" port)))
           (display token)
           (and (eq? #\< (read-char port))
     	   (or (and (eq? #\! (peek-char port))	; check for possible comment
     		    (eq? #\- (peek-next-char port))
     		    (eq? #\- (peek-next-char port))
     		    (find-string-from-port? "-->" port)
     		    (loop))
     	       (and
     		(find-string-from-port? ">" port)
     		(loop)))))))

The article referenced below embeds the dehtmlize function in a complete program that fetches a URL and prints out the corresponding HTML document without tags and comments.

The article also compares dehtmlize to the corresponding Perl code. The Scheme code has a notable advantage over the Perl code in its ability to dehtmlize the input stream "on the fly". The Perl code loads the entire text of the input document into memory. Any Perl code must do that as you cannot apply regular expression substitutions to a port as an input port generally does not allow "backtracking" by an arbitrary amount. The Scheme code works with a strictly sequential input port; it needs as much memory as the longest token -- which is far smaller than the whole document. The Perl code does four passes through the whole input (one to read it all into memory; two others do regular expression substitutions specified in dehtmlize; and yet another pass writes out the result). The Scheme code accomplishes all the work in only one pass. The size of the Scheme code is comparable with that of the Perl code.

References

A quick comparison of Scheme and Perl regarding parsing of input [plain text file]
The article itself. It was originally posted as HTTP fetching on a newsgroup comp.lang.schemeon 28 Feb 2001 03:07:49 +0100.

How to read the whole file or a port into a string

The task is to write a function read-all, which takes an input port and returns the string of all characters read from the port till the end-of-file. The port can be opened on a file, terminal or a communication channel.

The straightforward solution: building a list of characters

The most straightforward (if not to say naive) implementation is to accumulate read characters into a list, in reverse order. The list is reversed and converted to a string at the very end:

     (define (read-all port)
       (let loop ((accum '()))
         (let ((c (read-char port)))
           (if (eof-object? c)
               (apply string (reverse accum))
               (loop (cons c accum))))))

This method is clumsy and slow: all the characters read from the port have to be scanned once more during the reversal of the list, and one more time when building a string. Furthermore, applying a function (in our case, string) to a large argument list is inefficient. Some Scheme systems even limit the number of arguments that can be passed to a function.

Reading the port by chunks

Rather than reading the port character-by-character, we read it by chunks. We place the chunks into a list. At the end, we reverse the list and concatenate its elements.

     (define (read-all port)
       (let ((quantum 8192))
         (let loop ((chunks-rev '()))
           (let ((chunk (read-string quantum port)))
             (if (or (eof-object? chunk) (zero? (string-length chunk)))
                 (string-concatenate-reverse chunks-rev)
                 (loop (cons chunk chunks-rev)))))))

where read-string is one of the "standard" functions implemented on many systems. It is also a part of the parsing library above. The function string-concatenate-reverse is in SRFI-13. For completeness, here is the code of that function:

     ; See SRFI-13 for details
     (define (string-concatenate-reverse string-list)
       (let* ((final-size
              ; total the length of strings in string-list
               (let loop ((size 0) (lst string-list))
                 (if (pair? lst)
                     (loop (+ size (string-length (car lst))) (cdr lst))
                     size)))
              (final-str (make-string final-size))
              (dest-i (- final-size 1)))
         (let outer ((lst string-list))
           (if (pair? lst)
               (let ((str (car lst)))
                 (do ((i (- (string-length str) 1) (- i 1)))
                     ((negative? i) (outer (cdr lst)))
                   (string-set! final-str dest-i (string-ref str i))
                   (set! dest-i (- dest-i 1))))
               final-str))))

Using next-token

     (define (read-all port)
       (next-token '() '(*eof*) "reading string" port))

This version of read-all runs 4 times faster in elapsed time, 3 times faster in CPU time and allocates 3 times less memory, compared with the chunked read procedure above. The benchmark was run by a Gambit-C 3.0 interpreter, on a Pentium III Xeon 500 MHz computer with 128 MB of main memory and FreeBSD 4.0-RELEASE operating system.

Assuming the size of the port or its upper bound is known

If the input port is open on a file and we happen to know the size of the file, or if we know an upper bound for the size of the file, we need to invoke read-string only once:

     (define (read-all port)
       (read-string (get-upper-bound-size-estimate) port))

This procedure allocates no garbage and is twice as fast as the previous one (the one that used next-token).

Using port-copy

Finally we can use a function port-copy, if it is available:

     (define (read-all port)
       (call-with-output-string
         (lambda (o-port)
           (port-copy port o-port))))

This version is about 20% slower than the read-string-based implementation above. However this version does not require any estimate for the size of the port.

References

An article Re: Nicest and most efficient way to (read-all port)=>string posted on a newsgroup comp.lang.scheme on Fri, 2 Nov 2001 11:31:14 -0800

The function port-copy: Ports and the enhanced i/o

Functions next-token and read-string: Stream tokenizers

Exhaustive parsing of nested strings

This problem, posed on comp.lang.scheme in June 2002, is about extracting all substrings that are delimited by parentheses. Normally a parser is only asked for the longest or the shortest match. A string delimited by parentheses may include substrings delimited by parentheses. The depth of nesting is arbitrary. For example, given the input string

     aa(AA(BB)CC)aaaa((DD)EE)aaa(FF(GG))aaa(HH)aaa

we should produce the list of strings (AA(BB)CC) (BB) (DD) ((DD)EE) (FF(GG)) (GG) (HH)

A solution is given below. It has been tested on Gambit-C 3.0, Bigloo 2.4b and SCM 5d6. We assume that myenv-bigloo.scm or a similar platform-specific prelude and input-parse.scm have been loaded.

parse-nested-strs port accum

This function performs the parsing task. It reads the input string from port and adds the list of found parenthesized strings to accum. The latter is the list of paren-delim strings seen so far. Initially, accum should be an empty list.

     (define (parse-nested-strs port accum)
       (if (eof-object? (skip-until '(#\( *eof*) port))
           accum ; done reading, no more open paren encountered
           ; we have seen and read the opening paren
           (let ((accum
                  (read-accum-within-paren port '("(") accum)))
             (parse-nested-strs port accum))))

read-accum-within-paren port fragments accum

The function read-accum-within-paren is a helper. We invoke it when we saw an opening parenthesis. The function reads port until the corresponding closing parenthesis and returns the updated accum. If there are nested paren-delimited strings, they will be added to accum as well. At the top of the returned accum is the longest paren-delimited string. The argument fragments is the list of seen text fragments, in the reverse order.

     (define (read-accum-within-paren port fragments accum)
       (let* ((token (next-token '() '(#\( #\) ) "reading body" port))
              (delim (read-char port)) ; what we've seen after the token
              (fragments (cons token fragments)) ; update the list of fragments
              )
         (case delim
           ((#\) ) ; we've seen the matching closing paren;
                   ; return what we've got
            (cons (apply string-append (reverse (cons ")" fragments))) accum)
            )
           (else   ; we've seen an opening paren: recurse
            (let ((accum (read-accum-within-paren port '("(") accum)))
              (read-accum-within-paren port
                                       (cons (car accum) fragments)
                                       accum))))))

As a test, the following code:

     (write (call-with-input-string
       "aa(AA(B(X())(Y)B)CC)aaaa((DD)EE)aaa(FF(GG))aaa(HH)aaa"
       (lambda (port)
         (parse-nested-strs port '()))))

prints the following expected result:

     ("(HH)" "(FF(GG))" "(GG)" "((DD)EE)" "(DD)" "(AA(B(X())(Y)B)CC)"
      "(B(X())(Y)B)" "(Y)" "(X())" "()")

References

An article Re: Parsing of nested strings, help needed posted on a newsgroup comp.lang.scheme on 1 Jul 2002 19:17:48 -0700

Functions next-token and read-string: Stream tokenizers

Reading numbers from a file with comments

This problem, posed on comp.lang.scheme in August 2003, was to read (floating point) numbers from a file and return them as a list. The file may contain comments, which are marked by the character #\# in the first position, and span through the end of the line. Files of that structure are often used as configuration files for numeric code. Given the sample file /tmp/a.txt of the following content

     # comment
     # another line of comment
      1.23 
      1.223e12  1.e0
           2.3e23 
     # comment again
     2.344 
     -1.223e23  2.2334
     +1.23e0
     #
     #

the expression (call-with-input-file "/tmp/a.txt" read-numbers) should yield

     (1.23 1.223e12 1.0 2.3e23 2.344 -1.223e23 2.2334 1.23)

The function read-numbers is given below. It is portable and should work on any R5RS system. The code makes no assumptions about the distribution of the numbers or comments. The code scans the input exactly once. There is no need to read a string and scan it over again.

     (define (read-numbers port)
       ; we're at the beginning of the line. Check the first char for
       ; #\# to see if it starts a comment
       (define (state-1 port accum)
         (let ((c (peek-char port)))
           (if (eof-object? c) (finish accum)
             (case c
               ((#\#) ; it was a comment -- skip to the end of the line
                 (skip-until '(#\newline *eof*) port)
                 (state-1 port accum))               ; recur
               (else (state-2 port accum))))))       ; it was probably a number
     
       ; we're sure we're not at the comment. Read the number.
       ; keep track if we're at the end of the line or not
       (define (state-2 port accum)
         (let ((c (skip-while '(#\space) port)))
           (if (or (eof-object? c) (char=? c #\newline))
             (begin (read-char port) (state-1 port accum))
             (let ((token 
                     (next-token-of
                       (lambda (c)           ; recognizer for valid tokens
                         (cond
                           ((eof-object? c) #f)
                           ((string-index "0123456789eE+-." c) c)
                           (else #f)))
                       port)))
               (if (equal? token "")
                 (error "bad number")
                 (state-2 port (cons (string->number token) accum)))))))
     
       (define (finish accum) (reverse accum))
     
       ; Get the ball rolling
       (state-1 port '()))

References

An article Re: Reading a file in Scheme posted on a newsgroup comp.lang.scheme on Mon, 25 Aug 2003 12:10:50 -0700

Functions next-token-of, skip-until and skip-while : Stream tokenizers

Reading comma-separated values (CSV)

The problem is to read numbers, strings and tokens from a file of comma-separated-values. The input port should contain comma-separated tokens, interspersed with the arbitrary amount of white space. The tokens are: numbers (that is, anything parseable by number->string), strings and anything else. The input file may also contain `missing' tokens, that is, two consecutive commas or the trailing comma, within the arbitrary whitespace. Such missing tokens are considered `symbols' whose string representation is the empty string. The string token in the input stream should be a R5RS string datum: the string should be enclosed in double quotes and may contain (escaped) backslashes, escaped quotes, and other escaped characters as permitted by R5RS or the particular Scheme implementation.

read-csv nan-interp port -> list-of-values

Here nan-interp is either a procedure string-or-symbol -> val or an associative list with string or symbolic keys. The function returns the list of read values: a numeric token is converted to a number; strings and anything else are passed to the nan-interp function and its result is used in the list of return vals. If nan-interp is an associative list, we do the lookup in the list then.

For example,

     (write
      (call-with-input-string
        "    1.223, \"str\" , \"complex \\\\ string, indeed\\\"\"   , 2.23 ,   nil"
        (lambda (port) (read-csv (lambda (x) x) port))))
     ;==> (1.223 "str" "complex \\ string, indeed\"" 2.23 nil)

     (display
       (call-with-input-string
         "-1,+2,-15e-15,,  nan,   22.3334, 2.23 , ,   \"nil\","
         (lambda (port) (read-csv 
             `((,(string->symbol "") . 0.0) (nan . -999) ("nil" . -1.0)) port))))
     ;==> (-1 2 -1.5e-14 0.0 -999 22.3334 2.23 0.0 -1.0 0.0)

The latter example exhibits `skipped' tokens: two commas in a row and the trailing comma. The skipped tokens are represented by 0.0 in the result list, as specified in the associative list in the first argument to read-csv.

References

read-csv.scm [5K]
The implementation and the examples
An earlier version of the code for a simpler problem has been posted as Re: regexp, parsing on the newsgroup comp.lang.scheme on Tue, 3 Feb 2004 16:40:14 -0800

Last updated April 1, 2006

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from SXML by SXML->HTML