The following simple functions surprisingly often suffice when parsing an input stream. They either skip, or build and return tokens according to the inclusion or delimiter semantics. The list of characters to expect, include, or to break at may vary from one invocation of a function to another. This feature allows the functions to easily parse even context-sensitive languages.
EOF
is generally frowned upon, and thrown up
upon if encountered. Still, sometimes the EOF "symbol"
can be expected: it may appear in a list of break-characters, coded as a
symbol
*eof*
.
The input stream to parse is specified as a
PORT
, which is usually the last (and optional) argument. It defaults to the
current input port if omitted.
This package relies on a function parser-error, which must be defined by a user of the package. The function has the following signature:
parser-error PORT MESSAGE SPECIALISING-MSG*
Many procedures of this package call the
parser-error
to
report a parsing error. The first argument is a port, which typically
points to the offending character or its neighborhood. Most of the
Scheme systems let the user query a PORT for the current
position. MESSAGE is the description of the error. Other arguments
supply more details about the problem.
assert-curr-char
,
skip-until
,
skip-while
,
peek-next-char
next-token
,
next-token-of
,
read-text-line
,
read-string
QUERY_STRING
[in a separate document]
find-string-from-port? STR IN-PORT [MAX-NO-CHARS]
STR
within the first
MAX-NO-CHARS
chars of the input port
IN-PORT
MAX-NO-CHARS
may be omitted: in that case, the search span would be limited only by the end of the input stream
STR
)
#f
if
STR
was not found
find-string-from-port?
is a part of SLIB. Therefore, it is available on all platforms supported by SLIB.look-for-str.scm [5K]
The commented source code.
vinput-parse.scm [18K]
Verification code.
assert-curr-char CHAR-LIST STRING [PORT]
PORT
and looks it up in
the
CHAR-LIST
of expected characters. If the read
character was found among the expected, it is returned. Otherwise, the
procedure writes a nasty message using
STRING
as a comment,
and quits.skip-until CHAR-LIST [PORT]
PORT
until
one of the break characters is encountered. This break character is
returned. The break characters are specified as the
CHAR-LIST
. This list may include
EOF
, which is to be
coded as a symbol
*eof*
.skip-until NUMBER [PORT]
NUMBER
of characters from the
PORT
and returns
#f
.skip-while CHAR-LIST [PORT]
PORT
to the first character that is
not a member of the
CHAR-LIST
-- or till the EOF, whichever occurs sooner. This character or the EOF object is returned. This character is left on the stream.peek-next-char [PORT]
PORT
and
peeks at it. This function is useful when parsing LR(1)-type languages (one-char-read-ahead).input-parse.scm [14K]
The commented source code.
vinput-parse.scm [18K]
A thorough verification code.
next-token PREFIX-CHAR-LIST BREAK-CHAR-LIST [COMMENT-STRING] [PORT]
PREFIX-CHAR-LIST
), if any, and reads a sequence of characters up to (but not including) a break character, one of the
BREAK-CHAR-LIST
.
The string of characters thus read is returned. The break character is left on the input stream.
The list of break characters may include
EOF
, which is to be coded as a symbol
*eof*
. Otherwise,
EOF
is fatal, generating an error message including a specified
COMMENT-STRING
(if any).
As we cannot tell offhand the final size of the token we are reading, we make a guess, pre-allocate a string, and grow it by quanta if necessary. The quantum is always the length of the token buffer before it was extended the last time. This amounts to a Fibonacci-type extension, which has been shown optimal.
An internal procedure
input-parse:init-buffer
determines the initial buffer allocation policy. By default, the
policy is to reuse a statically-allocated buffer. For Scheme systems
without preemptive threads or shared substrings, this policy proved
optimal.
Another implementation of the same procedure is
next-token-list-based
, which accumulates token characters
in a list rather than placing them into an extendible string buffer.
In reality, the list version works just as fast as the string buffer
version above, yet the list version allocates 50% more memory and has
to run garbage collection 50% as many times (Gambit-C 3.0).
; Example: a snippet from CGI:url-unquote (let ((read-primitive-token (lambda () (if (eof-object? (peek-char)) "" (next-token '() '(#\= #\+ #\& #\% *eof*) "URL-unquoting"))))) ) ...) ; Parsing of a comma-delimited list (let loop () (cons (next-token '(#\space) '(#\, *eof*) "" port) (if (eof-object? (read-char port)) '() (loop))))
next-token-of INC-CHARSET [PORT]
PORT
that belong to the list of characters
INC-CHARSET
. The reading stops at the first character which is not a member of the set. This character is left on the stream. All the read characters are returned in a string.next-token-of PRED [PORT]
PORT
for which
PRED
(a procedure of one argument) returns non-
#f
. The reading stops at the first character for which
PRED
returns
#f
. That character is left on the stream. All the results of evaluating of
PRED
up to
#f
are returned in a string.
PRED
is a procedure that takes one argument (a character or the
EOF
object) and returns a character or
#f
. The returned character does not have to be the same as the input argument to the
PRED
. For example,
(next-token-of (lambda (c) (cond ((eof-object? c) #f) ((char-alphabetic? c) (char-downcase c)) (else #f))))will try to read an alphabetic token from the current input port, and return it in lower case.
read-text-line [PORT]
next-token
. It reads one line of text from the
PORT
, and returns it as a string. A line is a (possibly empty) sequence of characters terminated by
CR
,
CRLF
or
LF
(or even the end of file).
CRLF
combination) is removed from the input stream. The terminating character(s) is not a part of the return string either.
EOF
is encountered before any character is read, the return value is EOF.
read-line
. It is
renamed to avoid conflicts with many Scheme implementations, which
define a similar but not identical function.read-string N [PORT]
N
characters from the
PORT
, and returns them in a string. If
EOF
is encountered before
N
characters are read, a shorter string will be returned.
N
is not positive, an empty string will be returned.input-parse.scm [14K]
The commented source code.
vinput-parse.scm [18K]
A thorough verification code.
pat
in a file
filename
.(define (count-occurrences pat filename) (call-with-input-file filename (lambda (port) (let loop ((count 0)) (if (find-string-from-port? pat port) (loop (+ 1 count)) count)))))
or, more succinctly,
(define (count-occurrences-do pat filename) (call-with-input-file filename (lambda (port) (do ((count 0 (+ 1 count))) ((not (find-string-from-port? pat port)) count)))))
If a file
/tmp/a.t
contains
ab abra abracadabra abracadabra abracaDabracadabra abracadabraabracadabrathen
(count-occurrences "abracadabra" "/tmp/a.t") (count-occurrences-do "abracadabra" "/tmp/a.t")both return 5.
(string-length pat)
characters in memory, no matter how big the input file is.dehtmlize port
Given an input port that (possibly) contains an HTML or XML text, the function copies the text without the "tags" to the
current-output-port
. To be more precise, the function omits from copying anything between
<
and
>
, and the brackets themselves. Comments,
<!-- ... -->
, are disregarded as well. Note, the comments may contain nested tags.
(define (dehtmlize port) (let loop () (let ((token (next-token '() '(#\< *eof*) "opening tag" port))) (display token) (and (eq? #\< (read-char port)) (or (and (eq? #\! (peek-char port)) ; check for possible comment (eq? #\- (peek-next-char port)) (eq? #\- (peek-next-char port)) (find-string-from-port? "-->" port) (loop)) (and (find-string-from-port? ">" port) (loop)))))))
The article referenced below embeds the dehtmlize function in a complete program that fetches a URL and prints out the corresponding HTML document without tags and comments.
The article also compares dehtmlize to the corresponding Perl code. The Scheme code has a notable advantage over the Perl code in its ability to dehtmlize the input stream "on the fly". The Perl code loads the entire text of the input document into memory. Any Perl code must do that as you cannot apply regular expression substitutions to a port as an input port generally does not allow "backtracking" by an arbitrary amount. The Scheme code works with a strictly sequential input port; it needs as much memory as the longest token -- which is far smaller than the whole document. The Perl code does four passes through the whole input (one to read it all into memory; two others do regular expression substitutions specified in dehtmlize; and yet another pass writes out the result). The Scheme code accomplishes all the work in only one pass. The size of the Scheme code is comparable with that of the Perl code.
A quick comparison of Scheme and Perl regarding
parsing of input [plain text file]
The article itself. It was originally posted as
HTTP fetching on a newsgroup
comp.lang.scheme
on 28 Feb 2001 03:07:49 +0100.
read-all
, which takes an input port and returns the string of all characters read from the port till the end-of-file. The port can be opened on a file, terminal or a communication channel.(define (read-all port) (let loop ((accum '())) (let ((c (read-char port))) (if (eof-object? c) (apply string (reverse accum)) (loop (cons c accum))))))This method is clumsy and slow: all the characters read from the port have to be scanned once more during the reversal of the list, and one more time when building a string. Furthermore, applying a function (in our case,
string
) to a large argument list is
inefficient. Some Scheme systems even limit the
number of arguments that can be passed to a function.(define (read-all port) (let ((quantum 8192)) (let loop ((chunks-rev '())) (let ((chunk (read-string quantum port))) (if (or (eof-object? chunk) (zero? (string-length chunk))) (string-concatenate-reverse chunks-rev) (loop (cons chunk chunks-rev)))))))where
read-string
is one of the "standard"
functions implemented on many systems. It is also a part of the
parsing library above. The function
string-concatenate-reverse
is in SRFI-13. For completeness, here is the code of that function:; See SRFI-13 for details (define (string-concatenate-reverse string-list) (let* ((final-size ; total the length of strings in string-list (let loop ((size 0) (lst string-list)) (if (pair? lst) (loop (+ size (string-length (car lst))) (cdr lst)) size))) (final-str (make-string final-size)) (dest-i (- final-size 1))) (let outer ((lst string-list)) (if (pair? lst) (let ((str (car lst))) (do ((i (- (string-length str) 1) (- i 1))) ((negative? i) (outer (cdr lst))) (string-set! final-str dest-i (string-ref str i)) (set! dest-i (- dest-i 1)))) final-str))))
next-token
(define (read-all port) (next-token '() '(*eof*) "reading string" port))This version of
read-all
runs 4 times faster in
elapsed time, 3 times faster in CPU time and allocates 3 times less
memory, compared with the chunked read procedure above. The benchmark
was run by a Gambit-C 3.0 interpreter, on a Pentium III Xeon 500 MHz
computer with 128 MB of main memory and FreeBSD 4.0-RELEASE operating
system.(define (read-all port) (read-string (get-upper-bound-size-estimate) port))This procedure allocates no garbage and is twice as fast as the previous one (the one that used next-token).
port-copy
port-copy
, if it is
available:(define (read-all port) (call-with-output-string (lambda (o-port) (port-copy port o-port))))This version is about 20% slower than the
read-string
-based implementation above. However this version does not require any estimate for the size of the port.An article
Re: Nicest and most efficient way to (read-all port)=>string posted on a newsgroup
comp.lang.scheme
on Fri, 2 Nov 2001 11:31:14 -0800
The function
port-copy
:
Ports and the enhanced i/o
Functions
next-token
and
read-string
:
Stream tokenizers
aa(AA(BB)CC)aaaa((DD)EE)aaa(FF(GG))aaa(HH)aaawe should produce the list of strings
(AA(BB)CC) (BB) (DD) ((DD)EE) (FF(GG)) (GG) (HH)
A solution is given below. It has been tested on Gambit-C 3.0, Bigloo 2.4b and SCM 5d6. We assume that myenv-bigloo.scm or a similar platform-specific prelude and input-parse.scm have been loaded.
port
and adds the list of found parenthesized strings
to
accum
. The latter is the list of paren-delim strings seen so
far. Initially,
accum
should be an empty list.(define (parse-nested-strs port accum) (if (eof-object? (skip-until '(#\( *eof*) port)) accum ; done reading, no more open paren encountered ; we have seen and read the opening paren (let ((accum (read-accum-within-paren port '("(") accum))) (parse-nested-strs port accum))))
read-accum-within-paren
is a
helper. We invoke it when we saw an opening parenthesis. The function reads
port
until the corresponding closing parenthesis and
returns the updated
accum
. If there are nested
paren-delimited strings, they will be added to
accum
as
well. At the top of the returned accum is the longest paren-delimited
string. The argument
fragments
is the list of seen text
fragments, in the reverse order.(define (read-accum-within-paren port fragments accum) (let* ((token (next-token '() '(#\( #\) ) "reading body" port)) (delim (read-char port)) ; what we've seen after the token (fragments (cons token fragments)) ; update the list of fragments ) (case delim ((#\) ) ; we've seen the matching closing paren; ; return what we've got (cons (apply string-append (reverse (cons ")" fragments))) accum) ) (else ; we've seen an opening paren: recurse (let ((accum (read-accum-within-paren port '("(") accum))) (read-accum-within-paren port (cons (car accum) fragments) accum))))))As a test, the following code:
(write (call-with-input-string "aa(AA(B(X())(Y)B)CC)aaaa((DD)EE)aaa(FF(GG))aaa(HH)aaa" (lambda (port) (parse-nested-strs port '()))))prints the following expected result:
("(HH)" "(FF(GG))" "(GG)" "((DD)EE)" "(DD)" "(AA(B(X())(Y)B)CC)" "(B(X())(Y)B)" "(Y)" "(X())" "()")
An article
Re: Parsing of nested strings, help needed posted on a newsgroup
comp.lang.scheme
on 1 Jul 2002 19:17:48 -0700
Functions
next-token
and
read-string
:
Stream tokenizers
This problem, posed on comp.lang.scheme in August 2003, was to
read (floating point) numbers from a file and return them as a list.
The file may contain comments, which are marked by the character
#\#
in the first position, and span through the end of
the line. Files of that structure are often used as configuration
files for numeric code. Given the sample file
/tmp/a.txt
of the following content
# comment # another line of comment 1.23 1.223e12 1.e0 2.3e23 # comment again 2.344 -1.223e23 2.2334 +1.23e0 # #the expression
(call-with-input-file "/tmp/a.txt" read-numbers)
should yield(1.23 1.223e12 1.0 2.3e23 2.344 -1.223e23 2.2334 1.23)
The function
read-numbers
is given below. It is
portable and should work on any R5RS system. The code makes no
assumptions about the distribution of the numbers or comments. The
code scans the input
exactly once. There is no need to read
a string and scan it over again.
(define (read-numbers port) ; we're at the beginning of the line. Check the first char for ; #\# to see if it starts a comment (define (state-1 port accum) (let ((c (peek-char port))) (if (eof-object? c) (finish accum) (case c ((#\#) ; it was a comment -- skip to the end of the line (skip-until '(#\newline *eof*) port) (state-1 port accum)) ; recur (else (state-2 port accum)))))) ; it was probably a number ; we're sure we're not at the comment. Read the number. ; keep track if we're at the end of the line or not (define (state-2 port accum) (let ((c (skip-while '(#\space) port))) (if (or (eof-object? c) (char=? c #\newline)) (begin (read-char port) (state-1 port accum)) (let ((token (next-token-of (lambda (c) ; recognizer for valid tokens (cond ((eof-object? c) #f) ((string-index "0123456789eE+-." c) c) (else #f))) port))) (if (equal? token "") (error "bad number") (state-2 port (cons (string->number token) accum))))))) (define (finish accum) (reverse accum)) ; Get the ball rolling (state-1 port '()))
An article
Re: Reading a file in Scheme posted on a newsgroup
comp.lang.scheme
on Mon, 25 Aug 2003 12:10:50 -0700
Functions
next-token-of
,
skip-until
and
skip-while
:
Stream tokenizers
The problem is to read numbers, strings and tokens from a
file of comma-separated-values. The input port should contain
comma-separated tokens, interspersed with the arbitrary amount of
white space. The tokens are: numbers (that is, anything parseable by
number->string
), strings and anything else. The input
file may also contain `missing' tokens, that is, two consecutive
commas or the trailing comma, within the arbitrary whitespace. Such
missing tokens are considered `symbols' whose string representation is
the empty string. The string token in the input stream should be a
R5RS string datum: the string should be enclosed in double quotes and
may contain (escaped) backslashes, escaped quotes, and other escaped
characters as permitted by R5RS or the particular Scheme
implementation.
nan-interp
is either a procedure
string-or-symbol -> val
or
an associative list with string or symbolic keys. The function
returns the list of read values: a numeric token is
converted to a number; strings and anything else are passed to the
nan-interp
function and its result is used in the list of
return
val
s. If
nan-interp
is an associative
list, we do the lookup in the list then.(write (call-with-input-string " 1.223, \"str\" , \"complex \\\\ string, indeed\\\"\" , 2.23 , nil" (lambda (port) (read-csv (lambda (x) x) port)))) ;==> (1.223 "str" "complex \\ string, indeed\"" 2.23 nil)
(display (call-with-input-string "-1,+2,-15e-15,, nan, 22.3334, 2.23 , , \"nil\"," (lambda (port) (read-csv `((,(string->symbol "") . 0.0) (nan . -999) ("nil" . -1.0)) port)))) ;==> (-1 2 -1.5e-14 0.0 -999 22.3334 2.23 0.0 -1.0 0.0)The latter example exhibits `skipped' tokens: two commas in a row and the trailing comma. The skipped tokens are represented by 0.0 in the result list, as specified in the associative list in the first argument to
read-csv
.comp.lang.scheme
on Tue, 3 Feb 2004 16:40:14 -0800This site's top page is http://okmij.org/ftp/
Converted from SXML by SXML->HTML