This article analyzes various ways of traversing a collection. We pay a particular attention to the most complex and general case of enumerating only a subset of a collection and performing actions that depend on the previously seen elements. We also discuss generators such as those in Icon and Python. A proposal is made for a convenient generic enumerator, which could be used, for example, in a truly functional database interface.
This article was originally written in April 2000. Its analysis of enumerators has since been researched further and extended to other languages. The paper [LL3-enumerators] is the latest presentation on this topic. The proposal for a convenient and generic enumerator first suggested here has been tested in practice. We implemented the left-fold enumerator to access (object-)relational databases in a functional way [database-access]. This interface has been used in a production environment. The generic-left-fold-based interface underlies the Collection SRFI. It was also used in a TIFF library [TIFF], and even in other languages [nonrec-fold-stream].
This present article also demonstrates a non-trivial use of call/cc -- to invert a traversal "inside out." In this application, call/cc cannot be replaced with a simple throw/catch construction. Later on the inversion procedure was extended to the most general enumerator [inversion-src]. It turns out that with a small modification, we can perform the iteration inversion in languages without first-class continuations, e.g., Haskell [nonrec-fold-stream].
Suppose we have a collection with a
for-each-type
enumerator: a file system directory, an XML document, a suffix tree,
or a "dynamic" collection that is the result of executing a
SELECT DBMS/ODBC statement. We wish to obtain the first
10 elements of the collection that satisfy a certain criterion:
contain specific keywords, sound alike, etc.
For the purpose of this example, let us consider a collection
of numbers of the form
65537x+5 mod 1997; The selection
criterion is divisibility by 7. To make the problem less trivial, let
the collection be represented by a treap [Treap]. We will not use the full power of the treap in this example: what
matters to us is that the treap is an advanced collection and that it
implements a
for-each enumerator. A more general example is
given in [inversion-src].
A word about terminology is in order, as it is rather
confusing. We will call a
for-each-type procedure
enumerator. Note the active suffix
OR. It is
this procedure that takes a collection and a handler, and applies the
handler to each element of the collection in turn. The handler itself
will be called an
iteratee, because it is
being applied -- by an active entity, the enumeratOR.
Enumerator is presumably privy to the internals of a collection; it
knows how to fetch elements in some (specified) order, and knows when
to stop. A handler does not have this knowledge, nor is it supposed
to. The job of an iteratee is to do something with the current
element, irrespective of how this current element was obtained.
for-each enumeratorWe will first consider the most straightforward implementation: it will be our baseline. The complete code for this and the following examples is given in [enumerators-src].
(define (get-good-numbers-dumb treap count-max)
(let ((count 0) (result '()))
((treap 'for-each-ascending)
(lambda (key-value)
(let ((x (car key-value)))
(and (< count count-max) (good-value? x)
(begin
(set! count (+ 1 count))
(set! result (cons x result)))))))
(reverse result)))
(cerr (time (get-good-numbers-dumb treap pagesize)) nl)
23 ms real time
20 ms cpu time (20 user, 0 system)
1 collection accounting for 0 ms cpu time (0 user, 0 system)
171136 bytes allocated
no minor faults
no major faults
(42 49 56 63 70 112 119 126 133 140)
Assignments are prominent in this solution: they are necessary to
advance the state. The latter must be represented by global (to the
iteratee) variables, which can persists from one invocation of the
iteratee to another.
A smarter approach is to escape from the enumerator when we are done
(define (get-good-numbers-cc treap count-max)
(let ((count 0) (result '()))
(call-with-current-continuation
(lambda (escape)
((treap 'for-each-ascending)
(lambda (key-value)
(let ((x (car key-value)))
(and (good-value? x)
(begin
(set! count (+ 1 count))
(set! result (cons x result))
(if (>= count count-max)
(escape #f)))))))))
(reverse result)))
(cerr (time (get-good-numbers-cc treap pagesize)) nl)
2 ms real time
0 ms cpu time (0 user, 0 system)
no collections
15416 bytes allocated
no minor faults
no major faults
(42 49 56 63 70 112 119 126 133 140)
Let us take a different turn: view a collection as if it was a
lazy list. Lazy list is a promise, which when applied
returns
'() or
(cons value lazy-list)
data LazyList a = Promise (Nil | (Cons a (LazyList a)))
(define (treap->lazy-list treap)
(delay
(call-with-current-continuation
(lambda (k-main)
((treap 'for-each-ascending)
(lambda (key-value)
; (cerr "\nIn treap->lazy-list's iteratee: " key-value "\n")
(call-with-current-continuation
(lambda (k-reenter)
(let ((x (car key-value)))
(k-main (cons x
(delay
(call-with-current-continuation
(lambda (k-new-main)
(set! k-main k-new-main)
(k-reenter #f)))))))))))
(k-main '())))))
With this in mind, we can do (define lazy-treap-list (treap->lazy-list treap))
(define (get-good-numbers-ll lazy-treap-list count-max)
(let loop ((count 0) (llst lazy-treap-list) (result '()))
(cond
((or (>= count count-max) (fnull? llst)) (reverse result))
((good-value? (fcar llst))
(loop (+ 1 count) (fcdr llst) (cons (fcar llst) result)))
(else
(loop count (fcdr llst) result)))))
(cerr (time (get-good-numbers-ll lazy-treap-list pagesize)) nl)
5 ms real time
0 ms cpu time (0 user, 0 system)
no collections
47752 bytes allocated
no minor faults
no major faults
(42 49 56 63 70 112 119 126 133 140)
Note a commented out line at the very beginning of the treap
iteratee that calls the
cerr procedure . If you uncomment
that line you can see for yourself that during the execution of
get-good-numbers-ll, the iteratee is invoked only as many times
as absolutely necessary. In all other respects
lazy-treap-list looks like a regular list. This makes even
complex, history-dependent traversal easily realizable: lists
can be walked in a variety of ways. The
get-good-numbers-ll is a thinly-veiled finite state machine implemented in a
pure-functional style.
Performance-wise, the lazy list is definitely better than the dumb enumeration: the lazy list does not fetch elements if they are not needed. Yet the lazy list loses out to the early escape.
The
get-good-numbers-ll procedure is purely
functional: The state of traversal is abstracted into a lazy pair, a
first-class object. To advance the traversal, we have to
explicitly apply methods like
fcdr (aka
next). We also have to explicitly check for EOF. In this
respect, a lazy list is very similar to an STL iterator. On the other
hand, a for-each enumerator does not give an iteratee any hint about
the current state of traversal, or its history.
The procedure
treap->lazy-list shows off call/cc at
its best -- its ability not only to escape a scope but to re-enter it
as well.
treap->lazy-list cannot be implemented with
simpler, throw-like constructions. In this example, the enumerator and
its caller act as
coroutines. The full power of call/cc is
required to implement them in Scheme.
There has been a discussion about the fate of call/cc. No doubt
call/cc causes many a grief to implementors of Scheme. Is it worth the
trouble? The conclusions of this article are mixed. The early-escape
solution needs call/cc only to throw a non-reenterable exception. This
solution will work even if call/cc did not exist, being replaced by
its weak relative,
throw or
raise. Only
treap->lazy-list requires the full call/cc. The latter
seems indispensable in "inverting" of an iteration, when we
have an implementation control over an iteratee but not over its
iteratOR. We gain flexibility, but we have to brace for the problems with
performance, finalization, and excessive resource usage.
A more general iteration-inversion procedure along with a few validation tests can be found in [inversion-src].
A programming language Icon popularized generators as a way to traverse real and virtual collections [Generators-Icon]. Generators are also present in languages Ruby and Python:
Python 2.2 introduces a new construct accompanied by a new keyword. The construct is generators; the keyword is yield. Generators make possible several new, powerful, and expressive programming idioms, but are also a little bit hard to get one's mind around at first glance.[Generators-Python]
A generator traverses a real or an ephemeral collection, retrieves the values or computes them on the fly -- and returns all of them, potentially. The context of the generating expression -- the consumer -- determines how many values will actually be produced. The documentation of Icon defines a generator as an expression that can produce several values, on demand. An article [Generators-Icon] gave a good illustration of generators in Icon:
If the result produced by a generator does not lead to the success of an enclosing expression, however, the generator is resumed to produce another value. An example issentence := "Store it in the neighboring harbor" if (i := find("or", sentence)) > 5 then write(i)Here the first result produced by the generator, 3, is assigned toi, but this value is not greater than 5 and the comparison operation fails. At this point, the generator is resumed and produces the second position, 23, which is greater than 5. The comparison operation then succeeds and the value 23 is written.
In this Section, we show how generators are related to lazy lists, and how to translate the above "generative" code into Scheme, almost verbatim. The complete code for the examples is given in [generators-src].
Let us first re-write the above two lines of Icon code into Scheme:
(let ((sentence "Store it in the neighboring harbor"))
(display
(fcar
(lfilter (lambda (val) (> val 5))
(generative (find-str-generator "or" sentence))))))
Indeed, the translation is almost literal. Functions
fcar (a forcing
car) and
lfilter (a
lazy filter) are defined in [generators-src]. Informally
speaking, the function
generative takes a generator -- an
expression that can produce several results -- and collects all such
results in a lazy list. The laziness is important: we do not want to
generate more results than we are able to consume. In the above
example, the expression
find-str-generator is capable of
producing several values that are greater than five. We are interested
only in the first one. The other values are not demanded, and hence,
are not even generated.
To be more formal, our generators are procedures of one argument:
suspend, which is also a procedure of one argument. A
generator procedure is called to perform the iteration. Whenever the
procedure is about to yield a value, it should call
suspend and pass it that value. When
suspend returns, the
generator should continue the traversal and attempt to retrieve or
compute another value. The generator returns when there is no more
values to generate. The function
generative takes a
generator as its sole argument, makes a suspend procedure, invokes the
generator on the suspend procedure, and returns a lazy list of all the
values that have been passed to
suspend.
(define (generative generator)
(delay
(call-with-current-continuation
(lambda (k-main)
(define (suspend val)
(call-with-current-continuation
(lambda (k-reenter)
(k-main
(cons val
(delay
(call-with-current-continuation
(lambda (k-new-main)
(set! k-main k-new-main)
(k-reenter #f)))))))))
(generator suspend)
(k-main '())))))
The function
find-str-generator is equivalent to Icon's
find generator, which searches for occurrences of a
pattern in a
str: (define (find-str-generator pattern str)
(lambda (suspend)
(let* ((pat-len (string-length pattern))
(str-len (string-length str))
(last-pos (- str-len pat-len))); last position where match can occur
(do ((i 0 (+ 1 i)))
((> i last-pos)) ; a naive algorithm
(if (string=? pattern (substring str i (+ i pat-len)))
(suspend i))))))
To print the indices of all occurrences of a pattern in a string, we write, in Icon:
every write(find("or", sentence))
and in Scheme: (ffor-each (lambda (val) (display val) (display " "))
(generative (find-str-generator "or" sentence)))
The similarity is remarkable. Here
ffor-each is a
forcing version of the ordinary Scheme function
for-each.
Finally, an example Python programmers seem to be proud of, an in-order traversal of a tree:
>>>> # A recursive generator that generates Tree leaves in in-order.
>>> def inorder(t):
... if t:
... for x in inorder(t.left):
... yield x
... yield t.label
... for x in inorder(t.right):
... yield x
Here is how the same example looks in Scheme: ; tree:: '() | (list label tree-left tree-right)
(define (inorder tree)
(lambda (suspend)
(if (pair? tree)
(apply
(lambda (label left right)
(ffor-each suspend (generative (inorder left)))
(suspend label)
(ffor-each suspend (generative (inorder right))))
tree))))
and how it works: (let ((tree (make-binary-tree 3)))
(display "Original tree") (newline)
(display tree) (newline)
(display "In-order-traversal: ")
(ffor-each (lambda (val) (display val) (display " "))
(generative (inorder tree))))
Output:
Original tree
(1 (2 (4 () ()) (5 () ())) (3 (6 () ()) (7 () ())))
In-order-traversal: 4 2 5 1 6 3 7
Again, this is a literally literal translation.
We should stress that our
suspend is an ordinary,
first-class value. A generator may use that value as any other value,
for instance, to pass it to
ffor-each. This is in stark
contrast with Python, where the corresponding
yield keyword
is a
keyword.
Another approach of traversing a collection is to use a
cursor, which is often confusingly called a stream or an
iterator. Examples of cursors include C++ STL iterators, SQL cursors,
C streams, UNIX files. A (read-only) cursor lets the user read the current
value and advance the cursor. Often these operations are combined into
one (cf.
getc() in C).
Ben Escoto, in his thoughtful follow-up discussion about generators in Scheme [generators-disc], has observed a few displeasing facts about cursors.
The first problem is about telling the user that there are no
more values to access. When the collection is exhausted, the cursor
operations typically return an "out-of-band" value such as EOF (cf.
getc()), raise an error (cf. SQL cursors) or an
exception (Python's
iterator.next()). None of these approaches
appear entirely satisfactory.
Some collections may allow to backtrack a cursor -- but some may not. Cursors associated with network connections or those that compute values on the fly often do not permit backtracking. If we cannot look ahead by more than one item and then backtrack, some important stream processing functions become quite difficult to write.
Unlike cursors, an enumerator does not need to do anything special to tell the iteratee that the traversal is finished: the enumerator will simply return. There is no longer a problem with out-of-band values and user's disregarding such values.
Although lazy lists look rather similar to cursors, lazy lists do not have the above drawbacks either. A lazy list is either a pair -- in which case its first element is the current value -- or an empty list, in which case the traversal is finished. We should stress that a promise of a pair in a lazy list can be forced more than once. All further forcing are equivalent to the identity function. If a lazy pair is forced it yields an ordinary pair, which can be incorporated in other lists. Thus, lazy lists always permit backtracking -- transparent to the user and in arbitrary amount.
The only inconvenience with lazy lists, also noted by Ben Escoto, is that many standard list operations such as map, filter, and fold have to be re-written specifically for lazy lists. This problem does not arise in those Scheme systems that force values implicitly.
If one has the power over implementing a collection, he can
build an enumerator that allows for an early exit. For example, such
an enumerator may examine the return result of an iteratee. If the
latter yields
#f, the iteration is abandoned; the result
'() means the traversal will go on. Any other value is
added to a list that accumulates meaningful results of all iteratee
invocations, in order.
The best enumerator ought to let an iteratee explicitly pass
its private state from one invocation to another -- similar to the the
way an
unfold combinator (see SRFI-1) tosses the
seed
around. Some syntax however ought to be employed to make it easier
for an iteratee to pack its state data into a 'parcel' -- only to
unpack the data on the next iteration. For example, one can imagine
the following enumerator:
(enumerate
(lambda (curr-elem count result)
(let ((x (car key-value)))
(if (good-value? x)
(let ((new-count (+ 1 count)))
(values (< new-count count-max) ; #f as the first arg
new-count ; will terminate the iter
(cons x result))) ; early
(values #t count result)
)))
treap
0 ; Initial value for the count
'() ; Initial value for the result
)
The iteratee in the example above takes three arguments. The
first one is the current element -- similar to the argument the
ordinary
for-each passes to its iteratee. The initial values for
the other two arguments are given as invocation parameters of the
enumerator. The iteratee must return as many values as the arguments
it received. The first return value must be either
#t or
#f. In the latter case the traversal is prematurely
terminated; otherwise it may continue, if there are elements to
enumerate, of course. The other return values are passed to the next
invocation of the iteratee as the second,... arguments. The
enumerator returns several values, which is the result yielded by the
last invocation of the iteratee save the first value of the
multi-value. If the iteratee was never called, the initial values
will be returned. One can envision similar procedures
enumerate-replace or
enumerate-replace!,
which use the first return value of the iteratee to update the current
element, either functionally or destructively.
This is just a suggestion. It may also be considered a blueprint for an ODBC-like interface in a true Scheme spirit; the collection in question is the result of a SELECT statement. I'm interested to hear if all this makes any sense.
Since the previous phrase was written, the proposed enumerator interface has indeed been tried and found satisfactory. It was first implemented in a relational database access library [database-access]. This interface has been used in a production environment. The generic-left-fold-based interface underlies the Collection SRFI. It was also used in a TIFF library [TIFF], and even in other languages [nonrec-fold-stream]. The LL3 presentation [LL3-enumerators] discussed the interface in more detail.
Joe Marshall remarked:
Have you considered passing the iteratee a function that
continues the iteration? The iteratee would look something like this: (lambda (this-element current-state continue-iteration)
(let ((next-state (combine this-element current-state)))
(if (done? next-state)
next-state
(continue-iteration next-state))))
This solution -- along with a lazy list solution above -- has one troublesome property. A special function has to be called to continue the traversal. To stop the traversal, iteratee doesn't have to do anything special: just return. This is somewhat similar to the lazy-list solution. There, the element handler must force a promise to get to the next element in a collection. In contrast, in the enumerator proposal, to stop a traversal the iteratee must return a special kind of value to the caller. If some other value is returned, the iteration implicitly continues.
Traversing a collection may require resources, e.g, a stack, a connection to a database, a DIR handle, a DOM tree of an XML document, etc. An enumerator takes care of allocating these resources. When iteratee explicitly tells the enumerator to stop -- or when the collection is exhausted -- the enumerator can orderly free the resources.
On the other hand, if iteration advances only upon an explicit
call to a
continue-iteration function, it is not so clear
cut when to dispose of resources. Obviously an enumeration should be
assumed terminated when the
continue-iteration
procedure/closure is garbage-collected. This assumes that we can
associate
finalizers with objects -- which is not a
standard feature. Fortunately, Gambit provides for 'wills', which it
uses specifically for that purpose, to free a file handle when a port
is GCed. Alas, the moment garbage collection (GC) is performed is not
precisely defined. Often GC runs only when memory is tight. In that
case, the following example
(do ((i 0 (+ i 1))) ((>= i 1000))
(open-input-file "/dev/null" (lambda () #f)))
runs out of file handles way before it runs out of memory. On the
other hand, the following code, although just as silly, completes
without problems (do ((i 0 (+ i 1))) ((>= i 1000))
(with-input-from-file "/dev/null" (lambda () #f)))
Incidentally, the example is stolen from a Python FAQ, from a section that
aimed to illustrate why Python's reference-counting GC may be better
than Java's mark-and-sweep GC.
The lazy-list solution we considered earlier, as well STL iterators
exhibit this finalization problem, too. Therefore, there does appear
to be a merit in insisting that iteratee do not escape an
enumerator. Granted, the iteratee may call a continuation captured
before enumerator was entered. For that reason, enumerator should
employ a
dynamic-wind to detect such an attempt to
escape, shed excessive resources and switch to a "low-power", "sleep"
mode while the iteratee is on the run.
In this article we attempted to imagine a "perfect" enumerator, enumerator interface, or an enumerator pattern. If all collections were to support this general interface, there will be far less need to hack them with call/cc.
Actually, there may still be a need to invert an iteration: for example, to move some elements from one collection into another. It seems very difficult for a single iteratee-copier to be controlled by two enumerators at the same time. Is there a generic way to chain two or more enumerators? I don't even know where to begin.
We should note that some problems seem to lend themselves to
an event callback, dont-call-us-we-call-you-style: an enumerator
driving an iteratee. Some other problems are more amenable to a
read-char-,
GetNextEvent-style: enumeration
inversion.
The proposed left-fold-based enumerator with the premature
termination has been implemented and tested in practice, in
particular, in a database access library [database-access]. The example quoted in that reference shows that the interface can
deal with
EXISTS and other subqueries, table expressions,
collection datatypes, geodetic datablade datatypes and functions, and
other frills. The enumerator makes it possible indeed to present a
functional interface even to such complex collections as
object-relational databases.
[LL3-enumerators]
Towards the best collection traversal interface
[enumerators-src]
enumerators-callcc-code-old.scm [8K]
The complete code for the treap enumeration examples.
[inversion-src]
enumerators-callcc-code.scm [4K]
The complete code for the enumerator inversion procedure and a
few tests.
[nonrec-fold-stream]
The enumerator inversion procedure in Haskell
[database-access]
db-util1.scm [23K]
Scheme database access tools
vdbaccess.scm [5K]
Illustrative regression tests
<
http://www.metnet.navy.mil/Metcast/Code/gief-serve.scm>
An example of using the left-fold-based database interface in a
production environment.
[TIFF]
Handling a TIFF file in Scheme
[Treap]
Treaps: A sorted dictionary data structure based on randomized search trees.
[Generators-Python]
<
http://www-106.ibm.com/developerworks/linux/library/l-pycon?t=grl,l=252,p=iterators>
Charming Python: Iterators and simple generators
By David Mertz, Ph.D. (mertz@gnosis.cx) Autodidact, Gnosis
Software, Inc. September 2001.
[Generators-Icon]
<
http://lambda.weblogs.com/discuss/msgReader$1789>
Ehud Lamm - Re: Charming Python : Iterators and simple generators
Lambda the Ultimate 9/28/2001; 9:16:15 AM
[generators-src]
generators.scm [6K]
The complete code for the generator examples.
[generators-disc]
Ben Escoto. Re: Generators in Python, Icon and Scheme. A message
posted on comp.lang.scheme on 2001-10-03 01:50:02 PST.
The present page has started as a compilation of three articles General ways to traverse collections, and the fate of call/cc posted on a comp.lang.scheme newsgroup Apr 16 through Apr 20, 2000. The section about generators is based on an article Generators in Python, Icon and Scheme posted on comp.lang.scheme on Oct 2, 2001.
This site's top page is http://pobox.com/~oleg/ftp/
Converted from SXML by SXML->HTML