This article analyzes various ways of traversing a collection. We pay a particular attention to the most complex and general case of enumerating only a subset of a collection and performing actions that depend on the previously seen elements. We also discuss generators such as those in Icon and Python. A proposal is made for a convenient generic enumerator, which could be used, for example, in a truly functional database interface.
This article was originally written in April 2000. Its analysis of enumerators has since been researched further and extended to other languages. The pages [generators] and [iteratees] are the latest presentations on this topic. The proposal for a convenient and generic enumerator first suggested here has been tested in practice. We implemented the left-fold enumerator to access (object-)relational databases in a functional way [database-access]. This interface has been used in a production environment. The generic-left-fold-based interface underlies the Collection SRFI. It was also used in a TIFF library [TIFF], and even in other languages [nonrec-fold-stream].
This present article also demonstrates a non-trivial use of call/cc -- to invert a traversal "inside out." In this application, call/cc cannot be replaced with a simple throw/catch construction. Later on the inversion procedure was extended to the most general enumerator [inversion-src]. It turns out that with a small modification, we can perform the iteration inversion in languages without first-class continuations, e.g., Haskell [nonrec-fold-stream].
Suppose we have a collection with a for-each
-type
enumerator: a file system directory, an XML document, a suffix tree,
or a "dynamic" collection that is the result of executing a SELECT
DBMS/ODBC statement. We wish to obtain the first
10 elements of the collection that satisfy a certain criterion:
contain specific keywords, sound alike, etc.
For the purpose of this example, let us consider a collection
of numbers of the form 65537x+5 mod 1997
; The selection
criterion is divisibility by 7. To make the problem less trivial, let
the collection be represented by a treap [Treap]. We will not use the full power of the treap in this example: what
matters to us is that the treap is an advanced collection and that it
implements a for-each
enumerator. A more general example is
given in [inversion-src].
A word about terminology is in order, as it is rather
confusing. We will call a for-each
-type procedure
enumerator. Note the active suffix OR. It is
this procedure that takes a collection and a handler, and applies the
handler to each element of the collection in turn. The handler itself
will be called an iteratee, because it is
being applied -- by an active entity, the enumeratOR.
Enumerator is presumably privy to the internals of a collection; it
knows how to fetch elements in some (specified) order, and knows when
to stop. A handler does not have this knowledge, nor is it supposed
to. The job of an iteratee is to do something with the current
element, irrespective of how this current element was obtained.
for-each
enumeratorWe will first consider the most straightforward implementation: it will be our baseline. The complete code for this and the following examples is given in [enumerators-src].
(define (get-good-numbers-dumb treap count-max) (let ((count 0) (result '())) ((treap 'for-each-ascending) (lambda (key-value) (let ((x (car key-value))) (and (< count count-max) (good-value? x) (begin (set! count (+ 1 count)) (set! result (cons x result))))))) (reverse result))) (cerr (time (get-good-numbers-dumb treap pagesize)) nl) 23 ms real time 20 ms cpu time (20 user, 0 system) 1 collection accounting for 0 ms cpu time (0 user, 0 system) 171136 bytes allocated no minor faults no major faults (42 49 56 63 70 112 119 126 133 140)Assignments are prominent in this solution: they are necessary to advance the state. The latter must be represented by global (to the iteratee) variables, which can persists from one invocation of the iteratee to another.
A smarter approach is to escape from the enumerator when we are done
(define (get-good-numbers-cc treap count-max) (let ((count 0) (result '())) (call-with-current-continuation (lambda (escape) ((treap 'for-each-ascending) (lambda (key-value) (let ((x (car key-value))) (and (good-value? x) (begin (set! count (+ 1 count)) (set! result (cons x result)) (if (>= count count-max) (escape #f))))))))) (reverse result))) (cerr (time (get-good-numbers-cc treap pagesize)) nl) 2 ms real time 0 ms cpu time (0 user, 0 system) no collections 15416 bytes allocated no minor faults no major faults (42 49 56 63 70 112 119 126 133 140)
Let us take a different turn: view a collection as if it was a
lazy list. Lazy list is a promise, which when applied
returns '()
or (cons value lazy-list)
data LazyList a = Promise (Nil | (Cons a (LazyList a)))
(define (treap->lazy-list treap) (delay (call-with-current-continuation (lambda (k-main) ((treap 'for-each-ascending) (lambda (key-value) ; (cerr "\nIn treap->lazy-list's iteratee: " key-value "\n") (call-with-current-continuation (lambda (k-reenter) (let ((x (car key-value))) (k-main (cons x (delay (call-with-current-continuation (lambda (k-new-main) (set! k-main k-new-main) (k-reenter #f))))))))))) (k-main '())))))With this in mind, we can do
(define lazy-treap-list (treap->lazy-list treap)) (define (get-good-numbers-ll lazy-treap-list count-max) (let loop ((count 0) (llst lazy-treap-list) (result '())) (cond ((or (>= count count-max) (fnull? llst)) (reverse result)) ((good-value? (fcar llst)) (loop (+ 1 count) (fcdr llst) (cons (fcar llst) result))) (else (loop count (fcdr llst) result))))) (cerr (time (get-good-numbers-ll lazy-treap-list pagesize)) nl) 5 ms real time 0 ms cpu time (0 user, 0 system) no collections 47752 bytes allocated no minor faults no major faults (42 49 56 63 70 112 119 126 133 140)Note a commented out line at the very beginning of the treap iteratee that calls the
cerr
procedure . If you uncomment
that line you can see for yourself that during the execution of get-good-numbers-ll
, the iteratee is invoked only as many times
as absolutely necessary. In all other respects lazy-treap-list
looks like a regular list. This makes even
complex, history-dependent traversal easily realizable: lists
can be walked in a variety of ways. The get-good-numbers-ll
is a thinly-veiled finite state machine implemented in a
pure-functional style.
Performance-wise, the lazy list is definitely better than the dumb enumeration: the lazy list does not fetch elements if they are not needed. Yet the lazy list loses out to the early escape.
The get-good-numbers-ll
procedure is purely
functional: The state of traversal is abstracted into a lazy pair, a
first-class object. To advance the traversal, we have to explicitly apply methods like fcdr
(aka next). We also have to explicitly check for EOF. In this
respect, a lazy list is very similar to an STL iterator. On the other
hand, a for-each enumerator does not give an iteratee any hint about
the current state of traversal, or its history.
The procedure treap->lazy-list
shows off call/cc at
its best -- its ability not only to escape a scope but to re-enter it
as well. treap->lazy-list
cannot be implemented with
simpler, throw-like constructions. In this example, the enumerator and
its caller act as coroutines. The full power of call/cc is
required to implement them in Scheme.
There has been a discussion about the fate of call/cc. No doubt
call/cc causes many a grief to implementors of Scheme. Is it worth the
trouble? The conclusions of this article are mixed. The early-escape
solution needs call/cc only to throw a non-reenterable exception. This
solution will work even if call/cc did not exist, being replaced by
its weak relative, throw
or raise
. Only
treap->lazy-list
requires the full call/cc. The latter
seems indispensable in "inverting" of an iteration, when we
have an implementation control over an iteratee but not over its
iteratOR. We gain flexibility, but we have to brace for the problems with
performance, finalization, and excessive resource usage.
A more general iteration-inversion procedure along with a few validation tests can be found in [inversion-src].
The section on generators has been expanded to delimited control, ML and Haskell and moved to a separate document: [generators]
Another approach of traversing a collection is to use a cursor, which is often confusingly called a stream or an
iterator. Examples of cursors include C++ STL iterators, SQL cursors,
C streams, UNIX files. A (read-only) cursor lets the user read the current
value and advance the cursor. Often these operations are combined into
one (cf. getc()
in C).
Ben Escoto, in his thoughtful follow-up discussion about generators in Scheme [generators-disc], has observed a few displeasing facts about cursors.
The first problem is about telling the user that there are no
more values to access. When the collection is exhausted, the cursor
operations typically return an "out-of-band" value such as EOF (cf.
getc()
), raise an error (cf. SQL cursors) or an
exception (Python's iterator.next()
). None of these approaches
appear entirely satisfactory.
Some collections may allow to backtrack a cursor -- but some may not. Cursors associated with network connections or those that compute values on the fly often do not permit backtracking. If we cannot look ahead by more than one item and then backtrack, some important stream processing functions become quite difficult to write.
Unlike cursors, an enumerator does not need to do anything special to tell the iteratee that the traversal is finished: the enumerator will simply return. There is no longer a problem with out-of-band values and user's disregarding such values.
Although lazy lists look rather similar to cursors, lazy lists do not have the above drawbacks either. A lazy list is either a pair -- in which case its first element is the current value -- or an empty list, in which case the traversal is finished. We should stress that a promise of a pair in a lazy list can be forced more than once. All further forcing are equivalent to the identity function. If a lazy pair is forced it yields an ordinary pair, which can be incorporated in other lists. Thus, lazy lists always permit backtracking -- transparent to the user and in arbitrary amount.
The only inconvenience with lazy lists, also noted by Ben Escoto, is that many standard list operations such as map, filter, and fold have to be re-written specifically for lazy lists. This problem does not arise in those Scheme systems that force values implicitly.
If one has the power over implementing a collection, he can
build an enumerator that allows for an early exit. For example, such
an enumerator may examine the return result of an iteratee. If the
latter yields #f
, the iteration is abandoned; the result
'()
means the traversal will go on. Any other value is
added to a list that accumulates meaningful results of all iteratee
invocations, in order.
The best enumerator ought to let an iteratee explicitly pass
its private state from one invocation to another -- similar to the the
way an unfold
combinator (see SRFI-1) tosses the seed
around. Some syntax however ought to be employed to make it easier
for an iteratee to pack its state data into a 'parcel' -- only to
unpack the data on the next iteration. For example, one can imagine
the following enumerator:
(enumerate (lambda (curr-elem count result) (let ((x (car key-value))) (if (good-value? x) (let ((new-count (+ 1 count))) (values (< new-count count-max) ; #f as the first arg new-count ; will terminate the iter (cons x result))) ; early (values #t count result) ))) treap 0 ; Initial value for the count '() ; Initial value for the result )
The iteratee in the example above takes three arguments. The
first one is the current element -- similar to the argument the
ordinary for-each
passes to its iteratee. The initial values for
the other two arguments are given as invocation parameters of the
enumerator. The iteratee must return as many values as the arguments
it received. The first return value must be either #t
or
#f
. In the latter case the traversal is prematurely
terminated; otherwise it may continue, if there are elements to
enumerate, of course. The other return values are passed to the next
invocation of the iteratee as the second,... arguments. The
enumerator returns several values, which is the result yielded by the
last invocation of the iteratee save the first value of the
multi-value. If the iteratee was never called, the initial values
will be returned. One can envision similar procedures
enumerate-replace
or enumerate-replace!
,
which use the first return value of the iteratee to update the current
element, either functionally or destructively.
This is just a suggestion. It may also be considered a blueprint for an ODBC-like interface in a true Scheme spirit; the collection in question is the result of a SELECT statement. I'm interested to hear if all this makes any sense.
Since the previous phrase was written, the proposed enumerator interface has indeed been tried and found satisfactory. It was first implemented in a relational database access library [database-access]. This interface has been used in a production environment. The generic-left-fold-based interface underlies the Collection SRFI. It was also used in a TIFF library [TIFF], and even in other languages [nonrec-fold-stream]. The LL3 presentation [LL3-enumerators] discussed the interface in more detail.
Joe Marshall remarked:
Have you considered passing the iteratee a function that continues the iteration? The iteratee would look something like this:(lambda (this-element current-state continue-iteration) (let ((next-state (combine this-element current-state))) (if (done? next-state) next-state (continue-iteration next-state))))
This solution -- along with a lazy list solution above -- has one troublesome property. A special function has to be called to continue the traversal. To stop the traversal, iteratee doesn't have to do anything special: just return. This is somewhat similar to the lazy-list solution. There, the element handler must force a promise to get to the next element in a collection. In contrast, in the enumerator proposal, to stop a traversal the iteratee must return a special kind of value to the caller. If some other value is returned, the iteration implicitly continues.
Traversing a collection may require resources, e.g, a stack, a connection to a database, a DIR handle, a DOM tree of an XML document, etc. An enumerator takes care of allocating these resources. When iteratee explicitly tells the enumerator to stop -- or when the collection is exhausted -- the enumerator can orderly free the resources.
On the other hand, if iteration advances only upon an explicit
call to a continue-iteration
function, it is not so clear
cut when to dispose of resources. Obviously an enumeration should be
assumed terminated when the continue-iteration
procedure/closure is garbage-collected. This assumes that we can
associate finalizers with objects -- which is not a
standard feature. Fortunately, Gambit provides for 'wills', which it
uses specifically for that purpose, to free a file handle when a port
is GCed. Alas, the moment garbage collection (GC) is performed is not
precisely defined. Often GC runs only when memory is tight. In that
case, the following example
(do ((i 0 (+ i 1))) ((>= i 1000)) (open-input-file "/dev/null" (lambda () #f)))runs out of file handles way before it runs out of memory. On the other hand, the following code, although just as silly, completes without problems
(do ((i 0 (+ i 1))) ((>= i 1000)) (with-input-from-file "/dev/null" (lambda () #f)))Incidentally, the example is stolen from a Python FAQ, from a section that aimed to illustrate why Python's reference-counting GC may be better than Java's mark-and-sweep GC.
The lazy-list solution we considered earlier, as well STL iterators
exhibit this finalization problem, too. Therefore, there does appear
to be a merit in insisting that iteratee do not escape an
enumerator. Granted, the iteratee may call a continuation captured
before enumerator was entered. For that reason, enumerator should
employ a dynamic-wind
to detect such an attempt to
escape, shed excessive resources and switch to a "low-power", "sleep"
mode while the iteratee is on the run.
In this article we attempted to imagine a "perfect" enumerator, enumerator interface, or an enumerator pattern. If all collections were to support this general interface, there will be far less need to hack them with call/cc.
Actually, there may still be a need to invert an iteration: for example, to move some elements from one collection into another. It seems very difficult for a single iteratee-copier to be controlled by two enumerators at the same time. Is there a generic way to chain two or more enumerators? I don't even know where to begin.
We should note that some problems seem to lend themselves to
an event callback, dont-call-us-we-call-you-style: an enumerator
driving an iteratee. Some other problems are more amenable to a
read-char
-, GetNextEvent
-style: enumeration
inversion.
The proposed left-fold-based enumerator with the premature
termination has been implemented and tested in practice, in
particular, in a database access library [database-access]. The example quoted in that reference shows that the interface can
deal with EXISTS
and other subqueries, table expressions,
collection datatypes, geodetic datablade datatypes and functions, and
other frills. The enumerator makes it possible indeed to present a
functional interface even to such complex collections as
object-relational databases.
[generators]
Generators: the API for traversal and non-determinism
[iteratees]
Iteratees: Incremental multi-level input processing and collection enumeration
[LL3-enumerators]
Towards the best collection traversal interface
[enumerators-src]
enumerators-callcc-code-old.scm [8K]
The complete code for the treap enumeration examples.
[inversion-src]
enumerators-callcc-code.scm [4K]
The complete code for the enumerator inversion procedure and a
few tests.
[generators-src]
generators.scm [6K]
The complete code for the generator examples.
[nonrec-fold-stream]
The enumerator inversion procedure in Haskell
[database-access]
db-util1.scm [24K]
Scheme database access tools
vdbaccess.scm [5K]
Illustrative regression tests
<http://www.metnet.navy.mil/Metcast/Code/gief-serve.scm>
An example of using the left-fold-based database interface in a
production environment.
[TIFF]
Handling a TIFF file in Scheme
[Treap]
Treaps: A sorted dictionary data structure based on randomized search trees.
[generators-disc]
Ben Escoto. Re: Generators in Python, Icon and Scheme. A message
posted on comp.lang.scheme on 2001-10-03 01:50:02 PST.
The present page has started as a compilation of three articles General ways to traverse collections, and the fate of call/cc posted on a comp.lang.scheme newsgroup Apr 16 through Apr 20, 2000. The section about generators is based on an article Generators in Python, Icon and Scheme posted on comp.lang.scheme on Oct 2, 2001.
This site's top page is http://okmij.org/ftp/
Converted from SXML by SXML->HTML