;; -*-mode: Outline; -*- Towards the best collection API * Abstract Most programming languages support collections, represented by an in-memory data structure, a file, a database, or a generating function. A programming language system gives us typically one of the two interfaces to systematically access elements of a collection. One traversal API is based on enumerators -- e.g., for-each, map, filter higher-order procedures -- of which the most general is fold. The second approach relies on streams, a.k.a. cursors, lazy lists. Generators such as the ones in Icon, Ruby and Python are a hybrid approach. This article presents a design of the overall optimal collection traversal interface, which is based on a left-fold-like combinator with premature termination. We argue from the practical standpoint that such an interface is superior: in efficiency; in ease of programming; in more predictable resource utilization and avoidance of resource leaks. We also demonstrate a procedure that converts _any_ left-fold-like enumerator for any collection into the corresponding cursor. Therefore, the stream-based interface, when it is truly needed or preferred, can be automatically and generically derived from the proposed interface. We present two variants of the left-fold enumerator: for languages with and without first-class continuations. The proposed traversal interface has been implemented in a Scheme database access API used in a production environment. The generic-left-fold-based interface underlies the Collection SRFI (Scheme Request for Implementation) and is being considered for an Oracle database interface in Haskell. * Introduction and terminology Access to collections cuts through many programming languages. By a collection we mean a hash table; a file as a collection of lines, words, or bytes; a file system directory; an XML document; a suffix tree; or a "dynamic" collection produced by a database query or any other generating function. There are two basic approaches to systematically access collection elements. One approach relies on enumerators such as 'map', 'for-each' and 'filter'. The most general of those is fold, which lets us maintain our state during the traversal. The other traversal API is based on lazy lists (a.k.a. cursors, or streams). A word about terminology is in order, as it is rather confusing. We will call a higher-order for-each-type procedure an enumerator. It is this procedure that takes a collection and a handler, and applies the handler to each element of the collection in turn. The handler itself will be called an iteratee, because it is being applied -- by an active entity, the enumeratOR. An enumerator is presumably privy to the internals of a collection; it knows how to fetch elements in some (specified) order, and knows when to stop. A handler does not have this knowledge, nor is it supposed to. Throughout the article we eschew the commonly used term 'iterator': in C++ it means an accessor, which does not iterate over a collection. Rather, it is _being_ iterated. On the other hand, languages like OCaml provide a procedure 'iter' that actively traverses a collection, applying a user-defined function to elements. The next section argues from the practical standpoint against cursors and for enumerators. We contend that an enumerator such as left fold should be the primary means of traversing and obtaining all values of a collection. Enumerators are superior to cursors: - in efficiency - in ease of programming - in more predictable resource utilization and avoidance of resource leaks. The latter is of special importance for collections backed by external resources such as file handles and database connections. Section 3 proposes the overall optimal enumeration interface, which adds premature termination and explicit multiple state variables to the left fold. Section 4 discusses generators. Given a stream, we can always construct the corresponding enumerator. It is perhaps less appreciated that given a left fold enumerator, we can always construct the corresponding stream: we can invert an enumerator inside out. Section 5 describes the inversion procedure, which is fully generic and works for any collection. Because the enumerator interface subsumes the cursor-based one and because the former is more efficient and less error-prone, a collection API should offer enumerators as the primary, native traversal interface. The automatical enumerator inversion procedure in Section 5 relies on first-class (delimited) continuations. However, we can write a similar enumerator inversion even in a language that lacks first-class continuations. Section 6 demonstrates a non-recursive left-fold-like combinator that can be instantiated into the ordinary left fold or into a stream. Both instantiation procedures are generic and do not depend on the type of the collection. In conclusion, we recap our recommendations for the optimal collection traversal API and point out to the existing implementations of the API. * Why an enumerator is better than a cursor There are many fundamental reasons why enumerators such as fold are profoundly better than cursors. One such reason is the ease of program analysis and optimizations [Sheard, Bananas, Hutton]. This article will not discuss them. Instead, we concentrate on utterly practical considerations: performance, ease of programming, and resource utilization. Writing a procedure that traverses, e.g., an AVL tree is far easier than writing a cursor that remembers the current position and can advance to the next one. The latter is especially tricky if the tree in question does not have parent pointers (which waste space and lead to sharing and resource leakage problems). One look at the C++ STL should convince one how wretched programming of iterators could be. The "current element" of a cursor is a state, which can be altered by advancing the cursor. This state introduces an implicit dependency among all expressions that use the cursor, similar to the dependency imposed by global mutable variables. Therefore, programming with cursors is error-prone. In contrast, an enumerator does not expose its traversal state to an iteratee and does not permit any alteration to that state by an iteratee. The difference between the leaky and the perfect encapsulations of the traversal state is well explained in [ULLMAN] when comparing a cursor-based network database query language DBTG with SQL. Incidentally, streams, another cursor-based interface, expose the current element and hence eliminate the implicit state dependency. Another problem of cursors is telling the user that there are no more values to access. When a collection is exhausted, cursor operations typically return an "out-of-band" value such as EOF (cf. getc() in C) or raise an exception (cf. iterator.next() in Python). None of these approaches appear entirely satisfactory. Unlike cursors, an enumerator does not need to do anything special to tell the iteratee that the traversal is finished: the enumerator will simply return. There is no longer a problem with out-of-band values and user's disregarding such values. Cursors are simply less efficient. A cursor must maintain the state of the traversal. That state may be invalid. Each operation on a cursor, therefore, must first verify that the cursor is in the valid state. The cost of the checks quickly adds up. This issue of the inefficiency of cursors is well known. For example, it is highly advisable to use "native" operations to move large sections of an array (ArrayCopy in Java) rather than copy element-by-element. The cost of an array bound check on each access to an element is one of the considerations. Enumerators can do more than just eliminate redundant bound checks. Enumerators, if implemented as staged functions (e.g., C++ templates) can tile or unroll loops and thus partially or completely remove the traversal iteration overhead [Veldhuizen]. The perfect encapsulation of the traversal state makes enumerators particularly adaptable for multi-stage programming. Databases give another example of the efficiency of enumerators: "The performance of cursors is horrible in almost all systems. One of us (Shasha) once had an experience of re-writing an eight-hour query having nested cursors into a cursor-free query that took 15 seconds." [Shasha] It is often said that the key to the best performance is to do more on the server. To find the maximum value of a database table column, it's far faster to evaluate select MAX(col) from collection than to open a cursor on the collection and retrieve all values from 'col' searching for the maximum. Stored procedure languages (some of which are quite sophisticated) were introduced to make the server-side processing easier and powerful. We should stress that select MAX(col) from collection is precisely equivalent to foldl max lower_bound collection Traversing a collection often requires resources, e.g., a stack space, a connection to a database, a DIR handle, a DOM tree of an XML document, etc. An enumerator takes care of allocating these resources. When iteratee explicitly tells the enumerator to stop -- or when the collection is exhausted -- the enumerator can orderly free the resources. An enumerator can be programmed as (define (enum proc collection) (let ((database-connection #f)) (dynamic-wind (lambda () (set! database-connection (open-connection collection))) (lambda () (iterate proc)) (lambda () (set! database-connection (close-connection database-connection)))))) If 'proc' does not capture its continuation [Footnote-1] -- as it is often the case -- the database connection is opened (taken from the pool) at the beginning of the iteration and is returned to the pool at the end. If we were to provide access to our collection in the form of a cursor, we would have to place the variable 'database-connection' into that cursor. We cannot close the database until the cursor is alive. But how do we know when there are no alive cursors and therefore it is safe to recycle the database connection? We must rely either on the programmer's explicitly closing the connection, or on a finalizer. The sheer number of internet security advisories concerning memory allocation problems indicates that manual management of resources is greatly error-prone. The finalizer solution is not satisfactory either: finalizers are rarely supported and when they are, they are an unreliable tool to manage precious resources. The execution of finalizers is unpredictable and is generally beyond programmer's control. [Footnote-1] An iteratee may call a continuation captured before the enumerator was entered. For that reason, an enumerator should employ a dynamic-wind to detect such an attempt to escape, shed excessive resources and switch to a "low-power" mode while the iteratee is on the run. * The most general enumeration interface We propose the following general enumeration interface. It is based on a left-fold enumerator and explicitly supports multiple state variables (i.e., seeds) and a premature termination of the iteration. In this Section we describe the interface in pseudo-code, which can be instantiated in Scheme [SRFI-44], Haskell ([Daume], Section 6) or other concrete language. The enumeration procedure coll-fold-left COLL PROC SEED SEED ... -> [SEED ... ] traverses the collection denoted by COLL. The procedure takes one or more state arguments denoted by SEED, and returns just as many. PROC is an iteratee procedure: PROC VAL SEED SEED ... -> [INDIC SEED SEED ...] It takes n+1 arguments and returns n+1 values. The first argument is the current value retrieved from the collection. The other n arguments are the seeds. The first return value is a boolean indicator, whose false value causes the premature termination of the iteration. The other return values from PROC are the new values of the seeds. The procedure coll-fold-left enumerates the collection in some order and invokes PROC passing it the current value of the collection and the current seeds. The first invocation of PROC receives SEEDs that were the arguments of coll-fold-left. The further invocations of PROC receive SEEDs that were produced by the previous invocation of PROC. When the collection is exhausted or when the PROC procedure returns the indicator value of false, coll-fold-left terminates the iterations, disposes of allocated resources, and returns the current SEEDs. If the collection COLL is empty, coll-fold-left does not invoke PROC and returns its argument SEEDs. * Enumerators and generators A programming language Icon popularized generators as a way to traverse actual and virtual collections. Generators are also supported in Ruby and Python. The documentation of Icon defines a generator as an expression that can produce several values, on demand. Multiple-valued expressions can also be implemented with shift and reset [SHIFT]. The paper [SHIFT] gives many examples of their use. Generators occupy an intermediate place between enumerators and cursors. A generator is just as easy to write as an enumerator. It traverses a collection, fetches the current element and _yields_ it, by passing the element to a dedicated procedure or syntax form. When the latter returns, the traversal continues. On the other hand, generators are trivially related to streams [ENUM-CC]. The latter article discusses generators and enumerators in more detail. Incidentally, the article demonstrates that a generator-based code in Python can be translated into Scheme almost verbatim. Generators give the first hint that enumerators and cursors are related via first-class continuations. Like cursors and streams, generators are demand-driven. A user must explicitly request a new value to advance the traversal. Therefore, like cursors generators leak resources: it is not clear when the iteration should be assumed terminated and the associated resources can be safely disposed of. * How to invert an enumerator in a language with first-class continuations Sometimes we indeed need to traverse a collection via a cursor. Reasons may include moving data from one collection to another, or interfacing legacy code. If a collection API provides enumerators, we obtain cursors for free. We pass an enumerator to a generic translation procedure, which inverts the enumerator "inside out" and returns a cursor. The following code illustrates the conversion in Scheme, a language with first-class continuations. The procedure lfold->lazy-list is a fully generic translation procedure: it takes a left-fold enumerator for _any_ collection, and converts the enumerator to a stream (lazy list). The latter is a realization of a cursor. (define (lfold->lazy-list lfold collection) (delay (call-with-current-continuation (lambda (k-main) (lfold collection (lambda (val seed) (values (call-with-current-continuation (lambda (k-reenter) (k-main (cons val (delay (call-with-current-continuation (lambda (k-new-main) (set! k-main k-new-main) (k-reenter #t)))))))) seed)) '()) ; Initial seed (k-main '()))))) The present article is an abstract. Code in a functional language is the best abstract. The discussion is delegated to the talk and the full paper. The article [ENUM-CC] discusses the inversion procedure in more detail and points out to the complete code and the test cases. * How to invert an enumerator in a language without first-class continuations If a programming language lacks first-class continuations, the conversion from an enumerator to a cursor is still possible. However, we need to generalize the enumerator interface and make it non-recursive. For concreteness, this section demonstrates our approach for one particular language without first-class continuations: Haskell. Applications to other languages are straightforward. In Haskell, the general left-fold enumerator has the following interface [HINV]: > type CFoldLeft coll val m seed = coll -> CollEnumerator val m seed > type CollEnumerator val m seed = > Iteratee val seed > -> seed -- the initial seed > -> m seed > type Iteratee val seed = seed -> val -> Either seed seed where 'coll' is the type of a collection with elements of the type 'val', and 'm' is an arbitrary monad. The type 'seed' is the type of a state variable or variables (if the seed is a tuple). If an iteratee returns Right seed', the iteration continues with seed' as the new seed. If the iteratee returns Left seed'', the enumerator immediately stops further iterations, frees all the resources, and returns seed'' as the final result. Incidentally, Hal Daume III mentioned [DAUME] that such left-fold enumerator is indeed useful in practice. It is his preferred method of iterating over a file considered as a collection of characters, lines, or words. To make the enumerator non-recursive, we need to add an additional argument -- self: > type CFoldLeft' val m seed = > Self (Iteratee val seed) m seed > -> CollEnumerator val m seed > type Self iter m seed = iter -> seed -> m seed > type CFoldLeft1Maker coll val m seed = coll -> m (CFoldLeft' val m seed) A function of the type CFoldLeft' is also an enumerator. However, that enumerator does not recurse to advance the traversal. It invokes Self instead. Given CFoldLeft1Maker, we can obtain either the CFoldLeft enumerator, or a stream. The former translation procedure amounts to taking a fixpoint: > hfold_nonrec_to_rec:: (Monad m) => > coll -> (CFoldLeft1Maker coll val m seed) > -> m (CollEnumerator val m seed) > hfold_nonrec_to_rec coll hfold1_maker = do > hfold_left' <- hfold1_maker coll > return $ fix hfold_left' > fix f = f g where g = f g Converting CFoldLeft' into a stream is equally simple: > data MyStream m a = MyNil (Maybe a) | MyCons a (m (MyStream m a)) > hfold_nonrec_to_stream:: > (Monad m) => CFoldLeft' val m (MyStream m val) > -> m (MyStream m val) > hfold_nonrec_to_stream hfold_left' = do > let k fn (MyNil Nothing) = return $ MyNil Nothing > k fn (MyNil (Just c)) > = return $ MyCons c (hfold_left' k fn (MyNil Nothing)) > hfold_left' k (\_ c -> Right $ MyNil $ Just c) (MyNil Nothing) The polymorphic types of both conversion procedures indicate that the procedures are generic and apply to any collection and any traversal. The article [HINV] demonstrates both translations of CFoldLeft' on a concrete example of a file taken as a collection of characters. Haskell provides a cursor interface to that collection: hGetChar. We implement a left fold enumerator CFoldLeft'. We then show how to turn that enumerator back to a stream: how to express functions myhgetchar and myhiseof only in terms of the left fold enumerator. The derivation of these functions is independent of the precise nature of the enumerator. Incidentally, if we turn two enumerators into streams, we can safely interleave these streams. * Conclusions In a language with first-class continuations, we propose coll-fold-left as the overall optimal interface to systematically access values of a collection (Section 3): coll-fold-left COLL PROC SEED SEED ... -> [SEED ... ] PROC VAL SEED SEED ... -> [INDIC SEED SEED ...] In a language without first-class continuations, we propose CFoldLeft1Maker (Section 6) as such optimal interface. The enumerator-based interface is optimal because enumerators: are easier to write; are less error-prone to use; are more efficient; provide a better encapsulation of the state of the traversal; avoid resource leaks. We have presented generic conversion procedures that turn enumerators into cursors. The existence of these procedures demonstrates that enumerator- and cursor-based interfaces are inter-convertible. We have argued however that the enumerator interface should be considered primary and offered natively in a collection API. It is far more efficient and easy for a programmer to implement cursors via enumerators, than the other way around. The coll-fold-left interface has indeed been implemented and tested in practice. We have written a relational database interface for Scheme [DBINTF], which we have been using in the production environment. We have also implemented coll-fold-left to enumerate entries in a TIFF image tag directory. The enumerator coll-fold-left has been chosen to be the primary traversal interface in Scheme Collections SRFI [SRFI-44]. A similar interface is being considered for an Oracle RDBMS binding in Haskell. * References [SHIFT] Olivier Danvy and Andrzej Filinski. Abstracting Control. Proc. 1990 ACM Conf. on LISP and Functional Programming, pp. 151-160, Nice, France, June 1990. [Daume] Hal Daume III. Re: From enumerators to cursors: turning the left fold inside out. A message posted on the Haskell mailing list on 24 Sep 2003 07:47:23 -0700. [Hutton] Graham Hutton. A tutorial on the universality and expressiveness of fold. Journal of Functional Programming, 9(4):355-372, July 1999. [ENUM-CC] Oleg Kiselyov. General ways to traverse collections. January 1, 2004. http://pobox.com/~oleg/ftp/Scheme/enumerators-callcc.html [DBINTF] Oleg Kiselyov. Scheme database access tools. May 10, 2003. http://pobox.com/~oleg/ftp/Scheme/lib/db-util1.scm http://pobox.com/~oleg/ftp/Scheme/tests/vdbaccess.scm [HINV] Oleg Kiselyov. From enumerators to cursors: turning the left fold inside out. January 1, 2004. http://pobox.com/~oleg/ftp/Haskell/misc.html#fold-stream The first draft was posted on the Haskell mailing list on 23 Sep 2003 23:59:45 -0700. [Bananas] Erik Meijer, Maarten M. Fokkinga, and Ross Paterson. Functional programming with bananas, lenses, envelopes, and barbed wire. In J. Hughes, editor, FPCA'91: Functional Programming Languages and Computer Architecture, volume 523 of LNCS, pp. 124-144. Springer-Verlag, 1991. [SRFI-44] Scott G. Miller. Collections. Scheme Request for Implementation SRFI-44. October 2003. http://srfi.schemers.org/srfi-44/srfi-44.html [Shasha] Dennis E. Shasha and Philippe Bonnet. Smooth Talking Your Databases. Dr.Dobbs Journal, July 2002, pp. 46-54. [Sheard] Tim Sheard and Leonidas Fegaras. A fold for all seasons. Proc. Conf. on Functional Programming and Computer Architecture (FPCA'93), pp. 233-242, Copenhagen, Denmark, June 1993. [ULLMAN] Jeffrey Ullman. Principles of Database Systems. Second Edition, 484 pp. Computer Science Press, 1982. [Veldhuizen] T. Veldhuizen. Expression Templates. C++ Report, Vol. 7 No. 5 June 1995, pp. 26-31 http://osl.iu.edu/~tveldhui/papers/Expression-Templates/exprtmpl.html