Algorithms and Data Structures


 

Selecting a random node from a tree in one pass: from proof to code

We present a procedure that picks a uniformly distributed random node from a tree. We traverse the tree only once and we do not know beforehand the number of nodes in the tree. The provably correct algorithm is an instance of a Reservoir sampling.

The procedure is written in a pure functional subset of R5RS Scheme and comes with a correctness proof. We must stress that the proof was developed not after the implementation but along with the implementation. In our experience, thinking about the proof and writing it down notably helped design and code the algorithm. Once the proof was written, the code followed immediately. The code worked on the first try.

Version
The current version is 1.1, Apr 15, 2003.
References
random-tree-node.scm [6K]
The complete source code, the proof of the algorithm, and validation tests.
The code was originally posted in an article Re: random node in tree on the newsgroup comp.lang.scheme on Tue, 15 Apr 2003 22:17:15 -0700
 

Optimal justification of text with Dynamic Programming

This code demonstrates Dynamic Programming on the problem of pretty printing a paragraph of text on a printer with fixed-width fonts. The goal is to tightly arrange a given sequence of n words within page margins, maximizing overall neatness. To be more precise, we wish to minimize the sum, over all lines except the last, of the cubes of the number of blank characters at the end of each line. See the comments in the code for more details.

The algorithm has O(n^2) time and space complexities.

Version
The current version is 2.0, April 2001.
References

word_layout.cc [9K]
Commented C++ source code and sample output, with many annotations.

Z. Galil, K. Park: A linear-time algorithm for concave one-dimensional dynamic programming
Information Processing Letters, v33, N6, 309-311, 1990.
<http://portal.acm.org/citation.cfm?id=79800>
David Eppstein, livejournal.com user 11011110, pointed out that the present problem is an instance of concave 1d dynamic programming, which admits a linear-time solution.

 

Deriving fast functions to compute all subsets of size N

The following two-part article attempts to design the fastest solution to the problem of finding all subsets of a given size from a given set. The precise problem is: given a set L and a number N, return the set of all subsets of L of cardinality N. Sets are represented by lists. We will be using R5RS Scheme.

In part 1, we start with the mathematical definition of the problem, which leads to a simple, correct, but inefficient solution. We then try to systematically optimize the function until we end up with the fastest function, which is notably faster than the other solutions proposed so far. The final solution is still pure functional. We also demonstrate that the choice of the Scheme interpreter does matter in relative performance of various algorithms.

In part 2, we again start with the mathematical definition of the problem, which leads to a simple, correct, and stunningly efficient solution. The final, so far the fastest solution is still pure functional. The key was to choose the right definition.

In the discussion, Doug Quale presented lazy stream implementations in Haskell and Scheme, and compared them with the above. Eli Barzilay described various memozied versions, which have even better performance. The two USENET threads contain the excellent discussion of the relative merits of memoization and laziness, contributed by Doug Quale and Eli Barzilay. The threads also include many timing comparisons.

References

Part 1 of the article [plain text file]
It was originally posted as Re: Subsets of a list on the newsgroup comp.lang.scheme on Sat, 12 Jan 2002 00:52:23 -0800

Part 2 of the article [plain text file]
It was originally posted as The FASTEST subsets function [Was: Subsets of a list] on the newsgroup comp.lang.scheme on Sat, 12 Jan 2002 00:56:01 -0800
The article is updated with a more optimized solution, which should perform better when compiled.

Discussion threads of the above titles, comp.lang.scheme, Jan 9-18, 2002.
<http://google.com/group/comp.lang.scheme/browse_thread/thread/671474460d3e31d8>
<http://google.com/group/comp.lang.scheme/browse_thread/thread/ba04bcb97d8a6c2b>

 

Representing knowledge about knowledge: ``Mr.S and Mr.P'' puzzle

We describe a concise Haskell solution to the ``Mr.S and Mr.P'' puzzle. We rely on the straightforward encoding of multiple-world semantics of modalities.

The problem was posed by John McCarthy as follows. We pick two numbers a and b, so that a>=b and both numbers are within the range [2,99]. We give Mr.P the product a*b and give Mr.S the sum a+b. The following dialog takes place:

Can we find the numbers a and b?

The following Haskell code demonstrates a generic method of encoding facts, and the knowledge about facts, and the knowledge of the knowledge, etc. Incidentally, compared to the notation in McCarthy's paper, the Haskell notation is notably concise.

Chung-chieh Shan commented: ``The basic idea is to think of a set of possible worlds. Corresponding to each person (whose knowledge is being modeled) is a partition of this set of possible worlds; each partition contains one or more worlds that this person cannot distinguish. For someone to know a fact is for all of that person's indistinguishable possible worlds to verify that fact. For Alice to know that Bob doesn't know the weather, is for all of Alice's possible worlds (relative to the real world) to reside within a Bob-partition in which the weather is not consistent across all worlds.''

Version
The current version is 1.3, Jun 23, 2006.
References

John McCarthy: Formalization of two Puzzles Involving Knowledge. 1987.
<http://www-formal.stanford.edu/jmc/puzzles.html>

Mr-S-P.lhs [3K]
Complete literate Haskell98 code. It was first mentioned in the message posted on Lambda-the-Ultimate on Jan 27, 2003. The present version adds a straightforward memoization.

 

How to zip folds: A library of fold transformers (streams)

We show how to merge two folds `elementwise' into the resulting fold. Furthermore, we present a library of potentially infinite ``lists'' represented as folds (aka streams, aka success-failure-continuation--based generators). Whereas the standard Haskell Prelude functions such as map and take transform lists, we transform folds. We implement the range of progressively more complex transformers -- from map, filter, takeWhile to take, to drop and dropWhile, and finally, zip and zipWith. The standard list API is also provided.

Emphatically we never convert a stream to a list and so we never use recursion or recursive types. All iterative processing is driven by the fold itself. We only need higher-ranked types, because lists cannot be fully implemented in simply typed lambda-calculus.

The implementation of zip also solves the problem of ``parallel loops''. One can think of a fold as an accumulating loop and realize a nested loop as a nested fold. Representing a parallel loop as a fold is a challenge, answered at the end of the article. This becomes especially interesting in the case of general backtracking computations, or backtracking computations in direct style, with delimited continuations modeling `lists'.

Version
The current version is 1.6, Jun 16, 2008.
References

zip-folds.lhs [13K]
Complete literate Haskell code. An earlier version was posted on the Haskell mailing list on Tue, 11 Oct 2005 17:25:24 -0700 (PDT). That version implemented zip with the help of a recursive type. The present version, inspired by a question from Chung-chieh Shan, introduces no extra data types, no recursion, and rather relies on the already defined functions to deconstruct a fold. Version 1.6 adds two examples of surprisingly simple expressions of list intersperse and Fibonacci in terms of fold.

How to take a tail of a functional stream

LogicT - backtracking monad transformer with fair operations and pruning
which illustrates the close connection with foldr/build list-fusion, aka ``short-cut deforestation''. The FR representation of lists is what one passes to build.

Predecessor and lists are not representable in simply typed lambda-calculus
Therefore, higher-rank or recursive/inductive types are necessary for lists.

 

Total stream processors and their applications to all infinite streams

In the article on seemingly impossible functional programs, Marti'n Escardo' wrote about decidable checking of satisfaction of a total computable predicate on Cantor numerals. The latter represent infinite bit strings, or all real numbers within [0,1]. Mart'n Escardo's technique can tell, in finite time, if a given total computable predicate is satisfied over all possible infinite bit strings. Furthermore, for so-called sparse predicates, Marti'n Escardo's technique is very fast.

We re-formulate the problem in terms of streams and depth-limited depth-first search, and thus cast off the mystery of deciding the satisfiability of a total computable predicate over the set of all Cantor numerals (whose cardinality is greater than that of natural numbers!)

As an additional contribution, we show how to write functions over Cantor numerals in a `natural' monadic style so that those functions become self-partially evaluating. The instantiation of the functions in an appropriate pure monad gives us transparent memoization, without any changes to the functions themselves. The monad in question is pure and involves no reference cells.

On `dense' functions on numerals (i.e., those that need to examine most of the bits of its argument, up to a limit), our technique performs about 9 times faster than the most sophisticated one by Marti'n Escardo'.

Version
The current version is October 2, 2007.
References

Total stream processors and quantification over infinite number of infinite streams. The complete article.
<http://conway.rutgers.edu/~ccshan/wiki/blog/posts/StreamPEval/>

StreamPEval.hs [12K]
Extensively commented Haskell98 code

Marti'n Escardo': Seemingly impossible functional programs.
<http://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/>

 

Pure functional, mutation-free, efficient double-linked lists

It is always an interesting challenge to write a pure functional and efficient implementation of an imperative algorithm destructively operating a data structure. The functional implementation has a significant benefit of equational reasoning and modularity. We can comprehend the algorithm without keeping the implicit global state in mind. The mutation-free, functional realization has practical benefits: the ease of adding checkpointing, undo and redo. The absence of mutations makes the code multi-threading-safe and helps in porting to distributed or non-shared-memory parallel architectures. On the other hand, an imperative implementation has the advantage of optimality: mutating a component in a complex data structure is a constant-time operation, at least on conventional architectures. Imperative code makes sharing explicit, and so permits efficient implementation of cyclic data structures.

We show a simple example of achieving all the benefits of an imperative data structure -- including sharing and the efficiency of updates -- in a pure functional program. Our data structure is a doubly-linked, possibly cyclic list, with the standard operations of adding, deleting and updating elements; traversing the list in both directions; iterating over the list, with cycle detection. The code:

The algorithm is essentially imperative, thus permitting identity checking and in-place `updates', but implemented purely functionally. Although the code uses many local, type safe `heaps', there is emphatically no global heap and no global state.

It is not for nothing that Haskell has been called the best imperative language. One can implement imperative algorithms just as they are -- yet genuinely functionally, without resorting to the monadic sub-language but taking the full advantage of clausal definitions, pattern guards, laziness.

Version
The current version is 1.2, Jan 7, 2009.
References

FDList.hs [8K]
The complete, commented Haskell98 code and many tests

Haskell-Cafe discussion ``Updating doubly linked lists''. January 2009.

 

Annotating trees post factum

A common problem is annotating nodes of an already constructed tree or other such data structure with arbitrary new data. We should accept that the original tree has been defined with no provision for node attributes; we should not make any changes to the data type definition. We should not even require rebuilding of the tree as we add annotations to its nodes. Our code must be pure functional; in particular, the tree to annotate should remain as it was. Finally, our solution should be expressible in a typed language without resorting to the Universal type.

Our problem is an instance of a more general one, of attribute, or property lists. In Common Lisp, a property list is a list of arbitrary key-value pairs attached to a symbol as an annotation. Crucially, the set of possible keys and the types of possible associated values are not known beforehand. One can always add an arbitrarily new annotation to a symbol. Property lists are especially handy in a compiler, letting each node of an abstract syntax tree be annotated with source location data, with the inferred type of the corresponding sub-expression, or with the results of various analyses. MLton, for example, uses property lists extensively for that purpose. However, MLton destructively modifies the tree as new annotations are added, and MLton resorts to the Universal type. Pure functional, typed implementation of post-factum--added property lists is an interesting problem.

Our solution relies on the observation that each node in an (ordered) tree can be identified by a path -- a sequence of integers. The root node has the empty sequence as its path. The path (h:t) identifies the node that is the h-th child of the node reachable by the path t. The path is the `natural' identification for each tree node; it can be constructed and resolved purely functionally. Furthermore, there is a decidable notion of identity and order on paths. To annotate the tree with values of a certain type, we build a separate finite map data structure associating paths with the annotations on the corresponding nodes. To add annotations of a different type, we build another map.

We have used this solution in a type checking code, to annotate each node of the syntax tree with the type. Each sub-expression gets annotated with its reconstructed type. The annotations are attached without re-defining the data type of expressions or even rebuilding the syntax tree. Since the code was intended for teaching, clarity was important and so was the avoidance of STRefs let alone StableNames and any IO. Our code, extending the base type checker TEvalNC.hs, is in the file TEvalNR.hs.

A bonus exercise was to modify the code to make the type checker return the reconstructed types of sub-terms even if the entire term turns out ill-typed. The intention was to model OCaml -- which, given a special flag, can dump the inferred types of all sub-expressions, even if the overall type checking failed. In Emacs and vi, one can highlight an expression or variable and see its inferred type. If the type checking failed, one can still explore what the type checker did manage to figure out.

Version
The current version is 1.1, July 2008.
References

TEvalNC.hs [4K]
Complete Haskell code of the type checker for simply-typed lambda calculus with constants and the fix-point. Type reconstruction is implemented as an evaluator with non-standard semantics.

TEvalNR.hs [4K]
A version of TEvalNC.hs that reports not only the inferred type of the whole term but also the inferred types for each subterm. The latter data are returned in a `virtual' typed AST -- virtual because the original AST is not modified and the inferred types are attached to AST nodes, well, virtually.

Interpreting types as abstract values
Lecture notes with the explanation of the type checking code.

MLton: Property Lists
<http://mlton.org/PropertyList>
The description of property lists and their implementation using mutation and the Universal type.

 

Cross-product of sequences

This article shows a solution (in Scheme) to a problem of computing a cross-product of n sequences. For example
     (cross-product '((0) (0 1 2 3) (1 2)))
should evaluate to
     ((0 0 1) (0 0 2) (0 1 1) (0 1 2) (0 2 1) (0 2 2) (0 3 1) (0 3 2))
The solution is especially elegant if we use the standard (SRFI-1) append-map function.
Version
The current version is 1.1, Jan 11, 2001.
References
Re: combinatorial stuff ?? [plain text file]
An article with the complete code and transcripts, posted on a newsgroup comp.lang.scheme on Thu, 11 Jan 2001 19:59:45 GMT
 

Powers of three

A primitive self-contained C code that computes and prints out 3^N very fast. A non-negative integer exponent N may be as big as 2000.

This is an intended solution to a problem presented at the 1995 Programming Contest organized by University of North Texas' Chapter of ACM. The full text of the problem is given in the title comments to the code.

What counts is the overall speed, of computing the result and converting it to ASCII. And fast the code is: on HP 9000/770/J210, it completes 3^2000 under 0.09 seconds, whereas bc takes 0.3 seconds of user time. The present code uses no multiplications or divisions to compute and print 3^N.

Version
The current version is 1.1, March 1995.
References
pow3.c [6K]
Commented C source code
 

Fast computation of Bernoulli numbers

This Haskell98 code quickly computes Bernoulli numbers. The code avoids explicit recursion, explicit factorials and (most) computing with rationals, demonstrating stream-wise processing and CAF memoization. For example, the following snippet defines a 2D table of pre-computed powers r^n for all r>=2 and n>1. Thanks to lazy evaluation, the table is automatically `sized' as needed. There is no need to guess the maximal size of the table so to allocate it.

     powers = [2..] : map (zipWith (*) (head powers)) powers
The rest of the algorithm exhibits similar stream-wise processing and computations of tables in terms of themselves.
Version
The current version is 1.1, March 2003.
References

Bernoulli.hs [3K]
Commented Haskell98 source code and tests

Messages speedup help by Damien R. Sullivan, Bill Wood, Andrew J Bromage, Mark P Jones, and many others posted on the Haskell-Cafe mailing list on March 6-8, 2003
<http://www.haskell.org/pipermail/haskell-cafe/2003-March/004065.html>
<http://www.haskell.org/pipermail/haskell-cafe/2003-March/004075.html>

 

Eratosthenes and other number sieves

Computing prime numbers is an important practical task as well as a common example for programming language tutorials. The Eratosthenes sieve is probably the most familiar algorithm for determining prime numbers. Alas, quite many implementations that call themselves Eratosthenes sieve do not actually implement that algorithm. For example, the classic Haskell code

     primes = sieve [ 2.. ] where
       sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]
is not Eratosthenes sieve. For one thing, it uses the division operation (or, mod). The Eratosthenes sieve specifically avoids both division and multiplication, which were quite difficult in old times. Mainly, as Melissa O'Neill explains, the above code tests every number for divisibility by all previously found primes. In contrast, the true Eratosthenes sieve affects only composite numbers. The prime numbers are `left out'; they are not checked by division or numeric comparison.

Given below are several implementations of the true Eratosthenes sieve algorithm, in Scheme, Scheme macros, and Haskell. The algorithm is usually formulated in terms of marks and crossing off marks, suggesting the imperative implementation with mutable arrays. The Scheme code follows that suggestion, using two important optimizations kindly described by George Kangas.

Eratosthenes sieve, however, can be implemented purely functionally, as Scheme macros and Haskell code demonstrate. The Haskell implementation is not meant to be efficient -- rather, it is meant to be purely functional, insightful, minimalist, and generalizable to other number sieves, e.g., `lucky numbers'. Like other Haskell algorithms it produces a stream of prime numbers. The Haskell implementation stores only marks signifying the numbers, but never the numbers themselves. Not only the implementation avoids multiplication, division or the remainder operations. We also avoid general addition and number comparison. We rely exclusively on the successor, predecessor and zero comparison. The predecessor can be easily eliminated. Thus the algorithm can be used with Church and Peano numerals, or members of Elliptic rings, where zero comparison and successor take constant time but other arithmetic operations are more involved.

Version
The current version is 1.5, Feb 12, 2007.
References

Melissa O'Neill: Re: Genuine Eratosthenes sieve
<http://www.haskell.org/pipermail/haskell-cafe/2007-February/022666.html>
Messages explaining the sieve algorithm and its differences from impostors; posted on Haskell-Cafe mailing list, February 2007.

Eratosthenes sieve and its optimal implementation [plain text file]
The explanation of the original Eratosthenes sieve and its optimizations.
The original article was posted as Re: arbitrary precision rationals on a newsgroup comp.lang.scheme on Tue, 13 Nov 2001 15:07:34 -0800

number-sieve.lhs [4K]
The literate Haskell98 source code for pure functional, minimalist Eratosthenes and lucky number sieves
The code was originally posted in an article Even better Eratosthenes sieve and lucky numbers on the Haskell-Cafe mailing list on Mon, 12 Feb 2007 18:37:46 -0800 (PST)

A stress test of the syntax-rule macro-expander
Eratosthenes sieve as a syntax-rule macro, to perform primality test of Church-Peano numerals at macro-expand time

Lucky numbers: another number sieve
<http://mathworld.wolfram.com/LuckyNumber.html>
<http://www.research.att.com/~njas/sequences/A000959>

Random Relative Primes

 

A neuron that learns how to play blackjack

A trivial C code that learns how to play blackjack, by playing it with itself. The code implements a bona fide neural network with back-propagation. The network is made of a single neuron, possessing a single byte of intelligence. The neuron has ``weights'' and a ``threshold''; the neuron ``fires'' when the value of the activation function computed over the current input exceeds the current threshold. In game terms, using the current score and the number of aces on hand the neuron decides if to draw another card or to stay. The threshold (the neuron's memory) is adjusted as the game proceeds, using back-propagation -- if you can discern it in such a primitive setting.
Version
The current version is 1.1, December 18, 1997.
References
black-jack.c [10K]
ANSI C source code, albeit written in a C++ style. The code and its sample output are well annotated.
 

Iterated zipping puzzles

The following is the refined code for the challenge originally posted by Joe English on a Haskell-Cafe thread about homework-like puzzles. The challenge was to figure out what the code does without first loading it up in a Haskell interpreter.

     s f g x = f x (g x)
     
     puzzle = (!!) $ iterate (s (lzw (+)) (0:)) [1] where
         lzw op [] ys = ys
         lzw op (x:xs) (y:ys) = op x y : lzw op xs ys
Incidentally, a small change gives a different series:
     puzzle1 = (!!) $ iterate (s ((lzw (+)).(0:)) (1:)) [] where
         lzw op [] ys = ys
         lzw op (x:xs) (y:ys) = op x y : lzw op xs ys
Finally, how can we possibly live without the following:
     puzzle2 = (!!) $ iterate (s ((lzw (+)).(1:).(0:)) (0:)) [1,1] where
         lzw op xs [] = []
         lzw op (x:xs) (y:ys) = op x y : lzw op xs ys
Hints:
     *Main> puzzle 5
     [1,5,10,10,5,1]
     *Main> puzzle1 5
     [1,2,4,8,16]
     *Main> puzzle2 5
     [1,1,2,3,5,8,13]
Version
The current version is August 2003.
References

Message Homework posted on the Haskell-Cafe mailing list on Mon, 25 Aug 2003 18:50:42 -0700 (PDT)

Discussion thread, started on Haskell-Cafe by Thomas Bevan on Aug 22, 2003.
<http://www.haskell.org/pipermail/haskell-cafe/2003-August/004977.html>



Last updated April 5, 2009

This site's top page is http://pobox.com/~oleg/ftp/

oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from SXML by SXML->HTML