Algorithms and Data Structures, using Haskell

General-purpose algorithms and data structures, illustrated in Haskell

Provably perfect random shuffling and its pure functional implementations

We show two purely functional programs that perfectly, randomly and uniformly shuffle a sequence of arbitrary elements. We prove that the algorithms are correct. We also show why a commonly used sort-based shuffling algorithm is not perfect.

The article first formally defines the perfect random shuffle, outlines the algorithm, and discusses the well-known imperative shuffler that swaps elements in the array. We present and analyze two purely functional implementations. The naive one runs in `O(n^2)` time. A more sophisticated code has the `O(n*logn)` time complexity. The article contains the full code to arrange a sequence in a complete binary tree and to extract an arbitrary element from the tree maintaining the completeness of the remaining tree.

Version
The current version is 1.2, Jul 14, 2002.
References
perfect-shuffle.txt [15K]
The article with the full source code for the two implementations, including the code for building and deconstructing a complete binary tree. It was originally posted as Provably perfect shuffle algorithms on the newsgroup comp.lang.functional on Mon, 3 Sep 2001 13:59:46 -0700

Dave Bayer, Persi Diaconis: Trailing the dovetail shuffle to its lair
Ann. Appl. Probab. 2 (1992), no. 2, pp. 294-313.
In the context of card shuffling, `perfect shuffle' means a deterministic interlacing of the two halves of the deck. The paper studies the perfect and other card shuffles in detail.

Interpreting types as abstract values: a tutorial on Hindley-Milner type inference

We expound a view of type checking as evaluation with `abstract values'. Whereas dynamic semantics, evaluation, deals with (dynamic) values like 0, 1, etc., static semantics, type checking, deals with approximations like `int`. A type system is sound if it correctly approximates the dynamic behavior and predicts its outcome: if the static semantics predicts that a term has the type `int`, the dynamic evaluation of the term, if it terminates, will yield an integer.

As object language, we use simply-typed and let-polymorphic lambda calculi with integers and integer operations as constants. We use Haskell as a metalanguage in which to write evaluators, type checkers, type reconstructors and inferencers for the object language.

We explore the deep relation between parametric polymorphism and `inlining'. Polymorphic type checking then is an optimization allowing us to type check a polymorphic term at the place of its definition rather than at the places of its use.

Joint work with Chung-chieh Shan.

Version
The current version is 1.1, July 2008.
References
lecture.pdf [199K]
Lecture notes
The notes, which explain all the code below, are for the course given at the Formosan Summer School on Logic, Language, and Computation (FLOLAC08). Taipei, Taiwan. July 9-10, 2008.

EvalN.hs [4K]
Evaluator for the untyped (nominal) lambda-calculus with integers and zero-comparison

TEvalNC.hs [4K]
Type reconstruction for the simply-typed Church-style lambda-calculus with integers and zero-comparison
Source terms have type annotations on bound variables. Type reconstruction is implemented as an evaluator with a non-standard semantics.

TEvalNR.hs [4K]
A version of `TEvalNC.hs` that reports not only the inferred type of the whole term but also the inferred types for each subterm

TInfT.hs [7K]
Type reconstruction for the simply-typed Curry-style lambda-calculus with integers and zero-comparison
Source terms have no type annotations whatsoever.

TInfTM.hs [10K]
A version of `TInfT.hs` hiding the single-threaded state through simple algebraic transformations (beta-expansions)

TInfLetI.hs [10K]
The Curry-style calculus with `Let` : polymorphic type checking via `inlining'

TInfLetP.hs [14K]
Let-polymorphism via type schemes
This is a non-canonical presentation of Hindley-Milner type inference, treated as an optimization of polymorphic type checking by inlining.

TInfTEnv.hs [10K]
Simply-typed Curry-style lambda-calculus with integers and zero-comparison: reconstructing not only types but also the type environment
This approach lets us type check open terms. A type checking failure means the term cannot be typed in any type environment.
One often hears a complaint against typing: we can evaluate terms we can't type check. This code shows that we can type check terms we can't even evaluate. We cannot evaluate an open term at all, but we can type check it, inferring both its type as well as the requirement on the environments in which the term must be used.

EvalTaglessI.hs [9K]
Initial Tagless Typed lambda-calculus with integers and the conditional in the higher-order abstract syntax
Haskell itself ensures the object terms are well-typed. This code relies on GADT.

EvalTaglessF.hs [8K]
Final Tagless Typed lambda-calculus with integers and the conditional in the higher-order abstract syntax
Haskell itself ensures the object terms are well-typed. No GADTs are used. Apart from switching off the monomorphism restriction for convenience, the code is in Haskell98.

Pure-functional transformations of cyclic graphs and the Credit Card Transform

Cycles certainly make it difficult to transform graphs in a pure non-strict language. Cycles in a source graph require us to devise a way to mark traversed nodes -- however we cannot mutate nodes and cannot even compare nodes with a generic ('derived') equality operator. Cycles in a destination graph call for keeping track of already constructed nodes so we can make back-references. An obvious solution is to resort to the state monad or `IORef`s. There is also a monad-less solution, which maintains a dictionary of already constructed nodes. The latter is less obvious: seemingly we cannot add a node to the dictionary until we have built the node. Therefore, we cannot use the updated dictionary when building the descendants of the node -- which need the updated dictionary to link back. The problem can be overcome however with a credit card transform (a.k.a. "buy now, pay later" transform). To avoid hitting the bottom, we just have to "pay" by the "due date".

For illustration, the article below considers the problem of printing out a non-deterministic finite automaton (NFA) and transforming it into a deterministic finite automaton (DFA). Both NFA and DFA are represented as cyclic graphs.

Version
The current version is 1.3, Aug 18, 2002.
References
CCard-transform-DFA.lhs [10K]
A literate Haskell code expounding the technique
The code was first posted in a message Transformations of cyclic graphs [Was: Efficiency of list indices in definitions] on Thu, 08 Aug 2002 16:11:30 -0700 on the Haskell-Cafe mailing list.

Tying the knot: How to build a cyclic data structure

Trying to tie the knot
A message on the Haskell mailing list about a fixed-point combinator helping to tie a knot with forward links. Posted on Fri, 12 Apr 2002 18:44:07 -0700.

De-biforestation

Deforestation is usually defined as the elimination of an intermediate data structure, often a collection of items, sent from a single producer to a single consumer. Jorge Adriano showed an interesting example of one producer feeding two independent consumers. The two streams of data are mutually dependent. Furthermore, the rate of production is non-uniform: Generating an element in one stream may require generating trillion items in the other stream. If the first consumer is being evaluated, that consumer causes the evaluation of the producer until the latter yields an item. The trillion of items incidentally produced in the other stream have to be stored somewhere. In more OS-centric terms, the evaluator can't generally reduce two expressions in parallel. If one stream consumer is running, the other consumer sits in the ready queue. Therefore, data sent to the second consumer must be buffered, sometimes at great expense.

We consider the problem of one producer and two consumers in the general setting. We derive a deforested version, which no longer needs to buffer any produced items. When we run that version, the memory retaining profile shows no space leaks. We also discuss a ``parallel'' writing of several streams into two distinct files. Our solution is safe and yet the i/o is effectively interleaved. The deforested solution is derived, via a sequence of equivalent transformations. The derived code worked on the first try.

Version
The current version is 1.2, Dec 5, 2002.
References
de-biforestation.lhs [12K]
The article as a well-commented literate Haskell code
An earlier version was posted on the Haskell-Cafe mailing list on Tue, 3 Dec 2002 22:05:05 -0800 (PST).

Jorge Adriano: producing and consuming lists
A message posted on the Haskell-Cafe mailing list posted on Tue, 5 Nov 2002 16:04:58 +0000. The message describes the problem of consuming two inter-dependent streams while avoiding space leaks.

Logic programming in Haskell optimized for reasoning

We demonstrate an executable model of the evaluation of definite logic programs, i.e., of resolving Horn clauses presented in the form of definitional trees. Our implementation, `DefinitionTree`, is yet another embedding of Prolog in Haskell. It is distinguished not by speed or convenience. Rather, it is explicitly designed to formalize evaluation strategies such as SLD and SLD-interleaving, to be easier to reason about and so help prove termination and other properties of the evaluation strategies. The main difference of the `DefinitionTree` from other embeddings of Prolog in Haskell is the absence of name-generation effects. We need neither `gensym` nor the state monad to ensure the unique naming of logic variables. Since naming and evaluation are fully separated, the evaluation strategy is no longer concerned with fresh name generation and so is easier to reason about equationally.
Version
The current version is January 2008.
References
DefinitionTree.hs [14K]
The commented source code

Pure, declarative, and constructive arithmetic relations
`DefinitionTree` has been used to describe the SLD and SLD-interleave evaluation strategies and to prove the basic properties of solution sets.

Controlled memoization

The problem is adding memoization to a recursive `Int->Int` function. That seems simple until we realize that finding the value of that particular function at the argument `x` requires evaluating the function for smaller arguments and may require evaluating the function for larger arguments. Thus straight-forward memoization leads to a large and sparsely populated memoization table. As the original poster observed, the memoized function begins thrashing sooner than the original code.

The solution is to balance the cost of recomputing the value versus the cost of maintaining the memoization table and doing the lookup. It is worth memoize the values of the function for some small `x` -- which are used frequently. For larger `x`, re-computation is better. The parameter `sc_upb` in the code determines the balance, the largest argument value for which we still memoize. The simple benchmark shows that this controlled memoization does improve both the space and time performances.

A notable feature of the code is the transparency of adding the controlled memoization. The body of the algorithm remains essentially the same.

Version
The current version is 1.1, Mar 16, 2005.
References
memo-array.lhs [2K]
The literate Haskell code with explanations and the benchmark
The code was originally posted as Re: (Newbie) Dynamic Programming, Memoizing Etc. on the Haskell-Cafe mailing list on Wed, 16 Mar 2005 18:57:48 -0800 (PST)

Annotating trees post factum

A common problem is annotating nodes of an already constructed tree or other such data structure with arbitrary new data. We should accept that the original tree has been defined with no provision for node attributes; we should not make any changes to the data type definition. We should not even require rebuilding of the tree as we add annotations to its nodes. Our code must be pure functional; in particular, the tree to annotate should remain as it was. Finally, our solution should be expressible in a typed language without resorting to the Universal type.

Our problem is an instance of a more general one, of attribute, or property lists. In Common Lisp, a property list is a list of arbitrary key-value pairs attached to a symbol as an annotation. Crucially, the set of possible keys and the types of possible associated values are not known beforehand. One can always add an arbitrarily new annotation to a symbol. Property lists are especially handy in a compiler, letting each node of an abstract syntax tree be annotated with source location data, with the inferred type of the corresponding sub-expression, or with the results of various analyses. MLton, for example, uses property lists extensively for that purpose. However, MLton destructively modifies the tree as new annotations are added, and MLton resorts to the Universal type. Pure functional, typed implementation of post-factum--added property lists is an interesting problem.

Our solution relies on the observation that each node in an (ordered) tree can be identified by a path -- a sequence of integers. The root node has the empty sequence as its path. The path `(h:t)` identifies the node that is the `h`-th child of the node reachable by the path `t`. The path is the `natural' identification for each tree node; it can be constructed and resolved purely functionally. Furthermore, there is a decidable notion of identity and order on paths. To annotate the tree with values of a certain type, we build a separate finite map data structure associating paths with the annotations on the corresponding nodes. To add annotations of a different type, we build another map.

We have used this solution in a type checking code, to annotate each node of the syntax tree with the type. Each sub-expression gets annotated with its reconstructed type. The annotations are attached without re-defining the data type of expressions or even rebuilding the syntax tree. Since the code was intended for teaching, clarity was important and so was the avoidance of `STRefs` let alone `StableNames` and any IO. Our code, extending the base type checker `TEvalNC.hs`, is in the file `TEvalNR.hs`.

A bonus exercise was to modify the code to make the type checker return the reconstructed types of sub-terms even if the entire term turns out ill-typed. The intention was to model OCaml -- which, given a special flag, can dump the inferred types of all sub-expressions, even if the overall type checking failed. In Emacs and vi, one can highlight an expression or variable and see its inferred type. If the type checking failed, one can still explore what the type checker did manage to figure out.

Version
The current version is July 2008.
References
TEvalNC.hs [4K]
Complete Haskell code of the type checker for simply-typed lambda calculus with constants and the fix-point
Type reconstruction is implemented as an evaluator with a non-standard semantics.

TEvalNR.hs [4K]
A version of `TEvalNC.hs` that reports not only the inferred type of the whole term but also the inferred types for each subterm. The latter data are returned in a `virtual' typed AST -- virtual because the original AST is not modified and the inferred types are attached to AST nodes, well, virtually.

Interpreting types as abstract values: a tutorial on Hindley-Milner type inference
Lecture notes with the explanation of the type checking code.

MLton: Property Lists
< http://mlton.org/PropertyList >
The description of property lists and their implementation using mutation and the Universal type.

Similarity between instruction scheduling and imperative functional programming

The article makes the case that a transformation of a typical imperative code into functional code is similar to the data and control dependency analyses in compilers for modern CPUs. By typical imperative code we mean a typical numerical code with nested loops and updateable arrays.

We consider three examples of such transformations, including two pure functional implementations of a bubble-sort. The imperative code is quite complex: nested loops, data dependencies across iterations, reading from and writing into the array, conditional branches that depend on data and the iteration history, control dependencies across nested loops. We show two solutions in Haskell. Both use fast, imperative `STArrays`; both try to separate the imperative part from a functional part. The imperative part is performed in a monad while the functional part is written in ordinary Haskell. The two solutions are closely related, and written to emphasize their similarity. One of the solution is an interpreter for a trivial virtual machine.

The analysis of data and control dependencies across loop iterations, scheduling of instructions and interlocking of their results, pre-fetching of data -- all are the job of an advanced compiler or the CPU. On the other hand, we have to do the same kind of analyses to separate the functional part of a computation from its imperative part. Functional programming appears closer to ``the bare metal'' than one would have thought.

The relationship of the dependency analysis of the imperative array code with the extraction of a functional program from an imperative one has been stated by Andrew W. Appel, in his paper "SSA is Functional Programming". The present article has shown several illustrations of his thesis. We have done manual dependency analyses of several code snippets that update arrays in place. The analyses let us extract pure-functional parts of imperative programs. The rest requires serialization and can be handled by an `ST` monad.

Version
The current version is 1.2, Feb 10, 2002.
References

fp-arrays-assembly.txt [12K]
The article with the full source code
It was originally posted as FP vs assembly programming for super-scalar CPUs [Was: Nested Loops in Clean] on the newsgroup comp.lang.functional on Thu, 16 Aug 2001 18:24:24 -0700

Andrew W. Appel SSA is Functional Programming
ACM SIGPLAN Notices v. 33, no. 4, pp. 17-20, April 1998.

A character-efficient Mastermind game code

The following code is a space-efficient implementation of the game Mastermind. It is efficient not in terms of the heap or stack space but in the source text space. Jan Rochel wrote the original code and put it in his e-mail signature. He asked for a shorter version. The present code relies on a different algorithm to compute exact and inexact matches in two strings. The new algorithm made the code a bit faster, but more importantly, a bit (7%) shorter. The new code is also more obscure and exhibits some interesting patterns, in particular `f x=(x\\).(x\\)`.

The version to put in a .signature (275 chars):

```     import Random;import List;p=putStrLn;u=uncurry;f x=(x\\).(x\\)
main=mapM(\x->randomRIO(49,54))[1..4]>>=n 0.map toEnum>>=p.("Tries: "++).show
e=((partition\$u(==)).).zip;h(p,q)=['*'|x<-p]++['+'|x<-u f\$unzip q]
n a s=getLine>>=m where{m i|i==s=return a;m i=p(h\$e i s)>>n(a+1)s}
```

An expanded version that is easier to read:

```     import Random;import List

main = mapM(\x->randomRIO(49,54))[1..4]
>>= n 0 . map toEnum >>= p . ("Tries: "++) . show
p=putStrLn
u=uncurry
e=((partition\$u(==)).) . zip
f x=(x\\).(x\\)
h(p,q)=['*'|x<-p] ++ ['+'|x<-u f \$ unzip q]
n a s=getLine>>=m a s
m a s i | i==s = return a
m a s i = p(h\$e i s) >> n(a+1)s
```
The code runs on GHC(i).
Version
The current version is 1.1, May 24, 2003.
References
The code was originally posted as an article Haskell: Mastermind code for a .sig on the newsgroup comp.lang.functional on Sat, 24 May 2003 00:04:40 -0700

Last updated November 5, 2013

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML