previous   next   up   top

Haskell Programming

 


 

Preventing memoization in (AI) search problems

In lazy evaluation, an expression is not evaluated unless its result is needed. Once an expression is evaluated, its result is remembered, or memoized, in case the value of the same expression is needed again. This on-demand evaluation and memoization happens transparently and `for free', sparing us from explicitly programming these features. Another great attraction of lazy evaluation is the potential to improve algorithmic complexity: As Bird et al. wrote, ``for some problems at least -- lazy evaluators are strictly more powerful than eager ones.''

The implicit memoization however is a trade-off. We trade space to store expression's result for time gained in avoiding recomputations. The trade-off is particularly worthy if the result takes much less memory than needed for the closure (thunk) of the expression. This is often the case with numeric code. Non-deterministic, probabilistic programming and general AI search problems are the opposite case. A non-deterministic expression is typically represented as a lazy search tree, which is often huge even for small expressions. It becomes a better trade-off to re-evaluate an expression rather than to fill all memory with results.

Alas, GHC is designed for the opposite trade-off. Therefore, using Haskell even for simple search problems is quite a challenge since memoization gets in the way. Preventing the memoization is surprisingly hard, since GHC is very good at finding the opportunities for it, even within thunks. This article uses a typical example of non-deterministic search to illustrate the problem posed by lazy evaluation and to describe a few tricks to prevent memoization. Some of them are unexpected.

Our running example computes and prints the first n elements of the infinite stream of Pythagorean triples pyth, using three infinite streams of integers from 1. As typical for non-deterministic programs, the example generates candidate solutions and rejects most of them.

     from :: MonadPlus m => Int -> m Int
     from i = return i `mplus` from (i+1)
           
     pyth :: MonadPlus m => m (Int,Int,Int)
     pyth = do
       x <- from 1
       y <- from 1
       z <- from 1
       if x*x + y*y == z*z then return (x,y,z) else mzero
The interleaving of three infinite streams precludes the List monad (depth-first search), see below. Instead of a lazy list, we use a search tree to represent the result of a non-deterministic computation:
     data Tree1 a = Fail1 | Val1 a | Node1 (Tree1 a) (Tree1 a)
We rely on the non-strictness of Haskell to prevent evaluation of tree nodes until we traverse to them: after all, the tree may be infinite -- as is the case in our example. Tree1 is an instance of Monad and MonadPlus; here are the most complex parts of these instances (see the accompanying code for the rest):
     Node1 e1 e2 >>= f = Node1 (e1 >>= f) (e2 >>= f)
     mplus = Node1

To `run' the non-deterministic computation and produce the the stream of triples, we traverse the Tree, extract the successfully produced results from the Val leaves and return them as a lazy list. Different tree traversals correspond to different non-deterministic search strategies. Depth-first traversal (DFS) is the most efficient, needing only O(d) space to examine a node at depth d. Alas, an infinite branch in the tree traps DFS. In our pyth tree, DFS will get stuck chasing an infinite chain of Fail. Breadth-first traversal (BFS) in contrast shall visit any node in a tree, given time. BFS is a complete strategy: if a solution (leaf Val) exists, BFS will find it. Alas, BFS needs a lot of space to maintain the job queue, the frontier of the search. At search depth d the frontier may take O(2^d) space. Iterative deepening is a hybrid method, complete as BFS yet needing as little of working space as DFS. Iterative deepening explores the progressively long `prefix' of tree with DFS. Each new exploration phase repeats all the work of the previous explorations of shallower prefixes. Iterative deepening clearly trades time for space. Despite its gross wastefulness, the method is quite popular, for example, in automated theorem proving. Its trade-off has proved worthwhile.

Here are the results of computing and printing the first n Pythagorean triples. The code was compiled by GHC 7.0.4 with optimization -O2.

   Mutator time, sec   CG time, sec   Memory in use, MB   Average residency, KB 
BFS, n=30 13.0 5.0 3 465
Iter Deep, n=30 0.15 0.06 5 1506
Iter Deep, n=100 4.8 1.3 56 20832
We expected iterative deepening to take more time than BFS but significantly less memory. We observed just the opposite.

Recall that iterative deepening keeps re-traversing the tree. Each exploration cycle redoes all the previous explorations. Lazy evaluation helps, it seems. When we first reach Node e1 e2, we evaluate e1 and e2 that were stored in the node unevaluated (otherwise, we would have diverged constructing the tree, which is infinite). Lazy evaluation replaces e1 and e2 with their results. When iterative deepening comes across the same node in a new cycle, it gets the results of e1 and e2 right away. That seems like a good thing, until we look at the space. As iterative deepening explores the Tree, it needs more and more memory to store the explored prefix, which is about twice the size of the BFS frontier. Lazy evaluation thus defeats the purpose of iterative deepening, of recomputing the revisited tree nodes to avoid storing them. Lazy evaluation does exactly the wrong thing.

In a strict language, we would have used thunks to represent infinite trees. If tree nodes store thunks, lazy evaluation would memoize thunks -- which evaluate to themselves rather than to trees. It seems therefore the following modification should stop lazy evaluation's meddling in iterative deepening.

     data Tree2 a = Fail2 | Val2 a | Node2 (() -> Tree2 a) (() -> Tree2 a)
     Node2 e1 e2 >>= f = Node2 (\() -> e1 () >>= f) (\() -> e2 () >>= f)
     mplus e1 e2 = Node2 (\() -> e1) (\() -> e2)
Every time we need traverse through a Node, we have to force the thunks and re-compute the branches. At least, in theory. Here is the practice.
   Mutator time, sec   CG time, sec   Memory in use, MB   Average residency, KB 
BFS, n=30 13.0 5.0 3 509
Iter Deep, n=30 0.3 0.1 8 2964
Iter Deep, n=100 10.6 1.7 96 39244
The fix that was supposed to help iterative deepening has made it worse.

Such an unexpected result was quite a puzzle. It seems GHC is just too smart. Apparently it notices that a thunk (\() -> e) can only be applied to the same argument. Therefore, the first time the thunk is forced by applying it to (), the result can justifiably be memoized: the next time around the thunk will be applied to the same (), and hence, will give the same result anyway.

The new fix is to deliberately confuse GHC. We obfuscate the tree-construction operations (>>=) and mplus with auxiliary functions app and app1.

     Node3 e1 e2 >>= f = Node3 (app1 e1 f) (app1 e2 f)
     mplus e1 e2 = Node3 (app e1) (app e2)
     
     {-# NOINLINE app #-}
     app e () = e
     
     {-# NOINLINE app1 #-}
     app1 e f () = e () >>= f
That does the trick. Here are the results.
   Mutator time, sec   CG time, sec   Memory in use, MB   Average residency, KB 
BFS, n=30 13.2 4.7 3 413
Iter Deep, n=30 0.4 0.03 2 78
Iter Deep, n=100 13.4 0.9 2 413
Finally we see the agreement with our expectations. Iterative deepening now needs only 2MB (down from 56MB in the first try) to compute the first 100 Pythagorean triples. We may now try bigger searches.

We have seen that lazy evaluation is a trade-off, which may be hurtful in some cases, in particular, in search problems over huge data structures, where it is often beneficial to recompute the result than to store it. Preventing lazy evaluation is possible but surprisingly tricky.

Version
The current version is May 2012.
References
Richard S. Bird, Geraint Jones and Oege de Moor: More Haste, Less Speed: Lazy versus Eager Evaluation
J. Functional Programming, 1997, v7(5), pp. 541--547

STrees.hs [11K]
Complete code for our example and the benchmark.

Delimited control and breadth-first, depth-first, and iterative deepening search
with more details on BFS and iterative deepening

 

Breaking referential transparency with unsafeInterleaveST

Among unsafe functions, unsafeInterleaveST (and its close relative, or specialization, unsafeInterleaveIO) is often viewed as `mostly harmless'. The function unsafeInterleaveIO underlies Lazy IO, an (unfortunately) widely used feature. The interleaving functions bear hardly any stigma. On the contrary: Hackage has the monad-interleave package dedicated to them, encouraging their use.

It is unfortunate that the admonition of the people who introduced interleaveST has gone unheeded. In their 1995 paper ``State in Haskell'' Launchbury and Peyton Jones wrote: ``It should be clear by now that interleaveST has very undesirable properties. It duplicates and discards the state, which gives rise to a very subtle class of programming errors. We have so far failed to develop good techniques for reasoning about its correctness'' [Sec 10.5]. Their fears are justified: unsafeInterleaveST is unsafe. It lets us write Bool expressions with observable mutable-state side-effects, hence proving that True equals False. In other words, referential transparency, or the substitution of equals for equals, may fail even for Bool expressions.

Our example is about Boolean equality, defined as

     (==) :: Bool -> Bool -> Bool
     True  == True  = True
     False == False = True
     _     == _     = False
GHC's standard Prelude derivation is identical. Clearly, it is symmetric, as behooves of equality: x == y is the same as y == x -- even if either x::Bool or y::Bool (or both) are undefined.

And yet there exists a context that distinguishes x == y from y == x. That is, there exists

     bad_ctx :: ((Bool,Bool) -> Bool) -> Bool
such that
     *R> bad_ctx $ \(x,y) -> x == y
     True
     *R> bad_ctx $ \(x,y) -> y == x
     False

Here is the complete code:

     module R where
     
     import Control.Monad.ST.Lazy (runST)
     import Control.Monad.ST.Lazy.Unsafe (unsafeInterleaveST)
     import Data.STRef.Lazy
     
     bad_ctx :: ((Bool,Bool) -> Bool) -> Bool
     bad_ctx body = body $ runST (do
        r <- newSTRef False
        x <- unsafeInterleaveST (writeSTRef r True >> return True)
        y <- readSTRef r
        return (x,y))
     
     t1 = bad_ctx $ \(x,y) -> x == y
     t2 = bad_ctx $ \(x,y) -> y == x

To see how bad this is, recall the claims about ST from Launchbury and Peyton Jones' paper, excerpted from Section 1 below:

The function unsafeInterleaveST breaks each of these claims. It is unsafe. (The authors define safety as the Church-Rosser property: the program giving the same result regardless of the evaluation order provided data dependencies are respected; see the beginning of Sec. 10.) Launchbury and Peyton Jones knew that very well. In the hindsight, the bad_ctx example has the same kernel of badness as the unique-supply tree example in Sec 10.2 of their paper. The authors point out a subtlety in that code, admitting that they fell into the trap themselves. The fact that unsafeInterleaveST is unsafe and dangerous, and also apparently useful lead the authors to pessimistically conclude Sec 10.5: ``We fear that there may be no absolutely secure system -- that is, one which guarantees the Church-Rosser property -- which is also expressive enough to describe the programs which systems programmers (at least) want to write, such as those above. We do, however, regard interleaveST as useful primarily for systems programmers.''

The function unsafeInterleaveST ought to bear the same stigma as does unsafePerformIO. After all, both masquerade side-effecting computations as pure. Both break the equational reasoning, the greatest asset of Haskell.

Version
The current version is April 2013.
References
John Launchbury and Simon L. Peyton Jones: State in Haskell
Lisp and Symbolic Computation, 1995, v8 N4, pp. 293-341.
This paper has introduced the ST monad and also lazy IO.

 

Observing strictness

A strictness test takes a function and returns True if the function is strict and False otherwise. The article shows several implementations of the test. After a brief reminder of strictness, we explain why the strictness test is impossible to implement in Haskell. Nonetheless several implementations exist -- which should tell us about the features of GHC and Haskell to be leery of.

In a non-strict language, a function generally receives as its argument an unevaluated computation rather than a value. If a function can produce the result value without executing the argument computation -- without really needing the value of the argument -- the function is called non-strict. It is strict otherwise. For example, the function const 1 :: Int -> Int is non-strict (since it returns 1 without looking at its argument), and succ :: Int -> Int is strict since it needs the value to increment. The function id :: Int -> Int is also strict: obtaining the value (the weak head normal form) of the result requires the value of the argument. A well-known non-strict function is cons, or (:). For the reason why, see the famous paper that begat lazy evaluation, on why cons should not evaluate its arguments.

To tell if a function is strict we should be able to observe the evaluation of its argument. Functions are opaque and we cannot look inside them. Hence the only way to tell if the argument has been evaluated is by observing the side-effect of the evaluation. However, the type of the argument may preclude side effects. For example, the argument of an Int->Int function has the pure type Int. Computations of that type cannot have observable side-effects in a pure language. To put this another way: if x::Int is a computation whose evaluation is observable and if v is the value of x, then v and x are not substitutable for each other since f x and f v are observably different if f is strict. Therefore, the strictness test should not be possible. If it were, it would break the referential transparency.

There is an escape hatch in Haskell: a non-termination is not considered a side-effect. Therefore, a computation of the type Int is allowed not to evaluate to anything and loop forever. Haskell has a convenient term for such a looping computation: undefined. Thus we come to the conventional definition of non-strict functions: a function f is non-strict if f undefined returns a value.

This escape hatch does not contradict the impossibility to observe strictness: if f undefined takes long time to compute a value, we cannot be sure that it really loops (and the cause of looping is really the attempt to evaluate the looping argument). Perhaps f just takes a very long time to finish. In short, we cannot really observe divergence -- otherwise, we would solve the Halting problem.

And yet there are at least two ways to write the strictness test in practice, because GHC lets us create observably side-effecting computations of pure types, and IO contexts are not referentially transparent. Lazy IO makes the test possible already in Haskell98. The computation length str is side-effecting if str::String is the result of getContents. Evaluating length str has the side-effect of reading the file and closing the handle. The status of the handle can be detected. GHC brings imprecise exceptions, thus permitting `multiple bottoms'. Here is a simple test of strictness with imprecise exceptions.

     import Control.Exception
     
     handler :: SomeException -> IO Bool
     handler _ = return True
     
     is_strict :: (Int -> Int) -> IO Bool
     is_strict f = handle handler $
       if f (error "Bang!") > 0 then return False else return False
     
     main_s = is_strict succ 
     -- True
     
     main_ns = is_strict (const 1)
     -- False

Finally, even `mildly unsafe' functions like unsafeInterleaveST (let alone really unsafe ones) break referential transparency in any contexts. Here is the strictness test based in unsafeInterleaveST.

     import Control.Monad.ST.Lazy
     import Data.STRef.Lazy
     
     is_strict :: (Int -> Int) -> Bool
     is_strict f = f (snd t4) `seq` fst t4
      where
      t4 = runST (do
       r <- newSTRef False
       y <- unsafeInterleaveST (writeSTRef r True >> return 0)
       x <- readSTRef r
       return (x,y))
     
     tr1 = is_strict id            -- True
     tr2 = is_strict (const 1)     -- False

The demonstrated implementations of the strictness test may be practically useful, but they are theoretically worrying.

Version
The current version is April 2011.
References
StrictnessTestLIO.hs [3K]
The Lazy IO-based strictness test in Haskell98
The code was posted at the discussion thread Strictness is observable on the Haskell-Cafe mailing list on Fri, 1 Apr 2011 02:23:24 -0700 (PDT)

Daniel P. Friedman and David S. Wise: CONS should not Evaluate its Arguments
ICALP 1976 and also Indiana University Technical Report TR44. January 1976.
< http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR44 >

 

Lazy IO breaks equational reasoning

Lazy IO creates an attractive illusion that files are ordinary Haskell strings, letting the pure Haskell code handle a file as if it were loaded in memory. Furthermore, just as an ordinary string is built lazily, so could a file be read as needed. The illusion has a dark side: the real danger of deadlocks; running out of file descriptors and similar scarce resources; unpredictable, volatile and sometimes unbearably excessive use of memory. Lazy IO also has the fundamental problem: it breaks the most attractive feature of Haskell, equational reasoning. The problem was alluded to in Launchbury and Peyton Jones' paper that introduced the ST monad, interleaveST, and lazy IO. We demonstrate that the problem is as bad as they feared.

UnsafePerformIO is known to be unsafe, breaking equational reasoning; unsafeInterleaveIO, which underlies Lazy IO, gets a free pass because any computation with it has to be embedded in the IO context in order to be evaluated -- and we can expect anything from IO. But unsafeInterleaveIO has essentially the same code as unsafeInterleaveST: compare unsafeInterleaveST from GHC/ST.lhs with unsafeDupableInterleaveIO from GHC/IO.hs keeping in mind that IO and ST have the same representation, as described in GHC/IO.hs. And unsafeInterleaveST is really unsafe -- not just mildly or somewhat or vaguely unsafe. In breaks equational reasoning, in pure contexts.

Lazy IO is likewise problematic: On one hand, a simple equational proof shows that for all Boolean x and y, x < y is the same as not (y <= x). On the other hand, we exhibit a context that distinguishes the two expressions: plugging them into the context gives two programs that print different results. Equational reasoning is thus unsound.

Equational reasoning is regarded as the greatest advantage of pure functional languages: we may derive programs or prove their correctness using the elementary, high-school algebra. Equational reasoning is based on the principle of `substitution of equals for equals'. If an expression e1 is equal to e2 (which we write as e1 === e2) and e1 occurs as part of a larger expression e (written as e[e1]), then we may replace that e1 with e2 obtaining the equal expression: e[e1] === e[e2]. The familiar symmetry and transitivity properties of equality can be easily derived from the substitution principle. Here a few more examples of substitution: 2+2 === 4 derives (2+2) * 5 === 4 * 5, (\x -> (2 + 2) + x) === (\x -> 4 + x), and print (2+2) === print 4. We may thus substitute within a function and within an action. The expression to substitute into may be as big as the entire program. Therefore, we may reason equationally about program behavior -- which is the reason of existence of equational reasoning in Computer Science. The whole program in Haskell has the type of IO () and its behavior is doing an observable action, for example, printing something. If the equational reasoning is sound, the programs that are proven equal must behave the same: print the same results.

As an example of equational reasoning, we prove that for all Boolean x and y, x < y === not (y <= x), using elementary algebra:

     x < y
      === {- total order on Bool -}
     not (x >= y)
      === {- x >= y === y <= x -}
     not (y <= x)
We could have proved by cases: there are only 4 to examine.

The fact that x < y is equal to not (y <= x) lets us substitute the former expression with the latter in any program it may appear. The result should be an equal program. Alas, the enclosed code shows a program in which such a substitution changes the printed result. The program

     main = counterex_ctx $ \(x,y) -> x < y
prints True whereas
     main = counterex_ctx $ \(x,y) -> not (y <= x)
prints False.

The problematic counterex_ctx used lazy IO, which, recall, creates an illusion that file data is just a String. Pattern-matching on such a string may however cause IO, to read a chunk of a file. IO has side-effects, affecting other, side, expressions in context. Therefore, one may not freely substitute side-effecting expressions. Although reading a file may appear a benign effect, it has side-effects still: the file-descriptor and buffer allocation, locking, advancing the file position, or consuming data from a pipe. They are all observable. Our expression x < y appeared pure and had the pure type Bool, and so its substitution looked justified. With lazy IO, even a pure expression may have an observable side-effect. Equational reasoning becomes unsound.

Some object by saying that IO is inherently non-deterministic. If there is a race condition, one should not be surprised if a program that printed True prints False on the next run. The non-determinism of IO will hence explain why unsafeInterleaveIO is `fine' whereas unsafeInterleaveST is truly unsafe: ST is supposed to be deterministic.

Recall however that IO is just an instance of ST, by design. Launchbury and Peyton Jones, in the paper ``State in Haskell'' (1995) that introduced ST and Lazy IO, stated as one of the contributions: ``Input/output takes its place as a specialised form of stateful computation. Indeed, the type of I/O-performing computations is an instance of the (more polymorphic) type of stateful computations. Along with I/O comes the ability to call imperative procedures written in other languages.''

To expand on this connection, here is the IO version of the earlier ST demonstration that the equality on Booleans is, worryingly, not symmetric.

     import Data.IORef
     import System.IO.Unsafe
     
     bad_ctx :: ((Bool,Bool) -> Bool) -> IO Bool
     bad_ctx body = do
       r <- newIORef False
       x <- unsafeInterleaveIO (writeIORef r True >> return True)
       y <- unsafeInterleaveIO (readIORef r)
       return $ body (x,y)
     
     t1 = bad_ctx $ \(x,y) -> x == y
     -- True
     
     t2 = bad_ctx $ \(x,y) -> y == x
     -- False
This code has no ``real'' IO, no interaction with the external non-deterministic world. Furthermore, all the IO operations in the code are the type-specialized versions of the corresponding ST operations, with essentially the same code. For example, the type IORef is defined in GHC/IORef.hs as
     newtype IORef a = IORef (STRef RealWorld a)
It is hard to believe that the supposedly deterministic ST code becomes non-deterministic when the fully polymorphic state type is instantiated to RealWorld (recall, types are erased at run-time.)

We have seen that, disturbingly, replacing one pure Bool expression with another equal expression changes the result printed by the program. Now that there are many alternatives to lazy IO for incremental file processing, it is high time to banish lazy IO from Haskell.

Version
The current version is May 2013.
References
LazyIONotTrue.hs [4K]
The complete code for the example

Breaking referential transparency with unsafeInterleaveST
unsafeInterleaveST is the general version of unsafeInterleaveIO, which underlies Lazy IO.

John O'Donnell, Cordelia Hall, Rex Page: Discrete Mathematics Using a Computer
Springer, 2006 (Second Edition) Chaper 2. Equational Reasoning

describe.pdf [334K]
The full version of the paper presented at FLOPS 2012. A shorter version is published in FLOPS Proceedings, Springer's LNCS 7294, pp. 166-181.
Section 2 of the paper illustrates many practical problems with Lazy IO.

David Sabel and Manfred Schmidt-Schauss: Conservative Concurrency in Haskell
Logic in Computer Science (LICS), 2012
< http://www.ki.informatik.uni-frankfurt.de/papers/sabel/chf-conservative-lics.pdf >
< http://www.ki.informatik.uni-frankfurt.de/persons/sabel/chf-conservative.pdf >
Their conclusions: Program transformations valid for pure, deterministic, core Haskell (with no futures or IO) remain valid if concurrency (threads), MVars and futures are added. However, adding unsafeInterleaveIO or even lazy futures breaks this conservativity property because the order of evaluation becomes observable.

 

Bit-level, monad- and applicative-free parsing of AIS ship reports

Universal Shipborne Automatic Identification System (AIS) is a communication system ``developed to provide automatic reporting between ships and to shore, which contributes to the safety of navigation and facilitates traffic management by exchanging information such as identity, position, time, course and speed, autonomously and continuously.'' Class A ship AIS units (mandatory on most commercial ships) autonomously report their position (messages 1/2/3) every 2-10 seconds (or every 3 minutes or less if at anchor or moored), and the vessel's static and voyage related information (message 5) every 6 minutes. There is quite a bit of AIS data constantly coming in.

The goal of this old project was to put the AIS data, as they are received, into a database, to serve to remote clients on request. A Google Earth application was developed to display ship positions and tracks in real time. Clicking on ship's icon would show its destination, full name, dimensions, etc.

The first step in dealing with AIS messages is to parse them. That is not an easy task. The AIS format was designed quite a long time ago, for the telecommunication equipment of those times (still used today). Since AIS messages are typically transmitted in VHS maritime band using time-division multiple access, they have to be compact. AIS encoding is hence at the level of bits, not bytes. For example, the following two records (transmitted separately, one after another) encode an AIS message 5 (Class A vessel data report):

     !AIVDM,2,1,8,A,577n`T02<teAI88;J20`Tr0d4pN2222222222216@0vB?7?aN?43lU30,0*55,r000006098,1167638398
     !AIVDM,2,2,8,A,CQ8888888888880,2*3E,r000006098,1167638398
Can you make any sense of it? You probably cannot even tell the name of the ship. The 6-bit ASCII characters of the name are tightly packed in the message body, which is then ASCII encoded (in AIS-specific, BASE64-like encoding). Here are all the data from the message, returned by the parser:
     [Destination "PORTLAND",
      Draught 6.0,
      ETAMin 30, ETAHour 9, ETADay 31, ETAMonth 12,
      ShipDimD 15, ShipDimC 18, ShipDimB 62, ShipDimA 128,
      CatShipCargo "CARGO",
      ShipName "JIN KANG", Callsign "VRBB6",
      IMOnumber 9237204, MMSI 477997200, TStampRec 1167638398,
      MsgId 5]

To give the further sense of encoding, longitude is represented as a signed 28-bit (!) number, in 1/10000 minutes. Latitude is encoded as a signed 27-bit number, also in 1/10000 minutes.

The parser below parses AIS message 1 through 5: position reports and ship data/voyage reports. The bit-stream parsing engine is general and can be easily extended to other messages. The parser produces a CSV file, ready to be uploaded into a database.

The parser was written as an experiment in `pure' parsing, without applicatives or monads. We could therefore take the full advantage of (extended) pattern guards, clausal definitions and other niceties of Haskell. The monadic sub-language, in contrast, is less expressive and ugly. The problem for pure parsing is error reporting. Errors are expected (and actually rather frequent): after all, AIS messages are transmitted by radio. To indicate a parsing error and abort the processing of the current message, we rely Haskell's error, which may be used in pure code. Although elegant, such errors cannot be handled in pure code. More interestingly, such errors are not even guaranteed to be raised! If an error is encountered when parsing a datum that turns out unnecessary for the final result, no error is actually raised. Hence call-by-need is truly problematic: rather than returning valid parsed data or an error indicator, our parser just returns data -- which may be invalid. That is, when trying to use the data one can run into a parsing error.

Luckily, this very late reporting of errors was not a problem for our parser. Recall, it converts one or several lines of input into one line of the output (the CSV file to load into the database). At the point of writing the output line, the evaluation is forced, and whatever the problem was encountered in processing the input message becomes manifest. To assure this `delimited laziness', we merely need to note the position in the output stream before writing a line, so we can back to it if writing the output line ends in error (see the function write_parsed in the code). For this particular application, it turns out possible to write most of the code outside a monad.

The parser code also contrives poor-man's extensible records for parsing results: a list fields. This lets us easily omit a field.

When the parser was developed back in July 2007, I also evaluated its performance. It took the compiled Haskell program 59 secs (on a typical 2GHz desktop computer of that time) to parse 100,000 messages (about 1/17 of the daily amount) and create the output CSV file. If the input is already in the file system cache, the parsing time drops down to 45 secs. The output CSV file has 78,934 rows. The unparsed 21% correspond mostly to link management messages and proprietary reports.

Version
The current version is July 2007.
References
An Overview of AIS
Edition 2.0, June 2016. International Association of Marine Aids to Navigation and Lighthouse Authorities
< https://www.navcen.uscg.gov/pdf/IALA_Guideline_1082_An_Overview_of_AIS.pdf >

AISDecoder.hs [34K]
The complete code, which includes many tests

AIS-20070101-small.log [2K]
AIS-20070101-small.out [3K]
AIS-20070101-small.csv [2K]
Sample of AIS messages and the corresponding parsed data

 

Catching exceptions in a general MonadIO

[update] Extensible effects give a quite more satisfactory solution to the old problem described below.

Handling exceptions via catch, bracket, catchDyn, etc. in a MonadIO other than the IO itself has been a fairly frequently requested feature. The requests tend to recur, probably because these functions cannot be defined for a general MonadIO. However, we can define the generic exception handling for a large and interesting subset of MonadIO, which includes various (repeated) transformations of IO by ReaderT, WriterT, ErrorT, StateT, and newtype wrapping.

The generic catch has been successfully used since 2006 in a database library Takusen, where many operations work in the monad ReaderT Session IO; the database session data are always available as the environment. Many other foreign libraries are structured around sessions, connections, library environments, which can be encapsulated in a monad. We should nevertheless be able to handle IO errors and user exceptions that arise in such computations.

Back in 2006 Jesse Tov has done an admirably thorough job of implementing exception handling in general monads. His code defines two classes: EMonad and EMonadIO -- which contain most of the interesting monads. The latter is the subclass of the former, permitting arbitrary IO via liftIO. In either case, we use gthrow, gbracket, gcatch, ghandle, gfinally, etc. -- without even thinking which Monad we are in and how error handling is actually implemented, via ErrorT or via IO exceptions. It works universally for most of monads of interest. The experience of using EMonadIO since 2006 in large production code has been most positive.

Version
The current version is 1.1, February 2006; 1.2, September 2015.
References
Catching dynamic and asynchronous exceptions in the presence of other effects
Section 6 of the Haskell Symposium 2015 paper on extensible effects describes a new, much more satisfactory solution.

CaughtMonadIO.lhs [6K]
[OBSOLETE] The complete literate Haskell code
The code includes two tests, illustrating throwing and catching of (dynamic) exceptions in monads obtained from IO by several applications of ReaderT, WriterT and ErrorT.
The code was originally posted as generic catch in a MonadIO on the Haskell mailing list on Tue Feb 7 22:48:24 EST 2006

Takusen's MonadIO, with gtry, gtryJust, gbracket, gfinally. The code supports both the new- and the old-style exceptions. Joint work with Alistair Bayley.
< http://code.haskell.org/takusen/Control/Exception/MonadIO.hs >

Discussion threads:

Haskell' ticket, introduced as the result of the above Haskell-Cafe discussions:
< http://hackage.haskell.org/trac/haskell-prime/ticket/110 >

 

How to make a function strict without changing its body

This is a simple trick of making a function strict in some or all of its arguments, without affecting the body of the function.
     iterate2 f x n | seq f $ seq x $ seq n $ False = undefined
     iterate2 f x n = --- as before
That is, we add one simple line to the definition of the function, without changing the proper body of the function at all. We prepend an extra clause with the guard that always yields False. Before failing, however, the guard forces the evaluation of the selected arguments of the function. If needed, seq can be replaced with deepSeq.

Making (some of the) arguments of a function strict can make the function run in constant space, and so an iteration (Runge-Kutta iteration in the above example) can proceed without running out of (stack) space. The function iterate2 (and a similar function rk4Next) were applied to arguments like n-1 and y + a1/6 -- which is a red flag. These are exactly the kinds of expressions that lead to memory exhaustion. Perhaps it is because the size of an unevaluated thunk for n-1 is so much bigger than the size of the evaluated thunk. It seems that arithmetic expressions are the best candidates for some opportunistic evaluation...

Version
The current version is 1.1, Oct 2004.
References
Re: How do I get a long iteration to run in constant space
< http://www.haskell.org/pipermail/haskell-cafe/2004-October/006983.html >
Discussion thread on the Haskell-Cafe mailing list, where the message was originally posted on Tue, 5 Oct 2004 00:44:40 -0700 (PDT)

Partial signatures
Another application of the trick of adding a clause with an always failing guard

 

Simple and reliable uni- and bi-directional pipes

The MySysOpen module offers a reliable, proven way of interacting with another local or remote process via a unidirectional or bidirectional channel. It supports pipes and Unix and TCP sockets. MySysOpen is a simple and explicit alternative to the multi-threaded IO processing of the GHC run-time system. The module is the Haskell binding to sys_open -- the extended, user-level file opening interface.

The second half of MySysOpen.hs contains several bi-directional channel interaction tests. One checks repeated sending and receiving of data; the amount of received data is intentionally large, about 510K. Two other tests interact with programs that are not specifically written for interactive use, such as sort. The latter cannot produce any output before it has read all of the input, accepting no input terminator other than the EOF condition. One test uses shutdown to set the EOF. The other test programs the handler for a custom EOF indicator, literally in the file name of the communication pipe. The final example tests the handling of the forceful closure of the communication channel in the middle of writing.

Version
The current version is February 2013.
References
MySysOpen.hs [9K]
The code for the sys_open interface and tests

sys_open.c [21K]
The complete, well-commented code for sys_open

Applications of sys_open

 

Newer FastCGI and the memory-efficient IO interface

We present an efficient library for writing CGI and FastCGI programs. Our goal is the least latency and the least memory consumption. The library underlies a production web application server Metcast. The server has been running since February 2007, sending large amounts of text and binary data in response to a continuous stream of requests. The server processes a request in constant and small memory imposing little latency, encoding and sending to the client chunks of data as soon as the database delivers them. The server handles hundreds of such requests without allocating memory; in fact, it uses only one 16KB buffer for all of its I/O. With the exception of an occasional rank-2 type and extended pattern guards, all of the code is in Haskell98. Not a single unsafeHaskell function occurs in the code.

The library is originally based on the code written by Jesse Tov, who in turn used NewCGI by Bjorn Bringert et al. Most of the code has been re-written. The biggest change is minimizing the amount of state. The library is centered around generalized input and output ports: a simple IO interface to read, write and copy via a single, large, once allocated buffer. The buffer and its dimensions are hidden, to discourage aliasing. We use this interface for the IO to and from the client, to and from files or the PostgreSQL database, and for reading from external servers.

A generalized input port is a procedure to read at most the specified number of bytes into a given buffer. The procedure should return the number of bytes actually read, or 0 on EOF. It may throw various exceptions if reading didn't go well. The procedure is like hGetBuf partially applied to a handle.

     newtype Input = Input (forall m. EMonadIO m => Ptr Word8 -> Int -> m Int)
     inputFd       :: Fd -> Input
     inputStr      :: EMonadIO m => String -> m Input
     inputCombined :: EMonadIO m => Input -> Input -> m Input
One can construct an Input from a Posix file descriptor or a String. One can combine two Inputs into one, which reads from the first generalized port until EOF and then reads from the second. The frequently mentioned EMonadIO is a class of monads permitting IO along with throwing and catching of arbitrary errors. The IO monad and most of its transformations are in that class. EMonadIO lets us write gthrow, gcatch, gbracket, etc. without even thinking of the current monad.

Dually, a generalized output port is a procedure to write the specified number of bytes from the given buffer. It may throw various exceptions if writing didn't go well.

     newtype Output = Output (forall m. EMonadIO m => Ptr Word8 -> Int -> m ())
     outputFd :: Fd -> Output
     newtype BCopy = BCopy (forall m. EMonadIO m =>
                             Input -> Output -> Maybe Int -> m (Maybe Int))
The CGI monad exports (Fast)CGI output and error streams as Output, and lets us access request content as Input. We can use the generalized ports for reading and writing strings. The ports are intended though to be connected via BCopy, which copies from Input to Output the desired number of bytes or till EOF.
Version
The current version is February 2007.
References

NewerCGI.hs [29K]
The commented source code of the library for writing CGI and FastCGI programs
The code was first mentioned in a message Takusen and large PostgreSQL blobs [was: Handling custom types in Takusen] on the Haskell-Cafe mailing list on Fri, 27 Jul 2007 20:34:29 -0700 (PDT)

FastCGI.hsc [3K]
Bindings to the FastCGI C client library

 

Takusen: a DBMS access library built around a left-fold enumerator

Takusen is a joint project with Alistair Bayley. Takusen is a DBMS access library. It has a backend for Oracle on Unix, Linux or Windows via OCI, and a stub to test the library without any database. The infrastructure and the stub let one interface natively with other databases.

The distinguished feature of Takusen is processing query results using a left-fold enumerator. The user supplies an iteratee function, which receives rows one-at-a-time from the result-set. The number of the arguments to the iteratee is the number of the columns in the result-set, plus the seed. Each column in the result-set has its own Haskell type. The latter could be a Maybe type if the particular iteratee wishes to process NULLs.

The benefits are: more convenient and intuitive enumeration, iteration, and accumulation (see tests for examples); the retrieved data are not merely strings but have native Haskell types: Int, Float, Date, etc.; buffer preallocation; pre-fetching; support for both enumerators and cursors, proper handling of all errors including various IO errors. No unsafe operations are used.

Version
The current version is 0.8.6, 2009.
References
From enumerators to cursors: turning the left fold inside out
The invertible non-recursive enumerator designed in that article is the foundation of the Takusen interface.

< http://code.haskell.org/takusen/ >

 

How to implement AND without pattern-matching

How to implement the function AND, i.e., (&&), without the explicit or even the implicit pattern-matching? Keep in mind that the standard Prelude not and (||) are defined in terms of pattern-matching. The article below gives several answers. One of them relies on pointer arithmetic, in Haskell.
Version
The current version is 1.1, Jul 13, 2001.
References
bothTrue.txt [2K]
The article with the source code for three solutions. It was originally posted as Re: bothTrue on the newsgroup comp.lang.functional on Fri, 13 Jul 2001 17:45:22 -0700


Last updated April 1, 2018

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML