Type-safe functional formatted IO

Introduction
Formatted IO as an embedded DSL: the initial view
Formatted IO as an embedded DSL: the final view
Generic polyvariadic printf in Haskell98
Safe and generic printf/scanf with C-like format string
Functional un|unparsing
Delimited continuations with answer-type modification and polymorphism
Type-safe printf was the main example
Genuine dependent types and faking them

Introduction

Formatted input and output facilities -- epitomized by printf and scanf of C -- are present in many languages. With the exception of Danvy's functional unparsing and its implementation in SML/NJ, formatted input/output facilities are either type-unsafe or not applicative. Even in Haskell the type checker does not stop the programmer from passing to printf more or fewer arguments than required by the format string. OCaml supports typed sprintf and sscanf but relies on a particular weird typing of these functions and on ad hoc overloading of string literals.

Our goal is type-safe formatted i/o as an ordinary library of formatters, format parsers and format descriptors. Our library interface is essentially that of OCaml formatted i/o, without the weird typing. For concreteness, we will be discussing sprintf and sscanf . Other formatted i/o functions are similar. The function sprintf takes as arguments a format descriptor and the values to format, returning the formatted string as the result. The types and the number of the other arguments of the function depend on the format descriptor. Conversely, sscanf receives as arguments the input string, a format descriptor and a consumer function. The function sscanf parses the received string according to the format descriptor and passes the extracted data to the consumer function, whose result is returned. Again, the number and the types of the arguments to the consumer function depend on the format descriptor. The function sscanf is partial because parsing may fail. We further require format descriptors to be first-class: we should be able to pass them as arguments, return as results, and build incrementally.

The function sprintf is polyvariadic: it accepts an arbitrary number of arbitrarily typed arguments. Furthermore, the first argument to sprintf determines the number and the types of the subsequent arguments. It seems writing such a function requires dependent types. Olivier Danvy's surprising discovery was type-safe sprintf written as an ordinary SML function. Hinze and Asai demonstrated more type-safe applicative formatters. Although sscanf takes the fixed number of arguments, the consumer function passed as the last argument does not have the fixed type. The number and the types of its arguments are determined by the format descriptor. Again dependent types seem to be called for. The fact that one can write type-safe sscanf in any mainstream functional language is less known. Apparently it has been unknown if type-safe sprintf and sscanf could share the same format descriptor.

References

Olivier Danvy: Functional Unparsing
J. Functional Programming, 1998, v8, N6, pp. 621-625.

Ralf Hinze: Formatting: A Class Act
J. Functional Programming, 2003, v13, N5, pp. 935-944.

Kenichi Asai: On Typing Delimited Continuations: Three New Solutions to the Printf Problem
Ochanomizu U. Technical Report OCHA-IS 08-2, September 2008, 17 pp.
Higher-Order and Symbolic Computation, 2009, v22, N3, pp. 275--291.
< http://pllab.is.ocha.ac.jp/~asai/papers/tr08-2.pdf >

Formatted IO as an embedded DSL: the initial view

We demonstrate type-safe sprintf and sscanf sharing the same format descriptor. The gist of our surprisingly trivial solution is a simple embedded domain-specific language (DSL) of format descriptors. The functions sprintf and sscanf are two interpreters of the language, building or parsing a string according in the given format. The format descriptor, a term in our DSL, can be interpreted in far more than two ways. For example, one may write an interpreter that counts the number of integer descriptors or converts a format descriptor to a C format string. One can write sscanf-like functions whose input comes from a file, a buffer, etc.

This article uses initial tagless typed embedding of the formatting DSL into Haskell. The terms of the language are represented as GADTs. We also demonstrate that lambda-abstractions at the type level are expressible already in the Hindley-Milner type system; GADT with the inherent polymorphic recursion help us use these abstractions.

Here is the typical example of formatting and unformatting:

     fmt3 = lit "The value of " ^ char ^ lit " is " ^ int
     tp3 = sprintf fmt3 'x' 3
     -- "The value of x is 3"
     ts3 = sscanf "The value of x is 3" fmt3 (\c i -> (c,i))
     -- Just ('x',3)

A format descriptor is built by connecting primitive descriptors (such as lit "string" , int , char) with (^) . Whereas sprintf $ lit "Hello world" has the type String , sprintf $ fmt3 has the type Char -> Int -> String . Likewise, the type of the consumer of the values parsed by sscanf varies with the format descriptor. Our sprintf and sscanf indeed use the same format descriptor, fmt3, which is a first-class value.

The implementation is based on Hinze's observation (`Formatting: a class act' JFP 2003) that the format descriptor is a `type transformer', a functor. For example, the format descriptor int is associated with a functor transforming the type t to the type Int -> t. Hinze represented functors in Curry style: the functor associated with int is then (Int ->). We use a more direct, Church-style representation. Let us consider a type F a b where F is a binary type constructor, a is a type variable, and b is a type that may include one or more occurrences of a. Such F a b may be regarded a type abstraction, a type-level lambda-function. To apply that function to an argument, we unify a with the argument; the second parameter of F becomes the result of the application. This idea comes from the encoding of lambda-terms in Prolog as pairs (X,body) whose first component is a logic variable. Even the standard lambda-notation is a fancy way of pairing a variable with the function body.

Version

The current version is 1.1, Aug 31, 2008.

References

PrintScan.hs [4K]
The complete code with many examples

GADT.ml [15K]
The same code in OCaml (see Example 1 in the file)

PrintScanI.txt [6K]
The initial view on typed sprintf and sscanf
The message with more explanations, posted on the Haskell mailing list on Sun, 31 Aug 2008 19:40:41 -0700 (PDT)

Relating Final and Initial typed tagless representations

Formatted IO as an embedded DSL: the final view

We present a dual solution to the problem of type-safe sprintf and sscanf sharing the same format descriptor. Previously, we defined the embedded domain-specific language (DSL) of format descriptors in the initial style, as a data type. The language can also be defined in the final style. To the end user, the difference is hardly noticeable: all the examples of the initial solution work as they are (modulo a few adjustments caused by the monomorphism restriction). However, whereas the initial style required GADT, the final solution is entirely in Haskell98. Despite often-heard disparagement, Haskell98 is powerful enough for type-indexed terms, thought to require GADTs or similar dependent-type-like extensions.

The transformation from the initial to the final style follows the correspondence referenced below. In fact, the final implementation was `derived' from the initial implementation by Emacs `code movements'. Once the syntax errors have been fixed, the code worked on the first try.

Version

The current version is 1.1, Sep 2, 2008.

References

PrintScanF.hs [8K]
The complete Haskell98 code with many examples
The code has been used in the course on typed tagless-final interpreters and so has been embellished with many comments.

PrintScanF.txt [4K]
The final view on typed sprintf and sscanf
The message posted on the Haskell mailing list on Tue, 2 Sep 2008 00:57:18 -0700 (PDT)

Relating Final and Initial typed tagless representations

Generic polyvariadic printf in Haskell98

This generalization of Text.Printf.printf is inspired by the message of Evan Klitzke, who wrote on Haskell-Cafe about frequent occurrences in his code of the lines like

     	 infoM $ printf "%s saw %s with %s" (show x) (show y) (show z)

Writing show on and on quickly becomes tiresome. It turns out, we can avoid these repeating show , still conforming to Haskell98.

Our polyvariadic generic printf is like polyvariadic show with the printf-like format string. Our printf handles values of any present and future type for which there is a Show instance. For example:

     t1 = unR $ printf "Hi there"
     -- "Hi there"
     
     t2 = unR $ printf "Hi %s!" "there"
     -- "Hi there!"
     
     t3 = unR $ printf "The value of %s is %s" "x" 3
     -- "The value of x is 3"
     
     t4 = unR $ printf "The value of %s is %s" "x" [5]
     -- "The value of x is [5]"

The unsightly unR appears solely for Haskell98 compatibility: flexible instances remove the need for it. On the other hand, Evan Klitzke's code post-processes the result of formatting with infoM , which can subsume unR .

A bigger problem with our generic printf , shared with the original Text.Printf.printf , is partiality: The errors like passing too many or too few arguments to printf are caught only at run-time. We can certainly do better.

Version

The current version is 1.1, June 5, 2009.

References

GenPrintF.hs [2K]
The complete source code with the tests. It was published in the message posted on the Haskell-Cafe mailing list on Fri, 5 Jun 2009 00:57:00 -0700 (PDT)

Safe and generic printf/scanf with C-like format string

We implement printf that takes a C-like format string and the variable number of the subsequent arguments. Unlike C of Haskell's printf , ours is total: if the types or the number of the subsequent arguments, the values to format, do not match the format string, a type error is reported at compile time. To the familiar format descriptors %s and %d we add %a to format any showable value. The latter is like the format descriptor ~a of Common Lisp. Likewise, we build scanf that takes a C-like format string and the consumer function with the variable number of arguments. The types and the number of the arguments must match the format string; a type error is reported otherwise.

Our approach is a variation of the safe printf and scanf described elsewhere on this page. We use Template Haskell to translate the format string to a phrase in the DSL of format descriptors. We use the final approach to embed that DSL into Haskell.

Unlike the safe printf explained in the Template Haskell documentation, in our implementation, format descriptors are first class. They can be built incrementally. The same descriptor can be used both for printing and for parsing. Our printf and scanf are user-extensible: library users can write functions to direct format output to any suitable data sink, or to read parsed data from any suitable data source such as a string or a file. Finally, the result of formatting is parseable back using the same format descriptor.

Here are some of the tests from the test collection referenced below. The evaluation result is given in the comments below each binding. Example t31 shows that format descriptors are indeed first-class. The definition t32 , when uncommented, raises the shown type error because the format descriptor does not match the type of the corresponding argument.

     t4 = let x = [9,16,25]
              i = 2
          in sprintf $(spec "The element number %d of %a is %a") i x (x !! i)
     -- "The element number 2 of [9,16,25] is 25"
     
     t4s = sscanf "The element number 2 of [9,16,25] is 25"
             $(spec "The element number %d of %a is %a")
             (\i a e -> (i,a::[Int],e::Int))
     -- Just (2,[9,16,25],25)
     
     -- What we print we can parse, using the same descriptor
     t31 = let printed = sprintf desc "x" 3
               parsed  = sscanf printed desc (\s i -> (s,i))
           in (printed, parsed)
       where desc = $(spec "The value of %s is %d")
     -- ("The value of x is 3",Just ("x",3))
     
     -- t32 = sprintf $(spec "The value of %s is %d") "x" True
     --     Couldn't match expected type `Bool' against inferred type `Int'

Version

The current version is 1.2, July 4, 2009.

References

TFTest.hs [4K]
Tests of printing, parsing, parsing the printed data, and of extending the library to format to the standard output.

TotalPrintF.hs [4K]
The code to convert format strings to DSL phrases

Formatted IO as an embedded DSL: the final view

Functional un|unparsing

[The Abstract of the paper]
Danvy's functional unparsing problem (JFP 1998) is to implement a type-safe printf function, which converts a sequence of heterogeneous arguments to a string according to a given format. The dual problem is to implement a type-safe scanf function, which extracts a sequence of heterogeneous arguments from a string by interpreting (Friedman & Wand LFP 1984, Friedman & Wand 2008) the same format as an equally heterogeneous sequence of patterns that binds zero or more variables. We derive multiple solutions to both problems from their formal specifications (Wand JACM 1980, TCS 1982).

On one hand, our solutions show how the Hindley-Milner type system, unextended, permits accessing heterogeneous sequences with the static assurance of type safety. On the other hand, our solutions demonstrate the use of control operators (Felleisen et al. LFP 1988, Meyer & Wand 1985) to communicate with formats as coroutines (Wand LFP 1980, Haynes et al. LFP 1984).

This is joint work with Kenichi Asai and Chung-chieh Shan.

Version

The current version is August 2009.

References

un-unparsing.pdf [290K]
The paper published in Higher-Order and Symbolic Computation, v24, N4, 2011, 311-340
doi:10.1007/s10990-012-9087-2

Chung-chieh Shan. Slides of the talk presented at the Symposium in honor of Mitchell Wand. August 23, 2009, Boston.
< http://www.cs.rutgers.edu/~ccshan/binding/printf-scanf-slides.pdf >

derive5.ml [26K]
An extensively commented OCaml code explaining the step-wise derivation of typed printf and scanf . The code derives several novel implementations of printf that rely on multi-prompt delimited continuations.

Derive5.hs [4K]
Deriving type-safe printf in direct style, using shift/reset with effectful typing that supports the answer-type modification and polymorphism. The code rigorously derives printf of Asai (2009) along with an implementation that encloses in reset the whole application of printf to values to format.

derive6.ml [8K]
Deriving a context-passing implementation of printf

Genuine dependent types and faking them

Type-safe printf is a good example to illustrate the deep difference between genuine dependent types and their sometimes very close emulations in OCaml, Haskell and seemingly GCC. We will see how to derive these emulations, how we gain progressively simpler types and what we have to sacrifice.

On the surface of it, typing the C-like formatter printf requires dependent types. After all, the type of printf fmt (the number and the types of its arguments) depends on the value of the format string, fmt. Listed on this page however are many implementations of type-safe printf in the Hindley-Milner system, OCaml and Haskell. One way or another, first-class delimited continuation appeared, either directly or encoded in continuation-passing style (CPS). Does it mean that delimited continuations somehow `implement' dependent types?

Further, OCaml has its own type system for typed formatting, refusing to accept, for example, printf "the value of x is %d" 1.0 because 1.0 has the type float rather than the expected int. Even GCC issues a warning in similar cases. Does it mean that OCaml or GCC are dependently typed?

The answer to all these questions is negative. The C-like type-safe printf does require dependent types, which neither OCaml nor Haskell, let alone GCC, have. A simple test will demonstrate this, hence distinguishing a truly dependently-typed system from a fake. We systematically construct two archetype fake dependently-typed printfs, tracing the progression from dependent types to fancy non-dependent types to the Hindley-Milner system, and discussing what burden we shift from the type checker to the user and what we lose in expressiveness.

Recall that in C, printf is a polyvariadic function whose first argument is the format string, containing ordinary characters and conversion specifications. The latter are introduced by the % character and define how to format the corresponding printf argument. The number and the types of the subsequent arguments to printf depend on the number and the content of conversion specifications in the format string. Hence printf has the dependent-function type Pi x:String. PA x, which is also written as {x:String} -> PA x. Here x is a value, String is a Type (that is, String :: *) and PA :: String -> *, which computes the type of printf from the contents of the format string, has the dependent kind. With such a printf, we can give a type to the function

     fts x = "The formatting result is " ++ sprintf ("%" ++ x) 1

namely,

     fts :: {x:String | isPrefix "d" x && not (elem '%' x)} -> String

using so-called qualified types. To avoid the irrelevant i/o, we will be silently using a version of printf that returns the formatted result as a string rather than outputs it; that version is often called sprintf. Dependent-functions and qualified types are not available in OCaml or Haskell. Even if Haskell let us write the fts signature, it could not have checked that the signature matches the code of fts. Although OCaml can type check simple printf expressions as shown earlier, OCaml cannot type check fts:

     # let fts x = "The formatting result is " ^ sprintf ("%" ^^ x) 1;;
                                                          ^^^
     Error: Premature end of format string ``"%"''

At the opposite extreme, GCC accepts the C version of fts with no warnings. Just as happily GCC accepts the invocations of fts that lead to segmentation faults. None of the mainstream languages admit fts with a sound type, statically ensuring crash-free applications of the function.

The example fts confirmed that the C-like type-safe printf requires dependent types. There are however approximations, also type safe, that do not impose such requirement. We now design two characteristic approximations. With dependently-typed sprintf, sprintf "Hello" has the type String and sprintf "The value of %c is %d" has the different type Char -> Int -> String. And yet in both cases the format descriptor had the same type String. Therefore, the type checker has to parse and reason about the content of that string, its value. In fact, the OCaml error message about fts above betrays type checker's actual parsing of the format string. Alas, OCaml lacks the type system to express the result of parsing "%".

What if we parse the format string ourselves, separating out the literal text and explicating conversion specifications? Instead of "The value of %c is %d" we would give printf as the format descriptor the following parsed form:

     fmtG = LIT "The value of " :^ CHAR :^ LIT " is " :^ INT

Each component, LIT str, CHAR, INT, etc. can then be annotated with a distinct type; the connector (:^) will combine the annotations:

     data Desc a where
        LIT  :: String -> Desc ()
        INT  :: Desc Int
        CHAR :: Desc Char
        (:^) :: Desc a -> Desc b -> Desc (a,b)

Since each data constructor builds Desc a of a different type, GADTs are called for. The inferred type fmtG :: Desc ((((), Char), ()), Int) is a concise summary of fmtG, describing the contained conversion specifications and their sequence. The type abstracts over the content of LIT descriptors, which does not affect the type of printf. The type of fmtG contains just the right amount of information to compute the type of printf. No longer the type checker has to parse format strings. It can work entirely from the type annotations:

     printf :: Desc a -> PS a
     type PS a = PFormat a String
     
     type family PFormat a w :: *
     type instance PFormat () w    = w
     type instance PFormat Int w   = Int -> w
     type instance PFormat Char w  = Char -> w
     type instance PFormat (a,b) w = PFormat a (PFormat b w)

Here PS :: * -> * and the auxiliary PFormat :: * -> * -> * are non-dependent, purely type functions. Their arguments are types (kind *) rather than terms such as String.

There are people who would call a language with type functions `dependently typed'. After all, the type Desc a reflects the particular value of the format descriptor. For example, the type Desc Int corresponds to the value INT, and no other value has that type. Since the value of the format descriptor influences the type so directly and unambiguously, it seems the type of printfsort of depends on the format descriptor value. This dependence however is not sufficient to type check fts. Our emulation gives printf the already parsed format string. We have shifted the burden of parsing the format string from the type checker to the programmer -- gaining simpler, non-dependent typing but losing the flexibility of building arbitrary conversion specifications such as those in fts.

The signature of printf guides, even compels the implementation. Just as PS used an auxiliary type function PFormat, printf relies on the auxiliary function interp to interpret the format descriptor. The structure of PFormat drives the implementation of interp:

     printf desc = interp desc id
     
     interp :: Desc a -> (String -> w) -> PFormat a w
     interp (LIT str)  = \k -> k str
     interp INT        = \k -> k . show
     interp CHAR       = \k -> k . \c -> [c]
     interp (x :^ y)   = \k -> interp x (\sx -> interp y (\sy -> k (sx ++ sy)))

The interpreter is clearly compositional: the interpretation of a descriptor depends only on the components of the descriptor regardless of the overall context. The compositionality gives us an idea to name the applications of interp to primitive descriptors:

     lit :: String -> (String -> w) -> w   -- cf. PFormat () w  = w
     lit str = \k -> k str                 -- which is, interp (LIT str)
     
     int :: (String -> w) -> (Int -> w)    -- cf. PFormat Int w = Int -> w 
     int = \k -> k . show                  -- which is, interp INT
     
     char :: (String -> w) -> (Char -> w)  -- cf. PFormat Char w = Int -> w 
     char = \k -> k . \c -> [c]            -- which is, interp CHAR

and introduce the operator (^) for combining these interpreted descriptors. The operator must satisfy the equality

     interp (desc1 :^ desc2) === (interp desc1) ^ (interp desc2)

which immediately gives us the implementation

     (^)  :: ((String -> t1) -> t) -> ((String -> t2) -> t1) -> (String -> t2) -> t
     (^) x y = \k -> x (\sx -> y (\sy -> k (sx ++ sy)))

Instead of writing the format descriptor as a GADT such as fmtG, we write the interpreted descriptor, interp fmtG, directly in terms of the interpreted primitive descriptors lit, int and char:

     -- The interpreted descriptor: interp fmtG
     fmtI = lit "The value of " ^ char ^ lit " is " ^ int
     -- fmtI :: (String -> w) -> Char -> Int -> w

The main function printf now receives the already interpreted descriptor as the argument, so it becomes

     printf' idesc = idesc id

whose sample application is

     tIr = printf' fmtI 'x' 3
     -- "The value of x is 3"

We have thus derived one of Danvy's original implementations of type-safe printf. By making the programmer write interpreted format descriptors (which is hardly a burden in this case: fmtI requires fewer keystrokes than fmtG) we gain even simpler typing. We no longer need the GADT Desc a; furthermore, the type function PFormat is incorporated into the types of lit, int, etc. and does not have to be defined explicitly. The type-safe printf' and fmtI are hence typeable in the Hindley-Milner system. Since we have wired-in the interpreter interp, we have lost the ability to change it, that is, to interpret the format descriptor in a different way -- to parse input, to format to a network pipe, etc. Incidentally, we regain the ability to change the interpreter if we abstract over it, using type classes -- obtaining the tagless-final type-safe printf.

The interpreter interp is written in the continuation k passing style -- which is forced upon us by the structure of PFormat. The last clause of that type function

     type instance PFormat (a,b) w = PFormat a (PFormat b w)

says that format descriptors are interpreted in sequence: first the descriptor a receives and formats its arguments, then the descriptor b. In other words, the meaning of the annotation (a,b) in the context w is the meaning of a in the context (PFormat b w). CPS immediately springs to mind. Incidentally, the inferred type for (^) looks exactly like the type for string concatenation in the type-and-effect system of Danvy and Filinski for shift.

In conclusion, we showed a simple example of printf that is only typeable in a dependently-typed system. The example, fts, is the touchstone test for dependent types, distinguishing genuine dependently-typed systems from emulations. We have described a method to derive these emulations -- progressively shifting the burden of parsing and interpreting the format string from the type checker to the programmer -- and hence moving from dependent functions to non-dependent type functions to no type-functions at all. Luckily for printf, the loss in expressiveness is minimal and the burden is light and arguably worth carrying anyway.

Version

The current version is November 2012.

References

Undp.hs [3K]
The complete code for the article

Last updated January 1, 2014

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML