XML and Scheme

Tools: SSAX, SXML, SXPath, SXSLT

Applications, Examples, Sample Code

Papers and Presentations

SSAX-SXML Mailing list  SSAX-SXML SourceForge Project

Miscellanea


 

Papers and Presentations

SXML short presentation
A micro-talk presentation at the Workshop on Scheme and Functional Programming 2000. Montreal, September 17, 2000.

A better XML parser through functional programming
<http://link.springer.de/link/service/series/0558/tocs/t2257.htm>
XML-parsing.ps.gz [69K]
XML-parsing-talk.ps.gz [93K]
The paper is published in LNCS volume 2257 with the Copyright held by Springer-Verlag. The file XML-parsing.ps.gz is a local copy of that copyrighted paper. The file XML-parsing-talk.ps.gz is the transcript of a presentation at PADL 2002, the 4th International Symposium on Practical Aspects of Declarative Languages. The topic of the presentation was the derivation of the SSAX API and the comparison of SSAX with other functional-style XML parsers and with Expat.

SXML Specification, in SXML/HTML/LaTeX/PDF
ACM SIGPLAN Notices, v.37(6), pp. 52-58, June 2002.

XML, XPath, XSLT implementations as SXML, SXPath, and SXSLT
SXs.pdf [175K]
SXs-talk.pdf [200K]
A paper and a transcript of a complementary talk presented at the International Lisp Conference 2002, October 27-31, 2002.

SXSLT: Manipulation Language for XML
<http://link.springer.de/link/service/series/0558/tocs/t2562.htm>
SXSLT.ps.gz [68K]
SXSLT-talk.pdf [331K]
The paper (with Shriram Krishnamurthi) is published in LNCS volume 2562 with the Copyright held by Springer-Verlag. The file SXSLT.ps.gz is a local copy of that copyrighted paper. The file SXSLT-talk.pdf is the transcript of a presentation at PADL 2003, the 5th International Symposium on Practical Aspects of Declarative Languages. The paper introduces SXSLT and compares it with XSLT. Our experience and user comments show that SXSLT is expressive and easy to use. We argue that this outcome is a consequence of SXSLT providing right abstractions for XML transformations, of being higher-order, declarative and extensible.

Advanced SXLST tutorial
sxslt-advanced.scm [16K]
This file describes common patterns of SXSLT on an interesting example. We demonstrate higher-order tags, pre-order and post-order transformations, re-writing of SXML elements in regular and special ways, context-sensitive applications of re-writing rules, and reflection. Although the file is technically a Scheme source code, it is actually a tutorial. Of 357 lines in the file, 241 are comments and 24 are just blank lines.

 

Functional XML parsing framework
SAX/DOM and SXML parsers with support for XML Namespaces and validation

The framework consists of a DOM/SXML parser, a SAX parser, and a supporting library of lexing and parsing procedures. The procedures in the package can be used separately to tokenize or parse various pieces of XML documents. The framework supports XML Namespaces, character, internal and external parsed entities, xml:space, attribute value normalization, processing instructions and CDATA sections. The package includes a semi-validating SXML parser: a DOM-mode parser that is an instantiation of a SAX parser (called SSAX).

The parsing framework offers support for XML validation, to the full or any user-specified degree. Framework's procedures by themselves detect great many validation errors and almost all well-formedness errors. Furthermore, a user is given a chance to set his own handlers to capture the rest of the errors. Content is validated given user-specified constraints, which the user can derive from a DTD, from an XML schema, or from other competing doctype specification formats.

SSAX is a full-featured, algorithmically optimal, pure-functional parser, which can act as a stream processor. SSAX is an efficient SAX parser that is easy to use. SSAX minimizes the amount of application-specific state that has to be shared among user-supplied event handlers. SSAX makes the maintenance of an application-specific element stack unnecessary, which eliminates several classes of common bugs. SSAX is written in a pure-functional subset of Scheme. Therefore, the event handlers are referentially transparent, which makes them easier for a programmer to write and to reason about. The more expressive, reliable and easier to use application interface for the event-driven XML parsing is the outcome of implementing the parsing engine as an enhanced tree fold combinator, which fully captures the control pattern of the depth-first tree traversal.

Platforms
The package has been tested on PLT Scheme (versions 103 and 200), Bigloo 2.4b, GambitC 3.0, Chicken, Guile, SCM, MIT Scheme 7.5.2, Gauche, and Petite Chez Scheme 6.0a
Version
The current version is 5.1, Sep 16, 2004.
References

ssax.sourceforge.net
SSAX and SXML at SourceForge

CVS tree at SourceForge
The CVS tree includes the complete SSAX/SXML code with all its dependencies, some documentation, validation tests, as well as several sample applications, benchmarks and usable examples. You can browse the files in the CVS tree from any web browser.

SSAX.scm [122K]
Well-commented source code of the package. The code comes with an extensive set of self-tests, which verify not only the correct behavior but also the detection of constraint violations.

http://metapaper.net/xml/ssax/
Current versions of the SSAX ports to various Scheme systems (including Bigloo, Guile, Chicken, and PLT Scheme). These versions are kindly maintained by Kirill Lisovsky, whose efforts are greatly appreciated.

Papers and Presentations

Haskell Iteratee version of the XML parsing framework

SSAX/SXML mailing list

Comparing SSAX and Expat XML parsers in features and performance

SAX/DOM/SXML parsers with support for XML Namespaces and validation [plain text file]
An article posted on comp.lang.scheme and comp.text.xml newsgroups on Thu, 15 Mar 2001 19:31:16 GMT

A simple XML _lexer_ in Scheme [plain text file]
Announcement of version 1 of SSAX. An article posted on comp.lang.scheme and comp.text.xml newsgroups on Sun, 28 Nov 1999 23:12:50 GMT

Requires
Input parsing primitives
 

SXML specification

SXML is an abstract syntax tree of an XML document. SXML is also a concrete representation of the XML Infoset in the form of S-expressions. The generic tree structure of SXML lends itself to a compact library of combinators for querying and transforming SXML.

The master SXML specification file is written in SXML itself. With appropriate stylesheets, the master file has been converted to a web page and to a LaTeX file. The latter was processed to a PDF document.

Version
The current version is 3.0, Mar 12, 2004.
References

SXML.scm [52K]
The master file, written in SXML. The second part of the file is a stylesheet for rendering the specification in HTML.

SXML.html [51K]
The specification web page, converted from SXML above.

SXML-paper.tex [40K]
The specification in LaTeX, converted from SXML above.

SXML-paper.scm [16K]
The conversion stylesheet, from SXML to LaTeX.

SXML-paper.pdf [71K]
The specification in the PDF format.

ACM SIGPLAN Notices, v.37(6), pp. 52-58, June 2002.

 

SXPath -- SXML query language, XPath implementation

SXPath is a query language for SXML, an instance of an XML Information set (Infoset) in the form of s-expressions. On the other hand, SXPath is an implementation of XPath: there exists a formal re-writing from XPath to SXPath. Abbreviated-form SXPath expression look the same as abbreviated-form XPath modulo delimiters and parentheses. XPath underlies both XSLT and XPointer W3C Recommendations.

The full-form SXPath notation is a combinator library for querying SXML. It is comprised of several primitive converters that can be joined by a small number of powerful combinators. Primitive and composite converters closely correspond to XPath location paths. This makes SXPath an XPath's virtual machine.

The new version of SXPath.scm has more and better comments, more converters and combinators. It supports following-sibling:: and preceding-sibling:: axes of XPath. The better half of the SXPath.scm file (lines 463 through 1154) is a built-in test suite. The tests make sure that examples given in the XPath Recommendation indeed work in SXPath. The regression tests show the correspondence between full- and abbreviated-form XPath expressions, and the corresponding notations of SXPath. One of the examples is to illustrate and test the difference between //para[1] and /descendant::para[1]. The difference is explained in a note to Section 2.5 of the XPath Recommendation. After considerable practice, I need to read the note only twice or thrice to remember what this distinction is. SXPath somehow is capable of reproducing the said difference.

Platforms
I have tested on Gambit-C 3.0, SCM 5d2 and Bigloo 2.4b
Version
The current version is 3.5, Jan 12, 2001.
References
Papers and Presentations

SXPath.scm [41K]
Complete, very commented source code with included extensive validation tests.

Requires
My own standard prelude
 

SXML traversals and transformations

SXSLT is a practical XML transformation language. It is a higher-order pure functional transformation language based on flexible tree traversals. SXSLT is implemented as a library in a pure functional subset of Scheme. At the core of this library is a function pre-post-order, which takes an SXML tree and and a stylesheet, and returns a transformed tree.

pre-post-order:: TREE x BINDINGS -> TREE

where

     type TREE = SXML-NODE | SXML-NODELIST
     type BINDINGS = (BINDING ...)
     type BINDING = (TRIGGER-SYMB *preorder* . HANDLER)
                  | (TRIGGER-SYMB *macro* . HANDLER)
                  | (TRIGGER-SYMB NEW-BINDINGS . HANDLER)
                  | (TRIGGER-SYMB . HANDLER)
     type TRIGGER-SYMB = SXML-NAME | *text* | *default*
     type NEW-BINDINGS = BINDINGS
     type HANDLER = TRIGGER-SYMB x TREE -> TREE

The function pre-post-order traverses a TREE, which is either a SXML-NODE or SXML-NODELIST. For each SXML-NODE of the form (SXML-NAME SXML-NODE...) the function looks up an association with the given SXML-NAME among its BINDINGS. If failed, the function tries to locate a *default* binding. It is an error if the latter attempt fails as well.

Having found a binding, the function pre-post-order first checks whether the binding prescribes a pre-order traversal. If this is the case, the handler is 'applied' to the current node. Otherwise, the pre-post-order function first calls itself recursively on each child of the current node, with NEW-BINDINGS prepended to the current bindings. The handler is then applied to the head of the current node and the transformed children. The result of the handler, which should also be an TREE, becomes a SXML-NODE of the transformed tree. If the current SXML-NODE is a text string, a special *text* binding is looked up.

The NEW-BINDINGS that appears in (TRIGGER-SYMB NEW-BINDINGS . HANDLER) has the same form as BINDINGS. The NEW-BINDINGS is a local stylesheet. It takes effect only when traversing the children of a particular node (whose name is the TRIGGER-SYMBOL). The bindings of NEW-BINDINGS override the bindings from the "parent" stylesheet.

A *macro* binding also specifies a pre-order traversal. The handler of a *macro* binding is applied to the current node. The result however does not become a node in the output tree. Rather it is traversed again. A *macro* binding is thus equivalent to a syntactic binding of Scheme. Although *macro* could be written as a *preorder* binding whose handler recursively invokes pre-post-order, this pattern turns out to be quite frequent to justify its own notation. The word 'macro' highlights the computation of a fixpoint: we successively re-write an SXML node into other nodes until we reach text fragments representing the target HTML, XML or LaTeX document.

SRV:send-reply:: FRAGMENT x FRAGMENT ... -> BOOL

The FRAGMENTs are lists of strings, characters, numbers, thunks, #f -- and other fragments. The function traverses the tree of FRAGMENTs depth-first pre-order, writes out strings and characters, executes thunks, and ignores #f and '(). The function returns #t if it wrote anything.

replace-range:: BEG-PRED x END-PRED x FOREST -> FOREST

The function replace-range traverses the FOREST depth-first and cut/replaces ranges of nodes. Here

     type FOREST = (NODE ...)
     type NODE = Atom | (Name . FOREST) | FOREST
     beg-pred:: NODE -> #f | FOREST
     end-pred:: NODE -> #f | FOREST

The range of nodes to cut/replace is specified by two predicates, beg-pred and end-pred. The nodes that define a range do not have to have the same immediate parent, do not have to be on the same level, and the end node of a range does not even have to exist. The replace-range procedure removes nodes from the beginning node of the range up to (but not including) the end node of the range. In addition, the beginning node of the range can be replaced by a node or a list of nodes. The range of nodes is cut while depth-first traversing the forest. If all branches of the node are cut a node is removed as well. The procedure can cut several non-overlapping ranges from a forest.

Platforms
I have tested on Bigloo 2.4b, Gambit-C 3.0, and SCM 5d6
Version
The current version is 1.8, March 24, 2003.
References

SXML-tree-trans.scm [10K]
SXML expression tree transformers

vSXML-tree-trans.scm [3K]
The validation code

SXML specification
The HTML conversion stylesheet in the master SXML Specification file is a good example of *preorder*, local stylesheet, and *macro* bindings.

apply-templates.scm [11K]
Another approach to SXML transformations, in the spirit of XSLT.

Requires
My own standard prelude
 

HSXML: Typed SXML

This message reports on several experiments in Haskell to ascertain plausibility, benefits and drawbacks of typed SXML in a mainstream functional language. It turns out that Haskell as it is can represent semi-structured data in SXML-conforming syntax, with the extensible set of `tags' and statically enforced content model restrictions. Querying, transforming and advanced rendering into HTML and XML are possible. The experience of writing many moderately large and complex web pages in HSXML shows that the typed SXML can be used in practice.

The benefit of representing XML-like documents as a typed data structure/Haskell code is static rejection of bad documents -- not only those with undeclared tags but also those where elements appear in wrong contexts. Using HTML-like HSXML as an example, a document with H1 within the P element is rejected as invalid. No HSXML transformation may introduce nested Hn or P elements: the code will not compile otherwise due to a type error. Thus the generated XML or HTML document will not only be well-formed but also will satisfy some validity constraints. Types also guide queries, transformations, and rendering, making them context sensitive. The type inference of such expressive types is possible; in fact, no type annotations are needed in an HSXML document. It used to be that the type inference and type checking took time in GHC(i). The latest version is quite fast. Static typing does not inhibit extensibility. The HSXML library user may always define new `tags', new transformation rules and new contexts.

Version
The current version is 3.1, May 2014.
References

Description.txt [8K]
HSXML Haskell library, sample code and descriptions

Takahiro Kido (shelarcy): Typed S-expression-based XML
<http://page.freett.com/shelarcy/log/2012/diary_12.html#typed_s_expression_xml>
Detailed and easy-to-understand introduction to HSXML (in Japanese)

Typed SXML [plain text file]
An old message posted on the SSAX-SXML mailing list on Mon, 18 Dec 2006 18:02:01 -0800 (PST). The message described version 1.0. In the current version HSXML not longer uses square brackets and looks almost the same as SXML: see the examples below.

An example of authoring web pages in HSXML
undelimited.hs [35K]
The master HSXML file
The rendered page


Applications, Examples, Sample Code

 

HTML/XML authoring in Scheme

The converse to XML parsing is SXML pretty-printing: a translation from an abstract to a concrete XML syntax, to an XML or HTML document. The pretty-printing makes it possible to author and compose XML documents in their SXML form. SXML is more concise and expressive than a raw markup language. SXML representing regular Scheme code can be entered in any Scheme-sensitive editor. SXML as a data structure -- a list -- can likewise be composed as a literal or quasi-literal expression. Furthermore, SXML can be produced by regular Scheme functions, which may make authoring more succinct, advanced, and less tedious, as the code below illustrates.

The pretty-printing consists of two passes. The first one traverses an s-expression tree in post-order, and converts it into a tree of HTML fragments. The tree of fragments, when flattened and concatenated, yields the resulting HTML document. Of course there is no need to flatten the tree and concatenate strings explicitly. Rather, we traverse the tree of fragments depth-first and write out fragments. In this approach, all character and other data, once created, stay where they were originally allocated. No data are ever copied. Functions such as string->number are implied as well.

The second pass, a pre-order traversal of a tree of fragments, is performed by a function SRV:send-reply. Besides writing out character data it encounters, the function executes thunks and ignores '() and #f. This feature lets us, for example, embed conditional forms into an s-expression representing HTML:

     (SXML->HTML
      `(p "par1" "par2" ,(and test (list "par3" "par4"))))
Depending on the value of the test, strings "par3" and "par4" will be either both included, or both omitted from the marked-up paragraph.

Here is a more meaningful example, which intelligently generates a "navigation bar", a pair of links to the preceding and the following slides. Obviously the first and the last page must have only one navigation link.

     (define (print-slide n max-count)
       (SXML->HTML
        `((h2 "Slide number:" ,n)     ; Note n is used in its native form
          ,(and (positive? n)
      	  `(a (@ (href "base-url&slide=" ,(- n 1))) "prev"))
          ,(and (< (+ n 1) max-count)
      	  `(a (@ (href "base-url&slide=" ,(+ n 1))) "next"))
          (p "the text of the slide"))))

SXML makes composing XML or HTML documents even more convenient by enabling higher-order "tags" -- just as LaTeX helps typeset documents by offering higher-order "macros". LaTeX macros are eventually expanded by TeX; SXML "tags" are evaluated by Scheme. The SXML specification is an example of such an advanced composition. The SXML.html web page that describes SXML is actually written in SXML itself.

A better example of a logical markup is the following excerpt from SXML.scm. It defines a *TOP* element of SXML:
     (productions
          (production 1
            (nonterm "TOP")
            ((sexp
              (term-lit "*TOP*")
              (ebnf-opt (nonterm "aux-list"))
              (ebnf-* (nonterm "PI"))
              (ebnf-* (nonterm "comment"))
              (nonterm "Element")))))
When converted to HTML, the above fragment becomes
     <table border=0 bgcolor="#f5dcb3">
     <tr valign=top><td align=right><a name="prod-1">[1]</a>&nbsp;</td>
     <td align=right><code>&lt;TOP&gt;</code></td>
     <td align=center><code> ::= </code></td>
     <td align=left><code><strong>(</strong> <em>*TOP*</em> &lt;aux-list&gt;? 
     &lt;PI&gt;* &lt;comment&gt;* &lt;Element&gt; <strong>)</strong></code> </td>
     </tr></table>
In LaTeX, it does not look much better:
     \begin{tabular}{rrcp{2.8in}}
     {[}1{]} & \texttt{<TOP>} &  $::=$ & \texttt{\textbf{(} {\itshape *TOP*} <aux-list>? <PI>* <comment>* <Element> \textbf{)} } \\
     \end{tabular}
     \\
The SXML form appears most lucid. It was also easier to type, especially given suitable Emacs key bindings, similar to those designed for LAML.

Using SXML to express its grammar has another important advantage: you can easily write a transformation or an SXPath query on the whole SXML.scm to make sure that every nonterm mentioned on the right-hand of some production appears on the left-hand side of exactly one production. The SXML.scm code thus lends itself not only to a flexible presentation to a human but to a formal reasoning about as well.

As a matter of fact, the ability to treat SXML as a "code" or as "data" is used during SXML transformations. The expansion process of a tag can re-scan the SXML document, with a different stylesheet. That is how hierarchical tables of contents are generated. Unlike LaTeX, you do not need to write auxiliary files and do not need to re-run the document processor.

The definition of SXML->HTML is given below. The function also handles "higher-level" HTML tags, for example, html:begin. These tags make HTML code even more succinct and expressive, as the SXML code for this page clearly shows. This page is authored in SXML. Local and site links are generated automatically.

The advantage of the present approach compared to LAML is that we do not define any tag-related functions in the interaction-environment. HTML tags appear to live in their own "namespace". The function SXML->HTML handles the arbitrary set of tags: we do not need to pre-define any tag or attribute for it to work. On the other hand, if validation is desired, we can introduce SXML->HTML's bindings for all allowed tags, and bind the *default* wildcard to a procedure that prints an error. Therefore, our approach is just as flexible and expressive as LAML. Furthermore, since SXML is also a data structure, you can store it in a file. Or you you can store standard headers, footers, etc. templates in files and "pull" them in when needed. This inclusion of standard templates can be done lazily, for example, through a higher-level include-and-process tag.

Platforms
The package is tested on Gambit-C 3.0, SCM 5d6, Bigloo 2.4b and Petite Chez Scheme 6.0a
Version
The current version is 1.5, Aug 10, 2002.
References

SXML specification
An extensive example of HTML authoring in SXML, with a great number of high-level tags and the automatic generation of a table of contents. Evaluation of SXML.scm file generates SXML.html page.

xml.scm [60K]
This present page in SXML. This is the master file for the present web page.

SXML-to-HTML.scm [5K]
Commented source code for the top-level HTML transformer, SXML node handlers that generate the markup, and auxiliary functions to quote "bad" characters.

vSXML-to-HTML.scm [6K]
Validation (sample) code

SXML-to-HTML-ext.scm [15K]
Definitions of several functions and higher-order SXML "tags" that are used to compose HTML pages on this web site. In LaTeX terms, this file is similar to article.cls.

LaXmL: SXML as a higher-order markup language [plain text file]
An article posted on a comp.text.xml and comp.lang.scheme newsgroups on Thu, 5 Apr 2001 20:15:12 GMT.

sxslt-advanced.scm [16K]
An extensively commented example that transforms an SHTML-like SXML document with nested sections into a web page with recursively numbered sections and subsections. We use appropriate HTML tags (H2, H3, ...) to set off the title of the sections and subsections. The example also generates a hierarchical table of contents and places it before the first section.

"Active Server Pages" in Scheme
An example of using the present HTML generation approach to write active server pages with complex stateful forms.

LAML
Kurt Normark, "Programming World Wide Web Pages in Scheme," ACM SIGPLAN Notices, vol. 34, No. 12 - December 1999, pp. 37-46.

HTML authoring in the typed HSXML

Requires
SXML traversals and transformations
 

Writing LaTeX/PDF mathematical papers with SXML

The following message discusses a fairly non-trivial example of writing a theoretical computer science paper in SXML. The message shows many snippets of SXML and the corresponding LaTeX code. SXML helped not only in transcribing (hand-written) mathematical notation. The proofs themselves were conducted in Emacs/SXML, rather than on a piece of paper. The Emacs/SXML combination turns out a good assistant for equivalence-based proofs.

The SXML->LaTeX formatter is quite distinct from SLaTeX. We are not concerned with the typesetting of Scheme code. Rather, our aim is to use Scheme notation as an easy-to-type and easy-to-manipulate realization of mathematical notation. The end result must be conventional mathematical notation, with (lambda (x) (f g)) typeset as $\lambda x. f g$, (compose f g h) as $f\,\cdot\,g\,\cdot\,h$, (let ((x 1)) body) as $\mathrm{let}\,x=1 \mathrm{in} body$, and the reduction of transformed A to transformed B, (red* *A *B), as $\overline{A} \vartriangleright^{*} \overline{B}$.

We also explain that the placement of parentheses is quite a non-trivial problem. The message gives many an example. To determine whether a particular expression should be typeset enclosed in parentheses, we need to know the property of the expression and the property of the context. An expression may be enclosed in an arbitrary number of decorators (i.e., subscripts, overlines, etc), which complicates the matter. The typesetter is implemented as an evaluator of the SXML notation using mixed call-by-name and call-by-value rules and the explicit environment passing.

Version
The current version is Mar 23, 2005.
References

Writing LaTeX/PDF mathematical papers with SXML [plain text file]
The message with many examples of writing and conducting proofs in SXML. The messages also discusses the problem of parentheses. The message was posted on the SSAX-SXML mailing list on Tue, 26 Apr 2005 16:01:30 -0700.

Technical Report TR611, Department of Computer Science, Indiana University, 2005
<http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR611>

impromptu-shift-tr.scm [61K]
The master SXML file of the report
The first part of the file is the SXML source for the report. The second part is the formatter, from SXML to LaTeX (from which PDF was eventually built).

 

Joint processing of two immutable SXML documents with update cursors

We give an example of traversing two SXML documents ``side-by-side'' and making pure functional updates to arbitrary encountered nodes. The example illustrates the benefits of zipper -- an update cursor into an immutable data structure.

Our not-so-trivial example is detecting the differences in character content within the corresponding paragraphs of two SXML documents. We assume that two SXML documents in question are made of the same number of paragraphs. The paragraphs may be enclosed in other elements such as div, chapters, etc., and that enclosing may vary between the two documents. We disregard such differences when comparing the documents. Paragraphs contain text, perhaps interspersed with markup such as em, etc. We compare the text of the respective paragraphs from two SXML documents and annotate the paragraphs that differ. We return two other SXML documents with so annotated paragraphs. We compare only the character content, regardless of how that content is arranged into strings or enclosed into additional markup elements. For example, the following two paragraphs are considered the same:

     (p "This" " " "is a " (strong "simple") " example!")
     (p "This is a simple " (br) (em "example") "!")
Our example is a non-trivial generalization of the ``same fringe problem''.
Version
The current version is 1.1, Aug 23, 2005.
References

Stream-wise processing (differencing) of two SXML documents [plain text file]
The message with the explanation and examples of finding differences in two SXML documents. The message was posted on the SSAX-SXML mailing list on Tue, 23 Aug 2005 01:47:49 -0700.

streams-diff.scm [10K]
The implementation of the zipper, of the paragraph differencing function and of the examples.

Zipper in Scheme

 

Literate XML/DTD programming

Designing an XML format for a particular domain involves more than just creating a DTD for a collection of elements, attributes and entities. An important part of the format is the documentation. Ideally it is a hyperlinked document that explains the background, the motivation, and the design principles; describes and cross-references each tag and attribute; and shows representative examples of the proposed markup. This article will demonstrate a tool that helps design a specification and the documentation at the same time -- and to keep them consistent.

A literate design document should permit a transformation into a well laid-out, easy-to-read hyperlinked user manual. A literate design document should be easy to write. And yet the user manual should be precise enough to allow automatical extraction of a formal specification. We will demonstrate that SXML is rather suitable for literate XML programming. SXML is similar to TeX, but is far easier to write and read. SXML transformations do the job of "weaving" a document type specification and of "typesetting" the user manual.

Version
The current version is 1.1, Apr 19, 2001.
References

Literate XML/DTD programming [plain text file]
An article posted on a comp.text.xml and comp.lang.scheme newsgroups on Wed, 23 May 2001 04:26:31 GMT.

JMGRIB.html
JMGRIB.dtd
JMGRIB.scm
JMGRIB: Joint METOC XML format for grid data. A User Manual, DTD and the master file, i.e., literate DTD.

HTML/XML authoring in Scheme

 

SXML as a normalized database

S-expression-based files are handy not only for authoring Web pages. They can be used to build XML documents. The following article shows a real-life and less-trivial example of that. It is straightforward to convert <tag>data</tag> into (tag "data") and vice versa. The SSAX parser and the SXML manipulation tools can do that easily. However, exporting relational sources into XML often runs into an impedance mismatch: XML by its nature is a hierarchical database. We will show an example of generating XML from s-expressions that involves denormalizations, "table joins" and third-order tags. The s-expression format turns out not only more understandable and insightful, but more concise as well, by a factor of four.
Version
The current version is 1.1, Apr 17, 2001.
References

Re: s-expressions as a file format? [plain text file]
An article posted on a comp.lang.scheme newsgroup on Wed, 29 Aug 2001 19:13:29 -0700.

SXs.pdf paperSection 5 of that paper describes this example in great detail.

Manifest.scm [from www.metnet.navy.mil]
Manifest.scm [from DISA]
Normalized SXML database, which represents a collection of OMF resources. The file Manifest.scm also includes a stylesheet that converts the SXML database to XML, and "joins" s-expression-based tables in the process.

Manifest.xml [from www.metnet.navy.mil]
Manifest.xml [from DISA]
Denormalized XML database of the collection of OMF resources, in the DISA XML repository format.

OMF: Weather observations markup format

HTML/XML authoring in Scheme

 

Complete examples of practical (context-sensitive) SXML Transformations

The following examples demonstrate SXML transformations, from trivial to advanced. The advanced examples exhibit context-sensitive transformations: the result of the conversion of an SXML tag depends on that tag's context, i.e., siblings, parents, or descendants. We also demonstrate: higher-order tags, pre-order and post-order transformations, re-writing of SXML elements in regular and special ways. Finally, we illustrate SXSLT reflection: an ability of a rule to query its own stylesheet and to re-apply the stylesheet with "modifications".
Version
The current version is Dec 31, 2003.
References

apply-templates.scm [11K]
An example of a XSLT-like apply-templates form.

sxml-db-conv.scm [8K]
Using higher-order transformers for context-sensitive transformations. The example also shows the treatment of XML Namespaces in SXML and the namespace-aware pretty-printing of SXML into XML.

pull-punct-sxml.scm [9K]
A context-sensitive recursive transformation and breadth-first-like traversals of a SXML document.

sxslt-advanced.scm [16K]
An extensively commented example of SXSLT. The example transforms an SHTML-like SXML document with nested sections into a web page with recursively numbered sections and subsections. We use appropriate HTML tags (H2, H3, ...) to set off the title of the sections and subsections. The example also generates a hierarchical table of contents and places it before the first section.

sxml-nesting-depth-label.scm [5K]
Labeling nested sections of an SXML document by their nesting depth. This is another example of traversing an SXML document with a stylesheet that differs from node to node.

sxml-to-sxml.scm [9K]
Transforming SXML to SXML: Composing SXML transformations. The code introduces and tests a version of pre-post-order that transforms an SXML document into a strictly conformant SXML document. That is, the result of a pre-post-order transformation can be queried with SXPath or transformed again with SXSLT.

SSAX/SXML examples
A directory of the examples, including the Makefile to run them.

 

Complete examples of stream-wise (SAX) and DOM parsing

The following examples show how to use the SSAX library for a process-as-we-parse, stream-wise (SAX) XML parsing. We also illustrate a DOM parsing, and handling of namespace-rich documents. The examples are the complete, ready to run code.
Version
The current version is Jan 6, 2007.
References

remove-markup.scm [2K]
Given an XML document, remove all markup. In other words, compute the string-value of a DOM tree of the document, without constructing the DOM tree.

outline.scm [2K]
Pretty-print the structure of an XML document, disregarding the character data. This is yet another example of transforming an XML document on the fly, as we parse it. This example corresponds to outline.c of the Expat distribution.

run-sxml.scm [<1K]
A complete example of parsing an XML into SXML: DOM parsing with SSAX.

daml-parse-unparse.scm [17K]
Parsing and unparsing (and parsing again) of a namespace-rich XML document: DAML/RDF. The example illustrates that the parser and the pretty-printer maintain the invariant PARSE . UNPARSE . PARSE == PARSE

ssax-extraction.scm [7K]
SSAX parsing of a large XML document and recording only `interesting' parts, namely, the content of CDATA sections from from a (small) subset of elements. Because the document is large, it is important to avoid reading it all in memory and retaining irrelevant data.

SSAX/SXML examples
A directory of the examples, including the Makefile to run them.

On parent pointers in SXML trees

 

Permissive parsing of perhaps invalid HTML

SSAX is more than an XML parser -- it is a library, which includes lexers and parsers of various kinds. The library comes in handy when we need to permissively parse (or better say, lex) HTML documents or chunks, which may contain mismatched tags, unclosed tags, or other imperfections. Because browsers are so lax, many web pages on the Internet are ill-coded. For example, the FrontPage editor is notorious for creating such invalid sequences as <b><i>text</b></i>.

Because we accept even HTML with unmatched tags, we make no attempt in this example to recover the structure. We faithfully record all occurring tags and let the user sort out what matches what. For example, the following ill-composed sample document, which includes comments, parsed entities and attributes of various kinds:

      <html><head><title> </title> <title> whatever </title></head>
       <body> <a href="url">link</a> <p align=center> <ul compact
       style='aa'>
       <p> BLah <!-- comment <comment> --> <i> italic <b> bold <tt> end </i> still  &lt; bold </b>
       </body>
       <P> But not done yet...
parses to a token stream:
     (#(START html ()) #(START head ()) #(START title ())
       " " #(END title) " " #(START title ()) " whatever " #(END title)
       #(END head) "\n  "
       #(START body ()) " " #(START a ((href . "url"))) "link" #(END a) " "
       #(START p ((align . "center"))) " "
       #(START ul ((compact . "compact") (style . "aa")))
       "\n  " #(START p ()) " BLah  " #(START i ()) " italic "
       #(START b ()) " bold " #(START tt ()) " end " #(END i)
       " still  < bold " #(END b) "\n  " #(END body)
      "\n  " #(START P ()) " But not done yet...")
The parser has handled &lt; and other parsed entity references and fully preserved all the whitespace, including newlines. The parser obviously supports the three possible styles of HTML attributes: <tag name="value">, <tag name=value>, and <tag name>. Single quotes are also allowed. Comments are silently ignored.

This is an example of using the SSAX library for tasks other than XML parsing.

Version
The current version is 1.1, Nov 3, 2001.
References

html-parser.scm [8K]
The source code of the parser and of the test to run the above example

SSAX examplesA directory of SSAX examples, including the HTML parser, and the Makefile to run them.

HtmlPrag: a permissive HTML parser that emits SXML, by Neil W. Van Dyke
<http://www.neilvandyke.org/htmlprag/>

 

XML pull parsing and SSAX

SSAX parser can be considered a for-each-like enumerator. The collection in question is the XML document itself (to be more precise, the corresponding input port). Indeed, a sequential reading of an XML document is equivalent to a depth-first traversal of the corresponding XML Infoset tree. An article General ways to traverse collections demonstrated that every for-each enumerator can be turned inside-out. The result is a lazy list, a demand-driven producer of values, which is called generator in languages such as Icon, Python and Ruby. The user of the inverted SSAX is no longer being pushed by the parser -- on contrary, the user pulls markup and character data from the inverted-SSAX stream. Thus SSAX can be turned into an XML Pull parser mechanically.

We should note that many parts of the SSAX parser toolkit have been designed with the pull model in mind. The function lex-html of the permissive HTML parser illustrates how to pull with SSAX.

References

General ways to traverse collections


<http://www.xmlpull.org/>
XML Pull: Common API for XML Pull Parsing

Permissive parsing of perhaps invalid HTML

Differentiating Parsers

 

On parent pointers in SXML trees

In this article we discuss and compare five different methods of determining the parent of an SXML node. The existence of these methods is a crucial step in a constructive proof that SXML is a complete model of the XML Information set, and SXPath is a complete implementation of the XPath Recommendation. Four of the five techniques of determining the parenthood require no mutations, and can be implemented in a pure and strict functional language. Three of the five techniques rely on annotations attached to SXML nodes. The techniques differ in the nature of the annotations and in the time complexity of attaching and using these annotations.

The source code shows several advanced examples of XML SAX-style parsing: custom parsers that maintain node dictionaries and parent links as they read the document.

Version
The current version is 1.1 Feb 12, 2003.
References

The article itself [plain text file]

parent-pointers.scm [23K]
The complete source code of four techniques.

Discussion forum
Comments posted by Dmitry Lizorkin on Feb 22 and 27, 2003 are particularly interesting. He has suggested a way of accounting for the parent pointers during the evaluation of an (S)XPath expression.

 

SSAX parsing with limited XML doctype validation and datatype conversion

XML Schema introduces a number of atomic XML datatypes and the corresponding validation rules. SXML likewise supports boolean, numerical, and other Scheme values [1] in the "character" content of SXML elements and attributes. As we parse the document, we can validate and de-serialize character data from the source XML document into the corresponding numerical etc. Scheme values. The present sample code shows an example of such datatype conversion and validation. This code also demonstrates element content validation: making sure that the sequence of child elements matches a user-specified template.

The validation in the present example is intentionally limited, to keep the code simple and clear. The code can be extended to more complex validation rules.

[1] One may use any Scheme value in the "character" content of an SXML element or attribute as long as this value is not a pair or a symbol. If we need to insert a pair or a symbol in the "character" content, we should wrap them in a vector or a procedure.

Version
The current version is 1.1 Aug 22, 2003.
References
validate-doctype-simple.scm [12K]
The source code that instantiates the SSAX parser framework to do on-the-fly document type validation and datatype conversion. The tests at the end of the file demonstrate the datatype conversion and the detection of invalid atomic and structural XML fragments.


Miscellanea

 

CDATA, #PCDATA, and ANY

A short article explaining the difference among CDATA, #PCDATA and ANY -- as well as the distinction between a mixed content and any content.
Version
The current version is 1.1, Jan 5, 2001.
References
Re: XEXPR needs Schemers' help [plain text file]
An article posted on a comp.lang.scheme newsgroup on Fri, 05 Jan 2001 22:20:06 GMT.
 

SOAP 1.2 and SXML

SOAP 1.2 became a W3C Recommendation on June 24, 2003. SOAP 1.2, which used to be known as XML protocol, is quite different from SOAP 1.1 and XML-RPC. In particular, SOAP 1.2 permits SXML (s-expression) message systems to claim SOAP 1.2 compliance. Indeed, SOAP does not require XML! The Recommendation says,

A SOAP message is specified as an XML Information Set [XML InfoSet]. While all SOAP message examples in this document are shown using XML 1.0 syntax, other representations MAY be used to transmit SOAP messages between nodes (see 4. SOAP Protocol Binding Framework).
...
The binding framework does not require that every binding use the XML 1.0 serialization as the "on the wire" representation of the XML infoset; compressed, encrypted, fragmented representations and so on can be used if appropriate.

In other words, the W3 Consortium specifically leaves it to applications to choose the appropriate on-the-wire representation of the XML Infoset. S-expressions is one of the realizations of the XML Infoset. Thus serializing SOAP messages using SXML is specifically permitted. Indeed, S-expressions may reasonably be characterized as a "compressed representation" of the XML Infoset.

Version
The current version is 1.1, May 26, 2003.
References
SOAP 1.2 [plain text file]
A message with more discussion of the SOAP processing with SXML. The message was posted on a SSAX-SXML mailing list on Mon, 26 May 2003 12:53:24 -0700.
 

Evaluating SXML: XSL Transformations in Scheme

Write a procedure xml such that an expression
     (xml '((url "mypage.html" (name "FirstName LastName's page")) (p "my paragraph")))
evaluates to a string
     <a href="mypage.html"><name>FirstName LastName's page</name></a><p>my paragraph</p>

and

     (define myname '(name "FirstName LastName"))
     (xml `( ((p "par1") (p "par2")) (p "This is " ,myname " page") ))
evaluates to
     <p>par1</p><p>par2</p><p>This is <name>FirstName LastName</name> page</p>
Version
The current version is 1.2, Jun 7, 2000.
References

Evaluating Scheme to XML/HTML [Was: eval to invoke a local function?] [plain text file]
An article posted on a comp.lang.scheme newsgroup on Thu, 27 Jul 2000 01:41:52 GMT. It contains the source code for the xml function above, and discusses several generalizations.

SXML traversals and transformations

 

XEXPR, SXML, and "executable" XML code

This set of two articles was prompted by a W3C technical note that introduces XEXPR -- a programming (scripting) language with XML notation. The second article offers a detailed critique. What if we still want a programming language with an XML notation? The articles consider two ways of designing such languages. One is to assign an executable semantics, reduction rules, to an abstract syntax tree representation of an XML document. The other approach is to start with an extant programming language and to introduce rules to pretty-print its abstract syntax tree into the XML format.

Both approaches are especially easy if the programming language is Scheme. The concrete syntax of Scheme is basically its abstract syntax as well. On the other hand, SXML -- an abstract syntax for XML -- has a representation that parses not only as Scheme "data" but also as a Scheme "expression" (by R5RS syntax definitions). In SXML, abstract syntaxes of Scheme and XML unify.

The two approaches are illustrated in the articles below. In particular, the second article demonstrates that a SXML pretty-printer is general enough to permit a well-defined XML syntax for Scheme. This is not trivial as the following Scheme expression may show:

     (begin (append a (x y) '(b) '(c "d" 1 (f g) '(h i) `g () '() (()) j)))
It exhibits a great deal of context-sensitivity. What looks like an application may in reality be something else, if it is preceded by a quote. A quote within a quote loses its special meaning. An XML syntax should be able to capture these nuances -- and it does as the articles show.

To see how we can assign an XML notation to Scheme, or an evaluation semantics to XML, let us consider an expression:

     (string-join '("foo" "bar" "baz") ":")
The code in the first article pretty-prints this expression as
     <string-join>
       <list> <str>foo</str><str>bar</str><str>baz</str> </list>
       <str>:</str>
     </string-join>
On the other hand, we start with this XML string and feed it into a SSAX parser. We will immediately see that the output:
     (string-join (list (str "foo") (str "bar") (str "baz")) (str ":"))
can be assigned a well-defined evaluation semantics -- the semantics of Scheme. Indeed, this is a genuine Scheme expression, and can be evaluated as such. The result will be the same as that of the original expression, provided that we define str as the identity function.

The second article discusses another, more general XML notation for our example:

     <apply>
       <var>string-join</var>
          <list> <str>foo</str><str>bar</str><str>baz</str> </list>
          <str>:</str>
     </apply>
This representation holds a stronger invariant with respect to SXML pretty-printing and parsing.

The first article notes that a function post-order at the heart of SXML pretty-printing can also evaluate a Scheme expression directly. Indeed, this function is an implementation of eval -- or eval is a post-order traversal for a Scheme expression tree.

Version
The current version is 1.1, Jan 17, 2001.
References

XML as Scheme, or Scheme as XML [plain text file]
The first article, posted on comp.lang.scheme and comp.text.xml newsgroups on Sat Jan 06 02:45:03 2001 GMT.

XML as Scheme, Scheme as XML; XEXPR considered invalid [plain text file]
The second article, posted on comp.lang.scheme and comp.text.xml newsgroups on Wed Jan 17 06:58:14 2001 GMT.

XEXPR - A Scripting Language for XML. Gavin Thomas Nicol, Ed. W3C Note 21 November 2000.

The Next 700 Markup Languages by Philip Wadler.
Invited Talk, Second Conference on Domain Specific Languages (DSL'99), Austin, Texas, October 1999.
Among other things, this talk gives an interesting example of a functional programming language with an XML syntax.

XEXPR and Scheme discussion thread

Requires
SXML expression tree transformers


Last updated May 7, 2014

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from SXML by SXML->HTML