From oleg@pobox.com Thu Mar 15 11:27:59 2001 Newsgroups: comp.text.xml,comp.lang.scheme References: <200103151927.LAA68674@adric.cs.nps.navy.mil> Date: Thu, 15 Mar 2001 19:31:16 +0000 (UTC) Date-Sent: Thu, 15 Mar 2001 11:27:57 -0800 (PST) To: comp.lang.scheme@mailgate.org, comp.text.xml@mailgate.org Subject: SAX/DOM/SXML parsers with support for XML Namespaces and validation Keywords: XML, SAX, SXML, parsing, Namespaces X-Comment: Added URL to ports by Kirill Lisovsky Summary: Announcing a newer version of a framework to parse XML documents Status: OR This is to announce version 4 of SSAX, a package of low-to-high level lexing and parsing procedures that can be combined to yield a SAX, DOM, validating parsers, or a parser intended for a particular document type. Salient features of this version are: the _complete_ support for XML Namespaces, support for xml:space, character, external and internal parsed entities with detection of nonrecursion violations, support for validation of element and attribute content. The previous versions already handled attribute value normalization, processing instructions and CDATA sections. The parser by itself detects great many validation errors and almost all well-formedness errors; a user is given a chance to set his own handlers to capture the rest of the errors. Content is validated given user-specified constraints, which the user can derive from a DTD, from an XML schema, or from other competing doctype specification formats. Low- and intermediate-level SSAX procedures help parse these doctype formats. The procedures in the package can be used separately to tokenize or parse various pieces of XML documents. This package is intended to be a framework, a set of "Lego blocks" you can use to build a parser that follows DOM, SAX, or other discipline and performs validation to any degree. As an example of such parser construction, the package includes a semi-validating SXML parser. It converts XML to SXML, an instance of XML Infoset as S-expressions, an abstract syntax tree of an XML document. As the web page referenced below explains, SXML can be queried (in a XPath style), transformed and evaluated. The present framework parses XML in a pure functional style, as folding over a text XML document considered a spread-out tree. The input port is treated as a linear, read-once parameter. The framework's code does not use assignments at all. Platforms: SSAX was written and tested on Gambit-C 3.0; it also runs on SCM 5d3. Kirill Lisovsky (http://pair.com/lisovsky/) has kindly notified me that he has ported SSAX to Bigloo, Guile, Chicken, and PLT Scheme. References: http://pobox.com/~oleg/ftp/Scheme/SSAX.scm The code comes with an extensive set of self-tests, which verify not only the correct behavior but the detection of constraint violations as well. http://pobox.com/~oleg/ftp/Scheme/SXML.html Definition of SXML: an instance of XML Infoset as S-expressions, an Abstract Syntax Tree of an XML document. http://pobox.com/~oleg/ftp/Scheme/xml.html SXPath (SXML query) and SXML Transformations and Evaluations http://pair.com/lisovsky/download/ssax/ Current versions of the SSAX ports to Bigloo, Guile, Chicken, and PLT Scheme. License type: Public Domain There exists a myth that parsing of XML is easy. An article "Parsing XML" in January 2000 issue of Dr.Dobb's Journal states the ease of parsing as an alleged fact. The author of that article must have overlooked that there is more to XML that the grammar presented in the XML Recommendation. There are attribute normalization rules, well-formedness constraints, let alone validation constraints. XML Namespaces add another layer of complexity. One example is attribute value normalization rules (Section 3.3.3 of the XML Recommendation). To obtain the value of an attribute, it is emphatically not enough to merely read text between two single or double quotes. A parser shall replace TAB, CR, LF characters and a CRLF character combination with a single space. A parser shall expand character references, e.g., A parser shall expand internal entity references, e.g., <. The expansion text of an internal entity shall also be normalized. A parser shall check that the replacement text does not contain a character '<' verbatim. A parser shall treat quotes that may appear in the replacement text as regular characters of no special meaning. The replacement text may contain references to other character and internal entities, which must be expanded and normalized in turn. Furthermore, the parser must keep in mind a non-recursion constraint: "A parsed entity must not contain a recursive reference to itself, either directly or indirectly". Even a _non_-validating XML parser _must_ follow all these rules. Here is another example: Of the following three documents, one is OK, another is invalid but well-formed, and the third one is not even well-formed. The root element of all three documents is the same. ]>
]>
]>
Which is which is left as an exercise to the reader. The SSAX parser can tell the difference. In fact, these examples are from its self-tests. Examples The following examples are excerpts from SSAX' built-in self-tests. They present the input and the output from the parser: In: "
" Out: (BR) In: "

" Out: (BR) In: " link itlink &amp;" Out: (A (@ (XML:space "preserve") (HREF "URL")) " link " (I (@ (XML:space "default")) "itlink") " &") In: "

?

" out: (P (*PI* pi1 "p1 content ") "?" (*PI* pi2 "pi2? content? ?")) In: "

\n]]>]]>

" Out: (P "
\n]]>") An example from the XML Namespaces Recommendation. No user-prefixes. " Baby food " Out: (*TOP* (x (lineItem (@ (http://ecommerce.org/schema:taxClass "exempt")) "Baby food")))) Another example from the XML Namespaces Recommendation, with user-defined namespace prefixes: " Layman, A 33B Check Status 1997-05-24T07:55:00+1" Out: (*TOP* (*NAMESPACES* (HTML "http://www.w3.org/TR/REC-html40")) (RESERVATION (NAME (@ (HTML:CLASS "largeSansSerif")) "Layman, A") (SEAT (@ (HTML:CLASS "largeMonotype") (CLASS "Y")) "33B") (HTML:A (@ (HREF "/cgi-bin/ResStatus")) "Check Status") (DEPARTURE "1997-05-24T07:55:00+1")))) The SSAX source code has more examples, including a Part of RDF from the XML Infoset (with three namespaces) and RDF from RSS of the Daemon News Mall. All these examples are part of the self-test suite.