MicroXML

nxml-uxml

MicroXML is a delightful tiny subset of XML which, in my view, restores most of what was good about SGML to XML, being ultimately significantly simpler than both. Notably, it gets rid of namespaces, doctype and XML declarations, entity references other than the default five character entities, and processing instructions. This might seem radical, but the entire second edition of the Oxford English Dictionary was marked up in a format with only one more feature than this. (That feature was named character entity references, a necessity in the age before Unicode, but arguably less relevant today.)

The intention was to remove everything that encouraged the abuse of XML 1.0 to build things other than document processing apps. (Like RPC systems and other nonsense.)

Use cases for the removed features

Custom character entity references could be used for two purposes: firstly, to enable at least some recognizable, visible mnemonic form for characters as-yet unknown to Unicode which divorces them from any current Private Use Area allocation and thus eases the transition when and if the referenced character is finally encoded in Unicode itself. Entity references can also be entered on an ASCII keyboard. This advantage is arguably outweighed, though, by the increased complexity of needing a list of such character entity references in order to parse a document, and the fact that entity references can actually generate more than one character. Global search and replace through a document database to convert PUA characters to new standard Unicode equivalents is certainly a simpler solution; input methods in text editors can solve the problem of how to easily insert characters.

Namespaces have some uses, but admittedly I probably ultimately have to admit they’re overkill for the problem they solve. One use case when I was investigating writing a new Dictionary Writing System for Green’s Dictionary of Slang was cross-references. To understand this use case, some background information: a cross-reference in an historical dictionary might not be just to one entry, but (for example) to a specific sense, or even to a specific quotation/usage example; the reader-visible text of cross-references is computer-generated, to ensure that if the spelling of a headword changes, or sense numbers are changed as senses are re-ordered within an entry, the visible cross-reference continues to reflect the right information as it appears at its destination. So the computer just sees something like <xref dest="abcdef#123" /> and when the time comes to publish the dictionary entry, it looks up what’s under the magic internal ID abcedef#123 and converts that tag to something like <xref dest="abcdef#123">hot adj. (sense 2)</xref>.

The plan was to use two namespaces for this: one for the actual content elements, and one for cross-references, containing the exact same element names. So a cross-reference to dict:entry would be an xr:entry, a cross-reference to a dict:quotation would be an xr:quotation, etc. The code generating the visible cross-reference text would use this as a formatting hint and a type check.

But again, perhaps in reality, namespaces were overkill for this. After all, I’ve just outlined above a perfectly cromulent way to do this without using two distinct namespaces.

There’s also the more obvious use-case of allowing one language (say, HTML) to be embedded inside another (say, Atom). But this is both rare and often easily handled by context, and by the fact that, RELAX NG for example allows you to embed one schema inside another (with include). In the case of a MicroXML equivalent of HTML content inside Atom feeds, the MicroXML-ized version of the Atom schema would include a MicroXML-ized HTML schema, and the content element point to a MicroXMLized version of the HTML schema’s div element.

Processing instructions were handy for specifying stylesheets for rendering XML in browsers, but in reality I know of exactly one web page which works like this. The XML declaration isn’t needed because the only allowed encoding for MicroXML is UTF-8, which is fine. (The theoretical advantages of UTF-16 for e.g. CJKV text were apparently mostly negated in practice by the overhead for the mostly-ASCII element and attribute names etc.)

Criticism of the spec

Why did they keep the restriction on the appearance of -- inside comment text? There was no real reason to keep that restriction in XML, and even less in MicroXML now that markup declarations are gone entirely. See also Ian Hickson’s ‘People who don’t realise that they’re wrong’ (2006), explaining why he was finally convinced to allow -- in comments in HTML.

The requirement that newlines in attribute values be maintained rather than being normalized to spaces, on the other hand, is a gratuitous difference from XML 1.0, bound only to cause confusion for as long as XML processing tools lack a specific parser for MicroXML.

A specified error recovery model, even if non-normative, would have been a good idea.

Erik Naggum

Apropos of nothing — not because I take it particularly seriously, but because it’s well known — let’s compare what MicroXML has to offer in view of Erik Naggum’s infamous rant about XML’s disimprovements upon SGML.

SGML was a major improvement on the markup languages that preceded it (including GML), which helped create better publishing systems and helped people think about information in much improved ways, but when the zealots forgot the publishing heritage and took the notion that information can be separated from presentation out of the world of publishing into general data representation because SGML had had some success in ‘database publishing’, something went awry, not only superficially, but fundamentally.

[stretched analogy elided]

SGML is a good idea when the markup overhead is less than 2%. Even attributes is a good idea when the textual element contents is the ‘real meat’ of the document and attributes only aid processing, so that the printed version of a fully marked-up document has the same characters as the document sans tags.

This is more about philosophy of markup and serialization than anything else, but an explicit design goal of MicroXML is to move back into the document markup space and away from the misguided application for general data serialization.

Explicit end-tags is a good idea when the distance between start- and end-tag is more than the 20-line terminal the document is typed on. Minimization is a good idea in an already sparsely tagged document, both because tags are hard to keep track of and because clusters of tags are so intrusive.

MicroXML keeps explicit end-tags in all cases, and has no minimization features except for XML’s self-closing slash.

Character entities is a good idea when your entire character set is EBCDIC or ASCII.

MicroXML enforces the use of UTF-8, making character entities obsolete by this measure.

Validating the input prior to processing is a good idea when processing would take minutes, if not hours, and consume costly resources, only to abend.

The schema and validation story for MicroXML is still unclear.

Then there’s Naggum’s suggestions for how to do things better:

Remove the syntactic mess that is attributes. (You will then find that you do not need them at all.)

MicroXML keeps attributes, but Naggum himself admitted above they are useful and even suggested a rule-of-thumb for when it would be appropriate to use them: when there is useful metadata which enriches the meaning of a span of text but which itself is not entirely human-readable. (A canonical example would be the destination of a hyperlink.)

Enclose the element in matching delimiters, not the tag. These simple things makes people think differently about how they use the language. Contrary to the foolish notion that syntax is immaterial, people optimize the way they express themselves, and so express themselves differently with different syntaxes.

MicroXML retains the XML/SGML way of doing things. I’m not sure I understand Naggum’s point here anyway.

Next, introduce macros that look exactly like elements, but that are expanded in place between the reader and the ‘object model’.

This is an intriguing idea (with obvious Lisp influences). One would achieve it fairly trivially with a transformation stage between initial parsing and processing. That would effectively provide minimization without tag omission etc., at the cost of having to run the transformation stage before further processing. Again, the MicroXML transformation story is still unclear.

Then, remove the obnoxious character entities and escape special characters with a single character, like \, and name other entities with letters following the same character. If you need a rich set of publishing symbols, discover Unicode.

MicroXML does some of this, cutting down named character entities to the standard five, and using only Unicode hex character escapes otherwise. Unicode is the only way to use special characters. The XML/SGML syntax for them is retained, though.

Finally, introduce a language for micro-parsers than can take more convenient syntaxes for commonly used elements with complex structure and make them return element structures more suitable for processing on the receiving end, and which would also make validation something useful.

MicroXML doesn’t do this at all, and I’m not surprised as the features like this which SGML had were widely regarded as being among the worst and most error-prone. A framework for creating micro-markup-languages comparable to, say, Markdown or Textile, but targetting a custom tag set with totally distinct and presumably richer semantics than HTML, is not really a bad idea: it’s just not at all in scope of MicroXML.