FtanML - A new markup language

By Michael Kay on August 14, 2012 at 12:29p.m.

FtanML is a notation for data and documents designed to combine the simplicity of JSON with the expressive power of XML. Key aims are ease of reading and writing by human beings, and ease of processing by software in conventional programming languages.

It is named after Ftan, the village in the Swiss Alps where it was created, during a two-week summer school course in August 2012. The tutors on the course were Michael Kay and Stephanie Haupt, and the students were Max Altgelt, Julien Bergner, Lukas Graf, Dominik Helm, Axel Kroschk, Uwe von Lpke, My-Tien Nguyen, Sebastian Memer, Suhanyaa Nitkunanantharajah, Jan Landelin Pawellek, and Martin Schmitt.

The FtanML Grammar

value ::= array | element | string | number | "true" | "false" | "null"

string ::= (like a JSON string)

number ::= (like a JSON number)

array ::= "[" (value ("," value)* )? "]"

element ::= "<" element-name? attribute* ("|" content)? ">"

element-name ::= string | name

attribute ::= (string | name) "=" value

content ::= (content-char | escaped-char | element ) *

content-char ::= any character except \, <, >, and control characters

escaped-char ::= as in JSON strings with the addition of "\<", "\>" (to escape "<" and ">")

name ::= [\p{L}\p{N}:_$]+

The FtanML Data Model

A Value is either an Array, an Element, a string, a number, a boolean or null.

An Array is a sequence of Values.

An Element is a set of attributes. An attribute is a name-value pair, where the name is any string and the value is any Value. (An element is thus equivalent to a JSON Object.)

Two particular attributes are treated specially in the surface syntax, but not in the data model. The attribute named "name" represents what in XML is called the "element type name", and the syntax <name="hr"> may therefore be abbreviated to <hr>. The attribute named "content" represents what in XML is modelled as the children of the element, and its value is an array consisting of a sequence of elements and strings (with adjacent strings and zero-length strings not permitted). The content of an element is normally written in the form <p|Here is <b|bold> text> but in data model terms this is equivalent to <p content=["Here is ", <b content=["bold"]>, " text"]>.

Examples

A JSON-like array:

[1, 2, "abc", [1, 2]]

A JSON-like map:

<x=1 y=2 label="box" "corner coordinates"=[[0,1], [0,3], [1,3], [1,1]]>

Note: the attribute names do not require quotes unless they contain special characters not allowed in a name. The presence or absence of quotes is not exposed in the data model. The quotes around attribute values are required if the value is a string, but not otherwise.

An XML-like element:

<para|Here is some <b|bold> text>

The equivalent of the XML "start tag" is the material before the vertical bar; to the right of the vertical bar is the (optional) content; the end tag is reduced to a right angle bracket.

An XML-like element with attributes, and an empty element:

<para id="p123"|Line 1<br>Line 2>

A element named "list" containing three unnamed elements:

<list|<|red><|green><|blue>>

An element having a class attribute but no name:

<p|This is <class="u"|underlined>>

An element with no name, attributes, or content:

<>

Postscript

By the end of the course we implemented a parser for FtanML (using JavaCC), which generated Java objects corresponding to the object model. We also implemented serialization of the object model to FtanML, JSON, and XML, and wrote some parser tests and test applications for the object model.

Analyzing what we did, I think there are two things FtanML does particularly well:

(a) it adds mixed content to JSON at the lexical level, without adding any complexity to the data model. So programming against JSON is just as easy as it was before, but "document-like" information is no longer excluded

(b) It provides a document-oriented markup language that is both richer and simpler than XML. Richer, because it generalizes what attributes can contain (not just strings, but numbers, booleans, arrays, and elements), and simpler because (i) there is less syntactic baggage (<b|bold text> instead of <b>bold text</b>; simpler escape mechanism), and (ii) it cuts out a lot of the rarely used but complicating features such as comments, processing instructions, CDATA sections, entity references, XML declarations, etc.)

Of course, the fact that we have a better mousetrap will not itself cause the world to beat at our door; but I hope the ideas will prove influential.

There are some things we didn't tackle or decide; here is a list of what one might want to do next.

Parent pointers in the data model

Should an element (=map, = JSON object) have a pointer to its parent? In the XML world we expect this, in the JSON world we don't. There are advantages and disadvantages both ways. I think this remains an open question.

Whitespace

In element content, is whitespace significant? I think it's important that it should be possible to include both significant and insignificant whitespace, and that the two should be clearly distinguished. One suggestion is that any sequence of whitespace characters that follows a backslash should be considered insignificant. Another is the reverse: whitespace is normalized/collapsed except for any whitespace character that is escaped with "\".

Namespaces

I think there are three possible approaches to namespaces. (i) do nothing, as in JSON. (ii) do namespaces the XML way. (iii) provide a minimal namespaces facility.

Since FtanML generally adopts the approach of doing what is right without concern for compatibility, my preference is (iii). I think a simple namespace scheme might work as follows:

The "absolute name" of an element is in the form ":com.saxonica.project.para", that is a name rather like a Java class name using inverse DNS names by convention to achieve uniqueness. The element name can always be written literally in this form. Alternatively, it can be written as a short name "para" in which case it is implicitly in the same namespace as its parent element. The name that appears in the data model is always the absolute name; the short name is merely an authoring convenience. Attribute names are in no namespace unless they are written in full as absolute names.

Path/query language

We clearly need an XPath-like language to address into FtanML documents, and the data model is sufficiently different from XDM that XPath itself won't really do the job. For example, the arrays in the data model are subtly different from XDM sequences (a singleton item is not the same as an array containing that item) and this has considerable implications.

Schema language

Given time and effort, I would propose a schema language for FtanML that is rule-based and that incorporates both predicate-based validation, type assignment, and grammar rules. That is, the ability to say "for an element that matches this pattern, the following rules must be satisfied", where the rules include the ability to specify a named type, to add restrictions such as min and max values and regex patterns, or if the type is "element", to specify the required content of the element as a grammar.

Finally

I don't know if FtanML will go any further. But developing it was a great experience and a good way of spending a couple of weeks with a group of very talented students. I think it was a good learning experience for them; and perhaps, just perhaps, it will sow some ideas in the community that will influence the future of markup languages.