For various internal and performance reasons, we're making some changes to Saxon's internal Receiver interface for the next release. This interface is a SAX-like interface for sending an XML document (or in general, any XDM instance) from one processing component to another, as a sequence of events such as startElement(), attributes(), characters(), and so on.
The interface is very widely used within Saxon: it handles communication from the
XML parser to the document builder, document validation, serialization, and much else.
It also allows instructions to be executed in "push mode", so for example when XSLT
constructs a result tree, the tree is never actually constructed in memory, but instead
events representing the tree are sent straight from the transformer to the serializer.
I know that although this interface is labelled as internal, some user applications
attempt either to implement the interface or to act as a client, sending events to
one of Saxon's many implementations of the interface. So in making changes, it seems
a good time to recognize that there is a need for an interface at this level, and
that existing candidates are really rather clumsy to use.
Among those candidates are the venerable SAX ContentHandler interface, and the newer StAX XMLStreamWriter interface.
There are a number of structural reasons that make the ContentHandler hard to use:
- It offers a number of different configuration options for XML parsers, which cause namespace information to be provided in different ways. But the ContentHandler has no way of discovering which of these options the XML parser (or other originator of events) is actually using.
- It's not actually one interface but several: some events are sent not to the ContentHandler, but to a LexicalHandler or DTDHandler.
- The information available to the ContentHandler doesn't align well with the information defined in the XDM data model; for example, comments are available only to the LexicalHandler, not to the ContentHandler
In some ways the XMLStreamWriter is an improvement, and I've certainly used it in preference when writing an application that has to construct XML documents in this way. But a major problem of the XMLStreamWriter is that it's underspecified, to the extent that there is a separate guidance document from a third-party suggesting how implementations should interpret the spec. Again, the main culprit is namespace.
One of the practical problems with all these event-based interfaces is that debugging can be very difficult. In particular, if you forget to issue an endElement() call, you don't find out until the endDocument() event finds there's a missing end tag somewhere, and tracking down where the unmatched startElement() is in a complex program can be a nightmare. I decided that addressing this problem should be one of the main design aims of a new interface -- and it turns out that it isn't difficult.
Let's show off the new design with an example. Here is some code from Saxon's InvalidityReportGenerator, which generates an XML report of errors found during a schema validation episode, using the XMLStreamWriter interface:
writer.writeStartElement(REPORT_NS, "meta-data"); writer.writeStartElement(REPORT_NS,"validator"); writer.writeAttribute("name", Version.getProductName() + "-" + getConfiguration().getEditionCode()); writer.writeAttribute("version", Version.getProductVersion()); writer.writeEndElement(); //</validator> writer.writeStartElement(REPORT_NS,"results"); writer.writeAttribute("errors", "" + errorCount); writer.writeAttribute("warnings", "" + warningCount); writer.writeEndElement(); //</results> writer.writeStartElement(REPORT_NS,"schema"); if (schemaName != null) { writer.writeAttribute("file", schemaName); } writer.writeAttribute("xsd-version", xsdversion); writer.writeEndElement(); //</schema> writer.writeStartElement(REPORT_NS,"run"); writer.writeAttribute("at", DateTimeValue.getCurrentDateTime(null).getStringValue()); writer.writeEndElement(); //</run> writer.writeEndElement(); //</meta-data>
And here is the equivalent using the new push API:
Push.Element metadata = report.element("meta-data"); metadata.element("validator") .attribute("name", Version.getProductName() + "-" + getConfiguration().getEditionCode()) .attribute("version", Version.getProductVersion()); metadata.element("results") .attribute("errors", "" + errorCount) .attribute("warnings", "" + warningCount); metadata.element("schema") .attribute("file", schemaName) .attribute("xsd-version", xsdversion); metadata.element("run") .attribute("at", DateTimeValue.getCurrentDateTime(null).getStringValue()); metadata.close();
What's different? The most obvious difference is that the method for creating a new element returns an object (a Push.Element) which is used for constructing the attributes and children of the element. This gives it an appearance rather like a tree-building API, but this is an illusion: the objects created are transient. Methods such as attribute() use the "chaining" design - they return the object to which they are applied - making it easy to apply further methods to the same object, without the need to bind variables. The endElement() calls have disappeared - an element is closed automatically when the next child is written to the parent element, which we can do because we know which element the child is being attached to.
There are a few other features of the design worthy of attention:
- Names of elements and attributes can be supplied either as a plain local name, or as a QName object. A plain local name is interpreted as being in the default namespace in the case of elements (the default namespace can be set at any level), or as being in no namespace in the case of attributes. For the vast majority of documents, there is never any need to use QNames; very often the only namespace handling is a single call on setDefaultNamespace().
- The close() method on elements (which generates the end tag) is optional. If you write another child element, the previous child is closed automatically. If you close a parent element, any unclosed child element is closed automatically. The specimen code above shows one call on close(), which is useful in this case for readability: the reader can see that no further children are going to be added.
- The argument of methods such as attribute() and text() that supplies the content may always be null. If the content is null, no attribute or text node is written. This makes it easier to handle optional content without disrupting the method chaining.
I have rewritten several classes that construct content using push APIs to use this interface, and the resulting readability is very encouraging.