Navigating XML trees using Java Streams

For the next major Saxon release I am planning an extension to the s9api interface to exploit the facilities of Java 8 streams to allow powerful navigation of XDM trees: the idea is that navigation should be as easy as using XPath, but without the need to drop out of Java into a different programming language. To give a flavour, here is how you might select the elements within a document that have @class='hidden':

doc.select(descendant(isElement())
.where(attribute("class").eq("hidden")))

We'll see how that works in due course.

Why do we need it?

The combination of Java and XML is as powerful and ubiquitous today as it as been for nearly twenty years. Java has moved on considerably (notably, as far as this article is concerned, with the Java 8 Streams API), and the world of XML processing has also made great strides (we now have XSLT 3.0, XPath 3.1, and XQuery 3.1), but for some reason the two have not moved together. The bulk of Java programmers manipulating XML, if we can judge from the questions they ask on forums such as StackOverflow, are still using DOM interfaces, perhaps with a bit of XPath 1.0 thrown in.

DOM shows its age. It was originally designed for HTML, with XML added as an afterthought, and XML namespaces thrown in as a subsequent bolt-on. Its data model predates the XML Infoset and the (XPath-2.0-defined) XDM model. It was designed as a cross-language API and so the designers deliberately eschewed the usual Java conventions and interfaces in areas such as the handling of collections and iterators, not to mention exceptions. It does everything its own way. As a navigational API it carries a lot of baggage because the underlying tree is assumed to be mutable. Many programmers only discover far too late that it's not even thread-safe (even when you confine yourself to retrieval-only operations).

There are better APIs than DOM available (for example JDOM2 and XOM) but they're all ten years old and haven't caught up with the times. There's nothing in the Java world that compares with Linq for C# users, or ElementTree in Python.

The alternative of calling out from Java to execute XPath or XQuery expressions has its own disadvantages. Any crossing of boundaries from one programming language to another involves data conversions and a loss of type safety. Embedding a sublanguage in the form of character strings within a host language (as with SQL and regular expressions) means that the host language compiler can't do any static syntax checking or type checking of the expressions in the sublanguage. Unless users go to some effort to avoid it, it's easy to find that the cost of compiling XPath expressions is incurred on each execution, rather than being incurred once and amortized. And the API for passing context from the host language to the sublanguage can be very messy. It doesn't have to be quite as messy as the JAXP interface used for invoking XPath from Java, but it still has to involve a fair bit of complexity.

Of course, there's the alternative of not using Java (or other general-purpose programming languages) at all: you can write the whole application in XSLT or XQuery. Given the capability that XSLT 3.0 and XQuery 3.1 have acquired, that's a real possibility far more often than most users realise. But it remains true that if only 10% of your application is concerned with processing XML input, and the rest is doing something more interesting, then writing the whole application in XQuery would probably be a poor choice.

Other programming languages have developed better APIs. Javascript has JQuery, C# programmers have Linq, Scala programmers have something very similar, and PHP users have SimpleXML. These APIs all have the characteristic that they are much more deeply integrated into the host language, and in particular they exploit the host language primitives for manipulation of sequences through functional programming constructs, with a reasonable level of type safety given that the actual structure of the XML document is not statically known.

That leads to the question of data binding interfaces: broadly, APIs that exploit static knowledge of the schema of the source document. Such APIs have their place, but I'm not going to consider them any further in this article. In my experience they can work well if the XML schema is very simple and very stable. If the schema is complex or changing, data binding can be a disaster.

The Java 8 Streams API

This is not the place for an extended tutorial on the new Streams API introduced in Java 8. If you haven't come across it, I suggest you find a good tutorial on the web and read it before you go any further.

Java Streams are quite unrelated to XSLT 3.0 streaming. Well, almost unrelated: they share the same high-level objectives of processing large collections of data in a declarative way, making maximum use of lazy evaluation to reduce memory use, and permitting parallel execution. But that's where the similarity ends. Perhaps the biggest difference is that Java 8 streams are designed to process linear data structures (sequences), whereas XSLT 3.0 streaming is designed to process trees.

But just to summarise:

Java 8 introduces a new interface, Stream<X>, representing a linear sequence of items of type X
Like iterators, streams are designed to be used once. Unlike iterators, they are manipulated using functional operations, most notably maps and filters, rather than being processed one item at a time. This makes for less error-prone programming, and allows parallel execution.

The functional nature of the Java 8 Streams API means it has much in common with the processing model of XPath. The basic thrust of the API design presented in this article is therefore to reproduce the primitives of the XPath processing model, re-expressing them in terms of the constructs provided by the Java 8 Streams API.

If the design appears to borrow concepts from other APIs such as LINQ and Scala and SimpleXML, that's not actually because I have a deep familiarity with those APIs: in fact, I have never used them in anger, and I haven't attempted to copy anything across literally. Rather, any similarity is because the functional concepts of XPath processing map so cleanly to this approach.

The Basics of the Saxon s9api API

The Saxon product primarily exists to enable XSLT, XQuery, XPath, and XML Schema processing. Some years ago I decided that the standard APIs (JAXP and XQJ) for invoking such functionality were becoming unfit for purpose. They had grown haphazardly over the years, the various APIs didn't work well together, and they weren't being updated to exploit the newer versions of the W3C specifications. Some appalling design mistakes had been unleashed on the world, and the strict backwards compatibility policy of the JDK meant these could never be corrected.

To take one horrid example: the NamespaceContext interface is used to pass a set of namespace bindings from a Java application to an XPath processor. To implement this interface, you need to implement three methods, of which the XPath processor will only ever use one (getNamespaceURI(prefix)). Yet at the same time, there is no way the XPath processor can extract the full set of bindings defined in the NamespaceContext and copy them into its own data structures.

So I decided some years ago to introduce a proprietary alternative called s9api into the Saxon product (retaining JAXP support alongside), and it has been a considerable success, in that it has withstood the test of time rather well. The changes to XSLT transformation in 3.0 were sufficiently radical that I forked the XsltTransformer interface to create a 3.0 version, but apart from that it has been largely possible to add new features incrementally. That's partly because of a slightly less obsessive attitude to backwards compatibility: if I decide that something was a bad mistake, I'm prepared to change it.

Although s9api is primarily about invoking XSLT, XQuery, and XPath processing, it does include classes that represent objects in the XDM data model, and I will introduce these briefly because the new navigation API relies on these objects as its foundation. The table below lists the main classes.

Class	Description
`XdmValue`	Every value is the XDM model is a sequence of items. The `XdmValue` class is therefore the top of the class hierarchy. Because it's a sequence, it implements `Iterable<XdmItem>`, so you can use a Java `foreach` loop to process the items sequentially. In the latest version I have used Java generics to add a type parameter, so `XdmValue<XdmNode>` is a sequence of nodes, and `XdmValue<XdmAtomicValue>` is a sequence of atomic values. As well as an `iterator()` method, it has an `itemAt()` method to get the Nth item, and a `size()` method to count the items. Internally an `XdmValue` might exist as an actual sequence in memory, or as a "promise": sufficient data to enable the items to be materialized when they are needed.
`XdmItem`	This class represents an Item in the XDM model. As such it is both a component of an `XdmValue`, and also an `XdmValue` (of length one) in its own right. It's an abstract class, because every item is actually something more specific (a node, an atomic value, a function). Some of the methods inherited from `XdmValue` become trivial (for example `size()` always returns 1).
`XdmNode`	This is a subclass of `XdmItem` used to represent nodes. Unlike many models of XML, we don't subclass this for different kinds of node: that's mainly because XDM has deliberately aimed at uniformity, with the same accessors available for all node kinds. Many of the methods on `XdmNode`, such as `getNodeName()`, `getStringValue()`, `getTypedValue()`, and `getNodeKind()`, are directly equivalent to accessors defined in the W3C XDM specification. But in addition, `XdmNode` has a method `axisIterator` to navigate the tree using any of the XPath axes, the result being returned as an iterator over the selected nodes.
`XdmAtomicValue`	Another subclass of `XdmItem`, this is used to represent atomic values in the XDM model. As with `XdmNode`, we don't define further subclasses for different atomic types. There are convenience methods to convert `XdmAtomicValue` instances to and from equivalent (or near-equivalent) Java classes such as `String`, `Double`, `BigInteger`, and `Date`.
`XdmFunctionItem`	From XPath 3.0, functions are first-class values alongside nodes and atomic values. These are represented in s9api as instances of `XdmFunctionItem`. Two specific subclasses of function, with their own behaviours, are represented using the subclasses `XdmMap` and `XdmArray`. I won't be saying much about these in this article, because I'm primarily concerned with navigating XML trees.

The new API: Steps and Predicates

The basic concept behind the new extensions to the s9api API is navigation using steps and predicates. I'll introduce these concepts briefly in this section, and then go on to give a more detailed exposition.

The class XdmValue<T> acquires a new method:

XdmStream select(Step step)

The Step here is a function that takes an item of class T as its input, and returns a stream of items. If we consider a very simple Step, namely child(), this takes a node as input and returns a stream of nodes as its result. We can apply this step to an XdmValue consisting entirely of nodes, and it returns the concatenation of the streams of nodes obtained by applying the step to each node in the input value. This operation is equivalent to the "!" operator in XPath 3.0, or to the flatMap() method in many functional programming languages. It's not quite the same as the familiar "/" operator in XPath, because it doesn't eliminate duplicates or sort the result into document order. But for most purposes it does the same job.

There's a class net.sf.saxon.s9api.streams.Steps containing static methods which provide commonly-used steps such as child(). In my examples, I'll assume that the Java application has import net.sf.saxon.s9api.streams.Steps.*; in its header, so it can use these fields and methods without further qualification.

One of the steps defined by this class is net.sf.saxon.s9api.streams.Steps.child(): this step is a function which, given a node, returns its children. There are other similar steps for the other XPath axes. So you can find the children of a node N by writing N.select(child()).

Any two steps S and T can be combined into a single composite step by writing S.then(T): for example Step grandchildren = child().then(child()) gives you a step which can be used in the expression N.select(grandchildren) to select all the grandchildren.

The class Step inherits from the standard Java class Function, so it can be used more generally in any Java context where a Function is required.

Predicate<T> is a standard Java 8 class: it defines a function that can be applied to an object of type T to return true or false. The class net.sf.saxon.s9api.streams.Predicates defines some standard predicates that are useful when processing XML. For example isElement() gives you a predicate that can be applied to any XdmItem to determine if it is an element node.

Given a Step A and a Predicate P, the expression A.where(P) returns a new Step that filters the results of A to include only those items that satisfy the predicate P. So, for example, child().where(isElement()) is a step that selects the element children of a node, so that N.select(child().where(isElement())) selects the element children of N. This is sufficiently common that we provide a shorthand: it can also be written N.select(child(isElement())).

The predicate hasLocalName("foo") matches nodes having a local name of "foo": so N.select(child().where(hasLocalName("foo")) selects the relevant children. Again this is so common that we provide a shorthand: N.select(child("foo")). There is also a two argument version child(ns, "foo") which selects children with a given namespace URI and local name.

Another useful predicate is exists(step) which tests whether the result of applying a given step returns at least one item. So, for example N.select(child().where(exists(attribute("id")))) returns those children of N that have an attribute named "id".

The result of the select() method is always a stream of items, so you can use methods from the Java Stream class such as filter() and flatMap() to process the result. Here are some of the standard things you can do with a stream of items in Java:

You can get the results as an array: N.select(child()).toArray()
Or as a list: N.select(child()).collect(Collectors.toList())
You can apply a function to each item in the stream: N.select(child()).forEach(System.err::println)
You can get the first item in the stream: N.select(child()).findFirst().get()

However, Saxon methods such as select() always return a subclass of Stream called XdmStream, and this offers additional methods. For example:

You can get the results as an XdmValue: N.select(child()).asXdmValue()
A more convenient way to get the results as a Java List: N.select(child()).asList()
If you know that the stream contains a single node (or nothing), you can get this using the methods asNode() or asOptionalNode()
Similarly, if you know that the stream contains a single atomic value (or nothing), you can get this using the methods asAtomic() or asOptionalAtomic()
You can get the last item in the stream: N.select(child("para")).last()

More about Steps

The actual definition of the Step class is:

public abstract class Step<T extends XdmItem> implements Function<XdmItem, Stream<? extends T>>

What that means is that it's a function that any XdmItem as input, and delivers a stream of U items as its result (where U is XdmItem or some possibly-different subclass). (I experimented by also parameterizing the class on the type of items accepted, but that didn't work out well.)

Because the types are defined, Java can make type inferences: for example it knows that N.select(child()) will return nodes (because child() is a step that returns nodes).

As a user of this API, you can define your own kinds of Step if you want to: but most of the time you will be able to do everything you need with the standard Steps available from the class net.sf.saxon.s9api.stream.Steps. The standard steps include:

The axis steps ancestor(), ancestor-or-self(), attribute(), child(), descendant(), descendantOrSelf(), following(), followingSibling(), namespace(), parent(), preceding(), precedingSibling(), self().
For each axis, three filtered versions: for example child("foo") filters the axis to select elements by local name (ignoring the namespace if any); child(ns, local) filters the axis to select elements by namespace URI and local name, and child(predicate) filters the axis using an arbitrary predicate: this is a shorthand for child().where(predicate).
A composite step can be constructed using the method step1.then(step2). This applies step2 to every item in the result of step1, retaining the order of results and flattening them into a single stream.
A filtered step can be constructed using the method step1.where(predicate1). This selects those items in the result of step1 for which predicate1 returns true.
A path with several steps can be constructed using a call such aspath(child(isElement()), attribute("id")). This returns a step whose effect is to return the id attributes of all the children of the target node.
If the steps are sufficiently simple, a path can also by written means of a simple micro-syntax similar to XPath abbreviated steps. The previous example could also be written path("*", "@id"). Again, this returns a step that can be used like any other step. (In my own applications, I have found myself using this approach very extensively).
The step atomize() extracts the typed values of nodes in the input, following the rules in the XPath specification. The result is a stream of atomic values
The step toString() likewise extracts the string values, while toNumber() has the same effect as the XPath number() function

Last but not least, xpath(path) returns a Step that evaluates an XPath expression. For example, doc.select(xpath("//foo")) has the same effect as doc.select(descendant("foo")). A second argument to the xpath() method may be used to supply a static context for the evaluation. Note that compilation of the XPath expression occurs while the step is being created, not while it is being evaluated; so if you bind the result of xpath("//foo") to a variable, then the expression can be evaluated repeatedly without recompilation.

More about Predicates

The Predicate class is a standard Java 8 interface: it is a function that takes any object as input, and returns a boolean. You can use any predicates you like with this API, but the class net.sf.saxon.s9api.streams.Predicates provides some implementations of Predicate that are particularly useful when navigating XML documents. These include the following:

isElement(), isAttribute(), isText(), isComment(), isDocument(), isProcessingInstruction(), isNamespace() test that the item is a node of a particular kind
hasName("ns", "local"), hasLocalName("n"), and hasNamespaceUri("ns") make tests against the name of the node
hasType(t) tests the type of the item: for example hasType(ItemType.DATE) tests for atomic values of type xs:date
exists(step) tests whether the result of applying the given step is a sequence containing at least one item; conversely empty(step) tests whether the result of the step is empty. For example, exists(CHILD) is true for a node that has children.
some(step, predicate) tests whether at least one item selected by the step satisfies the given predicate. For example, some(CHILD, IS_ELEMENT) tests whether the item is a node with at least one element child. Similarly every(step, predicate) tests whether the predicate is true for every item selected by the step.
eq(string) tests whether the string value of the item is equal to the given string; while eq(double) does a numeric comparison. A two-argument version eq(step, string) is shorthand for some(step, eq(string)). For example, descendant(eq(attribute("id"), "ABC")) finds all descendant elements having an "id" attribute equal to "ABC".
Java provides standard methods for combining predicates using and, or, and not. For example isElement().and(eq("foo")) is a predicate that tests whether an item is an element with string-value "foo".

The XdmStream class

The fact that all this machinery is built on Java 8 streams and functions is something that many users can safely ignore; they are essential foundations, but they are hidden below the surface. At the same time, a user who understands that steps and predicates are Java Functions, and that the result of the select() method is a Java Stream, can take advantage of this knowledge.

One of the key ideas that made this possible was the idea of subclassing Stream with XdmStream. This idea was shamelessly stolen from the open-source StreamEx library by Tagir Valeev (though no StreamEx code is actually used). Subclassing Stream enables additional methods to be provided to handle the results of the stream, avoiding the need for clumsy calls on the generic collect() method. Another motivating factor here is to allow for early exit (short-circuit evaluation) when a result can be delivered without reading the whole stream. Saxon handles this by registering onClose() handlers with the stream pipeline, so that when the consumer of the stream calls the XdmStream.close() method, the underlying supplier of data to the stream is notified that no more data is needed.

Examples

This section provides some examples extracted from an actual program that uses s9api interfaces and does a mixture of Java navigation and XPath and XQuery processing to extract data from an input document.

First, some very simple examples. Constructs like this are not uncommon:

XdmNode testInput = (XdmNode) xpath.evaluateSingle("test", testCase);

This can be replaced with the much simpler and more efficient:

XdmNode testInput = testCase.selectFirst(child("test"));

Similarly, the slightly more complex expression:

XdmNode principalPackage = (XdmNode) xpath.evaluateSingle("package[@role='principal']", testInput);

becomes:

XdmNode principalPackage = testInput.selectFirst(child("package").where(eq(attribute("role"), "principal"));

A more complex example from the same application is this one:

boolean definesError = xpath.evaluate("result//error[starts-with(@code, 'XTSE')]", testCase).size() > 0;

Note here how the processing is split between XPath code and Java code. This is also using an XPath function for which we haven't provided a built-in predicate in s9api. But that's no problem, because we can invoke Java methods as predicates. So this becomes:

boolean definesError = testCase.selectFirst(child("result"), descendant("error").where(
                     some(attribute("code"), (XdmNode n) -> n.getStringValue().startsWith("XTSE"))) != null;