Navigating XML trees using Java Streams
For the next major Saxon release I am planning an extension to the s9api interface to exploit the facilities of Java 8 streams to allow powerful navigation of XDM trees: the idea is that navigation should be as easy as using XPath, but without the need to drop out of Java into a different programming language. To give a flavour, here is how you might select the elements within a document that have @class='hidden':
doc.select(descendant(isElement())
.where(attribute("class").eq("hidden")))
We'll see how that works in due course.
Why do we need it?
The combination of Java and XML is as powerful and ubiquitous today as it as been for nearly twenty years. Java has moved on considerably (notably, as far as this article is concerned, with the Java 8 Streams API), and the world of XML processing has also made great strides (we now have XSLT 3.0, XPath 3.1, and XQuery 3.1), but for some reason the two have not moved together. The bulk of Java programmers manipulating XML, if we can judge from the questions they ask on forums such as StackOverflow, are still using DOM interfaces, perhaps with a bit of XPath 1.0 thrown in.
DOM shows its age. It was originally designed for HTML, with XML added as an afterthought, and XML namespaces thrown in as a subsequent bolt-on. Its data model predates the XML Infoset and the (XPath-2.0-defined) XDM model. It was designed as a cross-language API and so the designers deliberately eschewed the usual Java conventions and interfaces in areas such as the handling of collections and iterators, not to mention exceptions. It does everything its own way. As a navigational API it carries a lot of baggage because the underlying tree is assumed to be mutable. Many programmers only discover far too late that it's not even thread-safe (even when you confine yourself to retrieval-only operations).
There are better APIs than DOM available (for example JDOM2 and XOM) but they're all ten years old and haven't caught up with the times. There's nothing in the Java world that compares with Linq for C# users, or ElementTree in Python.
The alternative of calling out from Java to execute XPath or XQuery expressions has its own disadvantages. Any crossing of boundaries from one programming language to another involves data conversions and a loss of type safety. Embedding a sublanguage in the form of character strings within a host language (as with SQL and regular expressions) means that the host language compiler can't do any static syntax checking or type checking of the expressions in the sublanguage. Unless users go to some effort to avoid it, it's easy to find that the cost of compiling XPath expressions is incurred on each execution, rather than being incurred once and amortized. And the API for passing context from the host language to the sublanguage can be very messy. It doesn't have to be quite as messy as the JAXP interface used for invoking XPath from Java, but it still has to involve a fair bit of complexity.
Of course, there's the alternative of not using Java (or other general-purpose programming languages) at all: you can write the whole application in XSLT or XQuery. Given the capability that XSLT 3.0 and XQuery 3.1 have acquired, that's a real possibility far more often than most users realise. But it remains true that if only 10% of your application is concerned with processing XML input, and the rest is doing something more interesting, then writing the whole application in XQuery would probably be a poor choice.
Other programming languages have developed better APIs. Javascript has JQuery, C# programmers have Linq, Scala programmers have something very similar, and PHP users have SimpleXML. These APIs all have the characteristic that they are much more deeply integrated into the host language, and in particular they exploit the host language primitives for manipulation of sequences through functional programming constructs, with a reasonable level of type safety given that the actual structure of the XML document is not statically known.
That leads to the question of data binding interfaces: broadly, APIs that exploit static knowledge of the schema of the source document. Such APIs have their place, but I'm not going to consider them any further in this article. In my experience they can work well if the XML schema is very simple and very stable. If the schema is complex or changing, data binding can be a disaster.
This is not the place for an extended tutorial on the new Streams API introduced in
Java 8. If you haven't come across
it, I suggest you find a good tutorial on the web and read it before you go any further. Java Streams are quite unrelated to XSLT 3.0 streaming. Well, almost unrelated: they
share the same high-level objectives
of processing large collections of data in a declarative way, making maximum use of
lazy evaluation to reduce memory use,
and permitting parallel execution. But that's where the similarity ends. Perhaps the
biggest difference is that Java 8 streams
are designed to process linear data structures (sequences), whereas XSLT 3.0 streaming
is designed to process trees. But just to summarise: The functional nature of the Java 8 Streams API means it has much in common with the
processing model of XPath. The basic thrust
of the API design presented in this article is therefore to reproduce the primitives
of the XPath processing model, re-expressing
them in terms of the constructs provided by the Java 8 Streams API. If the design appears to borrow concepts from other APIs such as LINQ and Scala and
SimpleXML, that's not actually because I have a deep familiarity
with those APIs: in fact, I have never used them in anger, and I haven't attempted
to copy anything across literally. Rather, any similarity is because
the functional concepts of XPath processing map so cleanly to this approach. The Saxon product primarily exists to enable XSLT, XQuery, XPath, and XML Schema processing.
Some years ago I decided that the standard
APIs (JAXP and XQJ) for invoking such functionality were becoming unfit for purpose.
They had grown haphazardly over the years, the various APIs
didn't work well together, and they weren't being updated to exploit the newer versions
of the W3C specifications. Some appalling design mistakes
had been unleashed on the world, and the strict backwards compatibility policy of
the JDK meant these could never be corrected. To take one horrid example: the
So I decided some years ago to introduce a proprietary alternative called s9api into the Saxon product (retaining JAXP support alongside), and it has been a considerable
success, in that it has withstood the test of time rather well. The changes to XSLT
transformation in 3.0 were sufficiently radical that I forked the
Although s9api is primarily about invoking XSLT, XQuery, and XPath processing, it
does include classes that represent objects in the XDM data model,
and I will introduce these briefly because the new navigation API relies on these
objects as its foundation. The table below lists the main classes. The basic concept behind the new extensions to the s9api
API is navigation using steps and predicates. I'll introduce
these concepts briefly in this section, and then go on to
give a more detailed exposition. The class The There's a class One of the steps defined by this class is Any two steps The class Given a The predicate Another useful predicate is The result of the However, Saxon methods such as The actual definition of the What that means is that it's a function that any Because the types are defined, Java can make type inferences: for example it knows
that As a user of this API, you can define your own kinds of Last but not least, The The fact that all this machinery is built on Java 8 streams and functions is something
that many users can safely ignore; they are essential foundations, but they are hidden
below the surface. At the same time, a user who understands that steps and predicates
are Java Functions, and that the result of the select() method is a Java Stream, can
take advantage of this knowledge.
One of the key ideas that made this possible was the idea of subclassing This section provides some examples extracted from an actual program that uses s9api
interfaces and does a
mixture of Java navigation and XPath and XQuery processing
to extract data from an input document. First, some very simple examples. Constructs like this are not uncommon: This can be replaced with the much simpler and more efficient: Similarly, the slightly more complex expression: becomes: A more complex example from the same application is this one: Note here how the processing is split between XPath code and Java code. This is also
using an XPath function for which we haven't provided a built-in
predicate in s9api. But that's no problem, because we can invoke Java methods as predicates.
So this becomes:The Java 8 Streams API
Stream<X>
, representing a linear sequence of items of type X
The Basics of the Saxon s9api API
NamespaceContext
interface is used to pass a set of namespace bindings from a Java application to
an XPath processor. To implement this
interface, you need to implement three methods, of which the XPath processor will
only ever use one (getNamespaceURI(prefix)
). Yet
at the same time, there is no way the XPath processor can extract the full set of
bindings defined in the NamespaceContext
and copy them
into its own data structures.XsltTransformer
interface to create a 3.0 version, but apart from that it has been largely possible
to add new features incrementally. That's partly
because of a slightly less obsessive attitude to backwards compatibility: if I decide
that something was a bad mistake, I'm prepared to change it.
Class
Description
XdmValue
Every value is the XDM model is a sequence of items. The
XdmValue
class is therefore
the top of the class hierarchy. Because it's a sequence, it implements Iterable<XdmItem>
,
so you can use a Java foreach
loop to process the items sequentially. In the latest version
I have used Java generics to add a type parameter, so XdmValue<XdmNode>
is a sequence
of nodes, and XdmValue<XdmAtomicValue>
is a sequence of atomic values. As well as
an iterator()
method, it has an itemAt()
method to get the Nth item,
and a size()
method to count the items.
Internally an XdmValue
might exist as an actual sequence in memory, or as a "promise": sufficient
data to enable the items to be materialized when they are needed.
XdmItem
This class represents an Item in the XDM model. As such it is both a component of
an
XdmValue
,
and also an XdmValue
(of length one) in its own right. It's an abstract class, because every
item is actually something more specific (a node, an atomic value, a function). Some
of the methods
inherited from XdmValue
become trivial (for example size()
always returns 1).
XdmNode
This is a subclass of
XdmItem
used to represent nodes. Unlike many models of XML, we don't
subclass this for different kinds of node: that's mainly because XDM has deliberately
aimed at uniformity,
with the same accessors available for all node kinds. Many of the methods on XdmNode
, such
as getNodeName()
, getStringValue()
, getTypedValue()
, and
getNodeKind()
, are directly equivalent to accessors defined in the W3C XDM specification.
But in addition, XdmNode
has a method axisIterator
to navigate the tree
using any of the XPath axes, the result being returned as an iterator over the selected
nodes.
XdmAtomicValue
Another subclass of
XdmItem
, this is used to represent atomic values in the XDM model.
As with XdmNode
, we don't define further subclasses for different atomic types. There are convenience
methods to convert XdmAtomicValue
instances to and from equivalent (or near-equivalent) Java
classes such as String
, Double
, BigInteger
, and Date
.
XdmFunctionItem
From XPath 3.0, functions are first-class values alongside nodes and atomic values.
These are
represented in s9api as instances of
XdmFunctionItem
. Two specific subclasses of function,
with their own behaviours, are represented using the subclasses XdmMap
and XdmArray
.
I won't be saying much about these in this article, because I'm primarily concerned
with navigating
XML trees.The new API: Steps and Predicates
XdmValue<T>
acquires a new method:XdmStream select(Step step)
Step
here is a function that takes an item of class T
as its input, and returns a stream of items. If we consider a very simple Step
, namely child()
, this takes a node as input and returns a stream of nodes
as its result. We can apply this step to an XdmValue
consisting entirely of nodes, and it returns the concatenation of the streams
of nodes obtained by applying the step to each node in the input value. This operation
is equivalent to the "!" operator in XPath 3.0, or to the
flatMap()
method in many functional programming languages. It's not quite the same as the familiar
"/" operator in XPath, because
it doesn't eliminate duplicates or sort the result into document order. But for most
purposes it does the same job.net.sf.saxon.s9api.streams.Steps
containing static methods which provide commonly-used steps such
as child()
. In my examples, I'll assume
that the Java application has import net.sf.saxon.s9api.streams.Steps.*;
in its header, so it can use these fields and methods without further
qualification.net.sf.saxon.s9api.streams.Steps.child()
: this step is a function which, given a node, returns its children. There are
other similar steps for the other XPath axes. So you can find the children of a node
N
by writing N.select(child())
.S
and T
can be combined into a single composite step by writing S.then(T)
: for example Step grandchildren = child().then(child())
gives you
a step which can be used in the expression N.select(grandchildren)
to select all the grandchildren.Step
inherits from the standard Java class Function
, so it can be used more generally in any Java context where a
Function
is required.Predicate<T>
is a standard Java 8 class: it defines a function that can be applied to an object
of type T
to return
true or false. The class net.sf.saxon.s9api.streams.Predicates
defines some standard predicates that are useful when processing XML. For example
isElement()
gives you a predicate that can be applied to any XdmItem
to determine if it is an element node.Step
A
and a Predicate
P
, the expression A.where(P)
returns a new
Step
that filters the results of A
to include only those items that satisfy the predicate P
. So, for example,
child().where(isElement())
is a step that selects the element children of a node, so that N.select(child().where(isElement()))
selects the element children of N
. This is sufficiently common that we provide a shorthand: it can also be written
N.select(child(isElement()))
.hasLocalName("foo")
matches nodes having a local name of "foo": so N.select(child().where(hasLocalName("foo"))
selects the relevant children. Again this is so common that we provide a shorthand:
N.select(child("foo"))
. There is also a two argument version
child(ns, "foo")
which selects children with a given namespace URI and local name.exists(step)
which tests whether the result of applying a given step returns at least one item.
So, for example
N.select(child().where(exists(attribute("id"))))
returns those children of N
that have an attribute named "id".select()
method is always a stream of items, so you can use methods from the Java Stream class
such as filter() and flatMap() to process the result. Here are some of the standard
things you can do
with a stream of items in Java:
N.select(child()).toArray()
N.select(child()).collect(Collectors.toList())
N.select(child()).forEach(System.err::println)
N.select(child()).findFirst().get()
select()
always return a subclass of Stream
called XdmStream
, and this offers additional methods. For example:
XdmValue
: N.select(child()).asXdmValue()
List
: N.select(child()).asList()
asNode()
or asOptionalNode()
asAtomic()
or asOptionalAtomic()
N.select(child("para")).last()
More about Steps
Step
class is:public abstract class Step<T extends XdmItem>
implements Function<XdmItem, Stream<? extends T>>
XdmItem
as input, and delivers a stream of U
items as its result (where U
is XdmItem
or some possibly-different subclass). (I experimented by also parameterizing the class
on the type of items accepted, but that didn't work out well.)N.select(child())
will return nodes
(because child()
is a step that returns nodes).Step
if you want to: but most of the time you will be able to do everything
you need with the standard Steps available from the class net.sf.saxon.s9api.stream.Steps
. The standard steps include:
ancestor()
, ancestor-or-self()
, attribute()
, child()
,
descendant()
, descendantOrSelf()
, following()
, followingSibling()
,
namespace()
, parent()
, preceding()
, precedingSibling()
,
self()
.child("foo")
filters the axis to select elements by
local name (ignoring the namespace if any); child(ns, local)
filters the axis to select elements by
namespace URI and local name, and child(predicate)
filters the axis using an arbitrary predicate: this is
a shorthand for child().where(predicate)
.step1.then(step2)
. This applies step2
to every item in the result of step1
, retaining the order of results and flattening them into a single stream.step1.where(predicate1)
. This selects those items in the
result of step1
for which predicate1
returns true.path(child(isElement()), attribute("id"))
. This returns a step whose effect is to return the id
attributes of all the children of the target node.path("*", "@id")
. Again, this returns a step that can be used like any other step.
(In my own applications, I have found myself using this approach very extensively).atomize()
extracts the typed values of nodes in the input, following the rules in the XPath
specification. The result
is a stream of atomic valuestoString()
likewise extracts the string values, while toNumber()
has the same effect as the
XPath number()
functionxpath(path)
returns a Step
that evaluates an XPath expression. For example,
doc.select(xpath("//foo"))
has the same effect as doc.select(descendant("foo"))
. A second argument
to the xpath()
method may be used to supply a static context for the evaluation. Note that compilation
of the XPath expression
occurs while the step is being created, not while it is being evaluated; so if you
bind the result of xpath("//foo")
to
a variable, then the expression can be evaluated repeatedly without recompilation.More about Predicates
Predicate
class is a standard Java 8 interface: it is a function that takes any object as input,
and returns a boolean.
You can use any predicates you like with this API, but the class net.sf.saxon.s9api.streams.Predicates
provides some implementations
of Predicate
that are particularly useful when navigating XML documents. These include the following:
isElement()
, isAttribute()
, isText()
, isComment()
,
isDocument()
, isProcessingInstruction()
, isNamespace()
test that the item
is a node of a particular kindhasName("ns", "local")
, hasLocalName("n")
, and hasNamespaceUri("ns")
make tests
against the name of the nodehasType(t)
tests the type of the item: for example hasType(ItemType.DATE)
tests for
atomic values of type xs:date
exists(step)
tests whether the result of applying the given step is a sequence containing at least
one item; conversely empty(step)
tests whether the result of the step is empty. For example, exists(CHILD)
is true for a node that has children.some(step, predicate)
tests whether at least one item selected by the step satisfies the given predicate.
For example,
some(CHILD, IS_ELEMENT)
tests whether the item is a node with at least one element child. Similarly every(step, predicate)
tests whether the predicate is true for every item selected by the step.eq(string)
tests whether the string value of the item is equal to the given string; while eq(double)
does
a numeric comparison. A two-argument version eq(step, string)
is shorthand for some(step, eq(string))
.
For example, descendant(eq(attribute("id"), "ABC"))
finds all descendant elements
having an "id" attribute equal to "ABC".and
, or
, and not
. For example
isElement().and(eq("foo"))
is a predicate that tests whether an item is an element with string-value "foo".The XdmStream class
Stream
with XdmStream
. This idea was shamelessly stolen from the open-source StreamEx library by Tagir Valeev (though no StreamEx code is actually used). Subclassing Stream
enables additional methods to be provided to handle the results of the stream, avoiding
the need for clumsy calls on the generic collect()
method. Another motivating factor here is to allow for early exit (short-circuit
evaluation) when a result can be delivered without reading the whole stream. Saxon
handles this by registering onClose()
handlers with the stream pipeline, so that when the consumer of the stream calls
the XdmStream.close()
method, the underlying supplier of data to the stream is notified that no more data
is needed.
Examples
XdmNode testInput = (XdmNode) xpath.evaluateSingle("test", testCase);
XdmNode testInput = testCase.selectFirst(child("test"));
XdmNode principalPackage = (XdmNode) xpath.evaluateSingle("package[@role='principal']",
testInput);
XdmNode principalPackage = testInput.selectFirst(child("package").where(eq(attribute("role"),
"principal"));
boolean definesError = xpath.evaluate("result//error[starts-with(@code, 'XTSE')]",
testCase).size() > 0;
boolean definesError = testCase.selectFirst(child("result"), descendant("error").where(
some(attribute("code"), (XdmNode n) -> n.getStringValue().startsWith("XTSE"))) != null;