Since I last wrote about ordered maps, the XPath 4.0 specifications have now introduced the idea of JNodes, which adds a new set of requirements to the specification.
I wrote about JNodes in my paper at Balisage 2025, and I won't repeat the introduction here. Suffice it to say that a JNode can wrap a specific entry in an XDM map, and it allows you (among other things) to:
- Establish which of two entries in the same map comes first in document order (or equivalently, in map order);
- Get the preceding or following siblings of a map entry (the entries that precede it or follow it in map order).
The prototype implementation of JNodes that I demonstrated at Balisage included support for these
operations, but the implementation would be hopelessly inefficient in production. The current implementation
of ordered maps on the Saxon 13 development branch uses the LinkedHashMap
implementation
from the VAVR library, and this doesn't have anything that would support either of these operations.
I'm therefore proposing to use a home-grown map implementation as described below.
First note that map:put
and map:remove
operations are rare, and it's OK
to have a map implementation that allows the map to be constructed by a mutable builder class in one phase of operation,
and thereafter to be read-only. If a map:put
or map:remove
operation is attempted,
we can rebuild the map using a different (immutable | persistent | functional) data structure.
So let's first consider a structure that
doesn't support map:put
and map:remove
operations directly. I'm calling this
a FixedMap
.
A FixedMap
is underpinned by three Java structures:
keys
: An array of keys: typeAtomicValue[]
.values
: An array of values: typeGroundedValue[]
.index
: A map from atomic match keys to integers: typeMap<AtomicMatchKey, Integer>>
. Note that this map doesn't need to be ordered. The classAtomicMatchKey
is a surrogate for an XDM atomic value whose Javaequals()
method respects the XDM equality semantics: for example, if the key is anxs:string
then theAtomicMatchKey
is an instance of Saxon'sUnicodeString
class.
The basic operations on XDM maps are simply and efficiently implemented as follows:
- Get the value corresponding to a given key: set
i = index.get(key)
; return ifi == null
thennull
elsevalues[i]
. - Get all keys in map order: iterate over
keys
. - Get all values in map order: iterate over
values
.
The new operations required to support JNode navigation can also be efficiently implemented:
- Determine which of two entries with keys
X
andY
is first in document order: compareindex.get(X)
withindex.get(Y)
(an integer comparison). - Get preceding or following siblings of an entry with key
X
: findindex.get(X)
and then iterate forwards or backwards in thekeys
andvalues
arrays from this integer offset.
All very straightforward. Now, what about map:put
and map:remove
?
When one of these operations occurs, I propose to copy the FixedMap
to an
ExtensibleMap
. This will have a very similar structure to the FixedMap
, except
that:
- The arrays of keys and values are replaced by immutable lists of keys and values.
I propose to implement these immutable lists using the existing Saxon class
ZenoChain
. This is described in a previous blog. This uses the same logic as theZenoString
which I introduced in a 2021 Balisage paper. It provides efficient support for getting an entry at a particular integer offset, and for appending individual entries at the end of the list: therefore formap:get
andmap:put
. - The map from
AtomicMatchKey
toInteger
is replaced by an immutable map. There are plenty of implementations available (we could revert to the Michael Froh implementation used in earlier Saxon releases), because there is no need for this structure to support ordering.
The design relies on the fact that adding or removing entries to the map does not change the integer
offset of existing entries. That's fine for map:put
, which always puts new entries at the end
(in terms of map ordering). To support map:remove
, I propose to keep a separate list holding
the integer offsets of map entries that have been removed; a traversal of the map keys or arrays needs
to check this list and skip an entry if it is present. When the number of removed entries exceeds some
threshold (a rare event) we can rebuild the map from scratch.
The new design looks as if all the main operations perform in constant or logarithmic time, with traversal of the map in linear time; and it seems to be reasonably economical in terms of memory usage.
The design allows the use of an optimization whereby the index (from atomic match keys to integer offsets)
is built lazily, the first time map:get
is called. This is useful when processing JSON, because
it's likely that many maps constructed during JSON parsing will either be copied unchanged to the
transformation output, or will never be referenced at all, so the work needed to construct an index can be
saved (it's still necessary to check for duplicate keys, but this can be done cheaply with a simple Bloom
filter).
We currently have an optimized implementation of maps for cases where all the keys are known in advance
to be instances of xs:string
. This can potentially save a bit of space because in place
of the StringValue
representing each string-valued key, we can simply store the underlying
UnicodeString
. This saves holding the wrapper object and the type annotation, which can
be significant in the case of very large maps.
One of the benefits of this design is that it reduces our dependencies on third-party code, which in turn simplifies porting Saxon across different platforms, and transpiling to C# for the .NET platform. In the current Saxon 13 code base, we're not only dependent on the VAVR library, but we're actually dependent on a fork of that library that uses a different implentation of linked hash maps, to avoid a performance bug.