There's been a lot of discussion recently in the QT4 community group about introducing ordered maps: that is, maps in which there is a defined and predictable ordering of entries, typically "last-in, last-out" where the most recently added entry always appears at the end. The main motivation for this is that JSON is designed (like XML) to be human-readable, but JSON content in which the entries appear in random order is anything but: if a phone bill contains a name, address, account number, a summary of charges, and an itemized list of calls, then you don't want the phone number appearing in the middle, sandwiched between the list of calls and the list of charges. Currently when data is serialized as JSON, we provide an option to indent it for readability, but indentation isn't going to make the data readable if it's in random order.
Retaining order is particularly useful for visual inspection of changes: if you write code that modifies one entry in a JSON document, you want to satisfy yourself that the transformation did exactly what you expected, and the best way to convince yourself is by placing the input and output side-by-side and comparing them visually.
There seems to be consensus that support for ordered maps, at least in some circumstances,
is desirable. There is debate about whether all maps should be ordered, and about whether
ordering should be the default, and about whether ordering should be supported if a map
is built incrementally using map:put
operations. The answer to those questions
depends at least in part on understanding how great the overhead is in retaining order
in maps: if the overhead is negligible, then we might as well make all maps ordered.
Normally I'm the first to argue that the language specification should not be driven by performance concerns: we should design a clean language and leave implementors to worry about how to implement it efficiently. But in this case, if we're making a change to the language semantics that affects users whether they want the feature or not, I think we need to understand clearly whether we are asking users to pay a performance price.
Both JavaScript (from ES2015) and Python (from 3.7) have moved in the direction of making all maps (objects/dictionaries) ordered, so we wouldn't be on our own in doing this. However, JavaScript objects and Python dictionaries are mutable, whereas XDM maps are functionally persistent (adding an entry creates a new map, leaving the original unchanged), so the performance constraints are somewhat different.
So let's look now at how Saxon implements maps.
In SaxonJ 12.x there are two main implementations (ignoring special cases such as empty
maps and singleton maps). The default implementation is in the class
net.sf.saxon.ma.map.HashTrieMap
, and this is built using an open source
implementation of immutable hash tries written by Michael Froh; it has been in the
product since 9.6. In SaxonCS 12.x we replace this with the functionally equivalent Microsoft
class System.Collections.Immutable.ImmutableDictionary
. Both these library
implementations are unordered.
There is a minor tweak that complicates the implementation. In an ideal world,
we would create an underlying map of type Map<AtomicValue, GroundedValue>
,
where AtomicValue
is the Saxon class used to hold all atomic values,
and GroundedValue
is the Saxon class used to hold all sequences other than
those that are lazily-evaluated. However, AtomicValue.equals()
does
not implement the equality semantics defined by XDM for comparing map keys. This
is because XPath has different rules for equality comparisons in different circumstances.
The Microsoft ImmutableDictionary
can take a custom KeyComparer
parameter, which would solve this problem, but there is no equivalent in the Froh
library that we use in SaxonJ. So instead we implement an underlying map of type
Map<AtomicMatchKey, Tuple<AtomicValue, GroundedValue>>
, where
AtomicMatchKey
is a value derived from the AtomicValue
that has the correct equality semantics. We need to hold the AtomicValue
because in general two atomic values can have the same AtomicMatchKey
(for example this is the case when the keys are a mix of different numeric types):
and the XPath functionality for maps requires the original key value (including
its type annotation) to be retained.
The second implementation of maps found in SaxonJ and SaxonCS is the class
net.sf.saxon.ma.map.DictionaryMap
. This is implemented over a standard
mutable java.util.HashMap<String, GroundedValue>>
on Java, or
System.Collections.Generic.Dictionary<string, GroundedValue>
on .NET. It is suitable only where the keys are all instances of xs:string
(which means we don't need to retain the type annotation), and where no in-situ
modification takes place. As soon as an operation such as map:put
or map:remove
is applied to the map, we make a copy using the
more general HashTrieMap
implementation. But for many maps,
especially those derived from JSON parsing, incremental modification is rare,
and the lower-overhead DictionaryMap
is perfectly satisfactory.
In Saxon 13 (not yet released), a third map implementation has been introduced:
the ShapedMap
. This is described in the article
Maps and Records,
and it is particularly useful in cases where many maps have exactly the same structure.
This often happens when parsing CSV or JSON files. A ShapedMap
is in two
parts: a Shape
object which holds a mapping from keys to integer slot numbers,
and a simple array of slots holding the values of the fields. The Shape
object can be shared between all map instances having a common structure. As with the
DictionaryMap
, if a ShapedMap
is subjected to map:put
or map:remove
operations, it is immediately copied to a HashTrieMap
.
How are these map implementations affected by the requirement to maintain order of entries?
For the ShapedMap
, order is already maintained, so it isn't a problem.
The only impact is that two maps can only share the same Shape
object
if their keys are in the same order. There isn't going to be any observable performance
regression.
For the DictionaryMap
, on the Java platform we can replace the
underlying HashMap<String, GroundedValue>
by a
LinkedHashMap<String, GroundedValue>
. That's easily done,
because it supports the same interface. I don't yet know how much overhead
it imposes (in space or time); that requires some measurements.
On .NET, unfortunately, there is no equivalent to Java's LinkedHashMap
.
I have therefore implemented my own: this comprises a Dictionary<string, int>
that maps string-valued keys to integer positions in the sequence, and two lists:
a list of AtomicValue
for the keys and a list of GroundedValue
for the values.
For the HashTrieMap
on Java, my plan is to scrap the immutable map implemented
by Michael Froh, and substitute it with the io.vavr.collection.LinkedHashMap
from the VAVR library, which appears to have the required semantics. Again, there appears
to be no direct equivalent on .NET, so a home grown solution is again called for. My
current implementation uses the same apprach as for the DictionaryMap
:
an immutable unordered map from atomic keys to integers, supplemented by ordered
immutable lists of AtomicValue
for the keys and GroundedValue
for the values.
Which brings us to the question, what are the overheads? Answering that question
means making some assumptions about the workloads we want to measure. For example,
how important are map:put
and map:remove
operations?
Anecdotal evidence suggests these are rather rare, and that most maps are read-only
once built. But they might be important to some use cases.
The other complication is that we might be able to mitigate the overheads of making
maps ordered by introducing new optimisations. We've already introduced the
ShapedMap
idea, where ordering hopefully imposes very little overhead.
On .NET we could consider taking advantage of the ability to use a custom
KeyComparer
to avoid the overhead of effectively storing the keys twice.
We could also get smarter about choosing which implementation of maps to use under
which circumstances. One change that I'm making is to introduce a MapBuilder
class: during the initial construction of a map (for example during JSON parsing or
during processing of map:merge
or map:build
, or during
evaluation of a map constructor) we can add entries to a mutable builder object, and
this then gives us the opportunity to choose the final map implementation when we
know what all the keys and values are. For example, if all the keys have the same
type annotation, then in principle we don't need to save the type annotations with
every key value. We also know the size of the map at this stage.
We can even go further and avoid indexing the map until the first lookup
(or map:get
) operation. It might seem surprising, but there
are many maps that are never used for lookup. For example, a JSON document
might contain thousands of maps that are simply copied unchanged to the output,
or that are discarded because they are irrelevant to the particular query.
Perhaps the map builder should simply maintain a list of keys and values,
and do nothing else until the first map:get
? The only complication
here is the need to detect duplicate keys, but that could be done cheaply
using a Bloom filter.
So we need to do some measurements. But there's a good chance that if it does turn out that ordered maps impose an overhead, we can find compensating optimisations that mean there's no regression on the bottom line.
My first experiments looking at the cost of parsing and re-serializing JSON actually suggest that most of the cost is in the parsing and serializing, and that the choice of data structure for the XDM maps has very little impact on the bottom line. But that's provisional and subject to confirmation.