Ordered Maps

There's been a lot of discussion recently in the QT4 community group about introducing ordered maps: that is, maps in which there is a defined and predictable ordering of entries, typically "last-in, last-out" where the most recently added entry always appears at the end. The main motivation for this is that JSON is designed (like XML) to be human-readable, but JSON content in which the entries appear in random order is anything but: if a phone bill contains a name, address, account number, a summary of charges, and an itemized list of calls, then you don't want the phone number appearing in the middle, sandwiched between the list of calls and the list of charges. Currently when data is serialized as JSON, we provide an option to indent it for readability, but indentation isn't going to make the data readable if it's in random order.

Retaining order is particularly useful for visual inspection of changes: if you write code that modifies one entry in a JSON document, you want to satisfy yourself that the transformation did exactly what you expected, and the best way to convince yourself is by placing the input and output side-by-side and comparing them visually.

There seems to be consensus that support for ordered maps, at least in some circumstances, is desirable. There is debate about whether all maps should be ordered, and about whether ordering should be the default, and about whether ordering should be supported if a map is built incrementally using map:put operations. The answer to those questions depends at least in part on understanding how great the overhead is in retaining order in maps: if the overhead is negligible, then we might as well make all maps ordered.

Normally I'm the first to argue that the language specification should not be driven by performance concerns: we should design a clean language and leave implementors to worry about how to implement it efficiently. But in this case, if we're making a change to the language semantics that affects users whether they want the feature or not, I think we need to understand clearly whether we are asking users to pay a performance price.

Both JavaScript (from ES2015) and Python (from 3.7) have moved in the direction of making all maps (objects/dictionaries) ordered, so we wouldn't be on our own in doing this. However, JavaScript objects and Python dictionaries are mutable, whereas XDM maps are functionally persistent (adding an entry creates a new map, leaving the original unchanged), so the performance constraints are somewhat different.

So let's look now at how Saxon implements maps.

In SaxonJ 12.x there are two main implementations (ignoring special cases such as empty maps and singleton maps). The default implementation is in the class net.sf.saxon.ma.map.HashTrieMap, and this is built using an open source implementation of immutable hash tries written by Michael Froh; it has been in the product since 9.6. In SaxonCS 12.x we replace this with the functionally equivalent Microsoft class System.Collections.Immutable.ImmutableDictionary. Both these library implementations are unordered.

There is a minor tweak that complicates the implementation. In an ideal world, we would create an underlying map of type Map<AtomicValue, GroundedValue>, where AtomicValue is the Saxon class used to hold all atomic values, and GroundedValue is the Saxon class used to hold all sequences other than those that are lazily-evaluated. However, AtomicValue.equals() does not implement the equality semantics defined by XDM for comparing map keys. This is because XPath has different rules for equality comparisons in different circumstances. The Microsoft ImmutableDictionary can take a custom KeyComparer parameter, which would solve this problem, but there is no equivalent in the Froh library that we use in SaxonJ. So instead we implement an underlying map of type Map<AtomicMatchKey, Tuple<AtomicValue, GroundedValue>>, where AtomicMatchKey is a value derived from the AtomicValue that has the correct equality semantics. We need to hold the AtomicValue because in general two atomic values can have the same AtomicMatchKey (for example this is the case when the keys are a mix of different numeric types): and the XPath functionality for maps requires the original key value (including its type annotation) to be retained.

The second implementation of maps found in SaxonJ and SaxonCS is the class net.sf.saxon.ma.map.DictionaryMap. This is implemented over a standard mutable java.util.HashMap<String, GroundedValue>> on Java, or System.Collections.Generic.Dictionary<string, GroundedValue> on .NET. It is suitable only where the keys are all instances of xs:string (which means we don't need to retain the type annotation), and where no in-situ modification takes place. As soon as an operation such as map:put or map:remove is applied to the map, we make a copy using the more general HashTrieMap implementation. But for many maps, especially those derived from JSON parsing, incremental modification is rare, and the lower-overhead DictionaryMap is perfectly satisfactory.

In Saxon 13 (not yet released), a third map implementation has been introduced: the ShapedMap. This is described in the article Maps and Records, and it is particularly useful in cases where many maps have exactly the same structure. This often happens when parsing CSV or JSON files. A ShapedMap is in two parts: a Shape object which holds a mapping from keys to integer slot numbers, and a simple array of slots holding the values of the fields. The Shape object can be shared between all map instances having a common structure. As with the DictionaryMap, if a ShapedMap is subjected to map:put or map:remove operations, it is immediately copied to a HashTrieMap.

How are these map implementations affected by the requirement to maintain order of entries?

For the ShapedMap, order is already maintained, so it isn't a problem. The only impact is that two maps can only share the same Shape object if their keys are in the same order. There isn't going to be any observable performance regression.

For the DictionaryMap, on the Java platform we can replace the underlying HashMap<String, GroundedValue> by a LinkedHashMap<String, GroundedValue>. That's easily done, because it supports the same interface. I don't yet know how much overhead it imposes (in space or time); that requires some measurements.

On .NET, unfortunately, there is no equivalent to Java's LinkedHashMap. I have therefore implemented my own: this comprises a Dictionary<string, int> that maps string-valued keys to integer positions in the sequence, and two lists: a list of AtomicValue for the keys and a list of GroundedValue for the values.

For the HashTrieMap on Java, my plan is to scrap the immutable map implemented by Michael Froh, and substitute it with the io.vavr.collection.LinkedHashMap from the VAVR library, which appears to have the required semantics. Again, there appears to be no direct equivalent on .NET, so a home grown solution is again called for. My current implementation uses the same apprach as for the DictionaryMap: an immutable unordered map from atomic keys to integers, supplemented by ordered immutable lists of AtomicValue for the keys and GroundedValue for the values.

Which brings us to the question, what are the overheads? Answering that question means making some assumptions about the workloads we want to measure. For example, how important are map:put and map:remove operations? Anecdotal evidence suggests these are rather rare, and that most maps are read-only once built. But they might be important to some use cases.

The other complication is that we might be able to mitigate the overheads of making maps ordered by introducing new optimisations. We've already introduced the ShapedMap idea, where ordering hopefully imposes very little overhead. On .NET we could consider taking advantage of the ability to use a custom KeyComparer to avoid the overhead of effectively storing the keys twice.

We could also get smarter about choosing which implementation of maps to use under which circumstances. One change that I'm making is to introduce a MapBuilder class: during the initial construction of a map (for example during JSON parsing or during processing of map:merge or map:build, or during evaluation of a map constructor) we can add entries to a mutable builder object, and this then gives us the opportunity to choose the final map implementation when we know what all the keys and values are. For example, if all the keys have the same type annotation, then in principle we don't need to save the type annotations with every key value. We also know the size of the map at this stage.

We can even go further and avoid indexing the map until the first lookup (or map:get) operation. It might seem surprising, but there are many maps that are never used for lookup. For example, a JSON document might contain thousands of maps that are simply copied unchanged to the output, or that are discarded because they are irrelevant to the particular query. Perhaps the map builder should simply maintain a list of keys and values, and do nothing else until the first map:get? The only complication here is the need to detect duplicate keys, but that could be done cheaply using a Bloom filter.

So we need to do some measurements. But there's a good chance that if it does turn out that ordered maps impose an overhead, we can find compensating optimisations that mean there's no regression on the bottom line.

My first experiments looking at the cost of parsing and re-serializing JSON actually suggest that most of the cost is in the parsing and serializing, and that the choice of data structure for the XDM maps has very little impact on the bottom line. But that's provisional and subject to confirmation.