For many years we have used the XMark benchmark to check for performance regression between Saxon releases, and to evaluate the impact of internal changes on query performance.
XMark originated with the MonetDB project and is described at https://projects.cwi.nl/xmark/. It consists of a scaleable XML data file (produced using a data generator), and a set of 20 XQuery queries to be run against that data file. We have run the data generator to produce files with nominal sizes of 100Kb, 1Mb, 4Mb, 10Mb, and 100Mb; we use the original queries as published, except for one or two small changes to correct errors in the original publication.
Recently we have converted these data files to JSON, and have produced equivalent XQuery 3.1 queries to deliver the same results as the original. The queries still produce XML rather than JSON output, so that we can compare the results; except in a few cases where large chunks of the original XML are copied to the output, which we cannot reproduce exactly because we don't have the original XML available. The results also differ because JSON maps don't retain order.
In this article I will report first on the data conversion; then on the query conversion; and finally on performance results.
Converting the data
I didn't attempt to use any off-the-shelf XML-to-JSON conversion tools. My instinct is that they wouldn't have done a very good job, and I would have needed an XSLT transformation to refine the output anyway, so I decided to do the whole job using XSLT 3.0.
The conversion stylesheet is not particularly interesting; in fact, it's rather tedious. A few points are worth mentioning:
As mentioned in What should we do about arrays? the XSLT 3.0 spec is weak on capabilities for constructing arrays. This is the only area where we used Saxon extensions. I'm not convinced we yet have a perfect solution to this problem, and I've proposed some new ideas at Constructing Arrays.
The XMark data files are mainly structured data, but there is some use of mixed content for narrative text. Mixed content is bad news for JSON. In real life, I would probably have handled this by embedding XML or HTML fragments within character strings in the JSON structure; but that seemed to be against the spirit of this exercise, so I instead expanded the node structure of the mixed content using JSON maps and arrays. As we'll see later, this had a severe effect on the ease of writing the queries and on their performance.
Most structured data elements in the original file fall into two categories: elements with homogenous element content (multiple children all having the same name), and elements with heterogeous element content (multiple children with different names). These translate very naturally into JSON arrays and maps respectively. A few cases weren't quite so simple: for example the content model for
open_auction
has the form(initial, bidder*, current, privacy, seller, ...)
. We handle this as if there were a wrapper elementbidders
around the sequence ofbidder
elements, soopen_auction
translates to a map, andbidders
converts to an array. The names of elements within an array are generally dropped.There are a few attributes in the XML; these posed no particular problem.
The nominal 10Mb file is actually 11,875,066 bytes in its XML form, and 10,464,266 bytes when converted to JSON, a reduction of 13%. Some of this difference (perhaps 200Kb) is due to unnecessary whitespace in the XML; the rest is the overhead of element end tags.
Parsing the XML and building a Saxon TinyTree took 353ms; parsing the JSON and building a structure
of XDM maps and arrays took 636ms. I haven't attempted to assess the memory usage of the two data structures,
but the maps and arrays are almost certainly larger. This is despite the fact that for maps derived directly
from JSON parsing, we use a specialized representation of maps that optimizes for the fact that all keys
are instances of xs:string
, and therefore don't need to retain a type annotation.
Converting the Queries
The queries were converted by hand. Generally we tried to change the query so it continued to produce the same (XML) output as the original, for ease of comparing results; but for queries whose output copies sizeable chunks of the input XML, we abandoned this principle, instead replicating the intent of the query as far as we could.
In most cases the conversion is very straightforward. For example, this is Q3:
(: Q3. Return the IDs of all open auctions whose current
increase is at least twice as high as the initial increase. :)
for $b in /site/open_auctions/open_auction
where $b/bidder[1]/increase * 2 <= $b/bidder[last()]/increase
return <increase first="{$b/bidder[1]/increase}"
last="{$b/bidder[last()]/increase}"/>
Which turns into:
(: Q3. Return the IDs of all open auctions whose current
increase is at least twice as high as the initial increase. :)
for $b in ?open_auctions?*
where $b?bidders?*[1]?increase *2 <= $b?bidders?*[last()]?increase
return <increase first="{$b?bidders?*[1]?increase}"
last="{$b?bidders?*[last()]?increase}"/>
Some observations:
- We have to use
bidders?*[1]
rather thanbidders?1
because the latter expression throws a dynamic error (rather than returning an empty sequence) for an auction in which there are no bidders. - We use
bidders?*[last()]
to get the last item in an array because converting the array to a sequence and using a filter is simpler than the alternative of writingbidders?(array:size($b?bidders))
. - The element name
site
is dropped because the JSON file is a map in whichopen_auctions
is a top-level entry. - The element name
open_auction
is dropped because theopen_auctions
entry in the JSON contains an array of objects which do not need to be named; the?*
in the JSON query corresponds to the/open_auction
in the original. - The JSON form introduces the name
bidders
as a wrapper for the group of individualbidder
elements in the XML (which become anonymous in the JSON).
Some specific difficulties that were encountered in converting other queries:
- Query Q4 looks for auctions where one bid differs from a previous bid, and for this
purpose it uses the operator
<<
to test the relative order of two nodes in document order. The JSON model offers no equivalent. To solve this I introduced a higher-order functionindex-where(array, predicate)
which returns the index positions in an array of members that satisfy the given predicate; it is then possible to find the required two items and compare their index positions. - For any query using the mixed content
description
field, I needed to include a recursive function that reconstructs thedescription
as text by flattening the arrays and maps that make it up. This is tedious and expensive. Queries that do a deep dive into the description, like looking for text marked up withkeyword
tags at any depth, are even more complicated. Themap:find
function sometimes does what's needed, but not if any context is involved (for example, finding a keyword that's nested withinemph
markup). - Debugging mistakes can be tricky. The diagnostics you get when you write
?closed_auctions?annotation
instead of?closed_auctions?*?annotation
aren't always helpful. I've tried to improve them. I've also proposed a language change so the first expression becomes valid: see Lookup operator on arrays of maps. - It's very easy to forget that if
$A
is an array, then$A[$index]
and$A[condition]
are both valid, but neither means what you think, because they treat the array as a single item, not as a collection. With arrays derived from JSON, every member of the array (discounting any nulls) is a singleton, so you can always write$A?*[$index]
or$A?*[condition]
instead.
Query Performance
For most of the queries, the JSON query was a little slower than the XML version. Queries in this category include:
Query | XML timing (ms) | JSON timing (ms) | Ratio (%) |
---|---|---|---|
q1 | 0.2649 | 0.6845 | 258% |
q2 | 0.4861 | 0.6588 | 136% |
q5 | 0.2711 | 0.3190 | 118% |
q8 | 1.9359 | 2.3572 | 122% |
q10 | 11.3329 | 14.3428 | 127% |
q11 | 93.5360 | 144.1105 | 154% |
q16 | 0.4183 | 0.8489 | 203% |
q17 | 0.5964 | 0.8887 | 149% |
q20 | 1.2380 | 2.2084 | 178% |
How do we account for these numbers? My theory (based on gut feeling) is that the XML queries are faster because of the use of integer fingerprints for name matching in the TinyTree. Look at q1, for example, which in the original is:
Q1: for $b in /site/people/person[@id="person0"] return $b/name
(The XMark queries were written by someone who felt that everything ought to be written
as a FLWOR expression. It can of course be simplified to a simple XPath. I'm surprised
they didn't use a where
clause...)
The child and attribute axis steps here (child::people
, child::person
,
attribute::id
etc) are implemented
in the TinyTree by a sequential search of node entries testing each one for an integer namecode. By contrast
the JSON equivalent is:
Q1: for $b in ?people?*[?id="person0"] return $b?name
and this involves string-based lookups in a hash table. Because the fan-out is fairly small, the sequential search wins.
To test this theory, I ran the XML queries using DOM rather than TinyTree as the tree model. Navigation in the DOM uses string matching on element and attribute names. The DOM queries are dramatically slower than the TinyTree: q1: 0.2947 q2: 9.1684 q5: 5.1841 q8: 49.4798 q10: 116.8379 q11: 402.2151 q16: 6.5635 q17: 44.1887 q20: 179.2854.
In the next group of queries, the JSON query is slightly faster:
Query | XML timing (ms) | JSON timing (ms) | Ratio (%) |
---|---|---|---|
q3 | 1.3507 | 1.2656 | 94% |
q6 | 0.2870 | 0.0316 | 11% |
q9 | 3.2959 | 2.2320 | 68% |
q12 | 32.3911 | 29.2320 | 90% |
q18 | 0.3134 | 0.2865 | 91% |
q19 | 4.9937 | 4.6699 | 93% |
Query q6 is clearly an outlier. This query counts descendants: the original XML formulation is:
Q6: for $b in /site/regions/* return count ($b//item)
As it happens, item
elements cannot appear at any depth, so the return clause
could equally have been written count($b/item)
. In writing the JSON query I took
advantage of this knowledge, and wrote the query as:
Q6: map:for-each(?regions, function($k, $v){a:size($v)})
This runs faster firstly because of this simplification, and secondly because the size of a map can be determined in constant time, whereas counting the number of children of an element requires actually scanning them.
For the other queries where there is a small speed-up, the cause is less obvious, but it's usually possible to hazard a guess. Some of them, for example, involve arithmetic and numeric comparisons, and the JSON queries in such cases avoid the overhead of converting strings to numbers on the fly (instead, the conversion is done during JSON parsing). We know from profiling that these conversions, especially if they occur in a filter predicate, can dominate query execution time.
For the final group of queries, the JSON performance is chronically worse:
Query | XML timing (ms) | JSON timing (ms) | Ratio (%) |
---|---|---|---|
q7 | 1.0953 | 87.4869 | 7987% |
q13 | 0.3635 | 15.1646 | 4171% |
q14 | 12.4252 | 138.0764 | 1111% |
These three queries all involve access to the description
of an item
, which in the XML representation
is a mixed-content field (text with inline markup). As remarked earlier, this has been represented in JSON by expanding the
node tree to a structure of arrays and singleton maps. As a result, a query like this one:
Q14: for $i in /site//item where contains ($i/description,"gold") return ($i/name, $i/description)
becomes thoroughly contorted (and inefficient) in the JSON representation: it is necessary to write a recursive function
that assembles the description (sans markup) as a string before executing the contains()
function. Even then,
the JSON query doesn't faithfully reproduce the original, because it outputs the description as a string, losing the internal
markup.
Conclusions
First, if you've got mixed content (text with inline markup) then you probably don't want to be using JSON. If you must use JSON, use XML or HTML within character strings in cases where inline markup is needed.
Secondly, for structured data it's a fairly even match; the differences aren't large enough to be critical for most applications. In Saxon, XML does slightly better on balance. This assumes, however, that for the XML case you are using an efficient model like the Saxon TinyTree, rather than a general-purpose DOM.
We found a few cases where the expressive power of XQuery 3.1 for querying JSON structures has gaps and omissions. Notably, searching for descendants in the tree is difficult; operations based on ordering of items within arrays are also tricky.