XML versus JSON: A Comparison using the XMark Benchmark

For many years we have used the XMark benchmark to check for performance regression between Saxon releases, and to evaluate the impact of internal changes on query performance.

XMark originated with the MonetDB project and is described at https://projects.cwi.nl/xmark/. It consists of a scaleable XML data file (produced using a data generator), and a set of 20 XQuery queries to be run against that data file. We have run the data generator to produce files with nominal sizes of 100Kb, 1Mb, 4Mb, 10Mb, and 100Mb; we use the original queries as published, except for one or two small changes to correct errors in the original publication.

Recently we have converted these data files to JSON, and have produced equivalent XQuery 3.1 queries to deliver the same results as the original. The queries still produce XML rather than JSON output, so that we can compare the results; except in a few cases where large chunks of the original XML are copied to the output, which we cannot reproduce exactly because we don't have the original XML available. The results also differ because JSON maps don't retain order.

In this article I will report first on the data conversion; then on the query conversion; and finally on performance results.

Converting the data

I didn't attempt to use any off-the-shelf XML-to-JSON conversion tools. My instinct is that they wouldn't have done a very good job, and I would have needed an XSLT transformation to refine the output anyway, so I decided to do the whole job using XSLT 3.0.

The conversion stylesheet is not particularly interesting; in fact, it's rather tedious. A few points are worth mentioning:

As mentioned in What should we do about arrays? the XSLT 3.0 spec is weak on capabilities for constructing arrays. This is the only area where we used Saxon extensions. I'm not convinced we yet have a perfect solution to this problem, and I've proposed some new ideas at Constructing Arrays.
The XMark data files are mainly structured data, but there is some use of mixed content for narrative text. Mixed content is bad news for JSON. In real life, I would probably have handled this by embedding XML or HTML fragments within character strings in the JSON structure; but that seemed to be against the spirit of this exercise, so I instead expanded the node structure of the mixed content using JSON maps and arrays. As we'll see later, this had a severe effect on the ease of writing the queries and on their performance.
Most structured data elements in the original file fall into two categories: elements with homogenous element content (multiple children all having the same name), and elements with heterogeous element content (multiple children with different names). These translate very naturally into JSON arrays and maps respectively. A few cases weren't quite so simple: for example the content model for open_auction has the form (initial, bidder*, current, privacy, seller, ...). We handle this as if there were a wrapper element bidders around the sequence of bidder elements, so open_auction translates to a map, and bidders converts to an array. The names of elements within an array are generally dropped.
There are a few attributes in the XML; these posed no particular problem.

The nominal 10Mb file is actually 11,875,066 bytes in its XML form, and 10,464,266 bytes when converted to JSON, a reduction of 13%. Some of this difference (perhaps 200Kb) is due to unnecessary whitespace in the XML; the rest is the overhead of element end tags.

Parsing the XML and building a Saxon TinyTree took 353ms; parsing the JSON and building a structure of XDM maps and arrays took 636ms. I haven't attempted to assess the memory usage of the two data structures, but the maps and arrays are almost certainly larger. This is despite the fact that for maps derived directly from JSON parsing, we use a specialized representation of maps that optimizes for the fact that all keys are instances of xs:string, and therefore don't need to retain a type annotation.

Converting the Queries

The queries were converted by hand. Generally we tried to change the query so it continued to produce the same (XML) output as the original, for ease of comparing results; but for queries whose output copies sizeable chunks of the input XML, we abandoned this principle, instead replicating the intent of the query as far as we could.

In most cases the conversion is very straightforward. For example, this is Q3:


(: Q3. Return the IDs of all open auctions whose current
     increase is at least twice as high as the initial increase. :)

for    $b in /site/open_auctions/open_auction
where  $b/bidder[1]/increase * 2 <= $b/bidder[last()]/increase
return <increase first="{$b/bidder[1]/increase}"
                 last="{$b/bidder[last()]/increase}"/>

Which turns into:


(: Q3. Return the IDs of all open auctions whose current
     increase is at least twice as high as the initial increase. :)

for    $b in ?open_auctions?*
where  $b?bidders?*[1]?increase *2 <= $b?bidders?*[last()]?increase
return <increase first="{$b?bidders?*[1]?increase}"
                 last="{$b?bidders?*[last()]?increase}"/>

Some observations:

We have to use bidders?*[1] rather than bidders?1 because the latter expression throws a dynamic error (rather than returning an empty sequence) for an auction in which there are no bidders.
We use bidders?*[last()] to get the last item in an array because converting the array to a sequence and using a filter is simpler than the alternative of writing bidders?(array:size($b?bidders)).
The element name site is dropped because the JSON file is a map in which open_auctions is a top-level entry.
The element name open_auction is dropped because the open_auctions entry in the JSON contains an array of objects which do not need to be named; the ?* in the JSON query corresponds to the /open_auction in the original.
The JSON form introduces the name bidders as a wrapper for the group of individual bidder elements in the XML (which become anonymous in the JSON).

Some specific difficulties that were encountered in converting other queries:

Query Q4 looks for auctions where one bid differs from a previous bid, and for this purpose it uses the operator << to test the relative order of two nodes in document order. The JSON model offers no equivalent. To solve this I introduced a higher-order function index-where(array, predicate) which returns the index positions in an array of members that satisfy the given predicate; it is then possible to find the required two items and compare their index positions.
For any query using the mixed content description field, I needed to include a recursive function that reconstructs the description as text by flattening the arrays and maps that make it up. This is tedious and expensive. Queries that do a deep dive into the description, like looking for text marked up with keyword tags at any depth, are even more complicated. The map:find function sometimes does what's needed, but not if any context is involved (for example, finding a keyword that's nested within emph markup).
Debugging mistakes can be tricky. The diagnostics you get when you write ?closed_auctions?annotation instead of ?closed_auctions?*?annotation aren't always helpful. I've tried to improve them. I've also proposed a language change so the first expression becomes valid: see Lookup operator on arrays of maps.
It's very easy to forget that if $A is an array, then $A[$index] and $A[condition] are both valid, but neither means what you think, because they treat the array as a single item, not as a collection. With arrays derived from JSON, every member of the array (discounting any nulls) is a singleton, so you can always write $A?*[$index] or $A?*[condition] instead.

Query Performance

For most of the queries, the JSON query was a little slower than the XML version. Queries in this category include:

Query	XML timing (ms)	JSON timing (ms)	Ratio (%)
q1	0.2649	0.6845	258%
q2	0.4861	0.6588	136%
q5	0.2711	0.3190	118%
q8	1.9359	2.3572	122%
q10	11.3329	14.3428	127%
q11	93.5360	144.1105	154%
q16	0.4183	0.8489	203%
q17	0.5964	0.8887	149%
q20	1.2380	2.2084	178%

How do we account for these numbers? My theory (based on gut feeling) is that the XML queries are faster because of the use of integer fingerprints for name matching in the TinyTree. Look at q1, for example, which in the original is:

Q1: for $b in /site/people/person[@id="person0"] return $b/name

(The XMark queries were written by someone who felt that everything ought to be written as a FLWOR expression. It can of course be simplified to a simple XPath. I'm surprised they didn't use a where clause...)

The child and attribute axis steps here (child::people, child::person, attribute::idetc) are implemented in the TinyTree by a sequential search of node entries testing each one for an integer namecode. By contrast the JSON equivalent is:

Q1: for $b in ?people?*[?id="person0"] return $b?name

and this involves string-based lookups in a hash table. Because the fan-out is fairly small, the sequential search wins.

To test this theory, I ran the XML queries using DOM rather than TinyTree as the tree model. Navigation in the DOM uses string matching on element and attribute names. The DOM queries are dramatically slower than the TinyTree: q1: 0.2947 q2: 9.1684 q5: 5.1841 q8: 49.4798 q10: 116.8379 q11: 402.2151 q16: 6.5635 q17: 44.1887 q20: 179.2854.

In the next group of queries, the JSON query is slightly faster:

Query	XML timing (ms)	JSON timing (ms)	Ratio (%)
q3	1.3507	1.2656	94%
q6	0.2870	0.0316	11%
q9	3.2959	2.2320	68%
q12	32.3911	29.2320	90%
q18	0.3134	0.2865	91%
q19	4.9937	4.6699	93%

Query q6 is clearly an outlier. This query counts descendants: the original XML formulation is:

Q6: for $b in /site/regions/* return count ($b//item)

As it happens, item elements cannot appear at any depth, so the return clause could equally have been written count($b/item). In writing the JSON query I took advantage of this knowledge, and wrote the query as:

Q6: map:for-each(?regions, function($k, $v){a:size($v)})

This runs faster firstly because of this simplification, and secondly because the size of a map can be determined in constant time, whereas counting the number of children of an element requires actually scanning them.

For the other queries where there is a small speed-up, the cause is less obvious, but it's usually possible to hazard a guess. Some of them, for example, involve arithmetic and numeric comparisons, and the JSON queries in such cases avoid the overhead of converting strings to numbers on the fly (instead, the conversion is done during JSON parsing). We know from profiling that these conversions, especially if they occur in a filter predicate, can dominate query execution time.

For the final group of queries, the JSON performance is chronically worse:

Query	XML timing (ms)	JSON timing (ms)	Ratio (%)
q7	1.0953	87.4869	7987%
q13	0.3635	15.1646	4171%
q14	12.4252	138.0764	1111%

These three queries all involve access to the description of an item, which in the XML representation is a mixed-content field (text with inline markup). As remarked earlier, this has been represented in JSON by expanding the node tree to a structure of arrays and singleton maps. As a result, a query like this one:

Q14: for $i in /site//item where contains ($i/description,"gold") return ($i/name, $i/description)

becomes thoroughly contorted (and inefficient) in the JSON representation: it is necessary to write a recursive function that assembles the description (sans markup) as a string before executing the contains() function. Even then, the JSON query doesn't faithfully reproduce the original, because it outputs the description as a string, losing the internal markup.

Conclusions

First, if you've got mixed content (text with inline markup) then you probably don't want to be using JSON. If you must use JSON, use XML or HTML within character strings in cases where inline markup is needed.

Secondly, for structured data it's a fairly even match; the differences aren't large enough to be critical for most applications. In Saxon, XML does slightly better on balance. This assumes, however, that for the XML case you are using an efficient model like the Saxon TinyTree, rather than a general-purpose DOM.

We found a few cases where the expressive power of XQuery 3.1 for querying JSON structures has gaps and omissions. Notably, searching for descendants in the tree is difficult; operations based on ordering of items within arrays are also tricky.