Schema-Awareness and XMark performance

When people ask what performance benefits they can expect from using schema-aware transformations and queries, I've often replied in a way that avoids setting expectations too high. Some queries can benefit significantly, others actually slow down because the extra cost of validating the input is not recovered by improvements in query execution speed. I've often stressed that the main benefit of schema-awareness is in the speed and ease of debugging and testing the query, not primarily in performance. But I've been taking another look at it, and I think I can probably start to be a bit more up-beat.

We need to distinguish the performance boost you get from Saxon-SA, from the performance boost you get by making your queries schema-aware. Saxon-SA has an improved optimizer which will often give you benefits whether or not you use a schema. Let's focus on a couple of the queries in the XMark benchmark: q9, which is simple equi-join, and q11, which is a non-equijoin (it joins two node sequences, comparing nodes using the "<" operator). Here are the figures for Saxon-B against the 1Mb and 10Mb versions of the database:

1Mb 10Mb q9 39ms 3533ms
q11 26ms 2447ms

Clearly both queries show quadratic performance: ten times the data size, 100 times the execution time.

Now run the same queries with Saxon-SA:

1Mb 10Mb q9 1.7ms 19ms
q11 27ms 2429ms

An enormous improvement for q9, showing the effects of the equijoin optimization in Saxon-SA; but no impact at all on q11.

Now let's see what happens if we make the queries schema-aware. The benchmark rules don't actually allow us to change the queries, but we're not trying to score benchmark points here, we're trying to improve our understanding of Saxon performance. To make the queries schema-aware we first have to write a schema for the XMark database, and then we have to add a few type declarations to the query source text. For example, q9 is changed to start:

import schema "" at "schema.xsd";
declare variable $auction as document-node(schema-element(site)) := .;

let $ca := $auction/site/closed_auctions/closed_auction
return ...

With Saxon 9.1.0.3 this changes the figures to

1Mb 10Mb q9 2.1ms 22.7ms
q11 17ms 1492ms

Note that q9 is a little bit slower, q11 fairly significantly faster. Which exemplifies why I'm cautious about telling people what to expect.

Now: something that I've had in mind for a very long time is to investigate what happens if I store typed values as well as string values in the tiny tree. In theory this gives a benefit because you only convert a string to a value such as a number or a date once, you don't do it each time the value is referenced. Today I finally got around to implementing this feature - it turns out to be only 100 lines of code or so. This is the impact:

1Mb 10Mb q9 1.6ms 17.6ms
q11 8.9ms 746.3ms

That is: the performance of q9 is pulled back to the non-schema-aware level, while the speed of q11 doubles! It's still quadratic, of course, but a 100% speed improvement in a slow-running query is always welcome. Interestingly, this doesn't depend on the compiler having any extra knowledge, one could potentially get similar improvements simply by validating the input document and running the non-schema-aware query. When I try this, the results are:

1Mb 10Mb q9 1.75ms 18.9ms
q11 14.64ms 1249.5ms

Although there's an improvement, it seems that the full benefit comes from have the typed values available as numbers at run-time, and the compiled code also knowing that they will be numbers at run-time.

So: there seems to be considerable benefits in caching typed value. It's such a simple thing to do, it's really rather surprising it took me this long to realise it.

At the same time, don't ignore the other benefits of schema-awareness. In this round, I discovered a problem with the version of q7 that I have been using: Q7. How many pieces of prose are in our database?

for $p in /site
return count($p//description) + count($p//annotation) + count($p//email)

The schema-aware compiler points out that there is no email element in the schema. I think that emailaddress was intended. But the query is strange anyway: annotation is a child element of description. Which all goes to underline the message, that performance really doesn't matter much unless the code is correct.