Schema-Awareness and XMark performance

By Michael Kay on November 30, 2008 at 02:18p.m.

When people ask what performance benefits they can expect from using schema-aware transformations and queries, I've often replied in a way that avoids setting expectations too high. Some queries can benefit significantly, others actually slow down because the extra cost of validating the input is not recovered by improvements in query execution speed. I've often stressed that the main benefit of schema-awareness is in the speed and ease of debugging and testing the query, not primarily in performance. But I've been taking another look at it, and I think I can probably start to be a bit more up-beat.

We need to distinguish the performance boost you get from Saxon-SA, from the performance boost you get by making your queries schema-aware. Saxon-SA has an improved optimizer which will often give you benefits whether or not you use a schema. Let's focus on a couple of the queries in the XMark benchmark: q9, which is simple equi-join, and q11, which is a non-equijoin (it joins two node sequences, comparing nodes using the "<" operator). Here are the figures for Saxon-B against the 1Mb and 10Mb versions of the database:

         1Mb      10Mb
q9       39ms    3533ms

q11      26ms    2447ms

Clearly both queries show quadratic performance: ten times the data size, 100 times the execution time.

Now run the same queries with Saxon-SA:

         1Mb      10Mb
q9       1.7ms     19ms

q11      27ms    2429ms

An enormous improvement for q9, showing the effects of the equijoin optimization in Saxon-SA; but no impact at all on q11.

Now let's see what happens if we make the queries schema-aware. The benchmark rules don't actually allow us to change the queries, but we're not trying to score benchmark points here, we're trying to improve our understanding of Saxon performance. To make the queries schema-aware we first have to write a schema for the XMark database, and then we have to add a few type declarations to the query source text. For example, q9 is changed to start:

import schema "" at "schema.xsd";
declare variable $auction as document-node(schema-element(site)) := .;

let $ca := $auction/site/closed_auctions/closed_auction
return ...

With Saxon 9.1.0.3 this changes the figures to

         1Mb      10Mb
q9       2.1ms     22.7ms

q11      17ms    1492ms

Note that q9 is a little bit slower, q11 fairly significantly faster. Which exemplifies why I'm cautious about telling people what to expect.

Now: something that I've had in mind for a very long time is to investigate what happens if I store typed values as well as string values in the tiny tree. In theory this gives a benefit because you only convert a string to a value such as a number or a date once, you don't do it each time the value is referenced. Today I finally got around to implementing this feature - it turns out to be only 100 lines of code or so. This is the impact:

         1Mb      10Mb
q9       1.6ms     17.6ms

q11      8.9ms    746.3ms

That is: the performance of q9 is pulled back to the non-schema-aware level, while the speed of q11 doubles! It's still quadratic, of course, but a 100% speed improvement in a slow-running query is always welcome. Interestingly, this doesn't depend on the compiler having any extra knowledge, one could potentially get similar improvements simply by validating the input document and running the non-schema-aware query. When I try this, the results are:

         1Mb      10Mb
q9       1.75ms     18.9ms

q11      14.64ms  1249.5ms

Although there's an improvement, it seems that the full benefit comes from have the typed values available as numbers at run-time, and the compiled code also knowing that they will be numbers at run-time.

So: there seems to be considerable benefits in caching typed value. It's such a simple thing to do, it's really rather surprising it took me this long to realise it.

At the same time, don't ignore the other benefits of schema-awareness. In this round, I discovered a problem with the version of q7 that I have been using: Q7. How many pieces of prose are in our database?

for $p in /site
return count($p//description) + count($p//annotation) + count($p//email)

The schema-aware compiler points out that there is no email element in the schema. I think that emailaddress was intended. But the query is strange anyway: annotation is a child element of description. Which all goes to underline the message, that performance really doesn't matter much unless the code is correct.