The last remaining query in the XMark benchmark where the compiled code was running significantly slower than the interpreted code was Q10. This query is along the lines:
for $i in distinct-values(
/site/people/person/profile/interest/@category)
let $p := for $t in /site/people/person
where $t/profile/interest/@category = $i
return <personne>
<statistiques>
<sexe>{ $t/profile/gender }</sexe>,
<age>{ $t/profile/age }</age>,
<education>{ $t/profile/education}</education>,
<revenu>{ $t/profile/@income } </revenu>
</statistiques>
<otherStuff>................</otherStuff>
</personne>
return <categorie>
<id>{ $i }</id>
{ $p }
</categorie>
It turned out that the interpreted code was writing all the constructed elements directly to the result stream, whereas the compiled code was constructing an intermediate tree representing the contents of variable $p, and then copying this to the result stream, which meant that the execution time was nearly doubled. The reason for this is that the interpreted code represents $p in a Closure object, essentially a value subject to lazy evaluation, and then makes a run-time decision whether to evaluate this in "pull" or "push" mode depending on how the value is actually used. The compiled code was always evaluating such variables in "pull" mode, leading to the unnecessary copy.
I considered various solutions. One could reproduce the interpretive behaviour by allowing a compiled closure to be evaluated in either mode, still making a run-time decision; or one could make a decision based on compile-time analysis that the reference to $p occurs in a context where push evaluation is advantageous. Instead I decided to do something more radical, and rewrite the tree to expand the variable reference $p inline. This is something I've been meaning to do for a long time, but I was always a bit concerned about context dependencies: what if the reference to $p occurs somewhere where the context is different from the place where $p is declared? However, on checking the code, it seems that all the analysis to resolve this has already been done. Saxon is already making a compile-time decision to use a "non-memo-closure" for $p (one which evaluates the expression without remembering the result), and the conditions for that seem to be exactly the same as the conditions that allow inlining. Moreover, there's already a tree-rewriting operation that does variable inlining, which is used for a more restricted purpose. So I discovered to my great surprise that the code to inline $p was only an extra four lines!
The effect on the performance of the interpreted code is an improvement from 167 to 161ms - nothing dramatic, as one would expect, because all we've done is to eliminate the small overhead of remembering the variable definition and evaluating it on first use. For the compiled code, however, inlining eliminates the tree copy operation, and improves the performance from 301ms to 104ms.
All the more complex queries are now showing a speed up (for compiled versus interpreted execution) of between 25% and 50%. That's good enough for a first release, although I hope to improve on it. I would hope that real user workloads will see higher figures: as the work on the Knight's Tour showed, much higher speed-up is possible in cases where there are lots of variables, function calls, and computation within the query.
But before the code can be released there are still a lot of things missing. Mainly
things that are missing from the test suites, like external Java functions, user-defined
collations, and so on; and also a lot more work on the functions and expressions which
are tested in XQTS only using constant expressions that can be evaluated at compile
time. Plus a review of the hundreds of TODO statements scattered around the code.