I would like to report on some Saxon performance measure on a Word ladder solution implemented in XSLT.
Firstly, some background information on the Word ladder problem. From Wikipedia, the free encyclopedia:
A word ladder (also known as a doublets, word-links, or Word golf) is a word game invented by Lewis Carroll. A word ladder puzzle begins with two given words, and to solve the puzzle one must find the shortest chain of other words to link the two given words, in which chain every two adjacent words (that is, words in successive steps) differ by exactly by one letter.
XSLT interest in this problem was first started (to the best of my knowledge) by Dimitre Novatchev through the mulberry mailing list, who provides a 20 step guide to create a stylesheet in his blog to solve the Word ladder problem (FindChainOfWordsHamming.xsl). Following the post on the list, there has been some interest; another solution to this problem was given by Wolfgang Laun (please see thread, file: FindChainOfWordsHamming2.xsl).
Experimental Evaluation
Our interest resides in the Saxon performances only. I was curious and surprised by the results reported by Dimitre. The question I had is why Dimitre's stylesheet was much slower than Wolfgang's stylesheet in Saxon and faster in another XSLT processor: there must be some optimization step we were not making. I was motivated to understand were the bottle necks were and how we could improve the performance in Saxon.
Wolfgang wrote: "The XSLT program is three times faster on one XSLT implementation than on another one is strange, 'very' strange".
Mike Kay addressed Wolfgang's comment by writing in the thread: "No, it's extremely common. In fact, very much larger factors than this are possible. Sometimes Saxon-EE runs 1000 times faster than Saxon-HE. This effect is normal with declarative languages where powerful optimizations are deployed - SQL users will be very familiar with the effect."
The table below shows the execution times of the stylesheets in Saxon 9.XX (for some recent X). Time were reported by Dimitre.
Transformation | Times (secs) |
---|---|
Dimitre | 39 |
Wolfgang | 25 |
We observe that Wolfgang's transformation is 1.56 times faster. Please note that with Wolfgang's stylesheet his results lists all solutions (i.e. ladders), whereas Dimitre only finds one.
Saxon represents a stylesheet as a compiled abstract syntax tree (AST) which is processed in a interpreted manner. Since the release of Saxon 9.4 we have included the bytecode generation feature, which allows us at the compilation phase to generate directly the byte code representation of the entire AST or sub-trees of it where performance benefits can be achieved. We make use of properties we know at compile time (See full paper).
Analysis of Dimitre's Stylesheet
Step one was to see how well Saxon does with the bytecode feature switched on. This proved inconclusive because we discovered a bug in the bytecode generated. A useful exercise already, we managed to fix the bug (see bug issue: #1653). The problem was in the function processQueue the tail recursive call was not being properly generated into bytecode.
The Table below shows running times of the stylesheets under Saxon 9.4.0.6. We observe that Wolfgang's stylesheet was 2.07 and 3.22 faster in Saxon Intepreted and bytecode, respectively.
Transformation | Interpreted - Times (secs) | With bytecode generation - Times (secs) |
---|---|---|
Dimitre | 7.95 | 7.78 |
Wolfgang | 3.83 | 2.41 |
Analyzing Dimitre's stylesheet with the Saxon tracing profile (i.e. option -TP) proved useful. See the html output produced by Saxon below. We observe that there is a big hit on the processNode method, with the most time spent in this function.
Analysis of Stylesheet Execution Time
Total time: 9498.871 milliseconds
Time spent in each template or function:
The table below is ordered by the total net time spent in the template or function. Gross time means the time including called templates and functions; net time means time excluding time spent in called templates and functions.
file | line | instruction | count | avg time (gross) | total time (gross) | avg time (net) | total time (net) |
---|---|---|---|---|---|---|---|
"*rdsHamming.xsl" | 79 | function my:processNode | 2053 | 4.12 | 8470.67 | 3.729 | 7655.792 |
"*rdsHamming.xsl" | 21 | function my:chainOfWords | 1 | 9491.1 | 9491.12 | 993.34 | 993.34 |
"*rdsHamming.xsl" | 131 | function f:eq | 3993 | 0.06 | 230.02 | 0.058 | 230.26 |
"*rdsHamming.xsl" | 131 | function my:HammingDistance | 3993 | 0.20 | 807.38 | 0.049 | 194.77 |
"*func-apply.xsl" | 21 | function f:apply | 15972 | 0.01 | 290.01 | 0.011 | 175.00 |
"*-Operators.xsl" | 244 | template f:eq | 15972 | 0.01 | 115.01 | 0.004 | 68.23 |
"*-Operators.xsl" | 248 | function f:eq | 15972 | 0.003 | 46.77 | 0.003 | 46.77 |
"*nc-zipWith.xsl" | 21 | function f:zipWith | 19965 | 0.002 | 33.11 | 0.002 | 33.11 |
"*nc-zipWith.xsl" | 9 | function f:zipWith | 19965 | 0.003 | 57.67 | 0.001 | 24.56 |
"*func-apply.xsl" | 16 | function f:apply | 15972 | 0.019 | 309.52 | 0.001 | 19.52 |
"*rdsHamming.xsl" | 70 | function my:processQueue | 2053 | 0.009 | 18.35 | 0.009 | 18.35 |
"*hFunctions.xsl" | 498 | function f:string-to-codepoints | 3993 | 0.003 | 10.52 | 0.003 | 10.52 |
"*rdsHamming.xsl" | 120 | function my:HammingDistance | 3993 | 0.204 | 814.48 | 0.002 | 7.09 |
"*hFunctions.xsl" | 498 | function f:string-to-codepoints | 3993 | 0.001 | 4.88 | 0.001 | 4.88 |
"*rdsHamming.xsl" | 73 | function my:processNode | 2053 | 4.128 | 8475.2 | 0.002 | 4.57 |
"*rdsHamming.xsl" | 54 | function my:processQueue | 2053 | 0.011 | 22.20 | 0.002 | 3.85 |
"*rdsHamming.xsl" | 17 | template /* | 1 | 9491.87 | 9491.9 | 0.756 | 0.76 |
"*rdsHamming.xsl" | 40 | function my:chainOfWords | 1 | 0.344 | 0.34 | 0.344 | 0.34 |
"*rdsHamming.xsl" | 117 | function my:enumerate | 10 | 0.166 | 1.65 | 0.029 | 0.29 |
"*rdsHamming.xsl" | 111 | function my:enumerate | 10 | 0.176 | 1.76 | 0.010 | 0.10 |
In addition to the Saxon tracing profile I ran the Java hrof profiling tool, which
showed up that most time was spent in comparing strings. See the Java profile results
below. It was now obvious that the GeneralComparison expression was in question. Specifically
we narrowed it down to the instruction: <xsl:for-each select="$vNeighbors[not(. = $pExcluded)]">
. For the interpreted code we were doing some unnecessary runtime type checking when
we know statically at compile time that we are comparing string values. More Specifically,
we know at compile time that $vNeighbors is a sequence of untyped atomic values and
$pExcluded is a sequence of strings. We were unnecessarily checking at runtime that
untyped atomic and string literal were comparable and we were doing an unnecessary
conversion from an untyped atomic to string.
CPU SAMPLES BEGIN (total = 1213) Thu Nov 29 14:42:47 2012 rank self accum count trace method 1 24.24% 24.24% 294 300547 java.lang.Integer.hashCode 2 19.13% 43.36% 232 300581 net.sf.saxon.expr.GeneralComparison.compare 3 7.75% 51.11% 94 300613 java.util.HashMap.getEntry 4 2.14% 53.26% 26 300570 java.util.LinkedHashMap$Entry.recordAccess 5 2.06% 55.32% 25 300234 java.lang.ClassLoader.defineClass1 6 2.06% 57.38% 25 300616 com.saxonica.expr.ee.GeneralComparisonEE.effectiveBooleanValue 7 1.98% 59.36% 24 300603 java.util.LinkedHashMap$Entry.recordAccess 8 1.98% 61.34% 24 300609 net.sf.saxon.type.Converter.convert ....
See full hprof results: java.hprof-DN.txt
Improvements in Bytecode generation
In the bytecode we discovered we were missing out on opportunities to capitalise on static properties we know at compile time. For example during atomization we were doing an instanceof test to see whether each item was a node when we already know from static analysis that this was the case. We were also able to avoid unnecessary conversions of the strings, checking of instanceof and we found we could avoid repeated conversions by saving of string values for reuse when appropriate.
With the code improvements discussed above we were able to apply them in Saxon-EE 9.5 (pre-release). The table below shows these running times on the stylesheet written by Dimitre and Wolfgang. We observe that in the interpreted code that Wolfgang's XSL is 2.13 times faster than Dimitre (This is similar to Dimitre results above). With the bytecode generation feature switched on: Dimitre's stylesheet has dramatically improved in performance and is now 1.19 times faster than Wolfgang's XSL.
Transformation | Interpreted - Times (secs) | With bytecode generation - Times (secs) |
---|---|---|
Dimitre | 7.373 | 1.938 |
Wolfgang | 3.450 | 2.17 |
We have not done any similar analysis on Wolfgang's stylesheet, we will now attempt to do this.
To be continued....