Saxon performance measures of the Word Ladders problem in XSLT

I would like to report on some Saxon performance measure on a Word ladder solution implemented in XSLT.

Firstly, some background information on the Word ladder problem. From Wikipedia, the free encyclopedia:

A word ladder (also known as a doublets, word-links, or Word golf) is a word game invented by Lewis Carroll. A word ladder puzzle begins with two given words, and to solve the puzzle one must find the shortest chain of other words to link the two given words, in which chain every two adjacent words (that is, words in successive steps) differ by exactly by one letter.

XSLT interest in this problem was first started (to the best of my knowledge) by Dimitre Novatchev through the mulberry mailing list, who provides a 20 step guide to create a stylesheet in his blog to solve the Word ladder problem (FindChainOfWordsHamming.xsl). Following the post on the list, there has been some interest; another solution to this problem was given by Wolfgang Laun (please see thread, file: FindChainOfWordsHamming2.xsl).

Experimental Evaluation

Our interest resides in the Saxon performances only. I was curious and surprised by the results reported by Dimitre. The question I had is why Dimitre's stylesheet was much slower than Wolfgang's stylesheet in Saxon and faster in another XSLT processor: there must be some optimization step we were not making. I was motivated to understand were the bottle necks were and how we could improve the performance in Saxon.

Wolfgang wrote: "The XSLT program is three times faster on one XSLT implementation than on another one is strange, 'very' strange".

Mike Kay addressed Wolfgang's comment by writing in the thread: "No, it's extremely common. In fact, very much larger factors than this are possible. Sometimes Saxon-EE runs 1000 times faster than Saxon-HE. This effect is normal with declarative languages where powerful optimizations are deployed - SQL users will be very familiar with the effect."

The table below shows the execution times of the stylesheets in Saxon 9.XX (for some recent X). Time were reported by Dimitre.

Transformation	Times (secs)
Dimitre	39
Wolfgang	25

We observe that Wolfgang's transformation is 1.56 times faster. Please note that with Wolfgang's stylesheet his results lists all solutions (i.e. ladders), whereas Dimitre only finds one.

Saxon represents a stylesheet as a compiled abstract syntax tree (AST) which is processed in a interpreted manner. Since the release of Saxon 9.4 we have included the bytecode generation feature, which allows us at the compilation phase to generate directly the byte code representation of the entire AST or sub-trees of it where performance benefits can be achieved. We make use of properties we know at compile time (See full paper).

Analysis of Dimitre's Stylesheet

Step one was to see how well Saxon does with the bytecode feature switched on. This proved inconclusive because we discovered a bug in the bytecode generated. A useful exercise already, we managed to fix the bug (see bug issue: #1653). The problem was in the function processQueue the tail recursive call was not being properly generated into bytecode.

The Table below shows running times of the stylesheets under Saxon 9.4.0.6. We observe that Wolfgang's stylesheet was 2.07 and 3.22 faster in Saxon Intepreted and bytecode, respectively.

Transformation	Interpreted - Times (secs)	With bytecode generation - Times (secs)
Dimitre	7.95	7.78
Wolfgang	3.83	2.41

Analyzing Dimitre's stylesheet with the Saxon tracing profile (i.e. option -TP) proved useful. See the html output produced by Saxon below. We observe that there is a big hit on the processNode method, with the most time spent in this function.

Analysis of Stylesheet Execution Time

Total time: 9498.871 milliseconds

Time spent in each template or function:

The table below is ordered by the total net time spent in the template or function. Gross time means the time including called templates and functions; net time means time excluding time spent in called templates and functions.

file	line	instruction	count	avg time (gross)	total time (gross)	avg time (net)	total time (net)
"*rdsHamming.xsl"	79	function my:processNode	2053	4.12	8470.67	3.729	7655.792
"*rdsHamming.xsl"	21	function my:chainOfWords	1	9491.1	9491.12	993.34	993.34
"*rdsHamming.xsl"	131	function f:eq	3993	0.06	230.02	0.058	230.26
"*rdsHamming.xsl"	131	function my:HammingDistance	3993	0.20	807.38	0.049	194.77
"*func-apply.xsl"	21	function f:apply	15972	0.01	290.01	0.011	175.00
"*-Operators.xsl"	244	template f:eq	15972	0.01	115.01	0.004	68.23
"*-Operators.xsl"	248	function f:eq	15972	0.003	46.77	0.003	46.77
"*nc-zipWith.xsl"	21	function f:zipWith	19965	0.002	33.11	0.002	33.11
"*nc-zipWith.xsl"	9	function f:zipWith	19965	0.003	57.67	0.001	24.56
"*func-apply.xsl"	16	function f:apply	15972	0.019	309.52	0.001	19.52
"*rdsHamming.xsl"	70	function my:processQueue	2053	0.009	18.35	0.009	18.35
"*hFunctions.xsl"	498	function f:string-to-codepoints	3993	0.003	10.52	0.003	10.52
"*rdsHamming.xsl"	120	function my:HammingDistance	3993	0.204	814.48	0.002	7.09
"*hFunctions.xsl"	498	function f:string-to-codepoints	3993	0.001	4.88	0.001	4.88
"*rdsHamming.xsl"	73	function my:processNode	2053	4.128	8475.2	0.002	4.57
"*rdsHamming.xsl"	54	function my:processQueue	2053	0.011	22.20	0.002	3.85
"*rdsHamming.xsl"	17	template /*	1	9491.87	9491.9	0.756	0.76
"*rdsHamming.xsl"	40	function my:chainOfWords	1	0.344	0.34	0.344	0.34
"*rdsHamming.xsl"	117	function my:enumerate	10	0.166	1.65	0.029	0.29
"*rdsHamming.xsl"	111	function my:enumerate	10	0.176	1.76	0.010	0.10

In addition to the Saxon tracing profile I ran the Java hrof profiling tool, which showed up that most time was spent in comparing strings. See the Java profile results below. It was now obvious that the GeneralComparison expression was in question. Specifically we narrowed it down to the instruction: <xsl:for-each select="$vNeighbors[not(. = $pExcluded)]">. For the interpreted code we were doing some unnecessary runtime type checking when we know statically at compile time that we are comparing string values. More Specifically, we know at compile time that $vNeighbors is a sequence of untyped atomic values and $pExcluded is a sequence of strings. We were unnecessarily checking at runtime that untyped atomic and string literal were comparable and we were doing an unnecessary conversion from an untyped atomic to string.

CPU SAMPLES BEGIN (total = 1213) Thu Nov 29 14:42:47 2012
rank   self  accum   count trace method
   1 24.24% 24.24%     294 300547 java.lang.Integer.hashCode
   2 19.13% 43.36%     232 300581 net.sf.saxon.expr.GeneralComparison.compare
   3  7.75% 51.11%      94 300613 java.util.HashMap.getEntry
   4  2.14% 53.26%      26 300570 java.util.LinkedHashMap$Entry.recordAccess
   5  2.06% 55.32%      25 300234 java.lang.ClassLoader.defineClass1
   6  2.06% 57.38%      25 300616 com.saxonica.expr.ee.GeneralComparisonEE.effectiveBooleanValue
   7  1.98% 59.36%      24 300603 java.util.LinkedHashMap$Entry.recordAccess
   8  1.98% 61.34%      24 300609 net.sf.saxon.type.Converter.convert
....

See full hprof results: java.hprof-DN.txt

Improvements in Bytecode generation

In the bytecode we discovered we were missing out on opportunities to capitalise on static properties we know at compile time. For example during atomization we were doing an instanceof test to see whether each item was a node when we already know from static analysis that this was the case. We were also able to avoid unnecessary conversions of the strings, checking of instanceof and we found we could avoid repeated conversions by saving of string values for reuse when appropriate.

With the code improvements discussed above we were able to apply them in Saxon-EE 9.5 (pre-release). The table below shows these running times on the stylesheet written by Dimitre and Wolfgang. We observe that in the interpreted code that Wolfgang's XSL is 2.13 times faster than Dimitre (This is similar to Dimitre results above). With the bytecode generation feature switched on: Dimitre's stylesheet has dramatically improved in performance and is now 1.19 times faster than Wolfgang's XSL.

Transformation	Interpreted - Times (secs)	With bytecode generation - Times (secs)
Dimitre	7.373	1.938
Wolfgang	3.450	2.17

We have not done any similar analysis on Wolfgang's stylesheet, we will now attempt to do this.

To be continued....