Tuning the Chess-FO stylesheet

By Michael Kay on July 27, 2006 at 02:18p.m.

In an interesting post (at least partly influenced by my compilation experiments) Anton Lapounov has published some Saxon performance measurements for a number of 1.0 stylesheets. I'm intrigued that I get faster performance for Saxon 8.7.3 than for 6.5.5, whereas he gets the opposite. I decided to look into the performance of one of these stylesheets a bit more closely and see how it can be improved. The one I chose is the Chess-FO stylesheet which takes as input the record of moves in a chess game and produces as output an XSL-FO document showing the successive positions.

Firstly, it's clear just from looking at the source code that evaluating path expressions doesn't play much of a role in this code. It's spending most of its time constructing output. There are a fair number of template calls, some of them tail-recursive. The ratio of output size to input size (377Kb from 3Kb) is unusually large, so we should expect that the cost of creating elements and serializing them will dominate.

I started by seeing how far I could improve the performance simply by tuning the XSLT code as written. The baseline execution time was 301ms (in this, and all subsequent figures, I'm reporting the second-best out of nine runs, running from the command line with the -9 option to run the transformation nine times, and the -t option to output the results.

First I modified the stylesheet to say version="2.0", and put in a couple of explicit type conversions in places where the code had been implicitly converting between strings and numbers. Time now 280ms.

My next move was to profile the stylesheet execution using the -TP option on the command line. This generates a trace file with timing information, which can be analyzed using the timing-profile.xsl stylesheet (in the samples/analysis directory of the Saxon distribution). This showed one template dominating the execution time: template render-chessboard-cell. So I focused my efforts on this template, which reads:

  <xsl:template name="render-chessboard-cell">
    <xsl:param name="col-number" select="1"/>
    <xsl:param name="row-number"/>
    <xsl:param name="row-content"/>
    <xsl:variable name="cell-background">
      <xsl:choose>
        <xsl:when test="($col-number mod 2 = 0 and $row-number mod 2 = 0) or
                        ($col-number mod 2 != 0 and $row-number mod 2 != 0)">
          <xsl:value-of select="'b'"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="'w'"/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:variable>
    <xsl:variable name="cell-content" select="substring($row-content, $col-number, 1)"/>
    <fo:table-cell>
      <xsl:if test="$cell-background = 'b'">
        <xsl:attribute name="background-color">#C0C0C0</xsl:attribute>
      </xsl:if>
      <fo:block>
        <xsl:if test="$cell-content != 'x'">
          <xsl:call-template name="piece-image">
            <xsl:with-param name="piece-name" select="$cell-content"/>
            <xsl:with-param name="cell-background" select="$cell-background"/>
          </xsl:call-template>
        </xsl:if>
      </fo:block>
    </fo:table-cell>
    <xsl:if test="$col-number &lt; 8">
      <!-- recurse to the next cell -->
      <xsl:call-template name="render-chessboard-cell">
        <xsl:with-param name="row-number"  select="$row-number"/>
        <xsl:with-param name="col-number"  select="$col-number + 1"/>
        <xsl:with-param name="row-content" select="$row-content"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>

I started by declaring types on the parameters and variables, and changing xsl:value-of to xsl:sequence (no need to create a text node when you only want a string). Time now down to 210ms.

Then I dealt with the recursion in this template, replacing it by an iteration. The call on this template, as written, was:

    <fo:table-row height="24pt">
      <xsl:call-template name="render-chessboard-cell">
        <xsl:with-param name="row-number" select="$row-number"/>
        <xsl:with-param name="count" select="1"/>
        <xsl:with-param name="row-content">
          <xsl:value-of select="substring($chessboard, $row-number * 8 - 7, 8)"/>
        </xsl:with-param>
      </xsl:call-template>
    </fo:table-row>

The count parameter is noise (it isn't even declared as a parameter: XSLT 2.0 spots this error). The row-content parameter uses the horribly unnecessary RTF construction: it should simply be written as

<xsl:with-param name="row-content" 
     select="substring($chessboard, $row-number * 8 - 7, 8)"/>

And then we can change it to call the template 8 times using <xsl:for-each select="1 to 8">, passing the column number as a parameter. This gave no further time savings, however. Using recursion to do simple iteration is a pain from the usability point of view, but it's not actually inefficient. But to tidy things up further, I removed the recursion to scan over the rows as well. This time the timing came down a little to 191ms.

Then I collapsed the three templates into one, so that the board is printed using two nested for-each loops. Time now down to 130ms!

The body of the template is still inefficient: the calculation of the colour of a cell can be simplified, and we don't need the variable $row-content. It's also passing a parameter cell-background to the template piece-image, which the called template doesn't actually use. So the template now looks like this:

      <fo:table-body>
        <xsl:for-each select="1 to 8">
          <xsl:variable name="row-number" as="xs:integer" select="."/>
          <xsl:variable name="row-content" select="substring($chessboard, $row-number * 8 - 7, 8)"/>
                <fo:table-row height="24pt">
                  <xsl:for-each select="1 to 8">
                    <xsl:variable name="col-number" as="xs:integer" select="."/>
                    <xsl:variable name="cell-number" as="xs:integer" 
                                  select="($row-number - 1) * 8 + $col-number"/>
                        <xsl:variable name="cell-content"  as="xs:string" 
                                      select="substring($chessboard, $cell-number, 1)"/>
                        <fo:table-cell>
                          <xsl:if test="$cell-number mod 2 = 0">
                            <xsl:attribute name="background-color">#C0C0C0</xsl:attribute>                    
                          </xsl:if>
                    
                          <fo:block>
                            <xsl:if test="$cell-content != 'x'">
                              <xsl:call-template name="piece-image">
                                <xsl:with-param name="piece-name" select="$cell-content"/>
                              </xsl:call-template>
                            </xsl:if>
                          </fo:block>
                        </fo:table-cell>
                  </xsl:for-each>  
                </fo:table-row>
        </xsl:for-each>  
      </fo:table-body>

No further improvement in performance, though (in fact, this time the second-best time is 140ms, but I'll put that down to random fluctuation).

The next obvious target is the piece-image template, which is using a 16-way xsl:choose statement to replace characters such as P, B, or K with a single Unicode character representing the chess piece. This can clearly be done in a single translate() call.

I thought this would make a significant difference, but the time is stubbornly sticking at 130ms. Still, that's down below 40% of the starting figure.

Rerunning the profiling, the render-chessboard template is still accounting for 80% of the execution time, so there's not much point looking anywhere else. There are still a couple of small tweaks we can make: removing variables that are only used once, and expanding them inline, for example. The best time is now down to 110ms: but I said I would use the second-best, and that's still 130ms. This is probably as good as we are going to get by tuning the stylesheet. The relevant template now looks like this:

  <xsl:template name="render-chessboard">
    <xsl:param name="chessboard"/>
    <fo:table width="192pt"
              height="192pt"
              border="thin solid black"
              start-indent="0.26in"
              end-indent="0.26in"
              font="20pt/0pt Chess">
      <fo:table-column column-width="24pt" number-columns-repeated="8"/>
      <fo:table-body>
        <xsl:for-each select="0 to 7">
          <xsl:variable name="row-offset" as="xs:integer" select=". * 8"/>
                <fo:table-row height="24pt">
                  <xsl:for-each select="1 to 8">
                    <xsl:variable name="cell-number" as="xs:integer" select="$row-offset + ."/>                    
                        <fo:table-cell>
                          <xsl:if test="$cell-number mod 2 = 0">
                            <xsl:attribute name="background-color">#C0C0C0</xsl:attribute>                    
                          </xsl:if>                    
                          <fo:block>
                            <xsl:value-of select="translate(substring($chessboard, $cell-number, 1), 
                              'KQRBNPkqrbnpx', 
       '&#x2654;&#x2655;&#x2656;&#x2657;&#x2658;&#x2659;&#x265A;&#x265B;&#x265C;&#x265D;&#x265E;&#x265F;')"/>
                          </fo:block>
                        </fo:table-cell>
                  </xsl:for-each>  
                </fo:table-row>
        </xsl:for-each>  
      </fo:table-body>
    </fo:table>
  </xsl:template>


I expect that the costs are now dominated by constructing and serializing the various FO elements and attributes, which we can't reduce any further. Let's run some profiling at the Java level, by adding -Xrunhprof:cpu=samples to the command line. The only sample points which occur more than once are these:

CPU SAMPLES BEGIN (total = 185) Thu Jul 27 11:58:26 2006
rank   self  accum   count trace method
   1  8.11%  8.11%      15 300106 java.lang.ClassLoader.defineClass1
   2  4.86% 12.97%       9 300316 java.io.FileOutputStream.close0
   3  3.24% 16.22%       6 300001 java.lang.ClassLoader.findBootstrapClass
   4  2.16% 18.38%       4 300311 java.lang.Runtime.gc
   5  1.62% 20.00%       3 300306 net.sf.saxon.functions.Substring.substring
   6  1.62% 21.62%       3 300011 java.util.jar.JarFile.getManifest
   7  1.08% 22.70%       2 300214 com.sun.org.apache.xerces.internal.util.XMLChar.<clinit>
   8  1.08% 23.78%       2 300112 java.lang.Class.forName0
   9  1.08% 24.86%       2 300052 sun.misc.FloatingDecimal.<clinit>
  10  1.08% 25.95%       2 300227 net.sf.saxon.style.StyleElement.processVersionAttribute
  11  1.08% 27.03%       2 300257 java.util.AbstractList.iterator
  12  1.08% 28.11%       2 300256 net.sf.saxon.pattern.Pattern.make
  13  1.08% 29.19%       2 300041 sun.security.provider.Sun$1.run
  14  1.08% 30.27%       2 300207 com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity
  15  1.08% 31.35%       2 300332 java.io.FileOutputStream.open


This doesn't give us much to go on, unfortunately. There's a fair bit of system overhead loading classes, opening files, and running the garbage collector. The most-sampled bit of Saxon code is the substring function, and that's followed by the code for parsing XSLT patterns at compile time. With only a couple of hits on each sample point, this data is pretty random: we need a much longer run for the samples to be meaningful.

So I changed the command-line code in my IDE to run the transformation 1000 times, with these results:

   1 49.97% 49.97%    7979 300137 java.net.PlainSocketImpl.socketAccept
   2  2.12% 52.09%     339 300289 sun.nio.cs.UTF_8$Encoder.encodeArrayLoop
   3  1.79% 53.89%     286 300293 net.sf.saxon.event.XMLEmitter.testCharacters
   4  1.70% 55.58%     271 300312 net.sf.saxon.functions.Substring.substring
   5  1.61% 57.19%     257 300314 java.io.FileOutputStream.writeBytes
   6  1.57% 58.76%     250 300296 net.sf.saxon.value.StringValue.getStringLength
   7  1.42% 60.18%     227 300297 net.sf.saxon.value.StringValue.getStringLength
   8  1.23% 61.41%     196 300305 java.io.FileOutputStream.open
   9  0.97% 62.38%     155 300333 java.io.FileOutputStream.close0
  10  0.96% 63.34%     154 300304 java.lang.AbstractStringBuilder.append
  11  0.86% 64.20%     137 300345 net.sf.saxon.instruct.LocalParam.evaluateVariable
  12  0.85% 65.05%     136 300327 net.sf.saxon.om.NamePool.getCodeForPrefix
  13  0.83% 65.89%     133 300307 java.lang.StringBuffer.toString
  14  0.80% 66.68%     127 300379 net.sf.saxon.expr.Assignation.evaluateVariable
  15  0.79% 67.47%     126 300318 java.lang.AbstractStringBuilder.<init>
  16  0.75% 68.22%     120 300366 java.io.FileInputStream.open
  17  0.75% 68.97%     120 300468 java.lang.String.<init>

I don't actually know what that dominant call on socketAccept is doing, but I'll assume it's an artefact of the CPU sampling and ignore it. Next to that in importance, we have a number of routines concerned with outputting of characters: UTF-8 encoding (I should have mentioned that I changed the output to UTF-8 to see what the effect would be, and there was none noticeable), Saxon's testCharacters which tests that characters are encodable, and various methods in FileOutputStream. Apart from that there are various methods for string manipulation, and routines for evaluating local variables and parameters.

I see now why changing to UTF-8 encoding made no difference. The routine testCharacters is still called to test whether every character appearing in an element name is present in the chosen encoding. If the encoding is UTF-8, that's clearly unnecessary, so there's something I can fix.

The call on Substring is evaluating the first argument of translate() in the expression

translate(substring($chessboard, $cell-number, 1),

This is a more expensive call than you might imagine, because Saxon has to scan the string looking for Unicode surrogate pairs, which occupy four bytes rather than the usual two: it's not possible to simply index into the string. Similarly, translate() itself has to step through the various strings supplied as arguments, rather than indexing into them directly. This suggests an opportunity to mark strings as "safe" once they have been examined and found to contain no characters outside the 2-byte range. However, there's another clue here: another trace shows that Substring() is calling ItemChecker and CardinalityChecker to perform type checking on one of the arguments. That reminds me that I didn't declare the type of $chessboard (because I wasn't sure what type it actually was). Back to the stylesheet, declare the type as xs:string. Second-best time is now 120ms.

There are quite a few calls on getDisplayName() in the NamePool, which does string concatenation to construct (prefix + ':' + localname). This is happening in the serializer when writing start tags. There's scope here to reduce the cost by caching, or by getting the components of the name separately, or perhaps even by storing the display name explicitly in the NamePool rather than reconstructing it on demand. But at user level, perhaps I could reduce the cost by making the FO namespace be the default namespace? Wow: second-best time is now 100ms.

I'll leave it there for now: I've cut the execution time to a third, by some pretty simple source-level changes to the stylesheet. It's encouraging to know that this can be achieved with a stylesheet even when there are no obvious gross inefficiencies in the coding. It does require some exploitation of XSLT 2.0 features, of course. I've also identified one or two places where I think Saxon could be tweaked internally. And I hope I've given you, dear reader, a demonstration of the methodology I use when tackling performance problems.

Saxonica, of course, is always pleased to tackle performance-related consultancy work.