Capturing Accumulators

A recent post on StackOverflow made me realise that streaming accumulators in XSLT 3.0 are much harder to use than they need to be.

A reminder about what accumulators do. The idea is that as you stream your way through a large document, you can have a number of tasks running in the background (called accumulators) which observe the document as it goes past, and accumulate information which is then available to the "main" line of processing in the foreground. For example, you might have an accumulator that simply keeps a note of the most recent section heading in a document; that's useful because the foreground processing can't simply navigate around the document to find the current section heading when it finds that it's needed.

Accumulator rules can fire either on start tags or end tags or both, or they can be associated with text nodes or attributes. But there's a severe limitation: a streaming accumulator must be motionless: that's XSLT 3.0 streaming jargon to say that it can only see what's on the parser's stack at the time the accumulator triggers. This affects both the pattern that controls when the accumulator is triggered, and the action that it can take when the rule fires.

For example, you can't fire a rule with the pattern match="section[title='introduction']" because navigation to child elements (title) is not allowed in a motionless pattern. Similarly, if the rule fires on match="section", then you can't access the title in the rule action (select="title") because the action too must be motionless. In some cases a workaround is to have an accumulator that matches the text nodes (match="section/title/text()[.='introduction']") but that doesn't work if section titles can have mixed content.

It turns out there's a simple fix, which I call a capturing accumulator rule. A capturing accumulator rule is indicated by the extension attribute <xsl:accumulator-rule saxon:capture="yes" phase="end">, which will always be a rule that fires on an end-element tag. For a capturing rule, the background process listens to all the parser events that occur between the start tag and the end tag, and uses these to build a snapshot copy of the node. A snapshot copy is like the result of the fn:snapshot function - it's a deep copy of the matched node, with ancestor elements and their attributes tagged on for good measure. This snapshot copy is then available to the action part of the rule processing the end tag. The match patterns that trigger the accumulator rule still need to be motionless, but the action part now has access to a complete copy of the element (plus its ancestor elements and their attributes).

Here's an example. Suppose you've got a large document like the XSLT specification, and you want to produce a sorted glossary at the end, and you want to do it all in streamed mode. Scattered throughout the document are term definitions like this:

<termdef id="dt-stylesheet" term="stylesheet">A  <term>stylesheet</term> consists of one or more packages: specifically, one
   <termref def="dt-top-level-package">top-level package</termref> and zero or
   more <termref def="dt-library-package">library packages</termref>.</termdef>

Now we can write an accumulator which simply accumulates these term definitions as they are encountered:

<xsl:accumulator name="terms" streamable="yes">
    <xsl:accumulator-rule match="termdef" phase="end" select="($value, .)" saxon:capture="yes"/>
</xsl:accumulator>

(the select expression here takes the existing value of the accumulator, $value, and appends the snapshot of the current termdef element, which is available as the context item ".")

And now, at the end of the processing, we can output the glossary like this:

<xsl:template match="/" mode="streamable-mode">
    <html> 
        <!-- main foreground processing goes here -->
        <xsl:apply-templates mode="#current"/>
        <!-- now output the glossary -->
        <div id="glossary" class="glossary">
            <xsl:apply-templates select="accumulator-after('terms')" mode="glossary">
                <xsl:sort select="@term" lang="en"/>
            </xsl:apply-templates>
        </div>
    </html>
</xsl:template>

The value of the accumulator is a list of snapshots of termdef elements, and because these are snapshots, the processing at this point does not need to be streamable (snapshots are ordinary trees held in memory).

The amount of memory needed to accomplish this is whatever is needed to hold the glossary entries. This follows the design principle behind XSLT 3.0 streaming, which was not to do just those things that required zero working memory, but to enable the programmer to do things that weren't purely streamable, while having control over the amount of memory needed.

I think it's hard to find an easy way to tackle this particular problem without the new feature of capturing accumulator rules, so I hope it will prove a useful extension.

I've implemented this for Saxon 9.9. Interestingly, it only took about 25 lines of code: half a dozen to enable the new extension attribute, half a dozen to allow it to be exported to SEF files and re-imported, two or three to change the streamability analysis, and a few more to invoke the existing streaming implementation of the snapshot function from the accumulator watch code. Testing and documenting the feature was a lot more work than implementing it.

Here's a complete stylesheet that fleshes out the creation of a (skeletal) glossary:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:package
  name="http://www.w3.org/xslt30-test/accumulator/capture-203"
  package-version="1.0"
  declared-modes="no"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:f="http://accum001/"
  xmlns:saxon="http://saxon.sf.net/"
  exclude-result-prefixes="#all" version="3.0">

  <!-- Stylesheet to produce a glossary using capturing accumulators -->
  
  <!-- The source document is a W3C specification in xmlspec format, containing
    term definitions in the form <termdef term="banana">A soft <termref def="fruit"/></termdef> -->
  
  <!-- This test case shows the essential principles of how to render such a document
    in streaming mode, with an alphabetical glossary of defined terms at the end -->
  
  <xsl:param name="streamable" static="yes" select="'yes'"/>
  
  <xsl:accumulator name="glossary" as="element(termdef)*" initial-value="()" streamable="yes">
    <xsl:accumulator-rule match="termdef" phase="end" saxon:capture="yes" select="($value, .)"/>
  </xsl:accumulator>

  <xsl:mode streamable="yes" on-no-match="shallow-skip" use-accumulators="glossary"/>
  
  <xsl:template name="main">
    <xsl:source-document href="xslt.xml" streamable="yes" use-accumulators="glossary">
      <xsl:apply-templates select="."/>
    </xsl:source-document>
  </xsl:template>
  
 <xsl:template match="/">
    <out>
      <!-- First render the body of the document -->
      <xsl:apply-templates/>
      <!-- Now generate the glossary -->
      <table>
        <tbody>
          <xsl:apply-templates select="accumulator-after('glossary')" mode="glossary">
            <xsl:sort select="@term" lang="en"/>
          </xsl:apply-templates>
        </tbody>
      </table>
    </out>
  </xsl:template>
  
  <xsl:template match="div1|inform-div1">
    <div id="{@id}">
      <xsl:apply-templates/>
    </div>
  </xsl:template>
  
  <!-- Main document processing: just output the headings -->
  
  <xsl:template match="div1/head | inform-div1/head">
    <xsl:attribute name="title" select="."/>
  </xsl:template>
  
  <!-- Glossary processing -->
  
  <xsl:mode name="glossary" streamable="no"/>
  
  <xsl:template match="termdef" mode="glossary">
    <tr>
      <td>
        <xsl:value-of select="@term"/>
      </td>
      <td>
        <xsl:value-of select="."/>
      </td>
    </tr>
  </xsl:template>

</xsl:package>