I can't help feeling that many simple transformations on XML documents could be expressed more simply (an idea I explored with the interactive Gizmo tool available in Saxon 10).
I also think that if we had simpler syntax for simple operations, it might become easier to optimise. In particular, I'd like to be able to do simple operations like adding an attribute to the outermost element of a tree without doing a physical copy of the entire tree. (I wrote about that in an XML Prague paper, but I had to abandon the idea because the code became too complicated. As often happens, the bugbear was namespaces. For example, if you add a namespaced attribute to the outermost element, then the new namespace declaration has to propagate all the way down the tree.)
I'd also like to see transformations on JSON structures (maps and arrays) become much easier.
I've prototyped these ideas in the saxon:deep-update and saxon:update extensions, but I don't think these are the last word on the subject. (Please try them and give feedback.)
A simpler update syntax might also be very useful for updating the HTML page in Saxon-JS.
I think we can pick up ideas from XQuery Update, but without the complications of pending update lists and in-situ modification.
Let's start with:
<xsl:update>
<xsl:delete match="note"/>
</xsl:update>
The idea is that xsl:update
is an instruction that returns a deep copy of the context item (or other selected item if there's a select attribute), applying changes defined by the contained rules. In this case there is one rule, to delete elements that match the pattern note.
So it's rather like the copy-modify
instruction in XQuery; it makes a copy of a supplied tree, with
defined changes.
Other rules that might appear within <xsl:update>
(for updating XML) might include:
<xsl:rename match="note" name="comment"/>
<xsl:rename match="a:*" name="{local-name()}"/>
<xsl:replace-value match="@status" value="accepted"/>
<xsl:add-attribute match="proposal(not(@status))" name="status" value="accepted"/>
<xsl:replace-content match="cite[@ref]" select="//bib[@id=current()/@ref]"/>
<xsl:insert match="section(not(head))" position="first">
<head>{@title}</head>
</xsl:insert>
Hopefully the intent is reasonably intuitive. The idea is to base the primitives on those available in XQuery Update.
However, I'm not proposing to allow flow-of-control structures such as conditionals and function calls: each invocation
of xsl:update
will simply process the selected tree recursively, applying matching rules to nodes as they
are found, based on pattern matching.
Defining the semantics
We can define the semantics of <xsl:update>
as being equivalent to <xsl:apply-templates>
using a mode that contains a number of implicit template rules, with a default action of shallow-copy
(but extended to handle maps and arrays, see below).
For example, the implicit template rule for the <xsl:rename>
rule might be (roughly):
<xsl:template match="note">
<xsl:element name="comment">
<xsl:apply-templates select="@*, node()"/>
</xsl:element>
</xsl:template>
Now, what if there's a rule to rename an element and another rule to add an attribute to the same element?
The way XQuery Update handles that is to process the rules in a number of phases: for example rename
operations are handled in phase 1, delete
operations in phase 5.
It's a bit hard to replicate that behaviour using template rules (in fact, this is something users often ask for). We could run a multiphase transformation using multiple modes, but it's not quite the same thing, because the match patterns would apply to the output of the previous phase, not to the original node in the input. And xsl:next-match
doesn't do the job either, because we want the effect of the rules to be cumulative.
We could try another approach, which is to have the template rules return functions, so the <xsl:rename>
rule becomes:
<xsl:template match="note" priority="1">
<xsl:sequence select="function($x) {upd:rename($x, 'comment')}"/>
<xsl:next-match/>
</xsl:template>
so the effect of apply-templates
is to return a sequence of functions (in the order determined by the priority
attributes) which are then applied to the node in turn.
This still doesn't exactly mirror what XQuery Update does, because after processing a node, it's then going to apply the rules to the new content of the node, not to the old content. But perhaps that actually makes more sense?
Implementation
Part of the aim is not just to have simpler syntax for the user, but also to make the implementation more efficient than the standard transformation approach which always involves physical copying of a tree, no matter how small the changes.
What I want to achieve is to have a data structure, rather like the HashTrie that we use for representing XDM maps, in which changing one entry doesn't involve copying the whole tree, but at the same time leaves the original value intact. The first essential for such a structure is that it doesn't contain parent pointers: instead upwards navigation is achieved by remembering, when we get to a node, how we got there: this means the same node can be reached by multiple routes, allowing subtrees to be shared between different trees.
Suppose we are changing the value of a single attribute. It ought to be possible to achieve this by the following steps:
- Find the element we are modifying, remembering the ancestor path of that element.
- Create a "virtual copy" of this element (we already have this capability in Saxon)
- Modify the virtual copy to add the attribute. Only one element is affected; the descendant tree of the virtual copy is shared with the original tree.
- Work back through the ancestors; for each one, create a copy in which the affected child is replaced with the modified child, and all other children are virtual copies of the original.
- Return the copied root node.
I'm hoping that it will be a lot easier to achieve this with the new syntax than it is with the current processing model, where we have to deal with all kinds of messiness like namespace inheritance. For example, we can define the new syntax so that it's equivalent to inherit-namespaces="no"
.
What about JSON?
I would like this mechanism to work just as well with JSON trees (that is, structures of maps and arrays) as with XML trees.
We're starting with some advantages: these structures don't have so much baggage. There's no node identity to worry about, no parent navigation, no namespaces. Also, the implementation data structures that we use for maps and arrays already allow efficient constant-time update.
I've experimented with mechanisms for deep update of a JSON structure with extension functions such as [saxon:pedigree()](https://www.saxonica.com/documentation/index.html#!functions/saxon/with-pedigree)
. and saxon:with-pedigree()
. That's not exactly usable. But it might be the right primitive to implement something more usable.
I've also proposed better pattern syntax for maps and arrays. For example, match="tuple(first, last, *)"
matches any map that has entries with keys "first" and "last".
One problem with using the XSLT recursive-descent approach for maps and arrays is that map entries (and indeed array members) aren't actually items. You can match a map as a whole, but it's hard to match one of its entries on its own. Again, I've experimented with various approaches to this. I think the introduction of tuples may help with this: we can define the recursive-descent operation on maps to process (match) each entry in the map in turn, where the entry is presented and matched as a tuple containing key and value. And then we allow syntax such as match="tuple(key: keyPattern, value: valuePattern)"
to match these entries.
But perhaps we don't need to expose this. Perhaps we can define a good enough set of primitive actions that match at the level of the map itself, for example:
<xsl:remove-entry match="tuple(first, last, *)" key="'salary"/>
<xsl:replace-entry match="tuple(product-code, *) key="'price'" value="?price * 1.05"/>
<xsl:add-entry match="tuple(x, y, *)" key="'area'" value="?x * ?y"/>
I think this could fly: but there's a lot of detail to be worked out. Shame we don't have a WG any more to bounce ideas off (and get the bugs out).