Tweaking the TinyTree

It's unusual to find anything these days that gives Saxon a 5% performance boost across the board, but I seem to have achieved that with some tweaks to the TinyTree data structure.

Text nodes in the TinyTree aren't stored as separate String objects, but rather as start/length offsets into a single concatenated text buffer. The reason for that is that is to avoid the high Java space overhead associated with every String. At one time I stored this buffer as a single array (that is, char[]), but that requires a lot of character copying when the size increases, so I changed it a while ago to a custom data structure, the LargeStringBuffer, which breaks the text up into an array of segments. The segments were variable length, so that the string value of a text node could always be stored in a single segment, meaning that retrieving the string value never involves character copying. I've been aware for a while, however, from the results of profiling, that with a large document, finding the right segment can take a significant time (it uses a binary chop search).

So I've been doing some experiments with a revised implementation of LargeStringBuffer that uses an array of fixed-length character arrays (65536 seems to work best). These means that text nodes can cross a segment boundary, in which case they will need to be copied, but the vast majority of text nodes still reside in one segment, and finding the text corresponding to a start position and length is now much faster. For any document above 1Mb or so the savings seem to be worthwhile, and there are no extra costs for small documents except perhaps that the minimum memory size is now 65536 characters (and it may be possible to avoid that).

The other long-standing idea on the TODO list for the TinyTree is to identify text nodes and attribute nodes that share the same value and use this fact to reduce the storage requirement. Some data-oriented XML should see a large space reduction from this. The downside, of course, as with any compression technique, is that it will take longer to build the tree. It's not difficult to do - but I hate adding performance features that are only used if the user sets a switch, because 99% of users will never discover its existence.