The Zeno Chain: a new data structure for XDM sequences

By Michael Kay on March 18, 2021 at 03:34p.m.

This article presents the Zeno Chain, a new data structure used to underpin the implementation of XDM sequences in Saxon. It is also designed to be capable of supporting XDM arrays, and might also have potential for holding long character strings.

The current implementation of the Zeno Chain is a mutable list, but the design lends itself easily to creating an immutable variant. It also makes it easy to construct an immutable list from a mutable one, making it efficient to construct a sequence with in-situ modification, and then "freeze" it once construction is complete.

Saxon currently uses a variety of structures for holding sequences and arrays. This variety is a problem in itself. Choosing the right structure for a particular scenario involves somewhat hit-or-miss decision making; it would be better to have a single "all-rounder" structure that performs well in a variety of situations.

There are of course vast numbers of data structures for sequences available in the computer science literature. One promising one, for example, is the "finger tree" which supports a wide range of access patterns efficiently. But it also has drawbacks: any tree structure that requires a node for each item in a list is going to have a large memory overhead when storing a long sequence, and the use of a fine-grained structure like this tends to mean that there is little locality of reference for memory addressing, leading to poor CPU caching performance.

The Zeno chain stores a sequence as a list of lists of items: that is, it is a tree with a constant depth of 2. In the Java implementation, both levels of list are instances of java.util.ArrayList. The key to the performance of the structure is managing the number and size of the second-level lists, which I call segments.

In a list that is constructed by appending individual items on the end (a common scenario), the length of a segment increases the closer it is to the start. For a list of 20,000 items, there are ten segments whose sizes are (8192, 4096, 4096, 2048, 1024, 256, 128, 64, 64, 32). (Now you know why I called it a Zeno chain.) The exact numbers don't matter here: what is important is that the total number of segments increases only logarithmically with the length of the sequence, and that the segments on the right are short, which makes further append operations efficient.

In a list constructed by prepending individual items, the distribution of lengths will be the other way round: shortest segments near the front. In the rare case where both append and prepend operations occur, both ends will have short segments, while longer segments will cluster around the middle.

Here's a summary of the major operations performed on the sequence:

It's important to note that most operations on sequences don't actually result in a new sequence being constructed. Calling tail(), for example, doesn't copy any data: it delivers an iterator over a portion of the original sequence. The sequence only gets materialized if, for example the result is stored in a variable (and even then, not always).

Saxon's default implementation for a sequence is simply a Java List. Appending an item to a list generally copies the whole list. Where Saxon can detect that this is going to be inefficient, it instead uses a structure called a Chain: this is effectively a tree of segments. But there's little serious attempt to manage the depth of the tree or the size of the segments, and the results in some cases can be rather poor.The Zeno chain offers a signficant improvement; it also looks as if it can be used for arrays as well as sequences.

For managing long strings, I invented a similar structure, which I then discovered already existed in the literature and is known as a Rope: a Rope represents a string as a tree of substrings. The literature on Ropes describes how to keep the tree balanced, but it has nothing to say about how to decide how many substrings to hold, and how long to make them. The Zeno chain might turn out to provide an answer to that question.