Compiling Saxon using GWT

After my TPAC rant a couple of weeks ago (see http://localhost:8126/mike/2010/11/tpac-the-morning-after.html), a number of people whom I respect said, in effect, "don't dismiss Javascript". So I decided to do some experiments that I've had in mind for quite a while, attempting to compile Saxon's Java source code using the Google Web Toolkit (GWT), which produces Javascript output capable of running in the browser. After much hacking, I've managed to get it to compile. The next stage is to try and get it to run, and after that, to see if it will perform. But meanwhile I thought I would try and report on what I've been doing.

My first step was to take the Saxon-HE 9.3 source code and cut away whatever wasn't needed for a minimally-conforming XSLT 2.0 processor on the browser. The code base I started from was about 143K ncloc (non-comment lines of code). I reckoned 80K might be a reasonable target to aim for. As with incoming governments slashing overblown public expenditure, the first cuts were quite easy, but they became progressively harder. First to go were XQuery-specific code, post-XSLT 2.0 extensions such as higher-order functions, Saxon-specific extensions, support for XOM, DOM4J, and JDOM, support for JAXP APIs, miscellaneous code only there for diagnostics or testing, and so on. Although schema-awareness, updates, streaming and various other features aren't supported in Saxon-HE, there's a fair bit of code that is only there to support these features, and I hacked much of it away. I got rid of the tiny tree - the idea is that the client-side product will operate directly on the DOM, though I retained the linked tree because it's needed for stylesheets. I also threw out the serializer - the aim is to deliver the result tree as a DOM. (We'll talk later about how to deliver secondary result trees).

This exercise got the size down to about 100K ncloc, and also got rid of a lot of the dependencies that would prevent the code running under GWT. The next phase was driven by the error messages obtained when compiling under GWT. These errors fell into a number of categories:

dependencies on classes not supported by GWT: Properties, URI, BitSet, WeakReference, Thread, StringTokenizer, File, ByteOutputStream. In many cases it was possible to rewrite the code to avoid the dependency; in other cases it was easy to produce a replacement for the class either from scratch or by borrowing from OpenJDK.
regular expressions. This falls into two groups: Saxon makes some internal use of regular expressions for things such as parsing dates and durations; these uses were easily replaced using String.matches(), which GWT supports. The more complex use of regexes is to support XSLT/XPath functionality like analyze-string. So far I have adapted this code to use GWT's wrapper around the Javascript RegExp class. I haven't yet modified it to handle the differences between Java regular expressions and JS regular expressions. I think I will probably need to revert to the JDK1.4 version of the XPath-to-Java regex translator, since this works on the bases of "." matching UTF16 code units rather than Unicode code points. Alternatively, I could consider throwing conformance out of the window and simply supporting JS regular expressions instead of XPath regular expressions - I expect a lot of users would thank me for this.
document input and output. All the functionality that handles principal and secondary input and output documents needed to be rewritten. The basic model is to use the Javascript DOM, as exposed by GWT, for both input and output. Fortunately the existing Saxon DOM wrapper code can be used almost without change. There are a few things that may cause hassle, however. The GWT DOM does not expose ID-typed attributes, so the id() function won't work except for attributes named xml:id. It's not possible to get from an attribute to its parent element. More particularly, the GWT DOM is namespace-unaware, so operations like matching the expanded name of a node could turn out to be quite expensive. It also has nothing to reduce the pain of sorting nodes into document order. I may consider doing a one-off conversion from an input DOM to a Saxon linked tree if this improves performance (which was one reason I retained the linked tree in the code base); I might also consider doing some kind of hybrid between the current alternatives of wrapping and conversion, where there's a single pass operation over the DOM tree to collect and add additional information such as namespaces and position in document order - a kind of "eager wrapping" process in place of the current lazy wrapping of nodes that are actually visited. The other problem in this area is that GWT only supports asynchronous fetching of documents from the server (with a callback mechanism activated when the transfer is complete). That really won't work for calls on doc() and document(): fortunately Vojtech Toman (who wrote about his GWT port of the Calumet XProc engine at Balisage last year) showed me how to do synchronous access by making native calls on the underlying XmlHttpRequest object.
dynamic loading. GWT can't do dynamic instantiation of classes. So I scrapped all the interfaces where Saxon dynamically instantiates user-supplied classes (such as URIResolvers and ErrorListeners), and I rewrote all the places where Saxon dynamically instantiates its own classes (such as system function calls).

The code is now about 83K ncloc, still a little above my first target, but comfortably close. It compiles, and the compiler generates 5 files of Javascript (packaged in HTML for reasons I haven't fully understood) each around 1Mb in size. I think the files are variants of each other, one per supported browser, so you're basically faced with a 1Mb download the first time you use a page that invokes an XSLT 2.0 stylesheet. That's big, but not impossibly so; it would be nice to get it down further.

The next stage is to design and implement an API suitable for the client-side developer, and then to test that it works. Then comes performance work. Then, before we release, we'll have to think about such matters as licensing and pricing. My initial thought is some kind of model where it's free for non-commercial use but chargeable on a per-domain basis for large enterprises. But we don't have total freedom on licensing terms: we're constrained by the fact that there is some code in the product for which Saxonica does not own the copyright.

Watch this space for further developments.

======

Two further points:

(a) the only thing I changed that explicitly causes a non-conformance is to remove the code for converting floating-point numbers to strings, using the underlying platform code instead. The existing code requires bit-twiddling on the internal representation of floating point, which isn't possible under GWT. What we lose is the precise rules for output of floating point - which given that GWT arithmetic isn't going to follow the XPath rules anyway, isn't a big loss.

(b) I've done some compile reports on the size of the generated code. As one might expect, a large proportion of the code comes from the classes that are data-intensive, in particular those holding information about character properties. The class CaseVariants which holds upper/lower-case mappings is particularly heavy, and could probably be thrown out entirely, relying instead on whatever JS provides us with.