Compiling Saxon using GWT

By Michael Kay on November 16, 2010 at 02:18p.m.

After my TPAC rant a couple of weeks ago (see http://localhost:8126/mike/2010/11/tpac-the-morning-after.html), a number of people whom I respect said, in effect, "don't dismiss Javascript". So I decided to do some experiments that I've had in mind for quite a while, attempting to compile Saxon's Java source code using the Google Web Toolkit (GWT), which produces Javascript output capable of running in the browser. After much hacking, I've managed to get it to compile. The next stage is to try and get it to run, and after that, to see if it will perform. But meanwhile I thought I would try and report on what I've been doing.

My first step was to take the Saxon-HE 9.3 source code and cut away whatever wasn't needed for a minimally-conforming XSLT 2.0 processor on the browser. The code base I started from was about 143K ncloc (non-comment lines of code). I reckoned 80K might be a reasonable target to aim for. As with incoming governments slashing overblown public expenditure, the first cuts were quite easy, but they became progressively harder. First to go were XQuery-specific code, post-XSLT 2.0 extensions such as higher-order functions, Saxon-specific extensions, support for XOM, DOM4J, and JDOM, support for JAXP APIs, miscellaneous code only there for diagnostics or testing, and so on. Although schema-awareness, updates, streaming and various other features aren't supported in Saxon-HE, there's a fair bit of code that is only there to support these features, and I hacked much of it away. I got rid of the tiny tree - the idea is that the client-side product will operate directly on the DOM, though I retained the linked tree because it's needed for stylesheets. I also threw out the serializer - the aim is to deliver the result tree as a DOM. (We'll talk later about how to deliver secondary result trees).

This exercise got the size down to about 100K ncloc, and also got rid of a lot of the dependencies that would prevent the code running under GWT. The next phase was driven by the error messages obtained when compiling under GWT. These errors fell into a number of categories:

The code is now about 83K ncloc, still a little above my first target, but comfortably close. It compiles, and the compiler generates 5 files of Javascript (packaged in HTML for reasons I haven't fully understood) each around 1Mb in size. I think the files are variants of each other, one per supported browser, so you're basically faced with a 1Mb download the first time you use a page that invokes an XSLT 2.0 stylesheet. That's big, but not impossibly so; it would be nice to get it down further.

The next stage is to design and implement an API suitable for the client-side developer, and then to test that it works. Then comes performance work. Then, before we release, we'll have to think about such matters as licensing and pricing. My initial thought is some kind of model where it's free for non-commercial use but chargeable on a per-domain basis for large enterprises. But we don't have total freedom on licensing terms: we're constrained by the fact that there is some code in the product for which Saxonica does not own the copyright.

Watch this space for further developments.

======

Two further points:

(a) the only thing I changed that explicitly causes a non-conformance is to remove the code for converting floating-point numbers to strings, using the underlying platform code instead. The existing code requires bit-twiddling on the internal representation of floating point, which isn't possible under GWT. What we lose is the precise rules for output of floating point - which given that GWT arithmetic isn't going to follow the XPath rules anyway, isn't a big loss.

(b) I've done some compile reports on the size of the generated code. As one might expect, a large proportion of the code comes from the classes that are data-intensive, in particular those holding information about character properties. The class CaseVariants which holds upper/lower-case mappings is particularly heavy, and could probably be thrown out entirely, relying instead on whatever JS provides us with.