Saxon-CS says Hello World

The Saxon product on .NET has been living on borrowed time for a while. It's built by converting the Java bytecode of the Java product to the equivalent .NET intermediate language, using the open-source IKVM converter produced by Jeroen Frijters. Jeroen after many years of devoted service decided to give up further development and maintenance of IKVM a few years ago, which didn't immediately matter because the product worked perfectly well. But then Microsoft in 2019 announced that future .NET developments would be based on .NET Core, and IKVM has never supported .NET Core, so we clearly had a problem.

There's a team attempting to produce a fork of IKVM that supports the new .NET, but we've never felt we could put all our eggs in that basket. In any case, we also have performance problems with IKVM that we've never managed to resolve: some applications run 5 times slower than Java, and despite a lot of investigation, we've never worked out why.

So we decided to try a new approach, namely Java-to-C# source code conversion. After a lot of work, we've now achieved successful compilation and execution of a subset of the the code, and for the first time this morning, Saxon-CS successfully ran the minimal "Hello World" query.

We're a long way from having a product we can release, but we can now have confidence that this approach is going to be viable.

How does the conversion work? We looked at some available tools, notably the product from Tangible Solutions, and this gave us many insights into what could be readily converted, and where the remaining difficulties lay; it also convinced us that we'd be better off writing our own converter.

The basic workflow is:

Using the open source JavaParser library, parse the Java code, generate an XML abstract syntax tree for each module, and annotate the syntax tree with type information where needed.
Using XSLT code, do a cross-module analysis to determine which methods override each other, which have covariant return types, etc: information needed when generating the C# code.
Perform an XSLT transformation on each module to generate C# code.

We can't convert everything automatically, so there's a range of strategies we use to deal with the remaining issues:

Some constructs can simply be avoided. We have trouble, for example, converting Java method references like Item::toString, because it needs a fair bit of context information to distinguish the various possible translations. But it's no great hardship to write the Java code a different way, for example as a lambda expression item -> item.toString(). Another example is naming conflicts: C# doesn't allow you, for example, to have a variable with the same name as a method in the containing class. It's no hardship to rename the variables so the problem doesn't arise.
We can use Java annotations to steer the conversion. For example, sometimes we want to generate C# code that's completely unrelated to the Java code. We can move this code into a method of its own, and then add an annotation @CSharpReplaceMethodBody which substitutes different code for the existing method body. The annotation is copied into the XML syntax tree by the JavaParser, and our converter can pick it up from there.
We already have a preprocessor mechanism to mark chunks of code as being excluded from particular variants of the product (such as Saxon-HE or Saxon-PE). We can make further use of this mechanism. However, it's limited by the fact that the code, prior to preprocessing, must be valid Java so that it works in the IDE.

The areas that have caused most trouble in conversion are:

Inner classes. C# has no anonymous inner classes, and its named inner classes correspond only to Java's static inner classes. Guided by the way the Tangible converter handles these, we've found a way of translating them that handles most cases, and we've added Java annotations that provide the converter with extra information where additional complexities arise.
Enumeration types. C#'s enumeration types are much more limited than the equivalent in Java, because enumeration constants can't have custom methods associated with them. We distinguish three kinds of enumeration classes: singleton enumerations (used to implement classes that will only have a single instance); simple enumerations with no custom behaviour, which can be translated to C# enumerations very directly, and more complex enumerations, that result in the generation of two separate C# classes, one to hold the enumeration constants, the other to accommodate the custom methods.
Generics. C# is much stricter about generic types than Java, because the type information is carried through to run-time, whereas in Java it is used only for compile-time type checking, which can be subverted by use of casting. So the rule in C# is, either use generics properly, or don't use them at all. We anticipated some of these issues a year or two ago when we first started thinking about this project: see Java Generics Revisited. The result is that the classes representing XDM sequences and sequence iterators no longer use generics, which has saved a lot of hassle in this conversion. But there are still many problems, notably (a) the type inference needed to support Java's diamond operator (as in new ArrayList<>(), where an explicit type parameter is needed in C#), and (b) the handling of covariant and contravariant wildcards (? extends T, ? super T.)
Iterators and enumerators. A for-each loop in Java (for (X x : collection)) relies on the collection operand implementing the java.lang.Iterable interface. To translate this into a C# for-each loop (foreach (X x in collection)) the collection needs to implement IEnumerable. So we convert all Iterables to IEnumerables, and that means we have to convert Iterators to Enumerators. Unfortunately Java's Iterator interface doesn't lend itself to static translation to a c# IEnumerator: in Java, the hasNext() method is stateless (so you can call it repeatedly), whereas C#'s MoveNext changes the current position (so you can't). We're fortunate that we only make modest use of Java iterators; in most of the code, we use Saxon's SequenceIterator interface in preferance, and this converts without trouble. We examined all the cases where Saxon explicitly uses hasNext() and next(), and made sure these followed the discipline of calling hasNext() exactly once before each call on next(); with this discipline, converting the calls to MoveNext() and Current works without problems.
Lambda expressions and delegates. In Java, lambda expressions can be used where the expected type is a functional interface; a functional interface in other ways is just an ordinary interface, and you can have concrete classes that implement it. So for example the second argument of NodeInfo.iterateAxis(axis, nodeTest) is a NodeTest, for which we can supply either a lambda expression (such as it -> it instanceof XSLExpose), or one of a whole range of implementation classes such as a SchemaElementTest, which tests whether an element belongs to an XSD-defined substitution group. In C#, lambda expressions can only be used when the expected type is a delegate, and if the expected type is a delegate, then (in effect) a lambda expression is the only thing you can supply. The way we've handled this is generally to make the main method (like iterateAxis() expect a non-delegate interface, and then to supply a proxy implementation of this interface that accepts a delegate. It's not a very satisfactory solution, but it works.

One area where we could have had trouble, but avoided it, is in the use of the Java CharSequence class. I wrote about this issue last year at String, CharSequence, IKVM, and .NET. As described in that article, we decided to eliminate our dependence on the CharSequence interface. For a great many internal uses of strings in Saxon, we now use a new interface UnicodeString which as the name implies is much more Unicode-friendly than Java's String and CharSequence. It also reduces memory usage, especially in the TinyTree. But there is a small overhead in the places where we have to convert strings to or from UnicodeStrings, which we can't hide entirely: it represents about 5% on the bottom line. But it does make all this code much easier to port between Java and C#.

What about dependencies? So far we've just been tackling the Saxon-HE code base, and that has very few dependencies that have caused any difficulty. Most of the uses of standard Java library classes (maps, lists, input and output streams, and the like) are handled by the converter, simply translating calls into the nearest C# equivalent. In some cases such as java.util.Properties we've written en emulation of the Java interface (or the parts of it that we actually use). In other cases we've redirected calls to helper methods. For example we don't always have enough type information to know whether Java's List.remove() should be translated to List.Remove() or List.RemoveAt(); so instead we generate a call on a static helper method, which makes the decision at runtime based on the type of the supplied argument.

The only external dependency we've picked up so far is for handling big decimal numbers. We're currently evaluating the BigDecimal library from Singulink, which appears to offer all the required functionality, though its philosophy is sufficiently different from the Java BigDecimal to make conversion non-trivial.

One thing I should stress is that we haven't written a general purpose Java to C# converter. Our converter is designed to handle the Saxon codebase, and nothing else. Some of the conversion rules are specific to particular Saxon classes, and as a general principle, we only convert the subset of the language and of the class library that we actually need. Some of the conversion rules assume that the code is written to the coding conventions that we use in Saxon, but which might not be followed in other projects.

So, Hello World to Saxon-CS. There's still a lot of work to do, but we've reached a significant milestone.