The Saxon product on .NET has been living on borrowed time for a while. It's built by converting the Java bytecode of the Java product to the equivalent .NET intermediate language, using the open-source IKVM converter produced by Jeroen Frijters. Jeroen after many years of devoted service decided to give up further development and maintenance of IKVM a few years ago, which didn't immediately matter because the product worked perfectly well. But then Microsoft in 2019 announced that future .NET developments would be based on .NET Core, and IKVM has never supported .NET Core, so we clearly had a problem.
There's a team attempting to produce a fork of IKVM that supports the new .NET, but we've never felt we could put all our eggs in that basket. In any case, we also have performance problems with IKVM that we've never managed to resolve: some applications run 5 times slower than Java, and despite a lot of investigation, we've never worked out why.
So we decided to try a new approach, namely Java-to-C# source code conversion. After a lot of work, we've now achieved successful compilation and execution of a subset of the the code, and for the first time this morning, Saxon-CS successfully ran the minimal "Hello World" query.
We're a long way from having a product we can release, but we can now have confidence that this approach is going to be viable.
How does the conversion work? We looked at some available tools, notably the product from Tangible Solutions, and this gave us many insights into what could be readily converted, and where the remaining difficulties lay; it also convinced us that we'd be better off writing our own converter.
The basic workflow is:
- Using the open source JavaParser library, parse the Java code, generate an XML abstract syntax tree for each module, and annotate the syntax tree with type information where needed.
- Using XSLT code, do a cross-module analysis to determine which methods override each other, which have covariant return types, etc: information needed when generating the C# code.
- Perform an XSLT transformation on each module to generate C# code.
We can't convert everything automatically, so there's a range of strategies we use to deal with the remaining issues:
- Some constructs can simply be avoided. We have trouble, for example, converting
Java method references like
Item::toString
, because it needs a fair bit of context information to distinguish the various possible translations. But it's no great hardship to write the Java code a different way, for example as a lambda expressionitem -> item.toString()
. Another example is naming conflicts: C# doesn't allow you, for example, to have a variable with the same name as a method in the containing class. It's no hardship to rename the variables so the problem doesn't arise. - We can use Java annotations to steer the conversion. For example, sometimes
we want to generate C# code that's completely unrelated to the Java code. We can move
this code into a method of its own, and then add an annotation
@CSharpReplaceMethodBody
which substitutes different code for the existing method body. The annotation is copied into the XML syntax tree by the JavaParser, and our converter can pick it up from there. - We already have a preprocessor mechanism to mark chunks of code as being excluded from particular variants of the product (such as Saxon-HE or Saxon-PE). We can make further use of this mechanism. However, it's limited by the fact that the code, prior to preprocessing, must be valid Java so that it works in the IDE.
The areas that have caused most trouble in conversion are:
- Inner classes. C# has no anonymous inner classes, and its named inner classes correspond only to Java's static inner classes. Guided by the way the Tangible converter handles these, we've found a way of translating them that handles most cases, and we've added Java annotations that provide the converter with extra information where additional complexities arise.
- Enumeration types. C#'s enumeration types are much more limited than the equivalent in Java, because enumeration constants can't have custom methods associated with them. We distinguish three kinds of enumeration classes: singleton enumerations (used to implement classes that will only have a single instance); simple enumerations with no custom behaviour, which can be translated to C# enumerations very directly, and more complex enumerations, that result in the generation of two separate C# classes, one to hold the enumeration constants, the other to accommodate the custom methods.
- Generics. C# is much stricter about generic types than Java, because the type information
is carried through to run-time, whereas in Java it is used only for compile-time type checking,
which can be subverted by use of casting. So the rule in C# is, either use generics properly,
or don't use them at all. We anticipated some of these issues a year or two ago when we
first started thinking about this project: see
Java Generics Revisited.
The result is that the classes representing XDM sequences and sequence iterators no longer use
generics, which has saved a lot of hassle in this conversion. But there are still many
problems, notably (a) the type inference needed to support Java's diamond operator (as in
new ArrayList<>()
, where an explicit type parameter is needed in C#), and (b) the handling of covariant and contravariant wildcards (? extends T
,? super T
.) - Iterators and enumerators. A
for-each
loop in Java (for (X x : collection)
) relies on thecollection
operand implementing thejava.lang.Iterable
interface. To translate this into a C# for-each loop (foreach (X x in collection)
) thecollection
needs to implementIEnumerable
. So we convert all Iterables to IEnumerables, and that means we have to convert Iterators to Enumerators. Unfortunately Java'sIterator
interface doesn't lend itself to static translation to a c#IEnumerator
: in Java, thehasNext()
method is stateless (so you can call it repeatedly), whereas C#'sMoveNext
changes the current position (so you can't). We're fortunate that we only make modest use of Java iterators; in most of the code, we use Saxon'sSequenceIterator
interface in preferance, and this converts without trouble. We examined all the cases where Saxon explicitly useshasNext()
andnext()
, and made sure these followed the discipline of callinghasNext()
exactly once before each call onnext()
; with this discipline, converting the calls toMoveNext()
andCurrent
works without problems. - Lambda expressions and delegates. In Java, lambda expressions can be used where the
expected type is a functional interface; a functional interface in other ways is just
an ordinary interface, and you can have concrete classes that implement it. So for example
the second argument of
NodeInfo.iterateAxis(axis, nodeTest)
is aNodeTest
, for which we can supply either a lambda expression (such asit -> it instanceof XSLExpose
), or one of a whole range of implementation classes such as aSchemaElementTest
, which tests whether an element belongs to an XSD-defined substitution group. In C#, lambda expressions can only be used when the expected type is a delegate, and if the expected type is a delegate, then (in effect) a lambda expression is the only thing you can supply. The way we've handled this is generally to make the main method (likeiterateAxis()
expect a non-delegate interface, and then to supply a proxy implementation of this interface that accepts a delegate. It's not a very satisfactory solution, but it works.
One area where we could have had trouble, but avoided it, is in the use of the Java
CharSequence
class. I wrote about this issue last year at
String,
CharSequence, IKVM, and .NET. As described in that article, we decided to eliminate
our dependence on the CharSequence
interface. For a great many internal uses of strings
in Saxon, we now use a new interface UnicodeString
which as the name implies is much
more Unicode-friendly than Java's String
and CharSequence
. It also reduces
memory usage, especially in the TinyTree. But there is a small overhead in the places where we
have to convert strings to or from UnicodeStrings
, which we can't hide entirely:
it represents about 5% on the bottom line. But it does make all this code much easier to port
between Java and C#.
What about dependencies? So far we've just been tackling the Saxon-HE code base, and that has
very few dependencies that have caused any difficulty. Most of the uses of standard Java library
classes (maps, lists, input and output streams, and the like) are handled by the converter,
simply translating calls into the nearest C# equivalent. In some cases such as java.util.Properties
we've written en emulation of the Java interface (or the parts of it that we actually use). In other
cases we've redirected calls to helper methods. For example we don't always have enough type
information to know whether Java's List.remove()
should be translated to
List.Remove()
or List.RemoveAt()
; so instead we generate a call on
a static helper method, which makes the decision at runtime based on the type of the
supplied argument.
The only external dependency we've picked up so far is for handling big decimal numbers.
We're currently evaluating the BigDecimal
library from Singulink, which appears
to offer all the required functionality, though its philosophy is sufficiently different
from the Java BigDecimal
to make conversion non-trivial.
One thing I should stress is that we haven't written a general purpose Java to C# converter. Our converter is designed to handle the Saxon codebase, and nothing else. Some of the conversion rules are specific to particular Saxon classes, and as a general principle, we only convert the subset of the language and of the class library that we actually need. Some of the conversion rules assume that the code is written to the coding conventions that we use in Saxon, but which might not be followed in other projects.
So, Hello World to Saxon-CS. There's still a lot of work to do, but we've reached a significant milestone.