Analyzing dependencies in a class library: a use case for XSLT streaming

I'm in the middle of building Saxon 9.2 for the .NET platform, and part of that was to move forward to the latest IKVM release. IKVM (in version 0.40) now has much more complete coverage of the Java class library, but an unfortunate consequence of this is that the libraries have become rather large. They clearly contain main things Saxon doesn't need - for example the XML part of the class library is over 8Mb, containing the whole of Xerces, Xalan, XSLTC, a streaming parser, RelaxNG, fast infoset, JAXP databinding, SOAP, web services and so on. All we need is Xerces (and possibly not even that).

To cut down the size of the library, we need to understand the dependencies between its parts. My first attempt to partition the library based on knowledge of the functionality yielded JAR files (and hence DLLs) with suprising dependencies, for example a reference from the Xerces library into Xalan.

After looking at a couple of dependency analysis tools, I decided it would be useful to have something that essentially just collects the raw data, allowing me to analyze it myself. So I downloaded Dependency Finder from depfind.sourceforge.net, and ran its analysis tool on the OpenJDK library. This took a fair while, but it ran to completion, and output a 230Mb XML file listing all the dependencies, at the granularity of individual methods (that is, for every method, which methods it calls, and which methods it is called by).

Processing a file of this size is close to the limits of what Saxon can handle on my laptop (the limits of course depend on physical memory available) without using streaming. So it seemed a good opportunity to apply some of the new streaming facilities in Saxon 9.2. It seemed that the first useful step in processing would be to filter the data (a) to remove the redundant "called-by" links, (b) to remove references to classes in the "java.lang" package, since these are ubiquitous and therefore uninteresting, and (c) to reduce the granularity so it only contained class->class references rather than method->method references. After some experimentation using a smaller class library, it proved reasonable easy to do this.

The processing falls into a very common model for streaming, whereby each <class> element in the input file can be transformed independently. So each time a <class> is encountered, it is copied into a variable, and conventional processing (in this case, for-each-group to group dependencies over all the methods in the class) is applied within that <class>. The resulting file is a mere 5Mb, making it much more amenable to further processing.

I did hit some difficulties in achieving this. The main one was that it was working in streaming mode when running inside my Java IDE (IntelliJ), but was running out of memory when run on the actual Saxon 9.2 product build. Since I'm in the process of validating the product for release, I couldn't let this pass without investigation. It turned out that it runs out of memory when using JDK 1.6 with its built-in version of Xerces, but works fine on JDK 1.5, or on JDK 1.6 with the Apache release of Xerces. I've already seen so many bugs in the Sun JDK 1.6 Xerces version that I recommend people never to use it for production work - this seems to be yet another reason to avoid it.

So I've now got a 5Mb XML file holding the class->class dependencies, and that's as far as I've got today. The next step is to understand specific cross-package dependencies.

(continuing the story)

My next step was to reduce the graph of class->class dependencies to a graph of package->package dependencies. In fact I initially attempted to find the transitive dependencies between packages, but this turned out not to be needed. Finding the direct package uses is a simple case of stripping the package names out of the class names, and then eliminating the duplicate dependencies.

We now use this to evaluate the proposed module structure of the classpath library (that is, the way it is partitioned into DLL assemblies). The IKVM build uses a file called (strangely) response.txt to define which Java packages go in which module. This is in a custom non-XML syntax, but it's a simple matter using unparsed-text() and regular expressions to convert it into an XML representation looking something like this:

   <module>
      <name>IKVM.OpenJDK.Corba.dll</name>
      <package>com.sun.corba.se.impl.activation</package>
      <package>com.sun.corba.se.impl.copyobject</package>
      <package>com.sun.corba.se.impl.corba</package>
      <package>com.sun.corba.se.impl.dynamicany</package>
      <package>com.sun.corba.se.impl.encoding</package>
      <package>com.sun.corba.se.impl.interceptors</package>
      <package>com.sun.corba.se.impl.io</package>    </module>
And so on for each module.

We can now determine the module->module dependencies by reducing the package->package dependencies in much the same way as we reduced the class dependencies to package dependencies earlier; the only difference is that converting a package name to a module name isn't a string manipulation, it's a look-up in this module structure file. Also, the amount of data we are dealing with is now so far reduced that we don't have to throw information away: we can record the way in which one module depends on another, like this:

<module name="IKVM.OpenJDK.XMLstream.dll">
      <uses-module name="IKVM.OpenJDK.XMLparse.dll">
         <via>com.sun.xml.internal.stream</via>
         <via>com.sun.xml.internal.stream.dtd</via>
         <via>com.sun.xml.internal.stream.dtd.nonvalidating</via>
         <via>com.sun.xml.internal.stream.events</via>
         <via>com.sun.xml.internal.stream.writers</via>
      </uses-module>
      <uses-module name="IKVM.OpenJDK.Core.dll">
         <via>com.sun.xml.internal.stream</via>
         <via>com.sun.xml.internal.stream.buffer</via>
         <via>com.sun.xml.internal.stream.buffer.sax</via>
         <via>com.sun.xml.internal.stream.buffer.stax</via>
         <via>com.sun.xml.internal.stream.dtd.nonvalidating</via>
         <via>com.sun.xml.internal.stream.events</via>
         <via>com.sun.xml.internal.stream.util</via>
         <via>com.sun.xml.internal.stream.writers</via>
      </uses-module>
      <uses-module name="IKVM.OpenJDK.XMLapi.dll">
         <via>com.sun.xml.internal.stream</via>
         <via>com.sun.xml.internal.stream.buffer</via>
         <via>com.sun.xml.internal.stream.buffer.sax</via>
         <via>com.sun.xml.internal.stream.buffer.stax</via>
         <via>com.sun.xml.internal.stream.events</via>
         <via>com.sun.xml.internal.stream.writers</via>
      </uses-module>
      <uses-module name="IKVM.OpenJDK.XMLwebservices.dll">
         <via>com.sun.xml.internal.stream.buffer.stax</via>
      </uses-module>
      <uses-module name="IKVM.OpenJDK.Misc.dll">
         <via>com.sun.xml.internal.stream.buffer.stax</via>
      </uses-module>
   </module>

In theory of course we could do some kind of cluster analysis on the dependency graph to automate the allocation of packages to modules. But it seems more practical to do an initial modularization based on known characteristics of the software architectural structure, and then analyse the structure looking for anomalies, and adjusting the mapping accordingly.

(continuing the story again...)

After this it's a question of ad-hoc manual analysis. Some observations:

There are two dependencies that shouldn't be there "in theory": XMLparse -> XMLwebservices, and XMLsecurity -> XMLtransform. Without these dependencies, there would be many fewer circular dependencies between assemblies, and the story would be much cleaner.

The dependency of XMLparse (which now includes XMLstream) on XMLwebservices is via package com.sun.xml.internal.stream.buffer.stax, which references com.sun.xml.internal.org.jvnet.staxex, which is part of XMLwebservices. This package appears to have no other dependencies within XMLwebservices. So I've moved it into XMLparse, and this seems to solve the problem: XMLparse no longer has any dependencies on other assemblies in the XML family, other than XMLapi.

The dependency of XMLsecurity on XMLtransform is via two packages:

com.sun.org.apache.xml.internal.security.transforms.implementations
com.sun.org.apache.xml.internal.security.utils

In both cases the dependency is on XPath, not on Xalan proper. The obvious thing is to split com.sun.org.apache.xpath.internal and its subpackages into a separate assembly (XMLxpath). When we do this we find that xpath has a dependency on transform, via com.sun.org.apache.xpath.internal which references com.sun.org.apache.xalan.internal.extensions. However, com.sun.org.apache.xalan.internal.extensions has no dependencies on anything else in transform, so we can move it into xpath.

After this change, the dependencies within the XML family are:

api -> (nothing)
parse -> api
transform -> api, parse, xpath
xpath -> api, parse
binding -> api, webservices, relaxng
webservices -> api, parse, binding
relaxng -> api
security -> api, parse, xpath

leaving the only circularity within this family of assemblies as between binding and webservices.

Intuitively, I would expect webservices to depend on binding and not the other way around. The dependency of binding on webservices is via two packages:

com.sun.xml.internal.bind.v2.runtime.output
com.sun.xml.internal.bind.v2.runtime.unmarshaller

which both reference fastinfoset. So we'll try putting the fastinfoset packages (all those with "fastinfoset" in their name) into a separate assembly, and we find dependencies are now:

binding -> api, fastinfoset, relaxng
fastinfoset -> api
webservices -> api, parse, binding, fastinfoset

eliminating the remaining cycle. There would also be no cycles and no new dependencies if fastinfoset is moved into binding, so I've done that.

The resulting DLL sizes in Kb are:

API          204
Binding     1574
Parse       2474
RelaxNG      262
Security     338
Transform   1845
WebServices 1377
XPath        448

As far as Saxon is concerned, the only assemblies from this lot that we need to issue are API and Parse. In fact, at some stage I'd like to get rid of the Parse module: this module holds the Sun fork of the Apache Xerces parser, which is horribly buggy; I'd much rather use the Apache original which is much more reliable, or indeed make everything use the Microsoft System.Xml parser (the only drawback of that is its inability to report ID attributes, which an XSLT processor needs.)

Reflections: there are tools that do dependency analysis. I felt I wanted to play with the raw data. The Java tool that extracted the dependency information into XML proved just the right thing. Although the data was voluminous, Saxon in streaming mode can be used to bring it down to a practical size. The facilities in XSLT 2.0 (grouping, unparsed-text, regular expressions) make ad-hoc analysis of this kind of data fairly straightforward, and with an approach that relied partly on automation and partly on manual analysis, it was easy to find a mapping of packages to assemblies that eliminated cyclic dependencies.

Postscript: Jeroen Frijters used the packaging I proposed, and discovered using ndepend that there were dependencies between the various assemblies that I had not predicted. Looking back on the work, I discovered that the first stylesheet, to reduce method->method dependencies to class->class dependencies, was losing some dependencies (because of my inadequate understanding of the format of the output from the dependency analyzer). Correcting this reveals an unfortunate dependency from the parse assembly into the xpath assembly.

The dependency parse->xpath emanates from package com.sun.org.apache.xml.internal.dtm.ref, specifically from the fact that class com.sun.org.apache.xml.internal.dtm.ref.DTMNodeProxy has a reference to class com.sun.org.apache.xpath.internal.NodeSet. Looking at the code, this is an example of inappropriate re-use; it's using this NodeSet class as a convenient implementation of the DOM NodeList interface, but one that has far more functionality than is needed to deliver the interface, and that therefore creates further dependencies into the XPath package. It would be trivial to replace it with a simple inline implementation of NodeList, but of course we want to avoid source change if possible; apart from anything else, it's always conceivable that there's a client that relies on the returned NodeList being an instance of com.sun.org.apache.xpath.internal.NodeSet.

Without source change, I don't think we can remove this dependency. However, I think it's probably still worth keeping Parse and XPath separate, because at worst in only affects people using DOM parsing (as distinct from SAX or StAX parsing). I don't think it's possible to separate the functionality needed for DOM out of the Parse module.

Meanwhile, as far as Saxon is concerned, this strengthens the case for shipping with an IKVMC'd version of Apache Xerces, in place of the XML.Parse assembly derived from OpenJDK.