Transforming 20 Gigabytes

By Michael Kay on September 25, 2007 at 02:18p.m.

My heart sank when I received an email yesterday from a Saxonica customer reporting a failure when running a transformation on a 20Gb source file (using the "streaming" capabilities described at http://www.saxonica.com/documentation/sourcedocs/serial.html.) It was apparently failing half way through with no error message of any kind. How do you go about debugging such a problem, especially when you're in a different continent from the client?

Well, every problem is an opportunity, and this was my first chance to experiment with a transformation on this kind of scale - in fact, it's probably 100 times bigger than anything I have personally run before. And I was quite pleasantly surprised that it's more manageable than I expected. Downloading the 2Gb compressed file ran without problems while I was having dinner; decompressing it (using WinRAR) ran again without problems while I watched the news on TV.

I decided first to try my old DTDGenerator application on the source data, to check that the XML was well-formed. This is a pure SAX application with no dependencies on Saxon code. The first attempt ran out of memory; I discovered that the DTDGenerator is (unintentionally) heeping a hash table of all distinct attribute values. So I fixed this bug (first code change since 2001!) and fired it off again. After 40 minutes I decided I needed to get on with other work on my laptop and killed it, starting it again on the shiny new Vista machine sitting in the corner, once I had installed Saxon and Java on it. The DTDGenerator ran successfully on that machine in about 25 minutes, so I started up the transformation proper, which took around 50 minutes, producing a 1.9Gb CSV file as output, running with a constant memory footprint of around 20Mb. That's a processing rate of about 6Mb/sec, which doesn't seem bad.

It's noticeable that the new machine is only running at 25-30% CPU - it's one of the shiny new quad-core processors and of course Saxon isn't fully taking advantage of it. The streaming process actually uses two threads, but I think one of them is doing most of the work. When I bought the box I was hoping to do some experiments on multithreading, and this job seems a good one to start on.

I haven't solved the customer's problem yet, but I've shown that there's at least one environment in which Saxon can handle the job, which is reassuring. Now it's a question of finding out what's different in their environment - or indeed, whether they can reproduce the problem systematically at all.