Unicode, regular expressions, and Java

By Michael Kay on January 13, 2010 at 02:18p.m.

Many moons ago, when I first introduced regular expression support to Saxon's XSLT processor, I picked up a piece of software written by James Clark to translate regular expressions as defined in the XML Schema specification to regular expressions as understood by Java. Like any software written by James, it was extremely robust, handled all the quirks of the underlying specifications with unfailing accuracy, was tightly coded and fast, and was totally undocumented. 

One of the particular tasks it handled was to handle the fact that the Schema/XPath regex dialect counted characters above 65535 as one character, whereas the Java regex library until JDK 1.4 treated them as two. 

Over the years I've modified the code a bit. When JDK 1.5 came along and handled high-end characters correctly, I forked the code and produced one version for JDK 1.4, another for JDK 1.5. A third version targetted the .NET regex dialect. In Saxon 9.2 I finally got rid of the JDK 1.4 version, and I was also able to get rid of the .NET version by switching from using the .NET regular expression library to the library in OpenJDK, which had finally become reliable enough. 

Another of the tasks performed by James's code was to map character classes such as \P{Lu} (matching any upper-case character) from what XPath said it should mean to what Java thought it meant. This code has been untouched until now, but I've decided to take a fresh look at it and see whether it is really needed. Apart from the problem of high-end characters, it seems that what it was really doing was coping with differences between Unicode versions. It's a little hard to unearth the history now of which specifications mandated which Unicode version, but the current situation seems to be that JDK 1.5 and 1.6 support Unicode 4.0, while the schema (and hence XPath) specs originally specified Unicode 3.1, but now allow you to support whatever later Unicode version you like. So 4.0 support would be fine. 

I've generated XML documents showing the mapping of characters to classes by three different methods: direct Java coding using the JDK regex engine; XSLT code using Saxon 9.2 which uses the Clark translation of regular expressions to the JDK engine; and analysis of the data files published by the Unicode consortium. 

Between Saxon and the JDK there is a very close match. The only difference is that the JDK category C includes subcategory Cn (unassigned characters) whereas Saxon includes this subcategory in its parent class. 

Between the data coming from Unicode and the JDK there is a less close match,. This was because I worked with Unicode 5.2 data files, which includes many more characters than the JDK understands. But I've repeated the comparison with Unicode 4.0.0 files, and this gives a very close match, after allowing for expected discrepancies such as the omission of surrogates (which are non-characters in XPath) in one of the lists. 

So, it's looking as if I can cut out a lot more of James' translation code. 

There does seem to be one snag that I need to look at further: in one of my attempts to collect this data, I used the complement classes such as \P{Cn}, replacing characters that matched this term with an empty string. The result was a string containing unmatched surrogates, which immediately crashes Saxon. This could be a bug in the JDK handling of surrogate pairs, or it could be something else that I don't yet understand. One of the difficulties with regex handling has always been that you're very exposed to bugs in the regex library for the final ounce of conformance, and if you hit them, there's sometimes not much you can do about them.