Schema Modularity

By Michael Kay on June 20, 2023 at 12:00p.m.

Saxon, ever since we first introduced schema awareness back in 2004, has always worked with a single global schema maintained at the level of the Configuration object. This article discusses the advantages and disadvantages of this approach, and looks at possible alternatives.

Let's start by getting some terminology clear. A schema, in the terminology of the XSD specification, is a set of schema components (such as element declarations and type definitions). Don't confuse it with a schema document, which is an XML document rooted at an xs:schema element. A schema is what you get when you compile a collection of schema documents linked to each other using xs:include and xs:import declarations.

Not every set of schema components constitutes a valid schema. Most obviously you can't have two different components (for example, two type definitions) with the same name, unless they are identical. The XSD specification is a bit fuzzy about what it means for two components to be identical.

This means that in general, you can try and combine two schemas into one by taking their union, but the operation won't always succeed because the two schemas may be found to be inconsistent with each other.

The Global Schema

Saxon currently maintains a global schema at the level of the Configuration object. This means that every time you introduce a new schema, for example by compiling a schema-aware query or stylesheet that has an import schema declaration, or by validating a document against a schema loaded using the SchemaManager API, or referenced using xsi:schemaLocation, the schema components from that schema are added to the global pool, provided they are consistent with the declarations already present in the pool.

These consistency checks are of two kinds:

The main benefit of the global schema approach is that you can always be sure that type annotations in validated instance documents are consistent with types that are mentioned (or inferred) in compiled queries and stylesheets. If a query is compiled believing that element E will always be empty, then you can be sure that every validated instance of E will be empty, because no-one is allowed, between compilation and validation, to add a type definition that extends or redefines E allowing it to be non-empty. That's the theory, anyway.

The most obvious disadvantage of the approach is that an application can't work with two different versions of the same schema. If you want to write a stylesheet that transforms input documents from V1 to V2 of the same schema, you can't import both versions into the same stylesheet, one to validate the input and one to validate the output. In fact, you can't even have both versions in the same Configuration — which means you can't process an input collection containing a mix of different versions (or if you do, you have to forgo validation).

There are other less obvious disadvantages. One of them is revealed by a recent embarrassing bug where we discovered that schema compilation isn't thread safe: you can't reliably run two schema compilations within a single Configuration at the same time. We've patched that by adding some locking, but it's an imperfect solution because the lock is rather coarse-grained. We need to find a better solution, and that gives us an opportunity to re-examine the design and see whether we can fix some other long-standing issues at the same time.

Another outstanding issue is a long-standing bug #3531, concerning a situation where two independently-loaded schemas X and Y both extend the same substitution group. This has remained outstanding because we have had no reports of users being affected by it; but it remains an unsatisfactory state of affairs.

The X+Y Problem

Suppose that X and Y are valid schemas. Then we've already seen that their union, which I will call X+Y, is not necessarily a valid schema; their declarations might be inconsistent. Apart from the obvious inconsistencies where X and Y contain different elements or types with the same name, there can be much more subtle inconsistencies:

It doesn't really help that all these situations are rare. Should the processor simply ignore the problem and hope it doesn't happen? For the first three cases above, Saxon prevents the situation occurring, which imposes an inconvenience on users who are actually doing something completely safe. For the final case (wildcards), Saxon ignores the problem, which creates the theoretical risk that queries and stylesheets are not type-safe: a document that has been validated against a type T might not satisfy all the contraints that the query or stylesheet processor assumes to be true for any valid instance of T.

An Alternative: Modular Schemas

Let's consider an alternative model, where instead of adding all schema components to a single global schema at Configuration level, we keep schemas independent and modular. So two stylesheets that import different schema documents have separate unrelated schemas, and there are no requirements that the two schemas should be consistent with each other.

The challenge now is to ensure that a source document validated against a schema S1 is consistent with a stylesheet that imports schema S2. If the two schemas are identical, there's no problem (and it's not too hard to detect that they are identical, for example if they load the same schema document as their starting point).

But what if S2 is a superset of S1? Suppose the document is validated against a schema with target namespace X, while the stylesheet has two xsl:import-schema declarations, for namespaces X and Y? We're now back with the X+Y problem: a document that is valid against X is not necessarily valid against X+Y.

It gets worse: if we have a pipeline of stylesheets, each of which imports schemas for both its input document and its output document, then the first stylesheet might import schemas for X+Y, and the second for Y+Z, and we need to be sure that when the first stylesheet validates its output against X+Y, the result will also be valid input against Y+Z.

One possible solution here is to keep the imported schemas within a single stylesheet separate. Import one schema for the input, and another for the output, and don't require the two to be consistent. This also solves the problem of transforming from V1 to V2 of the same schema. So in our pipeline, the output of the first stylesheet would be validated not against X+Y, but merely against Y, which is the same schema used for the input of the second stylesheet. This would need language changes: xsl:import-schema declarations would need to identify which schema they belong to, and type names used in type or as attributes would need to qualify the type name with a schema name.

I've started doing work to allow free-standing schemas to be constructed and used for validation, independently of the Configuration. There are clearly cases where this is useful. However, there's a lot more work to be done on ensuring consistency of free-standing schemas, when a validated document is used as input to a schema-aware stylesheet or query. Expect a new class of (initially bewildering) error messages saying that element E is known to be valid against type T in schema X. but it isn't known to be valid against type T in schema Y. Hopefully these will be rare.

What about Wildcards?

I mentioned that there's an open issue with wildcards: if a schema type include a lax wildcard, then an element that's valid against that schema (because there's no element declaration matching the actual element name) can become invalid when more element declarations are added.

This isn't the only issue with wildcards. XSD 1.1 allows you to say, for example notQName="##defined" which means that the name used for an element or attribute must be one that has no global declaration anywhere in the schema. That's another example of how adding new declarations to a schema can make existing content invalid.

I think the answer to this problem is to interpret these definitions in the context of a "schema compilation unit". That is, when you compile a schema, notQName="##defined" is interpreted as meaning "not a name used for a global element/attribute declaration in the that schema"; any names or declarations added later (by merging this schema with others) have no effect on the meaning.

This seems to solve the problem whether using a global schema or local free-standing schemas, and makes the two cases behave more consistently and predictably.