Saxon, ever since we first introduced schema awareness back in 2004, has always
worked with a single global schema maintained at the level of the Configuration
object. This article discusses the advantages and disadvantages of this approach, and
looks at possible alternatives.
Let's start by getting some terminology clear. A schema, in the terminology
of the XSD specification, is a set of schema components (such as element declarations
and type definitions). Don't confuse it with a schema document, which is an
XML document rooted at an xs:schema
element. A schema
is what you get when you compile a collection of schema documents linked to each
other using xs:include
and xs:import
declarations.
Not every set of schema components constitutes a valid schema. Most obviously you can't have two different components (for example, two type definitions) with the same name, unless they are identical. The XSD specification is a bit fuzzy about what it means for two components to be identical.
This means that in general, you can try and combine two schemas into one by taking their union, but the operation won't always succeed because the two schemas may be found to be inconsistent with each other.
The Global Schema
Saxon currently maintains a global schema at the level of the Configuration
object. This means that every time you introduce a new schema, for example by
compiling a schema-aware query or stylesheet that has an import schema
declaration,
or by validating a document against a schema loaded using the SchemaManager
API, or referenced using xsi:schemaLocation
, the schema components
from that schema are added to the global pool, provided they are consistent with
the declarations already present in the pool.
These consistency checks are of two kinds:
Firstly, the new components must not have names that clash with the old. You can't have two different types with the same name. To test whether two types are the same, Saxon basically checks that they came from the same place: the same line number in the same source schema document.
Secondly, the new components must not change the semantics of existing components, if the existing components have already been used for validation purposes. The test whether components have been used is done at the level of a target namespace: if any declaration or definition in a target namespace has been used for validating a source document, or if it has been used when compiling a stylesheet or query, then the namespace is sealed, which means that its components must not be extended or redefined by any new components. We'll talk later about exactly what this means.
The main benefit of the global schema approach is that you can always be sure that type annotations in validated instance documents are consistent with types that are mentioned (or inferred) in compiled queries and stylesheets. If a query is compiled believing that element E will always be empty, then you can be sure that every validated instance of E will be empty, because no-one is allowed, between compilation and validation, to add a type definition that extends or redefines E allowing it to be non-empty. That's the theory, anyway.
The most obvious disadvantage of the approach is that an application can't
work with two different versions of the same schema. If you want to write a stylesheet
that transforms input documents from V1 to V2 of the same schema, you
can't import both versions into the same stylesheet, one to validate the input and one
to validate the output. In fact, you can't even have both versions in the same
Configuration
— which means you can't process an input collection containing
a mix of different versions (or if you do, you have to forgo validation).
There are other less obvious disadvantages. One of them is revealed by a recent
embarrassing bug where we discovered that schema compilation isn't thread safe: you
can't reliably run two schema compilations within a single Configuration
at the same time. We've patched that by adding some locking, but it's an imperfect
solution because the lock is rather coarse-grained. We need to find a better solution,
and that gives us an opportunity to re-examine the design and see whether we can fix some
other long-standing issues at the same time.
Another outstanding issue is a long-standing bug #3531, concerning a situation where two independently-loaded schemas X and Y both extend the same substitution group. This has remained outstanding because we have had no reports of users being affected by it; but it remains an unsatisfactory state of affairs.
The X+Y Problem
Suppose that X and Y are valid schemas. Then we've already seen that their union, which I will call X+Y, is not necessarily a valid schema; their declarations might be inconsistent. Apart from the obvious inconsistencies where X and Y contain different elements or types with the same name, there can be much more subtle inconsistencies:
Y might contain a type that redefines a type in X.
The semantics of
xs:redefine
as defined in XSD are a bit fuzzy round the edges, but the effect is supposed to be global: if a type is redefined, then all references to that type are affected. Exactly what happens when the same type is redefined more than once isn't made clear.The introduction of
xs:override
in XSD 1.1 makes this problem even worse, becausexs:override
allows existing definitions to be changed in completely arbitrary ways.Y might contain a type that extends a type in X, by allowing additional child elements or attributes. To be valid, instances that use the extended type have to identify themselves with an
xsi:type
attribute, but this doesn't alter the fact that adding the extended type to the global schema makes some things valid that were previously invalid. It's bad news if a document that was invalid at breakfast time suddenly becomes valid at tea time because a new schema has been added to theConfiguration
. It's even worse news if the schema changes during the course of validating an input document.Y might contain an element declaration that adds to the substitution group of an element defined in X. Again, this means that a document that was invalid when validated against schema X becomes valid when validated against X+Y.
X might contain a wildcard, for example
<xs:any processContents="lax"/>
. Suppose an instance document matches this wildcard with an element named W, and is valid (under the rules for lax validation) because the schema contains no element declaration for W. Now schema Y comes along and adds a definition for element W, and suddenly our source document is no longer valid.
It doesn't really help that all these situations are rare. Should the processor simply ignore the problem and hope it doesn't happen? For the first three cases above, Saxon prevents the situation occurring, which imposes an inconvenience on users who are actually doing something completely safe. For the final case (wildcards), Saxon ignores the problem, which creates the theoretical risk that queries and stylesheets are not type-safe: a document that has been validated against a type T might not satisfy all the contraints that the query or stylesheet processor assumes to be true for any valid instance of T.
An Alternative: Modular Schemas
Let's consider an alternative model, where instead of adding all schema
components to a single global schema at Configuration
level,
we keep schemas independent and modular. So two stylesheets that import
different schema documents have separate unrelated schemas, and there are
no requirements that the two schemas should be consistent with each other.
The challenge now is to ensure that a source document validated against a schema S1 is consistent with a stylesheet that imports schema S2. If the two schemas are identical, there's no problem (and it's not too hard to detect that they are identical, for example if they load the same schema document as their starting point).
But what if S2 is a superset of S1? Suppose the document is
validated against a schema with target namespace X, while the stylesheet
has two xsl:import-schema
declarations, for namespaces X
and Y? We're now back with the X+Y problem: a document that is valid
against X is not necessarily valid against X+Y.
It gets worse: if we have a pipeline of stylesheets, each of which imports schemas for both its input document and its output document, then the first stylesheet might import schemas for X+Y, and the second for Y+Z, and we need to be sure that when the first stylesheet validates its output against X+Y, the result will also be valid input against Y+Z.
One possible solution here is to keep the imported schemas within a single
stylesheet separate. Import one schema for the input, and another for the output,
and don't require the two to be consistent. This also solves the problem of transforming
from V1 to V2 of the same schema. So in our pipeline, the output
of the first stylesheet would be validated not against X+Y, but merely
against Y, which is the same schema used for the input of the second stylesheet.
This would need language changes: xsl:import-schema
declarations
would need to identify which schema they belong to, and type names used in
type
or as
attributes would need to qualify the type
name with a schema name.
I've started doing work to allow free-standing schemas to be constructed
and used for validation, independently of the Configuration
.
There are clearly cases where this is useful. However, there's a lot more
work to be done on ensuring consistency of free-standing schemas, when a validated
document is used as input to a schema-aware stylesheet or query. Expect a new
class of (initially bewildering) error messages saying that element E
is known to be valid against type T in schema X. but it isn't known
to be valid against type T in schema Y. Hopefully these will be rare.
What about Wildcards?
I mentioned that there's an open issue with wildcards: if a schema type include a lax wildcard, then an element that's valid against that schema (because there's no element declaration matching the actual element name) can become invalid when more element declarations are added.
This isn't the only issue with wildcards. XSD 1.1 allows you to say, for example
notQName="##defined"
which means that the name used for an element or
attribute must be one that has no global declaration anywhere in the schema. That's
another example of how adding new declarations to a schema can make existing content
invalid.
I think the answer to this problem is to interpret these definitions in the context
of a "schema compilation unit". That is, when you compile a schema,
notQName="##defined"
is interpreted as meaning "not a name used for
a global element/attribute declaration in the that schema"; any
names or declarations added later (by merging this schema with others) have no effect
on the meaning.
This seems to solve the problem whether using a global schema or local free-standing schemas, and makes the two cases behave more consistently and predictably.