Alphacodes for Sequence Types

By Michael Kay on October 15, 2019 at 02:06p.m.

In the next releases of Saxon and Saxon-JS we have devised a compact notation for representation of SequenceType syntax in the exported SEF file. This note is to document this syntax.

The main aims in devising the syntax were compactness, together with fast generation and fast parsing. In addition it has the benefit that some operations are possible on the raw lexical form without doing a full parse.

The syntax actually handles ItemTypes as well as SequenceTypes; and in addition, it can handle the two examples of NodeTests that are not item types, namely *:local and uri:*. It can therefore be used in the SEF wherever a SequenceType, ItemType, or NodeTest is required.

The first character of an alphacode is the occurrence indicator. This is one of: * (zero or more), + (one or more), ? (zero or one), 0 (exactly zero), 1 (exactly one). If the first character is not one of these, then "1" is assumed; but the occurrence indicator is generally omitted only when representing an item type as distinct from a sequence type.

The occurrence indicator is immediately followed by the "primary alphacode" for the item type. These are chosen so that alphacode(T) is a prefix of alphacode(U) if and only if T is a supertype of U. For example, the primary alphacode for xs:integer is "ADI", and the primary alphacode for xs:decimal is "AD", reflecting the fact that xs:integer is a subtype of xs:decimal. The primary alphacodes are as follows:

"" (zero-length string): item()

A: xs:anyAtomicType
AB: xs:boolean

AS: xs:string
ASN: xs:normalizedString
ASNT: xs:token
ASNTL: xs:language
ASNTM: xs:NMTOKEN
ASNTN: xs:Name
ASNTNC: xs:NCName
ASNTNCI: xs:ID
ASNTNCE: xs:ENTITY
ASNTNCR: xs:IDREF

AQ: xs:QName
AU: xs:anyURI
AA: xs:date
AM: xs:dateTime
AMP: xs:dateTimeStamp
AT: xs:time
AR: xs:duration
ARD: xs:dayTimeDuration
ARY: xs:yearMonthDuration
AG: xs:gYear
AH: xs:gYearMonth
AI: xs:gMonth
AJ: xs:gMonthDay
AK: xs:gDay

AD: xs:decimal
ADI: xs:integer
ADIN: xs:nonPositiveInteger
ADINN: xs:negativeInteger
ADIP: xs:nonNegativeInteger
ADIPP: xs:positiveInteger
ADIPL: xs:unsignedLong
ADIPLI: xs:unsignedInt
ADIPLIS: xs:unsignedShort
ADIPLISB: xs:unsignedByte
ADIL: xs:long
ADILI: xs:int
ADILIS: xs:short
ADILISB: xs:byte

AO: xs:double
AF: xs:float
A2: xs:base64Binary
AX: xs:hexBinary
AZ: xs:untypedAtomic

N: node()
NE: element(*)
NA: attribute(*)
NT: text()
NC: comment()
NP: processing-instruction()
ND: document-node()
NN: namespace-node()

F: function(*)
FM: map(*)
FA: array(*)

E: xs:error

X: external (wrapped) object
XJ: external Java object
XN: external .NET object
XS: external Javascript object

Every item belongs to one or more of these types, and there is always a "most specific" type, which is the one that we choose.

Following the occurrence indicator and primary alphacode are zero or more supplementary codes. Each is preceded by a single space, is identified by a single letter, and is followed by a parameter value. For example the sequence type "element(BOOK)" is coded as "1NE nQ{}BOOK" - here 1 is the occurrence indicator, NE indicates an element node, and nQ{}BOOK is the required element name. The identifying letter here is "n". The supplementary codes (which may appear in any order) are as follows:

n - Name, as a URI-qualified name. Used for node names when the primary alphacode is one of (NE, NA, NP). Also used for the XSD type name when the type is a user-defined atomic or union type: the basic alphacode then represents the lowest common supertype that is a built-in type.  (Note: we assume that type names are globally unique. This cannot be guaranteed when deploying a SEF file: the schema at the receiving end might vary from that of the sender.) Also used for the class name in the case of external object types (in this case the namespace part will always be "Q{}"). Note that strictly speaking, the forms *:name and name:* can appear in a NameTest, but never in a SequenceType. However, they can be represented in alphacodes using the syntax "n*:name" and "nQ{uri}*" respectively. The syntax "~localname" is used for a name in the XSD namespace. 

c - Node content type (XSD type annotation), as a URI-qualified name optionally followed by "?" to indicate nillable. The syntax "~localname" is used for a name in the XSD namespace. Optionally present when the basic code is (NE, NA); omitted for NE when the content is xs:untyped, and for NA when the content is xs:untypedAtomic. Only relevant for schema-aware code.

k - Key type, present when the basic code is FM (i.e. for maps), omitted if the key type is xs:anyAtomicType. The value is the alphacode of the key type, enclosed in square brackets: it will always start with "1A".

v - Value type, present when when the basic code is (FM, FA) (i.e. for maps and arrays), omitted if the value type is item()*. The value is the alphacode of the value type, enclosed in square brackets. For example the alphacode for array(xs:string+)* is "*FA v[+AS]".

r - Return type, always present for functions. The value is the alphacode of the return type, enclosed in square brackets.

a - Argument types, always present for functions. The value is an array of alphacodes, enclosed in square brackets and separated by commas. For example, the alphacode for the function fn:dateTime#2 (with signature ($arg1 as xs:date?, $arg2 as xs:time?) as xs:dateTime?) is "1F r[?AM] a[?AA,?AT]"

m - Member types of an anonymous union type. The value is an array of alphacodes for the member types (these will always be atomic types), enclosed in square brackets and comma-separated. The basic code in this case will be "A", indicating xs:anyAtomicType. This is not used for the built-in union type xs:numeric, nor for user-defined atomic types defined in a schema; it is used only for anonymous union types defined using the Saxon extension syntax "union(a, b, c)".

e - Element type of a document-node() type, present optionally when the basic code is ND. The value is an alphacode, which will always start with "1NE".

t - Components of a tuple type (Saxon extension). The value is an array of tokens, enclosed in square brackets, where each token comprises the name of the component (an NCName), a colon, and the alphacode of the component type.

i, u, d - Venn type. The item type is the intersection, union, or difference of two item types. The letter "i", "u", or "d" indicates intersection, union, or difference respectively, followed by a list of (currently always two) item types enclosed in square brackets and separated by a comma. The principal type will typically be "N" or "NE". Saxon uses venn types internally to give a more precise inferred type for expressions; it is probably largely unused at run-time, and can therefore be safely ignored when reading a SEF file.

Named union types have a basic alphacode of "A", followed by the name of the union type in the form "A nQ{uri}local". The syntax "~localname" is used for a name in the XSD namespace, so the built-in union types xs:numeric and xs:error are represented as "A n~numeric" and "A n~error" respectively.

TODO: the documentation for union types is not aligned with the current implementation

Examples:

0 - empty-sequence()

1AS - xs:string

1N - node()

1 - item()

* - item()*

1NE nQ{}item - element(item)

1ND e[1NE nQ{}item] - document-node(element(item))

*FM k[1AS] v[?AS] - map(xs:string, xs:string?)*

1F a[?AS,*AO] r[1AB] - function(xs:string?, xs:double*) as xs:boolean

Version: 2019-10-30