XML is at heart a set of lexical rules that provide a means to present data in a structured format. It is both the simplicity of the rules, and the richness of the structures that they enable, that have facilitated the great success of the XML paradigm.
A schema language is used as a way to tame the potential complexity-via-combinatorics that raw XML presents. A schema document provides rigid lexical constraints on what is permissible in an XML document. Typically, these constraints are so strict as to eliminate the extensibility that is built into XML's name - eXtensible Markup Language.
This last point is worth some comment, as the Skeleton Schema approach avoids this limitation. Once a document is submitted for validation against a schema, the validation does two things. The first is to confirm that all the structural rules defined by the schema are adhered to. The second is to enforce that no extra-schema content is present. It is this second feature that destroys the extensibility of xml when schemas are applied. The Skeleton Schema approach differs on this point. The structure that it defines is to be considered as being limited in scope to the instance document content that matches the schema artifacts. In other words, when an element or attribute is found in an instance document that matches a QName from the schema in the correct location, then the data contained in the instance document must conform to the schema rules. However, if any extra element or attributes appear in the instance document, then they are treated as not existing at all with respect to schema validation. So the Skeleton Schema formalism enforces what is defined, and permits all that is not defined.Of course, there are situations when it is desired to prevent extra content from being included in an instance document. The Skeleton Schema does provide a means to lock up the content model in such circumstances. However, it is important to understand that this is not the default behavior.
The two prevalent schema languages are DTD (Document Type Definition) and XSDL (Xml Schema Definition Language). Each has their advantages and disadvantages. Perhaps the largest disadvantage that they share is that it is difficult to discern the structure of an instance document that any given schema describes. For example, a DTD schema isn't even written as an xml document. An XSDL schema is written in xml, but its structure differs drastically from that of the associated instance documents.
When working with these schemas, it is a common practice to attempt to manually generate an instance document to get a better sense of the structure. It is also common, when a schema designer distributes the schema to people charged with the generation of instance documents, to include a number of example instance documents to assist in communicating the intent of the schema. Since the schema itself should be, and technically is, sufficient to describe the structure of instance documents, clearly these common practices indicate that these particular schema languages are inadequate - at least in communicating structure to a human audience.
So the goal of the Skeleton Schema language is to be people-friendly. The design goal is to provide a schema that looks very close to an instance document. All the schema artifacts are collected into a small set of namespaces so that they standout. The number of artifacts are purposely small to limit the complexity of the schema language. While the schema language must be complete so that it can, at least in principle, be used to validate instance documents, the syntax has several shortcuts that are intended for ease of comprehension for human readers. The human reader is the primary target of the Skeleton Schema language. The chief design goal is to permit a human reader, unfamiliar with the schema language, to understand the meaning of a schema and discern the structure of instance documents at first reading.
There are typically two main types of xml documents. The first is a data document and is intended for long term storage of data. Traditional schema languages are quite good at describing data documents. In the Skeleton Schema language these documents are described by a schema carrying the /DataDocument/ namespace.
The other type of xml document is used for making requests to, and sending responses from, application service providers. These are, here, referred to as protocol documents. These instance documents are usually composed using a common set of xml artifacts, but the multiplicities for these artifacts vary greatly among the different request and response types. This variability in the multiplicities causes traditional schema languages to fail. There are two approaches that can be taken with these schema languages. The first is to make most artifacts optional. Clearly this is a severe problem, and forces the application developer to handle the validation of multiplicity constraints himself. The second approach is to create a separate schema for each request and response type. This introduces the need for the application developer to introduce his own methods for determining which schema to apply during validation. The Skeleton Schema language handles protocol xml documents within a single schema, and provides a way to declare the rule for matching instance documents with the appropriate schema artifacts. In the Skeleton Schema language these documents are described by a schema carrying the /Protocol/ namespace.
Different schema languages have different strengths and weaknesses. It is not a bad idea to create schemas for the same xml domain using different schema languages. There is no claim that the Skeleton Schema language is a replacement for the alternatives. For the interested reader, there is a discussion of the relative strengths and weaknesses of the different schema languages here.
Before giving the technical details on the Skeleton Schema language, I'd like to make a brief comment on attributes and namespaces, as this is often a confusing issue and traditional schemas only cloud the issue further. Within an XML instance document, an element is associated with a namespace via the explicit use of a namespace prefix, via an xmlns declaration on the element that doesn't declare a prefix, or by inheriting such a namespace declaration from its parent. Attributes are completely different. The ONLY way an attribute gets associated with a namespace is via an explicit namespace prefix. If there is no namespace prefix then the attribute is considered to not be in any namespace. Since a Skeleton Schema is designed to look like an instance document (plus adornment with attributes from one or more of the Skeleton Schema namespaces), the same rule will hold. Any attributes that are declared in a Skeleton Schema will not be associated with any namespace unless they are given an explicit namespace prefix.
The schema describes the structure of an instance document, used to store data long-term.
The root element of the schema is named d : DataDocument and lives in the /DataDocument/ namespace. The first child of the document element is the document element of the instance documents. There may be a second child, an s: DataTypeDefinitions element, as described below.
The schema describes a number of packets that are used in a communication protocol.
The root element of the schema is named p:Protocol and lives in the /Protocol/ namespace. Each protocol document type is declared as a child of the p:Protocol element. If protocol packets are to be validated against the skeleton schema, then each child of the p:Protocol element must declare the @p:selector. This is used by a validator to locate the correct protocol packet type from the schema to use for validation. On the other hand, if the skeleton schema is not going to be used for validation, then the @p:selector is not required.
When a protocol packet is validated against the schema, the validator will step through each child of the schema's p:Protocol element. If a schema-child element has the same QName as the root element of the protocol packet, then the schema element's @p:selector is examined. The value is an xpath expression which is evaluated against the protocol packet, with the root element as the context node. If the expression evaluates as false( ) (or an empty node set) then the validator continues searching for a matching schema element. If the xpath expression evaluates as true( ) (or a non-empty node set) then the schema element and it's content is taken as the relevant schema to use when validating the protocol packet. From then on the validation proceeds similar to the validation with a /DataDocument/ schema.
In addition to the attribute @p:selector, each packet-root element can also have the optional attribute @p: name. This can be used to give a user-friendly name to each protocol packet. It is anticipated that any tools that are implemented to work with skeleton schemas will use the @p: name while interacting with users. For example, a validator may report the @p: name as part of a validation error message, which will permit human investigators to more easily locate the error source.
Most of the schema-like information is given via entities in the /Structure/ namespace. Consequently most of this discussion is devoted to this namespace. In what follows, the prefix s: will be used to denote this namespace.
This namespace is used in the definition of a separate (optional) Catalog document. The purpose of a Catalog document is to provide a way for a Skeleton Schema validator to locate the schemas to use when validating an xml instance document. There are two reasons why this may be necessary.
The first is when a schema references namespaces defined elsewhere. Since a Skeleton Schema models the structure of the instance document, the external schema isn't required for validating structure. Data types are another matter. It shouldn't be necessary to redeclare all external data types within a schema in order to use them. The mechanism of a Catalog file permits the mapping of an external namespace to a schema location. A Skeleton Schema validator can then locate the external schema and extract the relevant data type declarations. In the scenario where externally declared data types appear in a Skeleton Schema document and the validator cannot locate the declaring schema, the undetermined types will be treated as s:string for validation purposes. The exception to this are the types in the XSDL namespace http://www.w3.org/2001/XMLSchema. These types are well-known and Skeleton Schema validators are required to recognize them intrinsically.
The second use for a Catalog file is to provide the validator with a mechanism to handle version changes in an xml instance document. It is quite common for the definition of xml documents within an application to evolve over time. As they evolve their structures are enriched and/or modified. These changes are reflected in changes to their descriptive schemas. Such changes are typically flagged with a version identifier embedded within the instance document. This should permit a validator to select the matching schema version.
The structure of a Skeleton Schema Catalog file will be defined at the end of this document.
There are at least three version numbers that should be relevant when working with Schema Skeletons. The first is the version of the Skeleton Schema language itself. This will allow the language to evolve without having to maintain absolute backward compatibility. The second is the version of a given schema instance. Xml schemas tend to evolve over time so it is useful to capture versioning. One approach for this is to include the version in the namespace, but that can become unwieldy over time. Also, that would mean a different mechanism than that for other version information. The third (and possibly) other version numbers arise from the application domain being modeled. The Skeleton Schema language recommends handling all of these version numbers in a uniform fashion, with attributes of a particular data type. For the two version numbers associated with schemas, we have two defined attributes on the document element of the schema:
There is yet another version that can come into play. Consider the scenario where a standards body develops a schema. Many entities will use this schema as the cornerstone of their communication needs. Ideally, the standard schema would suffice for all needs, but in practice this doesn't always work out perfectly. For a given pair of communication partners, there may arise a need to deviate from the schema or add to it. A typical example would be the need to require certain data items whereas the standard schema has set them as optional. In the case of adding to the schema (i.e., data items that weren't anticipated by the authors of the standard schema) it becomes necessary to add new elements and/or attributes to the standard schema. In order to satisfy these needs, the skeleton schema language provides for two more attributes that can be attached to the schema's root element. They are:
It should be noted that these branding attributes are for documentation only. They play no role in the workings of a skeleton schema validator. There is also no way to explicitly declare where and how a branded schema deviates from the standard. If an entity is going to work with multiple brandings of the same standard schema, then there will need to be an indicator within the instance documents themselves to permit a validator to select the correct branded schema via a Catalog file.
The entities described below give the meta-information that defines the structure of the instance documents.
For elements, these artifacts are attributes declared on the element. If the element is to take a constant value the corresponding schema element will take that value. If the element is to take a value from an enumeration, the corresponding schema element will take the full enumeration, as described under the Data Types section below. In the case of s: type, a notational shortcut is to declare the type as the element's value. This will work as long as the value isn't a constant or enumeration. However, since in this case the s: type keyword doesn't appear, the declared type must include the namespace prefix.
To clarify the last point, you can define an element's value to be of type s: int in any of the following ways:
<ElementName s: type="int"/>
or
<ElementName s: type="s: int"/>
or
<ElementName>s: int</ElementName>
For attributes, the full structural information must be included within the value of the attribute. In this case
the full format is:
@attribute_name="s: mult='mult-indicator';s: type='type-indicator';s:cond='cond-indicator';s: vals='valuespace-indicator'"
However, this is usually overkill. It is only necessary to include the artifacts that do not take a default
indicator. In these cases a shorthand syntax can be used.
The four shorthand scenarios are:
@attribute_name="" or @attribute_name="?" to express s: mult, which can only be 1 or ? for an attribute
@attribute_name="s: int" to express s: type
@attribute_name="the constant value" to express that s: vals is a constant value of s: type='string'
@attribute_name="A|B|C|s:space|s:empty" to express that s: vals is an enumeration (here consisting of 5 possible values)
When the shorthand is used to indicate s: vals, the s: type is usually implicit in the format. However there is
always ambiguity since the type s:string can take any value, and other types may have overlapping value spaces.
@attribute_name="{type-indicator}valuespace-indicator".
For example, to declare an enumerated set of prices one could have
@price="{s: money}1.99|2.99|3.99"
One can also use a similar shorthand to express s: mult, so if the previous attribute is optional, you could have
@price="{?}{s: money}1.99|2.99|3.99"
An s:space is useful as it is often difficult to discern the presence of a typographic space. Also, xml parsers may eliminate spaces as irrelevant white space.
An s:empty is useful to indicate that an element or attribute can take an empty value.
Both s: posinf and s: neginf are useful in defining unbounded ranges.
An s: null is useful in declaring a NULL value, which has special meaning in certain application domains. One example is in serializing for transport the contents of a table in a relational database. Another example is when the xml models containment relationships between instances of objects in an OOP system. Of course, in an instance document there will already be a mechanism for expressing NULL values that is traditional for a given application domain. The s: null is a means to uniformly handle NULL values in schemas across application domains. It is similar to the xsi:nil="true" from the xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance", only much easier to work with.
A final comment on the defined constants is to stress that they are used within schema documents, not in instance documents. In fact, all artifacts defined within the Skeleton Schema namespace are to be understood as appearing in schema documents only. This is what distinguishes s: null from xsi:nil="true". The former is for the skeleton schema and would map to a domain-specific representation of NULL in instance documents. The xsi:nil attribute is intended to appear in instance documents, providing a universal representation of NULL across all domains. The Skeleton Schema language recommends the use of s: null in schema documents and xsi:nil="true" in all instance documents.
The following pre-defined data types can be used in Skeleton Schemas:
A constant value can be declared by using the value. An enumeration can be declared by listing the values separated by '|', for example A|B|C.
It is possible to use the not( ) operator on most types. What that means is that when validating the instance document, the validator will determine whether a datum conforms to the type as specified in the schema. The answer to this determination will be a logical TRUE or FALSE. If the not( ) operator is applied to the type in the schema, then the validator checks against the contained type and it will pass the test when the test against the contained type is FALSE (because not( ) converts it to TRUE). Let's examine a particularly useful example. Consider the type expression not(s:empty). This can be used to declare that a datum can take any value, but it cannot be empty (this example is equivalent to using the data type s: vstring). An example of when the not( ) operator cannot be used would be not(s:string). Since in xml all data is lexically a string, it is impossible to satisfy the condition not(s:string).
An interesting question is how would a validator deal with a declared data type when the xml artifact is not required. For example, consider an attribute declared att_name="{?}s: int". How would validation treat an instance document that has att_name=""? There are two possible answers. The first is that since the attribute is optional, it should be acceptable to have an empty value. The second answer would be to note that if a blank value were acceptable the schema should be att_name="{?}s: int|s:empty". To resolve the issue, one must remember that the design goal is to be user-friendly. It is a common practice for developers to create xml instance documents (especially protocol packets) by starting with a template string consisting of all the elements and attributes with empty values, then populating those values for which there are data. This practice forces the viewpoint that an empty value must be acceptable for any optional artifacts. Of course, for required artifacts an empty value will only be acceptable if the schema specifically allows for it.
For attributes the above discussion is sufficient. An optional element can contain a value, attributes with their own values and subelements. The rule for an optional element is that it is okay to pass empty data if and only if all the data is empty, except for constants which must be populated. Even when an optional element contains a required attribute and/or subelement that data must be empty for the rule that allows empty strings for optional values to apply. Again, this is motivated by the common usage of beginning with a template xml document that has all non-constant data as empty.
Other types can be used by including a reference to their namespace – for example, the types familiar from XSDL can be used. If using types that aren't native to XSDL or the /Structure/ namespace, one can reference the defining schema for those types using a Catalog file, as described below.
Alternatively, new types can also be defined by including, as the last children of the document element, the schema element s: DataTypeDefinitions. This element must declare the @s: types_namespace to provide the namespace within which the data types are defined - one s: DataTypeDefinitions is required for each namespace. Each subelement of s: DataTypeDefinitions is an xsd:simpleType and uses the syntax of XSDL. Of course, the xsd:simpleType must have a non-empty @name so that it can be referenced from within the body of the schema.
These defined types should declare unique value spaces. It is considered an abuse of the Skeleton Schema philosophy to use a defined type to provide an alias for an existing type. In particular, one should not attempt to provide semantically meaningful types on the back of simple value-space types. Semantics should be confined to the application realm. While it is true that semantically meaningful names are typically used in designing xml, this is really somewhat of an illusion. The true semantic meaning only holds when the content of an xml document is consumed. Also, these names appear in the instance documents, whereas the names given to schema types do not – they only appear within schema documents. One should maintain clarity on this point and confine semantic names to elements and attributes and leave data types at the lower level of lexical value spaces, which means the permissible combinations of typographic characters.
As an example, consider attempting to define a type t:zipcode to be 5 digits. The schema would include
<Zipcode s: type="t:zipcode"/>.
This is incorrect because it attempts to place semantic information in the type when it belongs in the element name.
The correct schema would be:
<Zipcode s: type="int(5)"/>.
On the other hand, an application may need a full Zip+4 value. In that case it would be permissible to define a new lexical type t:zip4 which should look like s: int(5)–s: int(4). While this can be handled via s: regex, it is considered good form to define a distinct lexical type; s: regex is a catch-all type, and isn't very user friendly.
I want to present a real world example. I'm going to use an existing DTD schema and translate it to the Skeleton Schema language. The DTD in question is from the Mismo organization that is charged with defining standards for communicating between various entities involved in the mortgage industry. I've taken their merge-only credit report request schema and placed it here. Notice that the DTD defines a protocol, and as such nearly all of the defined entities are optional. However, in an actual Mismo request packet of a given request type, most entities are required. I've placed the corresponding Skeleton schema here. As you can see, the multiplicities are specified accurately because a /Protocol/ schema is possible. The better support of data types is obvious, but that is only because the Mismo schema is written in DTD. Mismo has recently moved to an XSDL schema, so the data types are handled better. The lack of support for protocols is still an obvious limitation in the new Mismo schemas, which are still forced to define nearly all entities as optional.
One last point on the Mismo skeleton schema is to notice the attributes @s:brand and @s:brand_version on the root element. Since Mismo is a standards body, they are unable to anticipate every need of all the entities that have adopted their standard. As such, the move from the DTD to the skeleton schema had somewhat arbitrary choices made – though typical of actual practice. There is therefore a need to 'brand' the schema as deviating from the standard according to the particular needs of the communicating partners. It is also possible that these needs will change in time while the standard schema remains fixed. Pretending that the @s: version was set by the Mismo organization, the evolving variations from the standard are marked with the @s:brand_version.
A Skeleton Schema Catalog file has the root element c:SchemaCatalog in the namespace xmlns:c="http://SkeletonSchema.info/Catalog/". The structure is defined by this Skeleton Schema.
Each schema in the catalog is defined by an @instance_namespace and an optional set of Selector elements.
When a validator is using the catalog to locate a data type definition, the namespace is that to which the data type belongs. In this case there is no need for Selector elements. If no element has a matching namespace then the type is treated as equivalent to s:string. If multiple elements have a matching namespace then each schema is searched, in order, for a definition of the data type in question. The first definition found is the one that is used.
When a validator is using the catalog to locate a schema, the namespace refers to the namespace of the document element of the instance document. If there are more than one Schema elements with the same namespace, then the Selector elements are used to identify the correct schema. The first set of Selector elements that all evaluate to true( ) (or non-empty node set) defines the chosen schema. If no set of Selectors evaluate to true( ), then the first Schema with the correct namespace is the chosen schema. If none of the Schema elements has a matching namespace, then the validator must return an error as this is most likely the result of an error on the part of either the schema author or the author of the instance document.
In all scenarios the Location element gives the location where the schema document can be found.
Frequently you may use elements and attributes from well-known namespaces within your own xml grammars. When doing so, you may find that the tool you use to write the schema is aware of these well-known namespaces. When that is the case, you'll find that your skeleton schema gets flagged as erroneous for these artifacts. For example, consider the attribute xml:id. The value of this must match the DTD type ID. If you were to include it as an optional attribute in a skeleton schema you might enter xml:id="?". This is an error since any xml parser that is aware of xml:id knows that "?" is not a valid value because it is not an NCName. There are many well-known namespaces and the number is growing all the time. As xml tools get updated they will become aware of this growing list of namespaces, so the Skeleton Schema language is faced with a challenge.
The Skeleton Schema Language offers a pseudo-namespace,
http://SkeletonSchema.info/NamespaceAlias/. This is used to create a mapping
between well-known namespaces into the skeleton schema arena, for which xml
tools are not prepared to enforce any rules. Lets look at a simple example:
<d:DataDocument xmlns:d="http:SkeletonSchema.info/DataDocument/" xmlns:s="http:SkeletonSchema.info/Structure/" xmlns:_xml="http://SkeletonSchema.info/NamespaceAlias/xml" xmlns:_xlink="http://SkeletonSchema.info/NamespaceAlias/xlink">
<ElementExample xmlns:xlink="http://www.w3.org/1999/xlink" _xml:id="?" _xlink:type="simple" _xlink:href="?"/>
</d:DataDocument>
and an associated instance document:
<ElementExample xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="id_001132" xlink:type="simple" xlink:href="http://mydomain.com/image.gif">
This is an example document.
</ElementExample>
The schema uses the attribute named 'id' in the namespace
http://SkeletonSchema.info/NamespaceAlias/xml. This is perfectly legal xml and
any tools that parse the XML will treat it without imposing any interpretation.
A Skeleton Schema validator will recognize the pseudo namespace, extract the
terminating 'xml' and then apply the rules to _xml:id as if it were xml:id.
In this case the namespace prefix 'xml' is well-known and so no actual
namespace mapping is required. In general this isn't the case so the
appropriate namespace declaration will be required as well. Notice that the
example contains a similar usage for xlink, and for that the namespace is
declared correctly in the schema even though there are no xml artifacts from that namespace
in the schema. In the instance document the correct prefixes are used.
Just as normal namespace declarations have the concept of scope attached to them, namespace aliases also have scope attached to them. This must be true as they appear in a Skeleton Schema document and in that context they are true namespace declarations. Accordingly, these can be placed on any elements within the schema and the scoping rules are applied. A parser will just extract the terminating characters and treat that as a namespace prefix. When a namespace alias is referenced via a namespace prefix, a matching true namespace prefix must be in scope or the parser will report an error. It is probably a good idea, though, to limit the usage of namespace alias declarations to the root element of the schema. This is desirable because a namespace alias decalaration takes up a lot of space and would present a distraction if embedded within an element that is part of the defined grammar. It is a distraction there because a namespace alias declaration should never appear in an instance document. Since well-known namespaces typically have customary namespace prefixes associated with them, a namespace alias should usually require a single alias prefix and that permits a single namespace alias declaration on the schema's root element.