Validation Tips and Techniques

Validation is a procedure that ensures an XML document conforms to the rules governing its logical structure as specified in a language schema such as DTD (Document Type Definition). An XML document might be well-formed—that is, it obeys the syntactical rules of XML—and at the same time be invalid. For example, an element might include a child element when it is supposed to have only textual content, or a required attribute of an element might be missing.

To perform validation it helps to construct a tree of an XML document’s schema that is parallel to a tree structure representing the document’s actual content (see “Constructing XML Tree Structures”). The schema tree presents a simple abstract view of how the document should be structured. Instead of nodes of objects representing the actual elements and text of the document, the schema tree contains nodes that express the rules by which the parts of the document can be combined. Validation tests the actual elements, attributes, and other parts of the document against the rules of the schema to see if the document conforms. If your application finds any violation of conformance, it can notify the user and perhaps require the user to fix the error. You can validate an XML document when it is first read and processed and later when users attempt to make any changes to it.

Because the programmatic interface of NSXMLParser is designed to report only XML constructs and DTD declarations, this article focuses on that language schema. However, if you use an XML-based language schema, such as RELAX NG, then NSXMLParser can process the schema just it would as any XML file, reporting what it finds to its delegate. You can use the data you thereby acquire for validation.

The sections on constructing rules focus primarily on element and attribute declarations because these are by far the most common and most important type of declaration. “Handling Other Declarations” briefly discusses what to do with other kinds of declarations, such as those for entities and notations.

Using NSXMLParser to Handle DTD Declarations

The NSXMLParser class reports to its delegate DTD declarations it encounters in a document (assuming the delegate implements the necessary methods). If the language schema you use is DTD, NSXMLParser helps you acquire the data you need either for validation or for other purposes, such as enforcing correctness when dynamically constructing objects (for example, a menu template).

The DTD Delegation Methods

The NSXMLParser class defines a half dozen delegation methods that the parser invokes when the parser encounters a DTD declaration in a internal or external source. These methods are of the form:

parser:foundTypeDeclarationWithName:...

The third parameter and any subsequent parameters depend on the type of declaration. The following list briefly describes the NSXMLParser delegation methods related to DTD declarations.

- parser:foundElementDeclarationWithName:model:

Example: <!ELEMENT dictionary (documentation?, suite+)>

- parser:foundAttributeDeclarationWithName:forElement:type:defaultValue:

Example: <!ATTLIST dictionary title CDATA #IMPLIED >

- parser:foundInternalEntityDeclarationWithName:value:

Example: <!ENTITY % OSType "CDATA">

- parser:foundExternalEntityDeclarationWithName:publicID:systemID:

Example: <!ENTITY name SYSTEM "name.xml">

- parser:foundNotationDeclarationWithName:publicID:systemID:

Example: <!NOTATION img PUBLIC "urn:mime:image/jpeg">

- parser:foundUnparsedEntityDeclarationWithName:publicID:systemID: notationName:

Example: <!ENTITY corplogo SYSTEM "logo.jpg" NDATA img>

Resolving External DTD Entities

An XML document, in the DOCTYPE declaration that occurs near its beginning, often identifies an external DTD file whose declarations prescribe its logical structure. For example, the following DOCTYPE declaration says that the DTD related to the root element “addresses” can be located by the system identifier “addresses.dtd”.

<!DOCTYPE addresses SYSTEM "addresses.dtd">

Often the system identifier assumes a standard file-system location for DTDs—for example, /System/Library/DTDs. At the start of processing, the NSXMLParser delegate is given an opportunity to resolve this external entity and give the parser a list of DTD declarations to parse.

  1. When you prepare the NSXMLParser instance, send it the setShouldResolveExternalEntities: with an argument of YES.

  2. Implement the delegation method parser:resolveExternalEntityName:systemID: to return the declarations in the external DTD file as an NSData object.

If the DTD declarations are internal to an XML document, then the delegate will receive the DTD-declaration messages automatically (assuming, of course, that it implements the related methods).

Constructing Rules for Elements

Just as elements are typically the most common kind of construct in an XML document, element declarations are the most common kind of declaration in a DTD. They express rules for the composition of elements from child elements, text, and other constituents.

An element declaration has three parts: the !ELEMENT keyword, the element name, and a content model. The content model is everything after the name up to the terminating angle bracket. Consider the following examples:

<!ELEMENT cocoa EMPTY>
<!ELEMENT keyboard (layouts+, modifierMap+, keyMapSet+, actions*, terminators*)>
<!ELEMENT dict (key, %plistObject;)*>
<!ELEMENT string (#PCDATA)>

The content model can specify no content (EMPTY), any content (ANY, which is rare), textual content (#PCDATA), and child elements. It may identify child elements by name or by an entity reference (such as %plistObject; in the third example above). The model can also specify mixed content—that is, the element can contain text and child elements in any order. Through occurrence modifiers (*, +, ?) and other syntactical conventions, the content model can also specify the order of child elements, whether an element is required or optional, how many times an element may occur, and acceptable choices between elements. Occurrence modifiers can be applied to groups of elements (in parentheses) as well as individual elements.

The job required for validation is to examine the content model of an element declaration and derive rules for the composition of that element. As one approach, you might design classes for each type of rule as well as for the scope of a rule (individual element or group of elements). You could then associate instances of that rule class with an element through the name of the element. During validation the instances are queried with regard to a current or potential member of an element.

Table 1 lists the most important rules derivable from an element declaration’s content model.

Table 1  Possible rules for element validation

Rule

Sample content model

Comments

Textual content only

(#PCDATA)

Mixed content

(#PCDATA | bold | italic)

Vertical bars in this case have a meaning different from choice; when #PCDATA is present, they mean that text and child elements can be intermixed.

No content

EMPTY

For flag-type values.

Required sequence

(name, address, phone)

Commas indicate prescribed sequence.

Choice

(read | write | readwrite)

Without #PCDATA being a member (see Mixed content), the vertical bars mean that one of the listed elements must be used.

Occurs exactly once

(name, address, phone)

No modifier punctuation mark. Can apply to individual element or group.

Occurs zero or more times

(%plistObject;)*

Occurrence modifier is asterisk (“*”). Can apply to individual element or group.

Occurs one or more times

(property+)

Occurrence modifier is plus sign (“+”). Can apply to individual element or group.

Occurs zero or one time

(%implementation;?)

Occurrence modifier is question mark (“?”). Can apply to individual element or group.

Constructing Rules for Attributes

Elements frequently have attributes associated with them, and consequently attribute-list declarations are frequently encountered in DTDs. Attribute-list declarations specify the rules for attributes using a syntax that is different from element declarations. They specify, in order, the associated element, the name of the attribute, the type of the attribute, and a default value. For example, the declaration

<!ATTLIST modifierMap defaultIndex NMTOKEN #REQUIRED >

states that the defaultIndex attribute, which is associated with the modifierMap element, is of type NMTOKEN (meaning that it must be a valid XML name); the #REQUIRED keyword given as the default value means that a value for the attribute must be supplied.

When a NSXMLParser instance encounters an attribute-list declaration, it sends parser:foundAttributeDeclarationWithName:forElement:type:defaultValue: to its delegate. Passed in as parameters are attribute name, the associated element, the attribute type, and its default value. The rules for attributes derive from combinations of the last two parameter (type and default value). Table 2 lists some the possible rules that you can construct from attribute-list declarations.

Table 2  Possible rules for attribute validation

Rule

Keywords or example

Type or default

Comments

Unique value

ID

type

The attribute value must be unique in the XML document.

Required value

#REQUIRED

default

The value of the attribute must be specified in the document.

Refers to unique attribute value

IDREF | IDREFS

type

Value must refer to valid ID-type value elsewhere in document. IDREFS specifies a list of ID references (in parentheses).

Valid XML name

NMTOKEN | NMTOKENS

type

Value must be valid XML name (including entity references). NMTOKENS specifies a list of XML names (in parentheses).

Value is fixed

#FIXED "value"

default

Value must be “value”.

Valid XML name in list

(name | address | phone)

type

Attribute enumeration: value must be one of the XML names in parentheses.

Valid defined type in list

NOTATION (tiff | gif | jpg)

type

Attribute enumeration: value must be one of the defined types in parentheses.

Handling Other Declarations

Other DTD declarations such as those for entities and notations are less common than element and attribute-list declarations. You can easily derive rule constructions for these other declarations after reviewing some DTD documentation. However, there are a couple of things to keep in mind: