XML processing and DOM Document Type Definitions

Abstract

This specification defines how various implementations of XML by the manakai project process XML documents, including: how to parse XML documents; how information in XML DTDs are represented in DOM; how base URLs are determined in XML documents; how namespace prefixes are serialized; and how conformance of XML documents are checked.

Parsing XML documents

This section defines additional requirements for an XML parser.

An XML parser MUST parse an XML document as specified in the XML5 [XML5] specification.

XXX If steps for an XML parser is not yet defined, similar steps for an HTML parser has to be used instead. Changes to HTML tokenizer made after XML5 has been specced must also be applied to XML tokenizer, where possible. Even for such cases, if the input is not well-formed, it is a parse error. There are before XML declaration phase, before document type phase, and document type phase in the tree construction stage. XML namespaces must also be supported. The XML parser has the entity body phase, which is initially set to the before document type phase. It is used as the next phase of the before XML declaration phase.

An XML tokenizer is a component of the XML parser performing tokenization. An input stream is associated with the tokenizer. Initially, an XML parser has an associated tokenizer and input stream. An XML parser has a stack of tokenizers, which is initially empty.

Unless explicitly specified, various states of the XML parser are shared among tokenizers associated with it.

The XML parser has standalone document, expose DTD content, don't process, and expand external entities flags. Unless explicitly stated, these flags are unset.

DOM APIs such as innerHTML and DOMParser do not set these flags.

Each entry in the list of entities of an XML parser have the open flag, which is initiall unset.

When the XML parser has to append an entity, it MUST do nothing in following cases:

If the entity flag is "normal" and the entity name is one of predefined entity names,
If the entity flag is "normal" and the list of entities contains an entity with same name, or
If the entity flag is "parameter" and the list of parameter entities contains an entity with same name.
If the don't process flag is set.

If the expose DTD content flag is set, the element, attribute list, unparsed entity, and notation declarations, as well as processing instructions in DTD, MUST be exposed to the DOM through the DocumentType object.

Unless when the previous phase is the start phase or when the currently processing input stream is an entity referenced by the document type declaration or an entity reference (rather than the document entity), everything must be ignored in the document type phase.

An XML parser has need predefined entity declarations and has character entity declarations flags. They are initially unset.

In the document type phase, the need predefined entity declarations flag MUST be set. It MUST be restored to the original value when the phase is set to another value.

Character encodings

The value of the encoding pseudo-attribute or the charset parameter MUST be interpreted as an encoding label defined by the Encoding Standard.

Expansion of entities

The external subset entity

Just before the phase is switched from the before document type phase to the start phase, the XML parser MUST run the following steps:

Let doctype be the doctype of the Document.
If doctype is null, abort these steps.
Let public ID be the publicId of doctype.
If public ID is one of predefined public IDs, set the has character entity declarations flag and abort these steps.
Set the need predefined entity declarations flag.
Let system ID be the systemId of doctype.
If system ID is the empty string, abort these steps.
Invoke the parse an entity steps with following parameters:

entity
The external subset.
url
The result of resolving system ID against the base URL of doctype.
original phase
The phase of the XML parser.
entity body phase
The document type phase.

The external subset is parsed and processed after the internal subset, if any.

Parameter entities

To consume a parameter entity, the XML parser MUST run these steps:

Consume the next input character while it is not one of following characters:
- A space character
- A U+0022 QUOTATION MARK character (")
- A U+0025 PERCENT SIGN character (%)
- A U+0026 AMPERSAND character (&)
- A U+0027 APOSTROPHE character (')
- A U+003B SEMICOLON character (;)
- A U+003C LESS-THAN SIGN character (<)
- A U+003D EQUAL SIGN character (=)
- A U+003E GREATER-THAN SIGN character (>)
- A U+0060 GRAVE ACCENT character (`)
- An implied EOF character
Let name be the characters consumed by the previous step.
If the next input character is a U+003B SEMICOLON character (;), consume the character and append the character to name.
Otherwise, this is a parse error.
Let entity be the parameter entity whose name is name in the list of parameter entities. If there is no such a parameter entity:
- Parse error.
- If the standalone document flag is not set, set the don't process flag.
- Abort these steps.
Names of parameter entities in the list of parameter entities always contain the ; character.
If the don't process flag is set and the standalone document flag is not set, return nothing and abort these steps.
If the open flag of the entry for entity in the list of parameter entities of the XML parser is set, this is a parse error. Return nothing and abort these steps.
Let public ID be the public ID of entity.
If public ID is one of predefined public IDs:

If the state is the document type internal subset state
Set the has character entity declarations flag and abort these steps.
Otherwise
Parse error. Abort these steps.
Invoke the parse an entity steps with following parameters:

entity
entity
url
The result of resolving the system ID of entity against the effective declaration base URL of entity, if entity is an external entity, or null.
original phase
The current phase of the tree construction stage.
entity body phase
The current phase of the tree construction stage.
state
If the state is one of document type entity value double quoted state, document type entity value single quoted state, or entity value state, the entity value state.
in markup declaration flag
Set if the state of the XML parser is not one of document type internal subset state, document type entity value double quoted state, document type entity value single quoted state, or entity value state, or unset otherwise.

General entities

When the steps to consume a character reference is applied to an XML parser, the following requirements in this section is applied.

The steps to consume a character reference is defined in the HTML Standard for the HTML parser. This section monkeypatches the steps to implement XML-specific rules.

If the stack of open elements is empty and the state is not the character reference in attribute value state when the steps to consume a character reference is invoked, this is a parse error.

Each entity in the list of entities MUST be appendded to the table of the named character references as a row, where the first column is the name of the entity followed by a U+003B SEMICOLON character (;). If the table already has a row whose first column is equal to the first column of the new row, the existing row MUST be replaced by the new row.

If all of following conditions are met, this is a parse error:

A row from the original named character references table is used
The first column of the row ends with a U+003B SEMICOLON character (;)
The has character entity declarations flag is unset
The first column of the row is none of predefined entity names

If only the last condition is not met and the need predefined entity declarations flag is set, the XML parser MAY report a warning.

When character tokens are returned using the second column of the table, an XML parser MUST run these steps:

Let entity be the entity of the selected row.
If entity is an unparsed entity, this is a parse error. Return nothing and abort the consume a character reference steps.
If the open flag of the entry for entity in the list of entities of the XML parser is set, this is a parse error. Return nothing and abort the consume a character reference steps.
If these steps are invoked in the character reference in attribute value state:
1. If entity is an external entity, this is a parse error. Return nothing and abort the consume a character reference steps.
2. Let replacement text be the replacement text of entity.
3. If replacement text contains a U+003C LESS-THAN SIGN character (<), this is a parse error. Return nothing and abort the consume a character reference steps.
4. Set the open flag of the entry for entity in the list of entities of the XML parser.
5. Let s be the empty string.
6. Loop: If replacement text is the empty string, go to the step labeled end.
7. If the first character in replacement text is not a U+0026 AMPERSAND character (&), append the character to s and remove the character from replacement text.
8. If the first character in replacement text is a U+0026 AMPERSAND character (&):
  1. Attempt to consume a character reference steps recursively, with no additional allowed character. Those steps MUST behave as if the input stream were replacement text and remove the characters consumed from replacement text.
  2. If nothing is returned, append a U+0026 AMPERSAND character (&) to s and remove the first character from replacement text.
  3. Otherwise, append the characters represented by the returned character tokens to s.
9. Go to the step labeled loop.
10. End: Unset the open flag of the entry for entity in the list of entities of the XML parser.
11. Return character tokens equivalent to characters in s and abort the consume a character reference steps.
Otherwise:
1. If the phase of the tree construction stage is not the main phase, return nothing and abort the consume a character reference steps.
2. Invoke the parse an entity steps with following parameters:
  
  entity
  entity
  url
  The result of resolving the system ID of entity against the effective declaration base URL of entity, if entity is an external entity, or null.
  original phase
  The main phase.
  entity body phase
  The main phase.
3. Return zero character tokens and abort the consume a character reference steps.

The number of character tokens returned by these steps can be zero, which is different from returning nothing.

Fetching and parsing external entities

To parse an entity entity with url, original phase, entity body phase, state whose default is null, and in markup declaration flag whose default is unset, the XML parser MUST run these steps:

If in markup declaration flag is set, act as if a U+0020 SPACE character were processed by the tokenizer.
Set the open flag of the entry for entity in the list of entities of the XML parser if entity is a general entity.
Push a marker to the stack of open elements. This marker is referenced later in this section, but is ignored for any other purposes. (Especially, the current element can never be a marker.)
Set the parser pause flag of the tokenizer of the XML parser to true.
Block the tokenizer of the XML parser, such that the event loop will not run tasks that invoke the tokenizer.
Push the current tokenizer of the XML parser to the stack of tokenizers.
If entity is an internal entity, if the expand external entities flag is unset, if url is null, if url is in error, or if there are more than entity references than an implementation-specific limit such that the entity reference that caused these steps invoked ought to be ignored:
1. Let replacement be the replacement text of entity, if it is an internal entity, or the empty string.
2. Set the tokenizer of the XML parser to a new tokenizer whose input stream is replacement.
If the expand external entities flag is unset, a reference to the external entity is expanded to the empty string.

There should be an implementation-specific limit on how many entity references are expanded to defend against billion laughs attack.
Otherwise:
1. Let referrer be XXX document's address of the Document of the XML parser, if referrer is enabled, or null.
2. Let req be a request with url url and referrer referrer.
3. Invoke the fetch steps using req. The tasks queued by the fetch algorithm MUST run these steps:
  
  If it is a task to process response
  If the type of the response of the fetch is not default, append the "EOF" character to the input byte stream of the current tokenizer of the XML parser.
  If it is a task to process response body
  If the type of the response of the fetch is default, append newly-arrived bytes in body of the response to the input byte stream of the current tokenizer of the XML parser.
  If it is a task to process response end-of-file
  Append the "EOF" character to the input byte stream of the current tokenizer of the XML parser.
4. Set the tokenizer of the XML parser to a new tokenizer.
5. Set the phase of the tree construction stage of the XML parser to before XML declaration phase and entity body phase to entity body phase.
  
  The before XML declaration phase is used to parse the text declaration in the external entity, if any.
6. If state is not null, let original state be the state of the XML parser and set the state of the XML parser to state.
When decoding the input byte stream, the character encoding given in the Content-Type metadata (e.g. the charset parameter) MUST be taken into account. However, the MIME type itself is ignored.

An external general entity can be served with various MIME types, including but not limited to: application/xml-dtd, text/xml-parsed-entity, application/xml-parsed-entity, various XML MIME types, text/sgml, application/sgml, text/plain, and application/octet-stream.
XXX Should the charset parameter be taken into account even when the charset parameter is not defined for the MIME type?

XXX set some error flag if external entity is not expanded
XXX interaction of standalone=yes and external entity reference??

While the stack of tokenizers is not empty, for the purpose of the processing of an end tag token only, the XML parser MUST act as if the stack of open elements does not contain the marker and any other element added to the stack before the marker.

While the stack of tokenizer is not empty and in markup declaration flag is set upon the last invocation of the parse an entity steps:

If the tokenizer consumes a U+003E GREATER-THAN SIGN character (>) and the state is changed to the document type internal subset state, it MUST be a parse error and the state MUST be changed to the bogus markup declaration state instead.
If the tokenizer consumes an implied EOF character, the steps to stop parsing MUST be run instead. If this is within an entity value, public literal, system literal, or content model group opened by current entity, this is a parse error and the state MUST be changed to the bogus markup declaration state.

While the stack of tokenizers is not empty, instead of the stops parsing steps, the XML parser MUST run these steps:

Pop the most recently added marker, as well as any element added after the marker, from the stack of open elements.
If the last invocation of the parse an entity steps sets original state, set the state of the XML parser to original state.
Set the phase of the tree construction stage of the XML parser to original phase for the last invocation of the parse an entity steps.
Pop a tokenizer from the stack of tokenizers and set it the tokenizer of the XML parser.
Unblock the tokenizer of the XML parser, such that tasks that invoke the tokenizer can again be run.
Set the parser pause flag of the tokenizer of the XML parser to false.
Unset the open flag of the entry for entity in the list of entities of the XML parser if entity is a general entity.
If in markup declaration flag is set upon the last invocation of the parse an entity steps, act as if a U+0020 SPACE character were processed by the tokenizer.

For each external entity (including the document entity and the external subset entity, if any)

If there is a byte sequence that are not legal in the encoding in use, then the parser MUST raise an xml-misc-error.

If it is the document entity or a general entity, then:

If the input byte sequence for the entity begins with the BOM, then the parser MUST set the BOM flag of the node corresponding to the entity (the Document node for the document entity or an Entity node for a general entity) to true. @@ flag must be checked later

If it is a parameter entity or the external subset entity, then:

If the character encoding of the entity is UTF-16 but the input byte stream for the entity does not begin with the BOM, then the parser MUST raise an xml-misc-error.
@@ encoding="" preferred name?

For the document

If the XML document does not begin with an XML declaration, then the parser MUST raise an xml-misc-recommendation.

If the document does not contain the document type declaration, or if it does but the document type definition does not contain entity declaration for any of amp, lt, gt, apos, or quot, then the parser MUST raise xml-misc-recommendation(s).

For the document type declaration

@@ read external entity

The entities attribute of the DocumentType node MUST contain a NamedNodeMap object whose first five items are as follows:

An Entity node whose nodeName attribute is amp. It contains a Text node whose data attribute is set to &.
An Entity node whose nodeName attribute is lt. It contains a Text node whose data attribute is set to <.
An Entity node whose nodeName attribute is gt. It contains a Text node whose data attribute is set to >.
An Entity node whose nodeName attribute is quot. It contains a Text node whose data attribute is set to ".
An Entity node whose nodeName attribute is apos. It contains a Text node whose data attribute is set to '.

For each internal general entity declaration being processed by the parser

If the EntityValue part of the general entity declaration contains a bare U+003C LESS-THAN SIGN (<) character, then the parser MUST raise an xml-misc-warning.

For each element type declaration being processed by the parser

If there is another processed element type declaration whose Name is equal to the Name of the element type declaration, then the parser MUST raise an xml-validity-error.

For each attribute definition list declaration being processed by the parser

If there is another processed attribute defintion list declaration whose Name is equal to the Name of the attribute definition list declaration, then the parser MUST raise an xml-misc-warning.

For each attribute definition in the attribute definition list declaration, if there is another processed attribute definition whose Name is equal to the Name of the attribute definition (whether or not in the same attribute definition list declaration), then the parser MUST raise an xml-misc-warning.

For each entity declaration being processed by the parser

Handle as follows:

If the entity declaration declares a general entity, the following is applied:

If the Name is lt or amp
If the entity declaration does not declare an internal entity, or if the replacement text of the entity is not the escaped form of < (if lt) or & (if amp), then the parser MUST raise an xml-misc-error.

In other word, the character in the EntityValue has to be double-escaped.

If the Name is gt, quot, or apos
If the entity declaration does not declare an internal entity, or if the replacement text of the entity is not equal to or not the escaped form of > (if gt), " (if quot), or ' (if apos), then the parser MUST raise an xml-misc-error.

In other word, the character in the EntityValue has to be single- or double-escaped.
If the entity declaration has to be ignored since there has already been declared an entity with the same Name as the declaration, then the parser MUST raise a misc-info and abort these steps.

Five predefined entities, i.e. amp, lt, gt, quot, and apos, are always declared implicitly and therefore any declaration for such an entity always raises an misc-info.
If the entity declaration declares a parameter entity and the Name of the entity begins with the string xml (in any combination of upper- and lowercase letters), then the parser MUST raise an xml-misc-warning.
If the entity declaration contains the EntityValue, then for each occurence of any references to unparsed entities in the EntityValue, the parser MUST raise an xml-misc-error.
If the entity declaration declares a general entity, then an Entity node MUST be created and appended to the NamedNodeMap object in the entities attribute of the DocumentType node.
Read the external entity
If the replacement text of the entity is read, then parse the replacement text as if it were referenced from the content of an element (with no namespace bindings). If no @@ parse error is raised by the parsing process, then the nodes generated by the parsing MUST be appended to the Entity node. The parse error MUST NOT be propagated to the entire parsing process. Other kinds of errors MUST be propagated. The first parse error MUST abort the internal parsing process. @@ better wording
@@ prop
Then, the Entity node and its descendant MUST be marked as read-only.

For each notation declaration being processed by the parser

If there is another processed notation declaration whose Name is equal to the Name of the notation declaration, then the parser MUST raise an xml-validity-error.

For each empty-element tag

If the Name of the tag is not declared by a processed element type declaration as EMPTY content, then the parser MUST raise an xml-misc-recommendation.

For each start-tag

If the Name of the tag is declared by a processed element type declaration as EMPTY content, then the parser MUST raise an xml-misc-recommendation.

For each attribute

The parser MUST set the normalized value of the attribute to the value attribute of the Attr node created for the attribute.

That is, any entity reference has to be expanded. Unexpanded entity references in attribute values are discarded.

For each xml:space attribute

The parser MUST set the normalized value of the xml:space attribute to the value attribute of the Attr node created for the attribute even if the normalized value is different from default or preserve.

For each parameter entity reference

Process as follows:

If the declaration for the entity is not processed, then:

If the document contains no external entity or if the document contains the standalone pseudo-attribute set to yes
The parser MUST raise an xml-well-formedness-error.
Otherwise
The parser MUST raise an xml-validity-error.
If the declaration for the entity is processed but the referenced entity cannot be retrieved, then the parser MUST raise an @@ ??-error.

In any of two cases above, process as follows:

If the parameter entity reference is contained in a declaration, then the declaration MUST be ignored except that any error before the parameter entity MUST be raised as usual.
If the parameter entity reference is contained in the status portion of a conditional section, then the conditional section MUST be processed as if it were an IGNOREd section.
The parser MUST NOT process any entity or attribute-list declaration after the parameter entity reference in the DTD except when the standalone pseudo-attribute of the XML declaration (if any) is set to yes.
If the parameter entity reference is the first reference to an entity that is not read, then the parser MUST raise an entity-error.
The allDeclarationsProcessed @@ ref attribute of the Document node MUST be set to false.

For each general entity reference in an attribute value or in the content of an element

Process as follows:

If the Name of the entity reference is either amp, lt, gt, quot, or apos, then abort these steps.
If the declaration for the entity is not processed, then:

If the document contains no external entity or if the document contains the standalone pseudo-attribute set to yes
The parser MUST raise an xml-well-formedness-error.
Otherwise
The parser MUST raise an xml-validity-error.
If the declaration for the entity is processed but the referenced entity cannot be retrieved, then the parser MUST raise an @@ ??-error.

In any of two cases above, process as follows:

If the general entity reference is the first reference to an entity that is not read, then the parser MUST raise an entity-error. @@ entity declared WFC?
An unexpended entity reference node MUST be inserted to the current node.

For each comment outside of document type declaration

A Comment node MUST be created and inserted appropriately.

The parser MUST try to read any entity referenced by general or parameter entity references and the external subset entity, if any in the document type definition.

Well-formedness constraints. When the parser detects a voilation to one of certain well-formedness constraints, it MUST raise an xml-well-formedness-error. The list of such well-formed constraints is as follows:

Validity constraints. When the parser detects a violation to one of certain validity contraints, it MUST raise an xml-validity-error. The list of such validity constraints is as follows:

Other creteria. If the parser detects a violation to one of certain additional constraints, it MUST raise an xml-misc-recommendation. The list of such constraints is as follows:

For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text SHOULD contain at least one non-blank character, and neither the first nor last non-blank character of the replacement text SHOULD be a connector (| or ,).
External parsed entities SHOULD each begin with a text declaration.

The parser MUST act as if it is a validating XML processor for the purpose of informing of white space characters appearing in element content (See Section 2.10 of the XML specification).

In other word, the isElementContentWhitespace attribute of Text nodes has to be set appropriately. Note that the value of the attribute will be set to false for any Text node in the content of an element whose declaration is not processed.

The parser MUST raise at least one xml-well-formedness-error if the entity it parses does not match to the appropriate production rule in the XML specification. As an exception to this requirement, it MAY choose not to raise such an error if the error will be raised by the conformance checker when the conformance checker checks the Document object produced by the parser.

DOM Document Type Definitions

This section defines various extensions to DOM to represent definitions that can be contained in XML DTDs. They are referred to as DOM Document Type Definitions, which is a set of DOM interfaces, including both new interfaces and modifications to existing standard DOM interfaces.

The primary goal of those features is to make it possible to built an XML DTD validator on the top of the extended DOM API.

Although these features are defined as extensions to the standard DOM interfaces, they are not expected to be implemented by Web browsers.

Interfaces defined in this section are partially modeled and inspired by early Working Drafts of DOM Level 1 [DOM1WD], DOM Level 3 Abstract Schemas draft [DOM3AS], and XML Schema API specification [XSAPI], but they are not compatible with any of them as a whole.

Features in this section are applied to both XML documents and HTML documents.

Nodes

This specification introduces two kinds of nodes: element types and attribute definitions. Also, this specification reuses two kinds of nodes defined by DOM3 Core specification [DOM3CORE] (but obsoleted by current DOM Standard [DOM]): entities and notations. Requirements on nodes are applied by the following rules:

Node types introduced by this specification (ElementTypeDefinition and AttributeDefinition): If this specification explicitly states requirements for them, they have to be followed. Otherwise, requirements mentioned in the DOM Standard have to be followed.
DocumentType: If this specification explicitly states requirements for it, they have to be followed. Otherwise, requirements mentioned in the DOM Standard have to be followed.

This specification defines how nodes attached to the DocumentType (e.g. ElementTypeDefinition and Entity) affect to attributes and methods of the nodes. When the DocumentType has no such attached node, they are expected to behave as specified by the DOM Standard.
Node types defined in DOM3 Core specification but obsoleted by DOM Standard (Entity and Notation): If this specification explicitly states requirements for them, they have to be followed. Otherwise, requirements mentioned in the DOM3 Core specification have to be followed.
Other node types: Requirements in the DOM Standard have to be followed.

Nodes of types ElementTypeDefinition, AttributeDefinition, Entity, and Notation cannot contain, or cannot be contained by, any kind of node as child.

Historically, DOM3 Core specification and earlier versions of this specification have allowed AttributeDefinition and Entity nodes containing children.

partial interface Node {
  const unsigned short ELEMENT_TYPE_DEFINITION_NODE = 81001;
  const unsigned short ATTRIBUTE_DEFINITION_NODE = 81002;
};

The nodeType attribute MUST return the following, depending on the context object:

ElementTypeDefinition: ELEMENT_TYPE_DEFINITION_NODE (81001)
AttributeDefinition: ATTRIBUTE_DEFINITION_NODE (81002)

The nodeName attribute MUST return the name (element type name or attribute definition name) of the node.

If the node has attributes attributes, localName, namespaceURI, and prefix, they MUST return null.

The nodeValue attribute and the textContent attribute MUST return the following, depending on the context object:

AttributeDefinition: The default value of the context object.
Entity: The replacement text of the context object.

Setting the nodeValue attribute or the textContent attribute MUST do the following, depending on the context object:

AttributeDefinition: Set the default value of the context object to the new value.
Entity: Set the replacement text of the context object to the new value.

On setting, the textContent attribute MUST act as if it was the empty string instead if the new value is null.

When a node is cloned, the following values MUST be copied:

DocumentType: Its element types, general entities, and notations.

The DOM Standard defines more values to copy.
ElementTypeDefinition: Its name and attribute definitions.
AttributeDefinition: Its name, declared type, allowed tokens, default type, and default value.
Entity: Its name, public ID, system ID, notation name, and replacement text.

The encoding is not copied.
Notation: Its name, public ID, and system ID.

For the equality of two nodes, equality of the following values MUST also be taken into account:

DocumentType: Its element types, general entities, and notations.
ElementTypeDefinition: Its name and attribute definitions.
AttributeDefinition: Its name, declared type, allowed tokens, default type, and default value.
Entity: Its name, public ID, system ID, notation name, and replacement text.
Notation: Its name, public ID, and system ID.

For the comparison of node sets, only the length and equality of nodes with same nodeName in the set are significant. The order of the nodes are not ignored for the purpose of the comparison. For the comparison of allowed tokens, the length of items and existence of the values in both are taken into account.

If at least one of two nodes compared by the compareDocumentPosition method (i.e. reference and other) are ElementTypeDefinition, AttributeDefinition, Entity, or Notation and they are different nodes, the following rules are applied:

If both nodes are of these four node types and their owners are the same node,
- If other's node type is less than reference's node type, the method MUST return the result of adding DOCUMENT_POSITION_PRECEDING to DOCUMENT_POSITION_IMPLEMENTATION_SPECIFIC.
- Otherwise, if other's node type is greater than reference's node type, the method MUST return the result of adding DOCUMENT_POSITION_FOLLOWING to DOCUMENT_POSITION_IMPLEMENTATION_SPECIFIC.
- Otherwise, if other's nodeName is less than reference's nodeName in code point order, the method MUST return the resulf of adding DOCUMENT_POSITION_PRECEDING to DOCUMENT_POSITION_IMPLEMENTATION_SPECIFIC.
- Otherwise, the method MUST return the result of adding DOCUMENT_POSITION_FOLLOWING to DOCUMENT_POSITION_IMPLEMENTATION_SPECIFIC.
Otherwise, the owner MUST be considered as the parent of the node for the purpose of comparison. The nodes of these four types MUST be considered as preceding any child node of the owner.

A node of type ElementTypeDefinition, Entity, or Notation has an associated owner document type definition. The AttributeDefinition node has an associated owner element type definition. Their values are null unless otherwise specified. When not ambiguous, they are simply referred to as owner of the node.

The ownerDocumentTypeDefinition attribute of the ElementTypeDefinition, Entity, and Notation interfaces MUST return the owner document type definition of the context object. The ownerElementTypeDefinition attribute of the AttributeDefinition interface MUST return the owner element type definition of the context object.

The publicId attribute and the systemId attribute of the Entity and Notation interfaces MUST return the public ID and system ID of the context object, respectively..

On setting, the publicId attribute and the systemId attribute of the DocumentType, Entity, and Notation interfaces MUST set the public ID and system ID of the context object, respectively, to the new value.

In DOM Standard and DOM3 Core specifications, these attributes are read-only.

Setting an invalid identifier to these attribute might make the node unserializable in the XML syntax.

When a NamedNodeMap collection represents element types, general entities, notations, or attribute definitions, methods getNamedItemNS, setNamedItemNS, and removeNamedItemNS of the NamedNodeMap interface [DOM3CORE] MUST throw a "NotSupportedError" exception.

Documents

Several factory methods are added to the Document interface.

partial interface Document {
  DocumentType createDocumentTypeDefinition(DOMString name);
  ElementTypeDefinition createElementTypeDefinition(DOMString name);
  AttributeDefinition createAttributeDefinition(DOMString name);
  Entity createGeneralEntity(DOMString name);
  Notation createNotation(DOMString name);
};

Earlier versions of this specification called this WebIDL fragment as the DocumentXDoctype interface.

The createDocumentTypeDefinition(name) method MUST run these steps:

If name does not match the Name production, throw an "InvalidCharacterError" exception and terminate these steps.
Return a new doctype, with name as its name and with its node document set to the context object.

The createElementTypeDefinition(name) method MUST run these steps:

If name does not match the Name production, throw an "InvalidCharacterError" exception and terminate these steps.
Return a new element type, with name as its name and with its node document set to the context object.

The createAttributeDefinition(name) method MUST run these steps:

If name does not match the Name production, throw an "InvalidCharacterError" exception and terminate these steps.
Return a new attribute definition, with name as its name and with its node document set to the context object.

The createGeneralEntity(name) method MUST run these steps:

If name does not match the Name production, throw an "InvalidCharacterError" exception and terminate these steps.
Return a new general entity, with name as its name and with its node document set to the context object.

The createNotation(name) method MUST run these steps:

If name does not match the Name production, throw an "InvalidCharacterError" exception and terminate these steps.
Return a new notation, with name as its name and with its node document set to the context object.

name does not have to be a namespace qualified name.

Document types

Each doctype has associated unordered sets of nodes element types, general entities, and notations. Unless otherwise specified, they MUST be empty when the doctype is created.

The set element types can only contains element types.

The elementTypes attribute of the DocumentType interface MUST return an NamedNodeMap collection containing the nodes in the element types set of the context object, sorted by their nodeName's code point.

The general entities of the doctype is exposed by the entities attribute of the DocumentType interface [DOM3CORE]. The generalEntities attribute of the DocumentType interface MUST return the same object as the entities attribute. The nodes in the entities collection MUST be sorted by their nodeName's code point.

The notations of the doctype is exposed by the notations attribute of the DocumentType interface [DOM3CORE]. The nodes in the notations collection MUST be sorted by their nodeName's code point.

Need to merge with manakai-allow-doctype-children configuration parameter spec...

A DocumentType interface MAY contain zero or more ProcessingInstruction nodes in the NodeList object contained in the childNodes attribute of the DocumentType node.

If the DocumentType node is created during the process to create a DOM from an XML document, the NodeList object in the childNodes object MUST contains the ProcessingInstruction nodes representing the processing instructions in the document type definition of the document processed [XML, XML11] by the XML processor.

If a DocumentType node is created from a document type declaration information item [INFOSET], the NodeList object in the childNodes attribute of the node MUST contain the ProcessingInstruction nodes created from any processing instruction information items in the list in the [children] property of the document type declaration item in the same order.

If a DocumentType node is mapped to a document type declaration information item, the list in the [children] property MUST contain the processng instruction information items created from the ProcessingInstruction nodes in the NodeList object in the childNodes attribute of the DocumentType node.

partial interface DocumentType {
  [TreatNullAs=EmptyString] attribute DOMString publicId;
  [TreatNullAs=EmptyString] attribute DOMString systemId;
  attribute DOMString? declarationBaseURI;
  attribute DOMString? manakaiDeclarationBaseURI;

  readonly attribute NamedNodeMap elementTypes;
  readonly attribute NamedNodeMap generalEntities;

  ElementTypeDefinition getElementTypeDefinitionNode
    (in DOMString name);
  Entity getGeneralEntityNode
    (in DOMString name);
  Notation getNotationNode
    (in DOMString name);

  ElementTypeDefinition? setElementTypeDefinitionNode(ElementTypeDefinition node);
  Entity? setGeneralEntityNode(Entity node);
  Notation? setNotationNode(Notation node);
  ElementTypeDefinition removeElementTypeDefinitionNode(ElementTypeDefinition node);
  Entity removeGeneralEntityNode(Entity node);
  Notation removeNotationNode(Notation node);
};

Earlier versions of this specification named this WebIDL fragment as the DocumentTypeDefinition interface.

If the DocumentType node is created during the process to create a DOM from an XML document, the following requirements are applied: The NamedNodeMap object in the elementType attribute MUST be so transformed that the object contains the ElementTypeDefinition nodes for the element types whose name is presented as the Name of the element type or attribute definition list declarations processed [XML, XML11] by the XML processor. If there are more than one element type declarations for an element type, then the declarations other than the first one MUST be ignored for the purpose of constructing the NamedNodeMap object.

All entities declared in the document type definition contained in or referenced from the document entity might not be exposed through this collection, depending on the information provided by the XML processor for the DOM implementation. In particular, it might not contain any entity if entity references are expanded at the parse time. An implementation [['''MUST NOT''']] expose an [CODE(DOMi)@en[[[Entity]]]] node whose [CODE(DOMa)@en[[[nodeName]]]] is equal to the name of one of five predefined general entities in XML through the collection as the result of parsing of an XML document that has no error. Duplicate entity declarations are also discarded.

The attribute [['''MUST''']] return the [CODE(DOMi)@en[[[NamedNodeMap]]]] object that contains all the [CODE(DOMi)@en[[[Entity]]]] nodes representing general entities belong to the node.

If the [CODE(DOMi)@en[[[DocumentTypeDefinition]]]] node is created from an XML document, duplicate notation declarations, if any, in DTD [['''MUST NOT''']] result in a node in the [CODE(DOMi)@en[[[NamedNodeMap]]]] object and only the first declaration [['''MUST''']] be made available as a [CODE(DOMi)@en[[[Notation]]]] node.

;; This definition is based on one for the [CODE(DOMa)@en[[[notations]]]] attribute of the [CODE(DOMi)@en[[[DocumentType]]]] interface in DOM XML module. Since duplication is violation to the Validity Constraint, XML parsers might vary on how notations are notified to the application. In particular, [CODE(InfoProp)@en[[[notations]]]] property of the document information item in XML Information Set is so defined that in case ''any'' notation is declared for multiple times then the property has no value.

The attribute [['''MUST''']] return the [CODE(DOMi)@en[[[NamedNodeMap]]]] object that contains all the [CODE(DOMi)@en[[[Notation]]]] nodes representing notations belong to the node.

getElementTypeDefinitionNode, method

Returns the ElementTypeDefinition node with the specified name.

The name parameter is the name of the element type.

When invoked, the method MUST return the ElementTypeDefinition node, whose nodeName is equal to name, in the NamedNodeMap object in the elementTypes attribute of the node. If there is no such node, it MUST return null.

getGeneralEntityNode, method

Returns the Entity node with the specified name.

The name parameter is the name of the general entity.

When invoked, the method MUST return the Entity node, whose nodeName is equal to name, in the NamedNodeMap object in the entities attribute of the node. If there is no such node, it MUST return null.

getNotationNode, method

Returns the Notation node with the specified name.

The name parameter is the name of the notation.

When invoked, the method MUST return the Notation node, whose nodeName is equal to name, in the NamedNodeMap object in the notations attribute of the node. If there is no such node, it MUST return null.

setElementTypeDefinitionNode, method

The setElementTypeDefinitionNode(node) method MUST return the result of setting node to the context object's element types.

setGeneralEntityNode, method

The setGeneralEntityNode(node) method MUST return the result of setting node to the context object's general entities.

setNotationNode, method

The setNotationNode(node) method MUST return the result of setting node to the context object's notations.

removeElementTypeDefinitionNode, method

The removeElementTypeDefinitionNode(node) method MUST return the result of removing node from the context object's element types.

removeGeneralEntityNode, method

The removeGeneralEntityNode(node) method MUST return the result of removing node from the context object's general entities.

removeNotationNode, method

The removeNotationNode(node) method MUST return the result of removing node from the context object's notations.

To remove a node node from the set of nodes set, run these steps:

If set does not contain node, throw an "NotFoundError" exception and abort these steps.
Remove node from set.
Set the owner of node to null.
Return node.

To set a node node to the set of nodes set associated to the node owner, run these steps:

If the owner of node is not null and is not equal to owner, throw a "HierarchyRequestError" and abort these steps.
Adopt node into the node document of owner.
Let oldNode be null.
If there is a node whose nodeName is equal to node's nodeName in set, let oldNode be the node and remove oldNode from set.
Add node to set.
Set the owner of node to owner.
Return oldNode.

Element types

The ElementTypeDefinition objects are simply known as element types. It represents the definition of the element, identified by the name.

Although the element type is corresponding to the ELEMENT declaration in the XML DTD, the element type does not directly represent the ELEMENT declaration.

interface ElementTypeDefinition : Node {
  readonly attribute DocumentType? ownerDocumentTypeDefinition;

  readonly attribute NamedNodeMap attributeDefinitions;
  AttributeDefinition? getAttributeDefinitionNode(DOMString name);
  AttributeDefinition? setAttributeDefinitionNode(AttributeDefinition node);
  AttributeDefinition removeAttributeDefinitionNode(AttributeDefinition node);
};

Each element type has an associated unordered set of nodes attribute definitions. Unless otherwise specified, it MUST be empty when the element type is created. The set attribute definitions can only contains attribute definition.

The attributeDefinitions attribute of the ElementTypeDefinition interface MUST return an NamedNodeMap collection containing the nodes in the attribute definitions set of the context object, sorted by their nodeName's code point.

Even if there are more than one element type declarations for an element type in DTD, the result DOM will contain only an ElementTypeDefinition node for that element type. In addition, if there are attribute definition declarations for an element type, even when there is no element type declaration for that element type, the DOM will contain an ElementTypeDefinition node for that element type.

If the ElementTypeDefinition node is created during the process to create a DOM from an XML document, the following requirements are applied: The NamedNodeMap object in the attributeDefinitions attribute MUST be so transformed that the object contains the AttributeDefinition node corresponding to the attribute definitions in the attribute definition list declarations processed [XML, XML11] by the XML processor and associated to the element type represented by the node. If there are more than one attribute definitions for an attribute, then the definitions other than the first one MUST be ignored for the purpose of constructing the NamedNodeMap object.

getAttributeDefinitionNode, method

Return the AttributeDefinition node with the specified name.

The name parameter is the name of the attribute.

When invoked, the method MUST return the AttributeDefinition node, whose nodeName attribute value is equal to name, in the NamedNodeMap in the attributeDefinitions attribute of the node. If there is no such node, it MUST return null.

The setAttributeDefinitionNode(node) method of the ElementTypeDefinition object MUST return the result of setting node to the context object's attribute definitions.

The removeAttributeDefinitionNode(node) method of the ElementTypeDefinition object MUST return the result of removing node from the context object's attribute definitions.

A future version of the specification might define a set of attributes and methods for representing and accessing to the content model of the element type.

partial interface ElementTypeDefinition {
  attribute DOMString? contentModelText;
};

Attribute definitions

The nodes of type ATTRIBUTE_DEFINITION_NODE represents an attribute definition. Such a node MUST implement the AttributeDefinition interface, which extends the Node interface.

Each attribute definition has an associated name.

Each attribute definition has associated declared type, allowed tokens, default type, and default value. Unless otherwise specified, they are initialized to their default values when the attribute definition is created. Their default values are NO_TYPE_ATTR, the empty list, UNKNOWN_DEFAULT, and the empty string, respectively.

An attribute definition represents a definition of the attribute associated to an element type. It is corresponding to the attribute definition in the attribute list declaration in DTD. However, an AttributeDefinition node does not represent the attribute definition in DTD itself. Even if there are more than one attribute definitions for an attribute of an element type in DTD, the result DOM will contain only an AttributeDefinition node for that attribute.

interface AttributeDefinition : Node {
  // DefaultValueType
  const unsigned short UNKNOWN_DEFAULT = 0;
  const unsigned short FIXED_DEFAULT = 1;
  const unsigned short REQUIRED_DEFAULT = 2;
  const unsigned short IMPLIED_DEFAULT = 3;
  const unsigned short EXPLICIT_DEFAULT = 4;

  readonly attribute ElementTypeDefinition? ownerElementTypeDefinition;
  attribute unsigned short declaredType;
  attribute DOMString[] allowedTokens;
  attribute unsigned short defaultType;
};

[NoInterfaceObject]
interface AttrDeclaredValueType {
  // DeclaredValueType
  const unsigned short NO_TYPE_ATTR = 0;
  const unsigned short CDATA_ATTR = 1;
  const unsigned short ID_ATTR = 2;
  const unsigned short IDREF_ATTR = 3;
  const unsigned short IDREFS_ATTR = 4;
  const unsigned short ENTITY_ATTR = 5;
  const unsigned short ENTITIES_ATTR = 6;
  const unsigned short NMTOKEN_ATTR = 7;
  const unsigned short NMTOKENS_ATTR = 8;
  const unsigned short NOTATION_ATTR = 9;
  const unsigned short ENUMERATION_ATTR = 10;
  const unsigned short UNKNOWN_ATTR = 11;
};
AttributeDefinition implements AttrDeclaredValueType;
Attr implements AttrDeclaredValueType;

declaredType of type unsigned short

The declared type [XML, XML11] of the attribute. It is expected that this attribute contains a value from the definition group DeclaredValueType.

On getting, the attribute MUST return the value associated to this attribute.

On setting, it MUST set the specified value as the value associated to this attribute.

If the AttributeDefinition node is created during the process to create a DOM from an XML document, an appropriate value from the DeclaredValueType constant group MUST be set to the attribute.

The definition group DeclaredValueType contains integers indicating the declared type of attributes. The definition group contains the following constants:

Name	Value	Description
`NO_TYPE_ATTR`	`0`	No value [INFOSET].
`CDATA_ATTR`	`1`	`CDATA` [XML, XML11].
`ID_ATTR`	`2`	`ID` [XML, XML11].
`IDREF_ATTR`	`3`	`IDREF` [XML, XML11].
`IDREFS_ATTR`	`4`	`IDREFS` [XML, XML11].
`ENTITY_ATTR`	`5`	`ENTITY` [XML, XML11].
`ENTITIES_ATTR`	`6`	`ENTITIES` [XML, XML11].
`NMTOKEN_ATTR`	`7`	`NMTOKEN` [XML, XML11].
`NMTOKENS_ATTR`	`8`	`NMTOKENS` [XML, XML11].
`NOTATION_ATTR`	`9`	`NOTATION` [XML, XML11].
`ENUMERATION_ATTR`	`10`	Enumeration [XML, XML11].
`UNKNOWN_ATTR`	`11`	Unknown, because no declaration for the attribute has been read but the [all declarations processed] property [INFOSET] would be false.

If no attribute type information is available, or if the source of the information does not distinguish no value and unknown [INFOSET], then the value NO_TYPE_ATTR MUST be used.

An AttributeDefinition node created by the createAttributeDefinition method has its declaredType attribute set to NO_TYPE_ATTR.

If the source of the information does not distinguish no value and/or unknown [INFOSET] and CDATA [XML, XML11], then the value CDATA_ATTR MUST be used.

In Perl binding [DOMPERL], the Attr nodes MUST implement the DeclaredValueType definition group.

allowedTokens of type DOMString[]

The list of allowed attribute values.

On getting, the attribute MUST return the DOMStringList object associated to this attribute. The object MAY contain zero or more ordered strings, consist of zero or more characters respectively, possibly with duplications.

If the AttributeDefinition node is created during the process to create a DOM from an XML document, the object MUST contain the names or name tokens allowed for the attribute defined by the node. If the document is well-formed, the object will be empty unless the declaredType is ENUMERATION_ATTR or NOTATION_ATTR.

If the declaredType is different from ENUMERATION_ATTR or NOTATION_ATTR, this attribute MUST be ignored for the purpose of serializing into (part of) XML document.

When serializing the node it should be noted that the object might be empty, might contain duplications, and might contain strings that are not names or name tokens.

defaultType of type unsigned short

The type of the default for the attribute. It is expected that this attribute contains a value from the definition group DefaultValueType.

On getting, the attribute MUST return the value associated to this attribute.

On setting, it MUST set the specified value as the value associated to this attribute.

If the AttributeDefinition node is created during the process to create a DOM from an XML document, an appropriate value from the DefaultValueType definition group MUST be set to the attribute.

The definition group DefaultValueType contains integers indicating the type of the default for the attribute. The definition group contains the following constants:

Name	Value	Description
`UNKNOWN_DEFAULT`	`0`	Unknown.
`FIXED_DEFAULT`	`1`	Provided explicitly and only the value is allowed. [XML, XML11].
`REQUIRED_DEFAULT`	`2`	No default value and the attribute have to be explicitly specified.
`IMPLIED_DEFAULT`	`3`	Implied [XML, XML11].
`EXPLICIT_DEFAULT`	`4`	Provided explicitly.

If the source of the default type does not distinguish implied and unknown default types, then the value IMPLIED_DEFAULT MUST be used.

An AttributeDefinition node created by the createAttributeDefinition method has its defaultType attribute set to UNKNOWN_DEFAULT.

General entities

The Entity node is known as general entity, or when not ambiguous, simply entity.

Each entity has an associated public ID and system ID. Unless otherwise specified, their values are the empty string when the entity is created.

partial interface Entity {
  readonly attribute DocumentType? ownerDocumentTypeDefinition;

  [TreatNullAs=EmptyString] attribute DOMString publicId;
  [TreatNullAs=EmptyString] attribute DOMString systemId;
  attribute DOMString? declarationBaseURI;
  attribute DOMString? manakaiDeclarationBaseURI;
  attribute DOMString? notationName;
  readonly attribute DOMString hasReplacementTree;
  attribute DOMString? manakaiEntityURI;
  attribute DOMString? manakaiEntityBaseURI;
  attribute boolean isExternallyDeclared;
};

Each entity has an associated notation name. Unless otherwise specified, it is null when the entity is created.

The notationName attribute returns the notation name of the context object.

On setting, the notationName attribute of the Entity interface MUST set the notation name of the context object to the new value.

In DOM3 Core specification, this attribute was read-only.

Setting an invalid name to this attribute would make the node unserializable in the XML syntax.

Each entity has an associated replacement text. Unless otherwise specified, the replacement text is the empty string when an entity is created.

isExternallyDeclared of type boolean

Whether the entity is declared by an external entity declaration or not. If the value is true, the entity is declared in an entity declaration in the external subset entity or in an external parameter entity. If the value is false, the entity is declared in an entity declaration in the internal subset, or the node is created in memory.

On getting, the attribute MUST return the value associated to this attribute.

On setting, it MUST set the specified value as the value associated to this attribute.

If the Entity node is created during the process to create a DOM from an XML document, the following requirements are applied: If the entity is an unparsed entity, then the attribute MUST be set to false. Otherwise, i.e. the entity is a parsed entity, then the attribute MUST be set to the value of whether the entity is declared by an external markup declaration or not.

A entity has associated entity URL and entity base URL. Unless otherwise specified, their values are null.

On getting, the manakaiEntityURI attribute of the Entity interface MUST run these steps:

If the entity URL of the context object is not null, return it and abort these steps.
Otherwise, if the system ID of the context object is not the empty string, resolve it relative to the effective declaration base URL of the context object. If it succeeded, return the result and abort these steps.
Return null.

On setting, the attribute MUST run these steps:

If the new value is null, set the entity URL of the context object to null and abort these steps.
Resolve the new value relative to the effective declaration base URL of the context object. Set the entity URL of the context object to the result if succeeded, or null otherwise.

It is expected that an XML Parser supporting this specification set the entity URL when an external entity is read.

Notations

The Notation node is simply known as notation.

Each notation has an associated public ID and system ID. Unless otherwise specified, their values are the empty string when the notation is created.

partial interface Notation {
  readonly attribute DocumentType? ownerDocumentTypeDefinition;
  [TreatNullAs=EmptyString] attribute DOMString publicId;
  [TreatNullAs=EmptyString] attribute DOMString systemId;
  attribute DOMString? declarationBaseURI;
  attribute DOMString? manakaiDeclarationBaseURI;
};

Conformance checking of XML documents

If there is a parse error, the document is not well-formed.

...

Much of invalid (well-formed or not) XML document parsing and XML document / XML DOM conformance is left undefined so that this document provides a guideline for conformance checkers.

Processing Model

Conceptually, validation of an XML document is split into two stages for the purpose of this specification: the XML document parsing stage and the DOM XML conformance checking stage.

The input to the XML document parsing stage is a byte sequence representing the parsed XML document (and any additional metadata), and the output are a DOM tree representing the XML document and zero or more errors. The processor that implements this stage is called parser. Requirements for a parser are defined in the section of Parsing an XML Document.

The input to the DOM XML conformance chcking stage is a DOM tree, and the output are zero or more errors. The processor that implements this stage is called conformance checker. Requirements for a conformance checker are defined in the section of Checking an XML DOM Tree.

Error Classification

An error is ...

If a Document node has no xml-well-formedness-error, entity-error, and unknown-error, then it is well-formed. If a well-formed Document node has no xml-validity-error, it is valid.

A well‐formed Document can be safely serialized into a well‐formed XML document. A valid Document can be easily serialized into a valid XML document.

To be a conforming validating XML processor, ...

Errors are classified into these error categories:

entity-error: @@

This algorithm does not support DOM tree with one or more EntityReference nodes. It is expected that any entity references are expanded at the parse time and any unexpandable entity references make parse time errors raised so that never result in DOM tree with EntityReference nodes.
round-trip-error: @@
round-trip-warning: A round-trip-warning will be raised when a construct, which might not be restored to the same construct when it is serialized and then re-parsed by a conforming processor, is encountered.

For a Comment node a round-trip-warning will be raised, since XML processors are not required to report texts of comments for applications.
unknown-error?: @@
xml-misc-error: An XML error (XML 1.0 [XML] error / XML 1.1 [XML11] error) that is not classified to any other error category.
xml-misc-fatal-error: An XML fatal error (XML 1.0 [XML] fatal error / XML 1.1 [XML11] fatal error) that is not classified to any other error category. @@ What errors fall into this category?
xml-misc-recommendation: An xml-misc-recommendation will be raised if a SHOULD‐level requirement in XML specification is not met.
xml-validity-error: A violation of validity constraint in XML document.
xml-well-formedness-error: If an xml-well-formedness-error is raised, it would not be possible to generate an XML serialization that would match to the appropriate production rule and that would not violate to any well‐formedness constraint in XML specification [XML, XML11].
misc-info: A misc-info is raised when some status information on parsing or checking process that are considered useful for debugging and so on is available. It by no means implies the non-conformance of the document.

@@ TODO: #dt-atuseroption at user option (MAY or MUST), #dt-compat for compatibility, #dt-interop for interoperability

TODO: XML 1.1, XML Namespace 1.0/1.1, xml:base, xml:id

TODO: XML "error"/"fatal error" is not always non-conforming (only when MUST or SHOULD).

Checking an XML DOM Tree

The following algorithms and definitions are applied to XML documents; especially, they are not applied to HTML documents.

Definitions

The XML version of a node is the XML version of the document to which the node belongs. For a Documemt node, the XML version of the document is the value of the xmlVersion attribute of the node. For a DocumentType node whose ownerDocument attribute is set to null, the XML version of the document is 1.0. For any other node, the XML version of the document is that of the Document node contained in the ownerDocument attribute of the node.

Conformance Checking Algorithms for Components

To to validate an XML string (s), the following algorithm MUST be used:

If s contains a character that is not in the character class Char10, then raise an xml-well-formedness-error.
If s contains a character that is in the character class CompatChar10, then raise an xml-misc-warning.
If s contains a character that is in the character class ControlChar10, then raise an xml-misc-warning.
@@ XML 1.1 support
If s contains a U+000D CARRIAGE RETURN character, then raise a round-trip-error. @@ We should not raise duplicate errors for U+000D in attribute values. In addition, we should support a mode where U+000D will be serialized as (so that no round-trip-error will be raised).

To validate a Name (s), the following algorithm MUST be used:

If s is an empty string, then raise an xml-well-formedness-error. Abort these steps.
Validate s as an XML string.
If the first character in s is a character that is not in the character class NameStartChar10, then raise an xml-well-formedness-error.
If a character other than the first character in s is a character that is not in the character class NameChar10, then raise an xml-well-formedness-error.
If s begins with the string xml (in any case combination), then raise an xml-misc-warning. @@ except for attribute names xml:lang, xml:space.
@@ XML 1.1 support

To validate an NCName (s), the following algorithm MUST be used:

Validate s as a Name.
@@

To validate a public identifier (pid), the following algorithm MUST be used:

If pid is null, abort these steps.
If pid contains a character that is not in the character class PubidChar, then raise an xml-well-formedness-error.
If pid contains one of U+0009 CHARACTER TABULATION, U+000A CARRIAGE RETURN, and U+000D LINE FEED characters, if the first character of pid is U+0020 SPACE character, if the last character of pid is U+0020 SPACE character, or if there is a U+0020 SPACE character immediately followed by another U+0020 SPACE character in pid, then it is a round-trip-error. Is this really a roundtripness problem? XML spec does only define the way to match public identifiers in fact, no canonical form.

To validate a system identifier (sid), the following algorithm MUST be used:

If sid is null, abort these steps.
Validate sid as an XML string.
If sid contains both U+0022 QUOTATION MARK (") and U+0027 APOSTROPHE (') characters, raise an xml-well-formedness-error.
If sid contains at least one U+0023 NUMBER SIGN (#) character, then raise an xml-misc-error.
@@ If sid cannot be converted to a URI reference, then raise a fact-level error (xml-misc-warning?).

Checking `Node`

The algorithm to check a node (n) is defined as following:

If n is an Attr node

Validate the localName attribute value as an NCName.
If the prefix attribute value is different from null, then validate the prefix attribute value as an NCName.
For each node n_c in the childNodes list of n,
1. If n_c is not a Text or EntityReference node, then it is an xml-well-formedness-error.
2. Otherwise, if n_c is an EntityReference node, then it is an entity-error.
3. Otherwise, check n_c recusrively.
If nodeName attribute of n is xml:space @@ or {xml namespace}:space ? and value attribute of n is neither default nor preserve, then it is an xml-misc-error.
@@ xml:lang value is not a language tag [RFC 3066 or its successor] or an empty string, then xml-misc-warning (a "fact"-level error; not an XML error).
@@ specified, manakaiAttributeType (#ValueType Validity constraint: Attribute Value Type)
Let v be the value of the attribute value of n.
Validate the n against the declared type as following:
ID_ATTR
1. Validate v as an Name. If it fails, then raise an xml-validity-error.
2. If ID v is defined, then raise an xml-validity-error.
IDREF_ATTR
1. Validate v as an Name. If it fails, then raise an xml-validity-error.
2. If ID v is NOT defined, then raise an xml-validity-error.
IDREFS_ATTR

@@

ENTITY_ATTR
1. Validate v as an Name. If it fails, then raise an xml-validity-error.
2. If Entity v is NOT defined, then raise an xml-validity-error.
ENTITIES_ATTR

@@

NMTOKEN_ATTR
1. Validate v as an Nmtoken. If it fails, then raise an xml-validity-error.
NMTOKENS_ATTR

@@

NOTATION_ATTR

v must be one of enumerated values. If not, then raise an xml-validity-error.

ENUMERATED_ATTR

v must be one of enumerated values. If not, then raise an xml-validity-error.

@@
If type ID and default is NOT #IMPLIED or #REQUIRED, then raise an xml-validity-error.
@@ #FixedAttr Validity constraint: Fixed Attribute Default
@@ strict serialization error for U+000D, U+000A, and U+0009 characters, leading/trailing U+0020, and U+0020{2,} string?

If n is an AttributeDefinition node

If nodeName attribute of n is xml:space @@ or {xml namespace}:space ? and its declared type is different from (default|preserve), (preserve|default), (default), or (preserve), then raise an xml-misc-error.
For each node n_c in the childNodes list of n,
1. If n_c is not a Text or EntityReference node, then it is an xml-well-formedness-error.
2. Otherwise, if n_c is an EntityReference node, then it is an entity-error.
3. Otherwise, check n_c recusrively.
If NOTATION_ATTR, enumerated values MUST be declared. If not, then raise an xml-validity-error.
If NOTATION_ATTR or ENUMERATED_ATTR, values MUST all be distinct. If not, then raise an xml-validity-error.
If NOTATION_ATTR on an EMPTY element, then raise an xml-validity-error.
@@ #defattrvalid Validity constraint: Attribute Default Value Syntactically Correct

If n is a CDATASection node

Validate the data attribute value as an XML character data.
If the data attribute value contains a string ]]>, then raise an xml-well-formedness-error.
If the childNodes list of n contains any nodes, they are in xml-well-formedness-error.

If n is a Comment node

Raise an round-trip-warning.
Validate the data attribute value as an XML character data.
If the data attribute value contains a string --, or if it ends with a character -, then raise an xml-well-formedness-error.
If the childNodes list of n contains any nodes, they are in xml-well-formedness-error.

If n is a Document node

If XML version of n is different from 1.0 or 1.1, then it is an unknown-error?.
If the xmlEncoding attribute value does not match to [A-Za-z] ([A-Za-z0-9._] | '-')* @@ formal def, then it is an xml-well-formedness-error.
The childNodes list of n have to consist of zero or more Comment and/or ProcessingInstruction nodes, followed by an optional DocumentType node, followed by zero or more Comment and/or ProcessingInstruction nodes, followed by an Element node, followed by zero or more Comment and/or ProcessingInstruction nodes. Any violation to this is an xml-well-formedness-error.
For each node n_c in the childNodes list of n,
1. If n_c is not an EntityReference node, then check n_c recursively.
@@ allDeclarationsProcessed

If n is a DocumentFragment node

For each node n_c in the childNodes list of n,
1. If n_c is not an Element, Text, CDATASection, Comment, ProcessingInstruction, or EntityReference node, then it is an xml-well-formedness-error.
2. Otherwise, if n_c is an EntityReference node, then it is an entity-error.
3. Otherwise, check n_c recursively.

If n is a DocumentType node

Validate the nodeName attribute value as an NCName.
Follow the following substeps:
1. If ownerDocument attribute of n is null, then abort these substeps.
2. If documentElement attribute of the node set to ownerDocument attribute of n is null, then abort these substeps.
3. If nodeName attribute of the node set to documentElement attribute of the node set to ownerDocument attribute of n is different from nodeName of n, then raise an xml-validity-error.
Validate the publicId attribute value as a public identifier.
Validate the systemId attribute value as a system identifier.
If the publicId attribute value of n is not null and the systemId attribute value of n is null, then raise an xml-well-formedness-error. @@ publicId == null? Or, publicId == ""
For each node n_c in the childNodes list of n,
1. If n_c is not a ProcessingInstruction node, then it is an xml-well-formedness-error. @@ ref to manakai's extensions
2. Otherwise, check n_c recusrively.
For each node in the entities, notations, and elementTypes lists of n, check the node recursively.
@@ externally declared?
If the NamedNodeMap object in the entities attribute of n does not contain Entity nodes whose nodeName attribute are amp, lt, gt, apos, and quot then raise xml-misc-recommendation(s).

If n is an Element node

Validate the localName attribute value as an NCName.
If the prefix attribute value is different from null, then validate the prefix attribute value as an NCName.
For each node n_c in the childNodes list of n,
1. If n_c is not an Element, Text, CDATASection, Comment, ProcessingInstruction, or EntityReference node, then it is an xml-well-formedness-error.
2. Otherwise, if n_c is an EntityReference node, then it is an entity-error.
3. Otherwise, check n_c recursively.
@@ #elementvalid Validity constraint: Element Valid
Let attrs be the value of the attribute attribute of n. Check conformance of attrs as following:
1. If attrs contains an Attr node whose nodeName attribute value is equal to that of another Attr node in attrs, then raise an xml-well-formedness-error.
2. @@ #RequiredAttr Validity constraint: Required Attribute

If n is an ElementTypeDefinition node

If the childNodes list of n contains any nodes, they are in xml-well-formedness-error.
@@ At user option, an XML processor MAY issue a warning when a declaration mentions an element type for which no declaration is provided, but this is not an error.
@@ For compatibility, it is an error if the content model allows an element to match more than one occurrence of an element type in the content model.
@@ #vc-MixedChildrenUnique Validity constraint: No Duplicate Types
@@ At user option, an XML processor MAY issue a warning if attributes are declared for an element type not itself declared, but this is not an error.
If there is more than one AttributeDefinition node with attribute type ID in the NamedNodeMap list contained in the attributeDefinitions attribute of n, then raise an xml-validity-error.
If there is more than one AttributeDefinition node with attribute type NOTATION in the NamedNodeMap list contained in the attributeDefinitions attribute of n, then raise an xml-validity-error.
"For interoperability, the same Nmtoken SHOULD NOT occur more than once in the enumerated attribute types of a single element type."

If n is an Entity node whose notationName attribute value is null (i.e. a parsed entity)

Raise an entity-error.
Validate the nodeName attribute value as an NCName.
Validate the publicId attribute value as a public identifier.
Validate the systemId attribute value as a system identifier.
If the publicId attribute value of n is not null and the systemId attribute value of n is null, then raise an xml-well-formedness-error.
For each node n_c in the childNodes list of n,
1. If n_c is not an Element, Text, CDATASection, Comment, ProcessingInstruction, or EntityReference node, then it is an xml-well-formedness-error.
2. Otherwise, if n_c is an EntityReference node, then it is an entity-error.
3. Otherwise, check n_c recursively.

If n is an Entity node whose notationName attribute value is not null (i.e. an unparsed entity)

Validate the nodeName attribute value as an NCName.
Validate the publicId attribute value as a public identifier.
Validate the systemId attribute value as a system identifier.
If the systemId attribute value of n is null, then raise an xml-well-formedness-error.
Validate the notationName attribute value of n as an NCName.
@@ #not-declared Validity constraint: Notation Declared
If the childNodes list of n contains any nodes, they are in xml-well-formedness-error.

If n is an EntityReference node

An entity-error.
Validate the nodeName attribute value as an NCName.
For each node n_c in the childNodes list of n,
1. If n_c is not an Element, Text, CDATASection, Comment, ProcessingInstruction, or EntityReference node, then it is an xml-well-formedness-error.
2. Otherwise, if n_c is not an EntityReference node, then it is an entity-error.
3. Otherwise, check n_c recursively.

If n is a Notation node

Validate the nodeName attribute value as an NCName.
Validate the publicId attribute value as a public identifier.
Validate the systemId attribute value as a system identifier.
If the childNodes list of n contains any nodes, they are in xml-well-formedness-error.

If n is a ProcessingInstruction node

If the target attribute value matches to the string xml in any case combination, then raise a xml-well-formedness-error.
Otherwise, validate the target attribute value as an NCName.
Then, validate the data attribute value as an XML character data.
If the data attribute value contains a string ?>, then raise a xml-well-formedness-error.
If the data attribute value starts with either U+0009 CHARACTER TABULATION, U+000A LINE FEED, U+000D CARRIAGE RETURN, or U+0020 SPACE character, then raise a round-trip-error.
If the childNodes list of n contains any nodes, then raise an xml-well-formedness-error.
@@ Warn if not declared

If n is a Text node

Validate the data attribute value as an XML character data.
If the childNodes list of n contains any nodes, they are in xml-well-formedness-error.

Otherwise

xml-well-formedness-error? unknown-error?

Character Classes

This section defines a couple of character classes. These classes are referred to by algorithms specified above.

Character class Char10 contains the following characters:

U+0009 CHARACTER TABULATION
U+000A LINE FEED
U+000D CARRIAGE RETURN
U+0020 SPACE .. U+D7FF
U+E000 .. U+FFFD REPLACEMENT CHARACTER
U+10000 .. U+10FFFF

This character class contains all characters allowed in the production rule Char of XML 1.0 [XML].

Character class CompatChar10 contains the following characters:

@@ Document authors are encouraged to avoid "compatibility characters", as defined in section 6.8 of [Unicode @@ Unicode 2.0 @@] (see also D21 in section 3.6 of [Unicode3]).

Character class ControlChar10 contains the following characters:

U+007F DELETE .. U+0084 INDEX
U+0086 START OF SELECTED AREA .. U+009F APPLICATION PROGRAM COMMAND
U+FDD0 .. U+FDEF
U+1FFFE .. U+1FFFF
U+2FFFE .. U+2FFFF
U+3FFFE .. U+3FFFF
U+4FFFE .. U+4FFFF
U+5FFFE .. U+5FFFF
U+6FFFE .. U+6FFFF
U+7FFFE .. U+7FFFF
U+8FFFE .. U+8FFFF
U+9FFFE .. U+9FFFF
U+AFFFE .. U+AFFFF
U+BFFFE .. U+BFFFF
U+CFFFE .. U+CFFFF
U+DFFFE .. U+DFFFF
U+EFFFE .. U+EFFFF
U+FFFFE .. U+FFFFF
U+10FFFE .. U+10FFFF

This character class contains the characters listed in the Note in Section 2.2 of XML 1.0 [XML], as amended by errata.

The character class NameStartChar10 contains the following characters:

This character class contains all characters allowed as the first character of a string matching to the production rule Name of XML 1.0 [XML].

The character class NameChar10 contains the following characters:

The characters in the character class NameStartChar10.

This character class contains all characters allowed as the second character of a string matching to the production rule Name of XML 1.0 [XML].

The character class PubidChar contains the following characters:

U+0009 CHARACTER TABULATION
U+000A LINE FEED
U+000D CARRIAGE RETURN
U+0020 SPACE
U+0021 EXCLAMATION MARK (!)
U+0023 DOLLAR SIGN ($)
U+0024 NUMBER SIGN (#)
U+0025 PERCENT SIGN (%)
U+0027 APOSTROPHE (')
U+0028 LEFT PARENTHESIS (()
U+0029 RIGHT PARENTHESIS ())
U+002A ASTERISK (*)
U+002B PLUS SIGN (+)
U+002C COMMA (,)
U+002D HYPHEN-MINUS (-)
U+002E FULL STOP (.)
U+002F SOLIDUS (/)
U+0030 DIGIT ZERO (0) .. U+0039 DIGIT NINE (9)
U+003A COLON (:)
U+003B SEMICOLON (;)
U+003D EQUAL SIGN (=)
U+003F QUESTION MARK (?)
U+0040 COMMERCIAL AT (@)
U+0041 LATIN CAPITAL LETTER A (A) .. U+005A LATIN CAPITAL LETTER Z (Z)
U+005F LOW LINE (_)
U+0061 LATIN CAPITAL LETTER A (A) .. U+007A LATIN CAPITAL LETTER Z (Z)

This character class contains all characters allowed in the production rule PubidChar of XML 1.0 [XML].

XML processing and DOM Document Type Definitions

Manakai Project Specification [DATE]

Abstract

Status of This Document

Table of contents

Introduction

History

Conformance requirements

Terminology

XML versions

Parsing XML documents

Character encodings

Expansion of entities

The external subset entity

Parameter entities

General entities

Fetching and parsing external entities

DOM Document Type Definitions

Nodes

Documents

Document types

Element types

Attribute definitions

General entities

Notations

Base URLs

XML namespaces

Namespace mappings

Namespace fixup of an XML element for serialization

Tests

Conformance checking of XML documents

Processing Model

Error Classification

Checking an XML DOM Tree

Definitions

Conformance Checking Algorithms for Components

Checking Node

Character Classes

Obsolete features

The DOM feature string for the DOM Document Type Definitions

Obsolete attribute

References

Normative References

Non-normative References

Checking `Node`