This document is no longer maintained. See XML processing and DOM Document Type Definitions specification.
<http://suika.fam.cx/www/markup/xml/xmlcc/xmlcc-work><http://suika.fam.cx/www/markup/xml/xmlcc/xmlcc><http://suika.fam.cx/www/markup/xml/xmlcc/xmlcc-work><http://suika.fam.cx/gate/cvs/markup/xml/xmlcc/xmlcc-work.en.html>© ‐ Wakaba.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front‐Cover Texts, and no Back‐Cover
Texts. A copy of the license is
available at <http://www.gnu.org/copyleft/fdl.html>.
This section describes the status of this document at the time of its publication. Other documents might supersede this document.
This document is a working draft, produced as part of the
Whatpm
subproject of the
manakai
project. It might be updated, replaced, or obsoleted by
other documents at any time. It is inappropriate to
cite this document as other than work in progress
.
The scope of this specification is explicitly limited to the Whatpm implementation. It is not the purpose of this specification to define a general guideline to parse or to check XML documents. This specification does not try to define a new version of XML at all.
This version of the specification supports the fourth edition of XML 1.0 and the second edition of XML 1.1. The fifth edition of XML 1.0 might be supported in a later version. XML namespaces specifications is expected to be supported in a later version of this specification.
Comments on this document are welcome and may be sent to the author.
Translations of thie document might be available. The English version of the document is the only normative version.
This section is non‐normative.
This specification defines how the parsing and the conformance checking of XML documents should be implemented in the Whatpm XML parser and conformance checker.
It is not the purpose of this specification to define, e.g., how to parse XML documents in general; its scope is explicitly limited to the Whatpm implementation.
Much of invalid (well-formed or not) XML document parsing and XML document / XML DOM conformance is left undefined so that this document provides a guideline for conformance checkers.
The key words MUST
,
MUST NOT
,
SHOULD
,
SHOULD NOT
, and
MAY
in this
document are to be interpreted as described in
RFC
2119
[KEYWORDS].
All examples and notes in this specification are non‐normative, as are all sections explicitly marked non‐normative. Everything else in this specification is normative.
Algorithm is normative but non-normative. In addition, the order in which errors are raised is undefined.
This document sometimes cites parts of XML 1.0 specification by hyperlinks. When the document being processed is an XML 1.1 document, however, corresponding parts of the XML 1.1 specification should be consulted instead.
Conceptually, validation of an XML document is split into two stages for the purpose of this specification: the XML document parsing stage and the DOM XML conformance checking stage.
The input to the XML document parsing stage is a byte sequence representing the parsed XML document (and any additional metadata), and the output are a DOM tree representing the XML document and zero or more errors. The processor that implements this stage is called parser. Requirements for a parser are defined in the section of Parsing an XML Document.
The input to the DOM XML conformance chcking stage is a DOM tree, and the output are zero or more errors. The processor that implements this stage is called conformance checker. Requirements for a conformance checker are defined in the section of Checking an XML DOM Tree.
An error is ...
If a Document node has no
xml-well-formedness-error, entity-error, and unknown-error,
then it is well-formed. If a well-formed Document
node has no xml-validity-error, it is valid.
A well‐formed Document can be safely serialized
into a well‐formed XML document. A valid Document
can be easily serialized into a valid XML document.
To be a conforming validating XML processor, ...
Errors are classified into these error categories:
entity-error@@
This algorithm does not support
DOM tree with one or more EntityReference
nodes. It is expected that any entity references are expanded
at the parse time and any unexpandable entity references
make parse time errors raised so that never result in
DOM tree with EntityReference nodes.
round-trip-errorround-trip-warningA round-trip-warning
will be raised when a construct, which might not be restored to the same
construct when it is serialized and then re-parsed by a conforming
processor, is encountered.
For a Comment node a
round-trip-warning will
be raised, since XML processors are not required to
report texts of comments for applications.
unknown-error?xml-misc-errorxml-misc-fatal-errorxml-misc-recommendationAn
xml-misc-recommendation
will be raised if a SHOULD‐level requirement
in XML specification is not met.
xml-validity-errorxml-well-formedness-errorIf an xml-well-formedness-error is raised,
it would not be possible to generate an XML serialization that
would match to the appropriate production rule and that would not
violate to any well‐formedness constraint in XML
specification [XML10,
XML11].
misc-info
A misc-info is raised when some status information
on parsing or checking process that are considered useful for debugging
and so on is available. It by no means implies the non-conformance of
the document.
@@ TODO: #dt-atuseroption at user option (MAY or MUST), #dt-compat for compatibility, #dt-interop for interoperability
TODO: XML 1.1, XML Namespace 1.0/1.1, xml:base, xml:id
TODO: XML "error"/"fatal error" is not always non-conforming (only when MUST or SHOULD).
When a byte stream that represents an XML document is given to a parser, it MUST create a DOM tree according to relevant specifications [XML10, XML11, XMLNAMES10, XMLNAMES11, DOM3CORE, WEBDOMCORE, DOMDTDEF, MANAKAIDOMEXT].
The parser MAY continue the parsing of the document even after a fatal error (as defined by the relavant specifications) is encountered. How the parsing ought to be continued is not defined by this specification.
A future version of this specification might define the entire parser in terms of input stream preprocessor, tokenizer, and tree constructor.
In addition, the following requirements are applied to the parser:
xml-misc-error.
Document node
for the document entity or an Entity node for a general
entity) to true.
@@ flag must be checked later
UTF-16 but
the input byte stream for the entity does not begin with the
BOM, then the parser
MUST raise an
xml-misc-error.
xml-misc-recommendation.
amp, lt, gt,
apos, or quot, then the parser
MUST raise
xml-misc-recommendation(s).
entities attribute of the DocumentType
node MUST contain a NamedNodeMap object
whose first five items are as follows:
Entity node whose nodeName attribute
is amp. It contains a Text node whose
data attribute is set to &.
Entity node whose nodeName attribute
is lt. It contains a Text node whose
data attribute is set to <.
Entity node whose nodeName attribute
is gt. It contains a Text node whose
data attribute is set to >.
Entity node whose nodeName attribute
is quot. It contains a Text node whose
data attribute is set to ".
Entity node whose nodeName attribute
is apos. It contains a Text node whose
data attribute is set to '.
EntityValue
part of the general entity declaration contains a bare U+003C
LESS-THAN SIGN (<) character, then the parser
MUST raise an
xml-misc-warning.
Name is equal to the Name of the element type
declaration, then the parser MUST raise an
xml-validity-error.
Name is equal to the Name of the attribute
definition list declaration, then the parser MUST
raise an
xml-misc-warning.
Name is equal to the Name of the attribute
definition (whether or not in the same attribute definition list
declaration), then the parser MUST raise an
xml-misc-warning.
If the entity declaration declares a general entity, the following is applied:
Name is lt or amp
If the entity declaration does not declare an internal entity,
or if the replacement text of the entity is not the escaped form of
< (if lt) or & (if
amp), then the parser MUST raise an
xml-misc-error.
In other word, the character in the EntityValue
has to be double-escaped.
Name is gt, quot, or
apos
If the entity declaration does not declare an internal entity,
or if the replacement text of the entity is not equal to or not the
escaped form of > (if gt), " (if
quot), or ' (if apos), then the
parser MUST raise an
xml-misc-error.
In other word, the character in the EntityValue
has to be single- or double-escaped.
If the entity declaration has to be ignored since there has already
been declared an entity with the same Name as the declaration,
then the parser MUST raise a
misc-info
and abort these steps.
Five predefined entities, i.e. amp, lt,
gt, quot, and apos, are always
declared implicitly and therefore any declaration for such an entity
always raises an
misc-info.
If the entity declaration declares a parameter entity and the
Name of the entity begins with the string xml
(in any combination of upper- and lowercase letters), then the parser
MUST raise an
xml-misc-warning.
If the entity declaration contains the EntityValue,
then for each occurence of any references to unparsed entities in the
EntityValue, the parser MUST raise an
xml-misc-error.
If the entity declaration declares a general entity, then an
Entity node MUST be created and
appended to the NamedNodeMap object in the
entities attribute of the DocumentType node.
Read the external entity
If the replacement text of the entity is read, then parse the
replacement text as if it were referenced from the content of an
element (with no namespace bindings). If no @@ parse error
is raised by the parsing process, then the nodes generated by the
parsing MUST be appended to the Entity
node. The parse error MUST NOT be propagated to
the entire parsing process. Other kinds of errors
MUST be propagated. The first parse error
MUST abort the internal parsing process.
@@ better wording
@@ prop
Then, the Entity node and its descendant
MUST be marked as read-only.
Name is equal to the Name of the notation
declaration, then the parser MUST raise an
xml-validity-error.
Name of the tag is not declared by a processed
element type declaration as EMPTY content, then the parser
MUST raise an
xml-misc-recommendation.
Name of the tag is declared by a processed element
type declaration as EMPTY content, then the parser
MUST raise an
xml-misc-recommendation.
The parser MUST set the normalized value of
the attribute to the value attribute of the Attr
node created for the attribute.
That is, any entity reference has to be expanded. Unexpanded entity references in attribute values are discarded.
xml:space attribute
xml:space attribute to the value attribute
of the Attr node created for the attribute even if the
normalized value is different from default or
preserve.
Process as follows:
standalone pseudo-attribute set to
yes
xml-well-formedness-error.
xml-validity-error.
In any of two cases above, process as follows:
IGNOREd section.
standalone pseudo-attribute of the
XML declaration (if any) is set to yes.
entity-error.
allDeclarationsProcessed @@ ref
attribute of the Document node MUST be
set to false.
Process as follows:
Name of the entity reference is either
amp, lt, gt, quot,
or apos, then abort these steps.
standalone pseudo-attribute set to
yes
xml-well-formedness-error.
xml-validity-error.
In any of two cases above, process as follows:
entity-error.
@@ entity declared WFC?
Comment node MUST be created
and inserted appropriately.
The parser MUST try to read any entity referenced by general or parameter entity references and the external subset entity, if any in the document type definition.
Well-formedness constraints. When the parser
detects a voilation to one of certain well-formedness constraints, it
MUST raise an xml-well-formedness-error.
The list of such well-formed constraints is as follows:
Validity constraints. When the parser detects a
violation to one of certain validity contraints, it MUST raise an xml-validity-error. The
list of such validity constraints is as follows:
Other creteria. If the parser detects a violation
to one of certain additional constraints, it MUST raise an xml-misc-recommendation.
The list of such constraints is as follows:
For interoperability, if a parameter-entity reference appears in achoice,seq, orMixedconstruct, its replacement text SHOULD contain at least one non-blank character, and neither the first nor last non-blank character of the replacement text SHOULD be a connector (|or,).
External parsed entities SHOULD each begin with a text declaration.
The parser MUST act as if it is a validating XML processor for the purpose of informing of white space characters appearing in element content (See Section 2.10 of the XML specification).
In other word, the isElementContentWhitespace attribute
of Text nodes has to be set appropriately. Note that the
value of the attribute will be set to false for any
Text node in the content of an element whose declaration
is not processed.
The parser MUST raise at least one xml-well-formedness-error if the entity
it parses does not match to the appropriate production rule in the XML
specification. As an exception to this requirement, it MAY choose not to raise such an error if the error
will be raised by the conformance checker when the conformance checker
checks
the Document object produced by the parser.
The following algorithms and definitions are applied to XML documents; especially, they are not applied to HTML documents.
The XML version of a node is
the XML version of the document to which the node belongs.
For a Documemt node, the XML version
of the document is the value of the xmlVersion
attribute of the node. For a DocumentType node whose
ownerDocument attribute is set to null,
the XML version of the document is 1.0.
For any other node, the XML version of the document
is that of the Document node contained in the
ownerDocument attribute of the node.
To to validate an XML string (s), the following algorithm MUST be used:
Char10,
then raise an
xml-well-formedness-error.CompatChar10,
then raise an
xml-misc-warning.ControlChar10,
then raise an
xml-misc-warning.U+000D
CARRIAGE RETURN character, then
raise a
round-trip-error.
@@ We should not raise duplicate errors for U+000D
in attribute values. In addition, we should support a mode where
U+000D will be serialized as
(so that no round-trip-error
will be raised).To
validate a
Name (s), the following
algorithm MUST be used:
xml-well-formedness-error.
Abort these steps.NameStartChar10, then raise
an
xml-well-formedness-error.NameChar10, then raise an
xml-well-formedness-error.xml (in any
case combination), then raise an
xml-misc-warning.
@@ except for attribute names xml:lang,
xml:space.To
validate
an NCName (s), the
following algorithm MUST be used:
To validate a public identifier (pid), the following algorithm MUST be used:
null, abort these steps.PubidChar, then
raise an
xml-well-formedness-error.U+0009
CHARACTER TABULATION,
U+000A CARRIAGE RETURN,
and U+000D LINE FEED
characters, if the first character of pid is
U+0020 SPACE character,
if the last character of pid is U+0020
SPACE character, or if there is a
U+0020 SPACE character
immediately followed by another U+0020
SPACE character in pid, then it is a
round-trip-error.
Is this really a roundtripness problem? XML spec
does only define the way to match public identifiers in fact, no
canonical form.To validate a system identifier (sid), the following algorithm MUST be used:
null, abort these steps.U+0022
QUOTATION MARK (") and
U+0027 APOSTROPHE
(') characters, raise an
xml-well-formedness-error.U+0023
NUMBER SIGN (#)
character, then raise an
xml-misc-error.NodeThe algorithm to check a node (n) is defined as following:
Attr nodelocalName attribute value as an NCName.prefix attribute value is different from
null, then validate
the prefix attribute value as an NCName.childNodes list of n,
Text or EntityReference node, then it is an
xml-well-formedness-error.EntityReference node, then it is an
entity-error.nodeName attribute of n is
xml:space @@ or {xml namespace}:space ?
and value attribute of n is neither
default nor preserve, then it is an
xml-misc-error.specified,
manakaiAttributeType (#ValueType Validity constraint: Attribute Value Type)value of n.ID_ATTRName. If it fails, then raise an
xml-validity-error.ID v is defined,
then raise an
xml-validity-error.Name. If it fails, then raise an
xml-validity-error.ID v is NOT
defined, then raise an
xml-validity-error.Name. If it fails, then raise an
xml-validity-error.Entity v is NOT
defined, then raise an
xml-validity-error.Nmtoken. If it fails, then raise an
xml-validity-error.xml-validity-error.xml-validity-error.xml-validity-error.AttributeDefinition nodenodeName attribute of n is
xml:space @@ or {xml namespace}:space ?
and its declared type is different from (default|preserve),
(preserve|default), (default), or (preserve), then raise an
xml-misc-error.childNodes list of n,
Text or EntityReference node, then it is an
xml-well-formedness-error.EntityReference node, then it is an
entity-error.NOTATION_ATTR, enumerated values MUST
be declared. If not, then raise an
xml-validity-error.NOTATION_ATTR or ENUMERATED_ATTR,
values MUST all be distinct. If not, then raise an
xml-validity-error.NOTATION_ATTR on an EMPTY
element, then raise an
xml-validity-error.CDATASection nodedata attribute value as an XML character
data.data attribute value contains
a string ]]>, then raise an
xml-well-formedness-error.childNodes list of n contains
any nodes, they are in
xml-well-formedness-error.Comment noderound-trip-warning.data attribute value as an XML character
data.data attribute value contains
a string --, or if it ends with a character
-, then raise an
xml-well-formedness-error.childNodes list of n contains
any nodes, they are in
xml-well-formedness-error.Document node1.0 or 1.1,
then it is an unknown-error?.xmlEncoding attribute value does not
match to [A-Za-z] ([A-Za-z0-9._] | '-')*
@@ formal def, then it is an
xml-well-formedness-error.childNodes list of n have to
consist of zero or more Comment and/or
ProcessingInstruction nodes, followed by
an optional DocumentType node, followed
by zero or more Comment and/or
ProcessingInstruction nodes, followed by
an Element node, followed
by zero or more Comment and/or
ProcessingInstruction nodes. Any violation to this is an
xml-well-formedness-error.childNodes list of n,
EntityReference node, then
check
nc recursively.allDeclarationsProcessedDocumentFragment nodechildNodes list of n,
Element, Text, CDATASection,
Comment, ProcessingInstruction, or
EntityReference node, then it is an
xml-well-formedness-error.EntityReference node, then it is an
entity-error.DocumentType nodenodeName attribute value as an NCName.ownerDocument attribute of n is
null, then abort these substeps.documentElement attribute of the node
set to ownerDocument attribute of n is
null, then abort these substeps.nodeName attribute of the node set to
documentElement attribute of the node set to
ownerDocument attribute of n is
different from nodeName of n,
then raise an
xml-validity-error.publicId attribute value as a public identifier.systemId attribute value as a system identifier.publicId attribute value of n is
not null and the systemId attribute
value of n is null, then raise an
xml-well-formedness-error.
@@ publicId == null? Or, publicId == ""childNodes list of n,
ProcessingInstruction node, then it is an
xml-well-formedness-error.
@@ ref to manakai's extensionsentities, notations,
and elementTypes lists of n,
check the
node recursively.NamedNodeMap object in the entities
attribute of n does not contain Entity nodes
whose nodeName attribute are amp,
lt, gt, apos, and quot
then raise
xml-misc-recommendation(s).
Element nodelocalName attribute value as an NCName.prefix attribute value is different from
null, then validate
the prefix attribute value as an NCName.childNodes list of n,
Element,
Text, CDATASection, Comment,
ProcessingInstruction, or
EntityReference node, then it is an
xml-well-formedness-error.EntityReference node, then it is an
entity-error.attribute
attribute of n. Check conformance of attrs
as following:
Attr node whose
nodeName attribute value is equal to that of another
Attr node in attrs, then raise an
xml-well-formedness-error.ElementTypeDefinition nodechildNodes list of n contains
any nodes, they are in
xml-well-formedness-error.At user option, an XML processor MAY issue a warning when a declaration mentions an element type for which no declaration is provided, but this is not an error.
For compatibility, it is an error if the content model allows an element to match more than one occurrence of an element type in the content model.
At user option, an XML processor MAY issue a warning if attributes are declared for an element type not itself declared, but this is not an error.
AttributeDefinition node
with attribute type ID in the
NamedNodeMap list contained in the
attributeDefinitions attribute of n, then raise an
xml-validity-error.AttributeDefinition node
with attribute type NOTATION in the
NamedNodeMap list contained in the
attributeDefinitions attribute of n, then raise an
xml-validity-error.Entity node whose
notationName attribute value is null (i.e. a
parsed entity)entity-error.nodeName attribute value as an NCName.publicId attribute value as a public identifier.systemId attribute value as a system identifier.publicId attribute value of n is
not null and the systemId attribute
value of n is null, then raise an
xml-well-formedness-error.childNodes list of n,
Element,
Text, CDATASection, Comment,
ProcessingInstruction, or EntityReference
node, then it is an
xml-well-formedness-error.EntityReference node, then it is an
entity-error.Entity node whose
notationName attribute value is not null
(i.e. an unparsed entity)nodeName attribute value as an NCName.publicId attribute value as a public identifier.systemId attribute value as a system identifier.systemId attribute value of n is
null, then raise an
xml-well-formedness-error.notationName attribute value of n as an
NCName.childNodes list of n contains
any nodes, they are in
xml-well-formedness-error.EntityReference nodeentity-error.nodeName attribute value as an NCName.childNodes list of n,
Element,
Text, CDATASection, Comment,
ProcessingInstruction, or EntityReference
node, then it is an
xml-well-formedness-error.EntityReference node, then it is an
entity-error.Notation nodenodeName attribute value as an NCName.publicId attribute value as a public identifier.systemId attribute value as a system identifier.childNodes list of n contains
any nodes, they are in
xml-well-formedness-error.ProcessingInstruction nodetarget attribute value matches to the string
xml in any case combination, then raise a
xml-well-formedness-error.target attribute value as an NCName.data attribute value as an XML character
data.data attribute value contains a string
?>, then raise a
xml-well-formedness-error.data attribute value starts with either
U+0009 CHARACTER
TABULATION, U+000A LINE
FEED, U+000D CARRIAGE
RETURN, or U+0020
SPACE character, then raise a
round-trip-error.childNodes list of n contains
any nodes, then raise an
xml-well-formedness-error.Text nodedata attribute value as an XML character
data.childNodes list of n contains
any nodes, they are in
xml-well-formedness-error.This section defines a couple of character classes. These classes are referred to by algorithms specified above.
Character class Char10
contains the following characters:
U+0009 CHARACTER
TABULATIONU+000A LINE FEEDU+000D CARRIAGE
RETURNU+0020 SPACE
.. U+D7FFU+E000 .. U+FFFD
REPLACEMENT CHARACTERU+10000 .. U+10FFFFThis character class contains all characters allowed in the production rule
Char
of XML 1.0
[XML10].
Character class CompatChar10
contains the following characters:
Document authors are encouraged to avoid "compatibility characters", as defined in section 6.8 of [Unicode @@ Unicode 2.0 @@] (see also D21 in section 3.6 of [Unicode3]).
Character class ControlChar10
contains the following characters:
U+007F DELETE ..
U+0084 INDEXU+0086 START OF SELECTED
AREA .. U+009F APPLICATION
PROGRAM COMMANDU+FDD0 .. U+FDEFU+1FFFE .. U+1FFFFU+2FFFE .. U+2FFFFU+3FFFE .. U+3FFFFU+4FFFE .. U+4FFFFU+5FFFE .. U+5FFFFU+6FFFE .. U+6FFFFU+7FFFE .. U+7FFFFU+8FFFE .. U+8FFFFU+9FFFE .. U+9FFFFU+AFFFE .. U+AFFFFU+BFFFE .. U+BFFFFU+CFFFE .. U+CFFFFU+DFFFE .. U+DFFFFU+EFFFE .. U+EFFFFU+FFFFE .. U+FFFFFU+10FFFE .. U+10FFFFThis character class contains the characters listed in the Note in Section 2.2 of XML 1.0 [XML10], as amended by errata.
The character class NameStartChar10
contains the following characters:
This character class contains all characters allowed as the first character
of a string matching to the production rule
Name
of XML 1.0
[XML10].
The character class NameChar10
contains the following characters:
This character class contains all characters allowed as the second
character of a string matching to the production rule
Name
of XML 1.0
[XML10].
The character class PubidChar
contains the following characters:
U+0009 CHARACTER
TABULATIONU+000A LINE FEEDU+000D CARRIAGE
RETURNU+0020 SPACEU+0021 EXCLAMATION MARK
(!)U+0023 DOLLAR SIGN
($)U+0024 NUMBER SIGN
(#)U+0025 PERCENT SIGN
(%)U+0027 APOSTROPHE
(')U+0028 LEFT PARENTHESIS
(()U+0029 RIGHT
PARENTHESIS ())U+002A ASTERISK
(*)U+002B PLUS SIGN
(+)U+002C COMMA
(,)U+002D HYPHEN-MINUS
(-)U+002E FULL STOP
(.)U+002F SOLIDUS
(/)U+0030 DIGIT ZERO
(0) .. U+0039
DIGIT NINE (9)U+003A COLON
(:)U+003B SEMICOLON
(;)U+003D EQUAL SIGN
(=)U+003F QUESTION MARK
(?)U+0040 COMMERCIAL AT
(@)U+0041 LATIN CAPITAL LETTER
A (A) .. U+005A
LATIN CAPITAL LETTER Z
(Z)U+005F LOW LINE
(_)U+0061 LATIN CAPITAL LETTER
A (A) .. U+007A
LATIN CAPITAL LETTER Z
(Z)<http://www.w3.org/TR/xml>.
This version of the specification is referenced.<http://www.w3.org/TR/CSS21>.<http://dev.w3.org/csswg/cssom/Overview.html>.<http://dev.w3.org/2006/webapi/selectors-api/Overview.html>.
The latest published version of the specification is available at
<http://www.w3.org/TR/selectors-api/>.<http://www.w3.org/TR/xbl/>.