[1] [[Unicode]] では、いくつかの[[符号位置]]が[DFN[[RUBYB[非文字]@en[noncharacters]]]]とされています。

* 定義・説明

[10] 
>
:C2:A process shall not interpret a noncharacter code point as an abstract character.
-The noncharacter code points may be used internally, such as for sentinel values
or delimiters, but should not be exchanged publicly.

;; [[Unicode 5.0]] 3.2 <http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=8>

[11] 
>
:C7 :When a process purports not to modify the interpretation of a valid coded character
sequence, it shall make no change to that coded character sequence other than the possible
replacement of character sequences by their canonical-equivalent sequences or the
deletion of noncharacter code points.
- [INS[(中略)]]
-If a noncharacter that does not have a specific internal use is unexpectedly
encountered in processing, an implementation may signal an error or delete or
ignore the noncharacter. If these options are not taken, the noncharacter
should be treated as an unassigned code point. For example, an API that
returned a character property value for a noncharacter would return the same
value as the default value for an unassigned code point.
-[INS[(後略)]]

;; [[Unicode 5.0]] 3.2 <http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=10> 抜粋

>
:D12 Coded character sequence: An ordered sequence of one or more code points.
- A coded character sequence is also known as a coded character representation.
- Normally a coded character sequence consists of a sequence of encoded characters,
but it may also include noncharacters or reserved code points.
- Internally, a process may choose to make use of noncharacter code points in its
coded character sequences. However, such noncharacter code points may not
be interpreted as abstract characters (see conformance clause C2), and their
removal by a conformant process does not constitute modification of interpretation
of the coded character sequence (see conformance clause C7).
- [INS[(後略)]]

;; [[Unicode 5.0]] 3.4 <http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=17> 抜粋

[12]
>
: D14 Noncharacter: A code point that is permanently reserved for internal use and that
should never be interchanged. Noncharacters consist of the values U+nFFFE and
U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
- For more information, see Section 16.7, Noncharacters.
- These code points are permanently reserved as noncharacters.
:D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
future assignment. Also known as an unassigned code point.
- Surrogate code points and noncharacters are considered assigned code points,
but not assigned characters.
- [INS[(後略)]]

;; [[Unicode 5.0]] 3.4 <http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#page=18>

[13] 
>16.7 Noncharacters
>Noncharacters: U+FFFE, U+FFFF, and Others
>Noncharacters are code points that are permanently reserved in the Unicode Standard for
internal use. They are forbidden for use in open interchange of Unicode text data. See
Section 3.4, Characters and Encoding, for the formal definition of noncharacters and conformance
requirements related to their use.
>The Unicode Standard sets aside 66 noncharacter code points. The last two code points of
each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF
on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code
points. In addition, there is a contiguous range of another 32 noncharacter code points in
the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained
within the Arabic Presentation Forms-A block, but those noncharacters are not
“Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any
other way from the other noncharacters, except in their code point values.
>Applications are free to use any of these noncharacter code points internally but should
never attempt to exchange them. If a noncharacter is received in open interchange, an
application is not required to interpret it in any way. It is good practice, however, to recognize
it as a noncharacter and to take appropriate action, such as removing it from the text.
Note that Unicode conformance freely allows the removal of these characters. (See conformance
clause C7 in Section 3.2, Conformance Requirements.)
>In effect, noncharacters can be thought of as application-internal private-use code points.
Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which
are assigned characters and which are intended for use in open interchange, subject to
interpretation by private agreement, noncharacters are permanently reserved (unassigned)
and have no interpretation whatsoever outside of their possible application-internal private
uses.
>U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being
associated with the largest code unit values for particular Unicode encoding forms. In
UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is
associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute
renders these two noncharacter code points useful for internal purposes as sentinels. For
example, they might be used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on.
>U+FFFE. This noncharacter has the intended peculiarity that, when represented in UTF-16
and then serialized, it has the opposite byte sequence of U+FEFF, the byte order mark. This
means that applications should reserve U+FFFE as an internal signal that a UTF-16 text
stream is in a reversed byte format. Detection of U+FFFE at the start of an input stream
should be taken as a strong indication that the input stream should be byte-swapped before
interpretation. For more on the use of the byte order mark and its interaction with the noncharacter
U+FFFE, see Section 16.8, Specials.

;; [[Unicode 5.0]] 16.7
<http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf#page=21>

[9] 
> These codes are intended for process-internal uses, but are not permitted for interchange.

;; <http://www.unicode.org/charts/PDF/Unicode-4.0/U40-FB50.pdf>、
<http://www.unicode.org/charts/PDF/UFB50.pdf>「04-Apr-2008 09:52  342K」 (2009年2月現在)

* 各種応用における取り扱い

** HTML

[4] [[HTML5]] では、[[著者]]は[[文書]]に[[非文字]]を含めては[['''なりません''']]。
[[HTML構文解析器]]は[[非文字]]を[[構文解析誤り]]とし、 [[U+FFFD]]
に置き換えなければ[['''なりません''']]。 [SRC@en[[[HTML5]]]]

** XML

[6] [[XML]] では、 [[U+FFFE]]、[[U+FFFF]] を[[文書]]に含めると[[整形式]]ではなくなります。
それ以外の[[非文字]]を含めることはできますが、 Note において[RUBYB[[[非推奨]]]@en[discouraged]]]]とされています。

;;
-[CITE@EN[Extensible Markup Language (XML) 1.0 (Fifth Edition)]] ([TIME[2008-11-21 21:41:46 +09:00]] 版) <http://www.w3.org/TR/2008/REC-xml-20081126/#charsets>
-[CITE@en[Extensible Markup Language (XML) 1.1 (Second Edition)]] ([TIME[2006-09-30 04:02:09 +09:00]] 版) <http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets>

* U+FDD0〜U+FDEF

** 範囲の誤り

[2] 複数の[[応用]]が、[[非文字]]の範囲を「U+FDD0〜U+FD''E''F」ではなく、誤って
「U+FDD0〜U+FD''D''F」としていました。

[8] [[Unicode 5.1]] の Code Chart PDF にすら、
>This block also contains 32 noncharacters in the range FDD0‐FDDF.

などと間違った記述が含まれています。

;; <http://www.unicode.org/charts/PDF/UFB50.pdf>「04-Apr-2008 09:52  342K」 (2009年2月現在)

;; [[Unicode 4.0]] の [[PDF]] <http://www.unicode.org/charts/PDF/Unicode-4.0/U40-FB50.pdf>
には該当部分の記述がそもそもなかったみたいです。

[5] [[XML]] は [[XML 1.0 4e]] E02、[[XML 1.1 2e]] E02 (2007年8月15日) でこの誤りを修正しました。

;;
-[CITE@en[Errata in REC-xml-20060816]] ([TIME[2008-11-19 06:33:50 +09:00]] 版) <http://www.w3.org/XML/xml-V10-4e-errata#E02>
-[CITE@en[Errata in REC-xml11-20060816]] ([TIME[2008-01-19 03:24:55 +09:00]] 版) <http://www.w3.org/XML/xml-V11-2e-errata#E02>

[3] [[HTML5]] は r2708 (2009年1月) でこの誤りを修正しました。

;; [CITE@en[(X)HTML5 Tracking]] ([TIME[2009-02-22 09:57:31 +09:00]] 版) <http://html5.org/tools/web-apps-tracker?from=2707&to=2708&context=10>

* U+FFFE

[7] [CODE(char)[[[U+FFFE]]]] は、 [[BOM]] [CODE(char)[[[U+FEFF]]]] と区別するため、
[[非文字]]として[[文字]]が[[符号化]]されない[[符号位置]]に指定されています。

* U+FFFF

* U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ..., U+FFFFE, U+FFFFF, U+10FFFE, U+10FFFF