[1] [[Unicode]] では、いくつかの[[符号位置]]が[DFN[[RUBYB[非文字]@en[noncharacters]]]]とされています。 * 定義・説明 [10] > :C2:A process shall not interpret a noncharacter code point as an abstract character. -The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly. ;; [[Unicode 5.0]] 3.2 [11] > :C7 :When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points. - [INS[(中略)]] -If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or delete or ignore the noncharacter. If these options are not taken, the noncharacter should be treated as an unassigned code point. For example, an API that returned a character property value for a noncharacter would return the same value as the default value for an unassigned code point. -[INS[(後略)]] ;; [[Unicode 5.0]] 3.2 抜粋 > :D12 Coded character sequence: An ordered sequence of one or more code points. - A coded character sequence is also known as a coded character representation. - Normally a coded character sequence consists of a sequence of encoded characters, but it may also include noncharacters or reserved code points. - Internally, a process may choose to make use of noncharacter code points in its coded character sequences. However, such noncharacter code points may not be interpreted as abstract characters (see conformance clause C2), and their removal by a conformant process does not constitute modification of interpretation of the coded character sequence (see conformance clause C7). - [INS[(後略)]] ;; [[Unicode 5.0]] 3.4 抜粋 [12] > : D14 Noncharacter: A code point that is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF. - For more information, see Section 16.7, Noncharacters. - These code points are permanently reserved as noncharacters. :D15 Reserved code point: Any code point of the Unicode Standard that is reserved for future assignment. Also known as an unassigned code point. - Surrogate code points and noncharacters are considered assigned code points, but not assigned characters. - [INS[(後略)]] ;; [[Unicode 5.0]] 3.4 [13] >16.7 Noncharacters >Noncharacters: U+FFFE, U+FFFF, and Others >Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data. See Section 3.4, Characters and Encoding, for the formal definition of noncharacters and conformance requirements related to their use. >The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values. >Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See conformance clause C7 in Section 3.2, Conformance Requirements.) >In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses. >U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on. >U+FFFE. This noncharacter has the intended peculiarity that, when represented in UTF-16 and then serialized, it has the opposite byte sequence of U+FEFF, the byte order mark. This means that applications should reserve U+FFFE as an internal signal that a UTF-16 text stream is in a reversed byte format. Detection of U+FFFE at the start of an input stream should be taken as a strong indication that the input stream should be byte-swapped before interpretation. For more on the use of the byte order mark and its interaction with the noncharacter U+FFFE, see Section 16.8, Specials. ;; [[Unicode 5.0]] 16.7 [9] > These codes are intended for process-internal uses, but are not permitted for interchange. ;; 「04-Apr-2008 09:52 342K」 (2009年2月現在) * 各種応用における取り扱い ** HTML [4] [[HTML5]] では、[[著者]]は[[文書]]に[[非文字]]を含めては[['''なりません''']]。 [[HTML構文解析器]]は[[非文字]]を[[構文解析誤り]]とし、 [[U+FFFD]] に置き換えなければ[['''なりません''']]。 [SRC@en[[[HTML5]]]] ** XML [6] [[XML]] では、 [[U+FFFE]]、[[U+FFFF]] を[[文書]]に含めると[[整形式]]ではなくなります。 それ以外の[[非文字]]を含めることはできますが、 Note において[RUBYB[[[非推奨]]]@en[discouraged]]]]とされています。 ;; -[CITE@EN[Extensible Markup Language (XML) 1.0 (Fifth Edition)]] ([TIME[2008-11-21 21:41:46 +09:00]] 版) -[CITE@en[Extensible Markup Language (XML) 1.1 (Second Edition)]] ([TIME[2006-09-30 04:02:09 +09:00]] 版) * U+FDD0〜U+FDEF ** 範囲の誤り [2] 複数の[[応用]]が、[[非文字]]の範囲を「U+FDD0〜U+FD''E''F」ではなく、誤って 「U+FDD0〜U+FD''D''F」としていました。 [8] [[Unicode 5.1]] の Code Chart PDF にすら、 >This block also contains 32 noncharacters in the range FDD0‐FDDF. などと間違った記述が含まれています。 ;; 「04-Apr-2008 09:52 342K」 (2009年2月現在) ;; [[Unicode 4.0]] の [[PDF]] には該当部分の記述がそもそもなかったみたいです。 [5] [[XML]] は [[XML 1.0 4e]] E02、[[XML 1.1 2e]] E02 (2007年8月15日) でこの誤りを修正しました。 ;; -[CITE@en[Errata in REC-xml-20060816]] ([TIME[2008-11-19 06:33:50 +09:00]] 版) -[CITE@en[Errata in REC-xml11-20060816]] ([TIME[2008-01-19 03:24:55 +09:00]] 版) [3] [[HTML5]] は r2708 (2009年1月) でこの誤りを修正しました。 ;; [CITE@en[(X)HTML5 Tracking]] ([TIME[2009-02-22 09:57:31 +09:00]] 版) * U+FFFE [7] [CODE(char)[[[U+FFFE]]]] は、 [[BOM]] [CODE(char)[[[U+FEFF]]]] と区別するため、 [[非文字]]として[[文字]]が[[符号化]]されない[[符号位置]]に指定されています。 * U+FFFF * U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ..., U+FFFFE, U+FFFFF, U+10FFFE, U+10FFFF