ids/0/124.txt

#?SuikaWiki/0.9 default-name="US-ASCII"


* charset 系パラメーター

,charset        ,token  ,[[MIME]]
,charset-edition        ,4DIGIT ,RFC 1922
,charset-extension      ,token  ,RFC 1922


* RFC 2046 から

A critical parameter that may be specified in the Content-Type field
for "text/plain" data is the character set.  This is specified with a
"charset" parameter, as in:

"text/plain" データの Content-Type 領域に指定しても良いパラメーターは
文字集合です、これは "charset" パラメーターで次のように指定します。

[PRE[
     Content-type: text/plain; charset=iso-8859-1
]PRE]

Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive.  The default character set, which
must be assumed in the absence of a charset parameter, is US-ASCII.

他の幾つかのパラメーター値とは違って、 charset パラメーターの値は
大文字・小文字を区別'''しません'''。 charset パラメーターが無い場合に
仮定しなければならない既定の文字集合は、 US-ASCII です。

[PRE[
   The specification for any future subtypes of "text" must specify
   whether or not they will also utilize a "charset" parameter, and may
   possibly restrict its values as well.  For other subtypes of "text"
   than "text/plain", the semantics of the "charset" parameter should be
   defined to be identical to those specified here for "text/plain",
   i.e., the body consists entirely of characters in the given charset.
   In particular, definers of future "text" subtypes should pay close
   attention to the implications of multioctet character sets for their
   subtype definitions.
]PRE]

将来の "text" の亜型の仕様は "charset" パラメーターを利用するかどうか
規定しなければなりません。また、その値を制限しても構いません。
"text/plain" 以外の "text" 亜型には、 "charset" パラメーターの意味は
ここで "text/plain" 用に既定するのと同じ様に定義するべきです。
つまり、本文は完全に与えた charset で構成されます。特に、
将来の "text" 亜型の定義者は多オクテット文字集合とその亜型定義
との関係についてしっかり注意するべきです。

[PRE[
   The charset parameter for subtypes of "text" gives a name of a
   character set, as "character set" is defined in RFC 2045.  The rules
   regarding line breaks detailed in the previous section must also be
   observed -- a character set whose definition does not conform to
   these rules cannot be used in a MIME "text" subtype.
]PRE]

"text" 亜型の charset パラメーターは文字集合の名前を与えます。
ここで「文字集合 character set」は RFC 2045 で定義したものです。
前の節で詳しく述べた改行に関する規則にも注意して下さい。
この規則に適合しない文字集合は MIME "text" 亜型で使うことは出来ません。

An initial list of predefined character set names can be found at the
end of this section.  Additional character sets may be registered with IANA.

予め定義した文字集合名はこの節の終わりにあります。追加の文字集合を
IANA で登録しても構いません。

Other media types than subtypes of "text" might choose to employ the
charset parameter as defined here, but with the CRLF/line break
restriction removed.  Therefore, all character sets that conform to
the general definition of "character set" in RFC 2045 can be
registered for MIME use.

"text" の亜型以外の媒体型もここで定義した charset パラメーターを使う
ことにしても構いませんが、 CRLF/改行制限は削除されます。
ですから、 RFC 2045 の「文字集合 character set」の定義に適合する
全ての文字集合を MIME で使用するのに登録出来ます。

[PRE[
   Note that if the specified character set includes 8-bit characters
   and such characters are used in the body, a Content-Transfer-Encoding
   header field and a corresponding encoding on the data are required in
   order to transmit the body via some mail transfer protocols, such as
   SMTP [RFC-821].
]PRE]

なお、指定文字集合が8ビット文字を含んでいてそのような文字が本文で
使われている場合、 Content-Transfer-Encoding 頭領域と対応するデータの符号化
が本文を SMTP のような幾つかのメイル転送プロトコルで転送するために
施す必要があります。

[PRE[
   The default character set, US-ASCII, has been the subject of some
   confusion and ambiguity in the past.  Not only were there some
   ambiguities in the definition, there have been wide variations in
   practice.  In order to eliminate such ambiguity and variations in the
   future, it is strongly recommended that new user agents explicitly
   specify a character set as a media type parameter in the Content-Type
   header field. "US-ASCII" does not indicate an arbitrary 7-bit
   character set, but specifies that all octets in the body must be
   interpreted as characters according to the US-ASCII character set.
   National and application-oriented versions of ISO 646 [ISO-646] are
   usually NOT identical to US-ASCII, and in that case their use in
   Internet mail is explicitly discouraged.  The omission of the ISO 646
   character set from this document is deliberate in this regard.  The
   character set name of "US-ASCII" explicitly refers to the character
   set defined in ANSI X3.4-1986 [US- ASCII].  The new international
   reference version (IRV) of the 1991 edition of ISO 646 is identical
   to US-ASCII.  The character set name "ASCII" is reserved and must not
   be used for any purpose.
]PRE]

既定の文字集合 US-ASCII は過去より混乱と曖昧がありました。定義に曖昧性がある
だけではなく、慣習上多様な変種があります。将来このような曖昧性と変種を
取り除くため、新しい利用者代理者は明示的に文字集合を Content-Type
頭領域の媒体型パラメーターとして指定することを強く推奨します。 "US-ASCII"
は任意の7ビット文字集合を示すのではなく、本文の
全てのオクテットを US-ASCII 文字集合によって文字として解釈しなければならない
と指定します。国家・応用指向の ISO 646 の版は一般に US-ASCII
と同一では'''なく'''、この場合 Internet メイルでの利用は明白に非推奨です。
ISO 646 文字集合をこの文書から省いたのは、このためわざとそうしたのです。
"US-ASCII" の名前の文字集合は明白に、 ANSI X3.4-1986 で定義された文字集合を
参照します。 ISO 646 の 1991 年版の新しい国際基準版 (IRV) は US-ASCII
と同一です。文字集合名 "ASCII" は保留され、どんな目的にも使ってはいけません。

[PRE[
   NOTE: RFC 821 explicitly specifies "ASCII", and references an earlier
   version of the American Standard.  Insofar as one of the purposes of
   specifying a media type and character set is to permit the receiver
   to unambiguously determine how the sender intended the coded message
   to be interpreted, assuming anything other than "strict ASCII" as the
   default would risk unintentional and incompatible changes to the
   semantics of messages now being transmitted.  This also implies that
   messages containing characters coded according to other versions of
   ISO 646 than US-ASCII and the 1991 IRV, or using code-switching
   procedures (e.g., those of ISO 2022), as well as 8bit or multiple
   octet character encodings MUST use an appropriate character set
   specification to be consistent with MIME.
]PRE]

[PRE[
   The complete US-ASCII character set is listed in ANSI X3.4- 1986.
   Note that the control characters including DEL (0-31, 127) have no
   defined meaning in apart from the combination CRLF (US-ASCII values
   13 and 10) indicating a new line.  Two of the characters have de
   facto meanings in wide use: FF (12) often means "start subsequent
   text on the beginning of a new page"; and TAB or HT (9) often (though
   not always) means "move the cursor to the next available column after
   the current position where the column number is a multiple of 8
   (counting the first column as column 0)."  Aside from these
   conventions, any use of the control characters or DEL in a body must
   either occur
]PRE]

[PRE[
    (1)   because a subtype of text other than "plain"
          specifically assigns some additional meaning, or
]PRE]

[PRE[
    (2)   within the context of a private agreement between the
          sender and recipient. Such private agreements are
          discouraged and should be replaced by the other
          capabilities of this document.
]PRE]

[PRE[
   NOTE: An enormous proliferation of character sets exist beyond US-
   ASCII.  A large number of partially or totally overlapping character
   sets is NOT a good thing.  A SINGLE character set that can be used
   universally for representing all of the world's languages in Internet
   mail would be preferrable.  Unfortunately, existing practice in
   several communities seems to point to the continued use of multiple
   character sets in the near future.  A small number of standard
   character sets are, therefore, defined for Internet use in this
   document.
]PRE]

The defined charset values are:

定義されている charset 値は次の通りです。

[PRE[
    (1)   US-ASCII -- as defined in ANSI X3.4-1986 [US-ASCII].
]PRE]

(1) US-ASCII ANSI X3.4-1986 で定義されたもの。

[PRE[
    (2)   ISO-8859-X -- where "X" is to be replaced, as
          necessary, for the parts of ISO-8859 [ISO-8859].  Note
          that the ISO 646 character sets have deliberately been
          omitted in favor of their 8859 replacements, which are
          the designated character sets for Internet mail.  As of
          the publication of this document, the legitimate values
          for "X" are the digits 1 through 10.
]PRE]

(2) ISO-8859-X ここで "X" は ISO-8859 の部分で置き換えたもの。
なお、 ISO 646 の文字集合達は代わりの Internet メイルの指示文字集合
である 8859 があるので故意に省いています。この文書の出版の時点では、
"X" の適当な値は数字 1 から 10 です。

Characters in the range 128-159 has no assigned meaning in ISO-8859-X. 
Characters with values below 128 in ISO-8859-X have the same
assigned meaning as they do in US-ASCII.

範囲 128〜159 の文字は ISO-8859-X で割り当てられた意味はありません。
ISO-8859-X で128以下の値は US-ASCII で割り当てられたのと同じ意味を持ちます。

[PRE[
   Part 6 of ISO 8859 (Latin/Arabic alphabet) and part 8 (Latin/Hebrew
   alphabet) includes both characters for which the normal writing
   direction is right to left and characters for which it is left to
   right, but do not define a canonical ordering method for representing
   bi-directional text.  The charset values "ISO-8859-6" and "ISO-8859-
   8", however, specify that the visual method is used [RFC-1556].
]PRE]

[PRE[
   All of these character sets are used as pure 7bit or 8bit sets
   without any shift or escape functions.  The meaning of shift and
   escape sequences in these character sets is not defined.
]PRE]

[PRE[
   The character sets specified above are the ones that were relatively
   uncontroversial during the drafting of MIME.  This document does not
   endorse the use of any particular character set other than US-ASCII,
   and recognizes that the future evolution of world character sets
   remains unclear.
]PRE]

Note that the character set used, if anything other than US- ASCII,
must always be explicitly specified in the Content-Type field.

なお、 US-ASCII 以外の文字集合が使われている時は、必ず
Content-Type 領域に明示しなければなりません。

[PRE[
   No character set name other than those defined above may be used in
   Internet mail without the publication of a formal specification and
   its registration with IANA, or by private agreement, in which case
   the character set name must begin with "X-".
]PRE]

Implementors are discouraged from defining new character sets unless
absolutely necessary.

実装者が新しい文字集合を定義するのは完全に必要でない限り非推奨です。

[PRE[
   The "charset" parameter has been defined primarily for the purpose of
   textual data, and is described in this section for that reason.
   However, it is conceivable that non-textual data might also wish to
   specify a charset value for some purpose, in which case the same
   syntax and values should be used.
]PRE]

[PRE[
   In general, composition software should always use the "lowest common
   denominator" character set possible.  For example, if a body contains
   only US-ASCII characters, it SHOULD be marked as being in the US-
   ASCII character set, not ISO-8859-1, which, like all the ISO-8859
   family of character sets, is a superset of US-ASCII.  More generally,
   if a widely-used character set is a subset of another character set,
   and a body contains only characters in the widely-used subset, it
   should be labelled as being in that subset.  This will increase the
   chances that the recipient will be able to view the resulting entity
   correctly.
]PRE]


* RFC 1922 4.   Two New MIME parameters

Here we define two new MIME parameters to be used with "charset" parameters.

ここに、 "charset" パラメーターと共に使う、
2つの新しい MIME パラメーターを定義します。


** 4.1.  "charset-edition"

[PRE[
   This parameter is used after the MIME "charset" parameter, using four
   digits (AD) to indicate what the year of edition is for the character
   set standard shown in "charset".  Its use is optional.
   Implementations should ignore this parameter unless the
   implementation has specific support for that particular character set
   edition.
]PRE]

[PRE[
   The reason for defining this parameter is that there are often
   differences in the defined characters between editions of a character
   set standard.  Sometimes, the difference can not be ignored,
   otherwise implementations would have problems when processing it.
   There are only two ways to indicate this difference, in the current
   MIME syntax.  One way is to indicate the edition in the charset name,
   such as CN-GB-1988-80 (the 1980's edition of GB 1988).  The other way
   is to define a new optional parameter such as "charset-edition".  The
   latter way is better because receiving applications that can only
   process an older edition can still recognize the character set and
   offer to display the text in the older edition.  This display may
   have a few mistakes, but it is better than refusing to display any
   text at all or defaulting to an inappropriate character set such as
   US-ASCII or ISO-8859-1.
]PRE]


** 4.2.  "charset-extension"

[PRE[
   This parameter is also used after the MIME "charset" parameter.  It
   is case-insensitive and optional, and any value of this parameter
   should be registered in IANA.  Unregistered value should start with
   "x-" as with any MIME extension-token.  Implementations should ignore
   this parameter unless the implementation has specific support for
   that particular character set extension.
]PRE]

[PRE[
   A character set extension has displayed glyphs for code points that
   are not assigned in the character set, for example, vendor-specific
   extensions of standard character sets.  This parameter provides the
   option of using these extensions.  Although character set extensions
   may cause interoperability problems, we recognize the existence of
   such extensions.
]PRE]

[PRE[
   For example:
      Content-Type: text/plain; charset=CN-Big5; charset-edition=1984;
       charset-extension=ETen-2.00.03-DOS
]PRE]

[PRE[
   This may indicate Eten company's extension of Big5: ETen 2.00.03 for
   DOS, assuming that "ETen-2.00.03-DOS" is registered with the IANA..
]PRE]


** 4.3.  Formal Syntax:

The following changes and additions are made to the MIME syntax:

MIME 構文に対して、次の通り変更・追加します。

[PRE[
   charset-edition   := "charset-edition" "=" 4DIGIT
                         ; year of edition in four digits
]PRE]

[PRE[
   charset-extension := "charset-extension" "=" extension-token
]PRE]


* メモ

RFC 1922 部分を訳し終えて送ろうとしたら、糞 IE がいかれて
消えちまった。和訳文を返せ。

charset-edition の定義は 4DIGIT ですが、[[10000年問題]]
を考慮して、4*DIGIT とするのがよさげ。[[2000年問題]]対策
と称して 2DIGIT に1900足したりする必要はないと思います。
そういうのを受け取っても、
西暦1世紀に制定された charset だということにしてはどーですか?

charset-extension の登録簿は IANA には無いみたい。
[[#comment]]


* 古い Netscape Navigator の charset パラメーター認識問題

[1] 古い [[NetscapeNavigator]] (2 以前?) には、 [CODE[charset]]
パラメーターを特定の値 (hard coding で実装されている名前) 
以外の場合、手動で[[符号化方法]]を指定することさえ出来なくなり、結果
(多くの場合) 文字化けするという問題があります。

[2] 特に日本語系[[文字コード]]では、 [CODE[x-euc-jp]] 及び [CODE[x-sjis]]
という私用名にしか対応しておらず、 [[NN]]2 より後に登録された [[IANA]]
名 [CODE[euc-jp]] 及び [CODE[shift_jis]] や [CODE[windows-31j]]
が指定されていると文字化けします。

[3] なお、この指定は [[HTTP]] [[頭]]内の [[Content-Type:欄]]に指定しても無視されるようで、
[[HTML]] の [[meta要素]]で [[http-equiv属性]]を使って指定する必要があります。
- [4] >>2 なお、 [CODE[iso-2022-jp]] については問題は起こりません。
- [5] この指定とは関係なしに文書の一部が[[文字化け]]する現象がありますが、原因はよくわかりません。
- [6] [WEAK[2002-12-01 (日) 10:38]] ''[[US-ASCII]]'': >>1-5 問題を Netscape Navigator 2.01 で確認しますた。 (っていうかこんな古い版はやく捨てましょう。)
[[#comment]]


* quoted-string

[15] [[MIME]]/[[HTTP]] で使われる[[媒体型]]の [CODE[charset]]
引数の指定方法では、引数値は他の引数同様に
[CODE(ABNF)[[[value]]]]
です。つまり、 [CODE(ABNF)[[[token]]]]
または [CODE(ABNF)[[[quoted-string]]]]
を1つ使って指定できます。

例:
- [CODE(MIME)[charset=US-ASCII]]
- [CODE(MIME)[charset="US-ASCII"]]
- [CODE(MIME)[charset="US\-ASCII"]]

(すべて等価)

[16] [[WinIE6]] は引用符があるものに対応していないみたいです。

簡単な傍証:
= [CODE(HTTP)[charset=iso-8859-1]] の実体を表示
= 次のどちらか:
-- [CODE(HTTP)[charset=iso-2022-jp]] の実体を表示
-- [CODE(HTTP)[charset="iso-2022-jp"]] の実体を表示

引用符をつけていると見事に化けます。

もちろんこのような杜撰な実装は規格不適合です。


[17]
<u style="display: none;">... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ... no changes ...  </u>

(
([[Mr.Anonymous]])


[[#comment]]


* RFC の部分の License

See [[RFCのライセンス]]


* メモ

- [7] [WEAK[2002-12-01 (日) 10:38]] ''[[US-ASCII]]'': age
- [8] [[RFC3023]] は、 [[XML]] 系の媒体型のための [CODE(MIME)[charset]] 引数を規定しています。
- [9] >>8 [CODE[text/xml]] などは、省略時の既定値が MIME でも ''HTTP でも'' [CODE(charset)[us-ascii]] です。また、 [CODE[application/xml]] などは、省略時には既定値なしで、 [[xml宣言]]などを参照します。
- [10] >>8-9 他の [CODE[+xml]] 系媒体型がよく定義にこの RFC を参照していますから、影響力は大きいです。
- [11] >>10 [INS[というわけで詳しい開設を各。]]
- [12] ''Bug 1697 - multipart/form-data 送出時に Content-Type が送られない'' <http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=1697>, ''Bug 116346 - Content-Type should be supplied for form data of 'enctype="multipart/form-data"'[from sub]'' <http://bugzilla.mozilla.org/show_bug.cgi?id=116346> : [CODE(MIME)[[[multipart/form-data]]]] を送るときに [[Mozilla]] が [CODE(MIME)[charset]] 引数をつけなかったという問題。だけどこれは特定の媒体型や特定の実装に固有じゃない、実は大きな問題をはらんでいます。「ファイル添付」のような、 MIME 実体を生成する応用を単に通過するだけのデータのメタ情報をどうやって得るのかとか、多文字化された実装で charset 情報とどう向き合うのかとか。 MIME は10年も前の規格で、こんなことなんて考えてもいなかったわけですけど、この先どうなるでしょう。やっぱりこれまで通り騙し騙し無理しながら現状維持し続けるしか方法はないのかな。
- [13] [CODE(MIME)[charset=136]] ってのみかけたけど、中身は [CODE(charset)[[[Big5]]]] だった。どうしてこんな値になったんだ?
- [14] >>13 はちなみに [[spam]]。