/[suikacvs]/webroot/www/2004/id/draft-ietf-html-charset-harmful-00.txt
Suika

Contents of /webroot/www/2004/id/draft-ietf-html-charset-harmful-00.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (show annotations) (download)
Tue Jun 15 08:37:16 2004 UTC (19 years, 11 months ago) by wakaba
Branch: MAIN
CVS Tags: HEAD
File MIME type: text/plain
New

1
2
3 HTML Working Group D. Connolly
4 INTERNET-DRAFT MIT/W3C
5 draft-ietf-html-charset-harmful-00.txt May 2, 1995
6 Expires November, 1995
7
8
9
10 Character Set Considered Harmful
11
12
13
14 Status of this Document
15
16
17
18 This document is an Internet-Draft. Internet-Drafts are working
19 documents of the Internet Engineering Task Force (IETF), its areas, and
20 its working groups. Note that other groups may also distribute working
21 documents as Internet-Drafts.
22
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference material
26 or to cite them other than as "work in progress."
27
28 To learn the current status of any Internet-Draft, please check the
29 "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
30 Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
31 munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
32 ftp.isi.edu (US West Coast).
33
34 Distribution of this document is unlimited. Please send comments to the
35 HTML working group (HTML-WG) of the Internet Engineering Task Force
36 (IETF) at <html-wg@oclc.org> ;. Discussions of the group are archived at
37 http://www.acl.lanl.gov/HTML_WG/archives.html .
38
39 Abstract
40
41
42
43 The term character set is often used to describe a ditigal
44 representation of text. ASCII is perhaps the most widely deployed
45 representation of text, and in the interest of interoperability,
46 information systems on the Internet traditionally rely on it
47 exclusively.
48
49 The Multipurpose Internet Mail Extensions (MIME) introduces Internet
50 Media Types, including text representations besides ASCII. The Hypertext
51 Markup Language (HTML) used in the World-Wide Web is a proposed Internet
52 Media Type. But HTML is also an application of Standard Generalized
53 Markup Language (SGML).
54
55
56
57
58 Connolly [Page 1]
59
60 Internet Draft Character Terminology May, 1995
61
62
63 In the MIME and SGML specifications, the discussion of characters
64 representation is notoriously complex, and apparently subtly
65 inconsistent or incompatible. This document presents a collection of
66 terms intended to reconcile the two specifications and serve as a basis
67 for rigorous discussion of characters and their digital representations.
68
69 Introduction
70
71
72
73 The term character set is often used to describe a ditigal
74 representation of text. The specification of such a representation
75 typically involves identifying a sufficiently expressive collection of
76 characters, and giving each of them a number.
77
78 In conventional mathematics terminology then, a "character set" is not
79 just a set of characters, but a function whose domain is a set of
80 integers, and whose range is a set of characters.
81
82 Some standards documents, including the SGML standard, make little or no
83 use of such conventional mathematical terms as function, domain and
84 range. Perhaps the authors of those documents intend the documents to be
85 comprehensible without a prior understanding of mathematics. But the
86 specification of notions such as the conformance of an SGML document or
87 SGML system are much more complex than the basics of logic and
88 mathematics.
89
90 In his text on Calculus [Spivak] , Michael Spivak writes:
91
92
93
94 Every aspect of this book was influenced by the desire to
95 present calculus not merely as a prelude to but as the first
96 real encounter with mathematics. Since the foundation of
97 analysis provided the arena in which modern modes of
98 mathematical thinking developed, calculus ought to be the
99 place in which to expect, rather than avoid, the strengthening
100 of insight with logic. In addition to developing the students'
101 intuition about the beautiful concepts of analysis, it is
102 surely equally important to persuade them that precision and
103 rigor are neither deterrents to intuition, nor ends in
104 themselves, but the natural medium in which to formulate and
105 think about mathematical questions.
106
107
108
109 This document is not intended as the first real encounter with
110 mathematics. But neither will we make any effort to avoid or apologize
111 for mathematical terminology. The reader is referred to the large body
112 of literature on logic and set theory, including a history of writings
113 on math and logic[SET] and Douglas Hofstadter's fascinating book [GEB] .
114
115
116 Connolly [Page 2]
117
118 Internet Draft Character Terminology May, 1995
119
120
121 Coded Character Sets
122
123
124
125 Using "character set" rather than something such as character table or
126 even character sequence to denote the functions that maps integers to
127 characters is unfortunate, but it is water under the bridge, and a lot
128 of it by now. Rather than attempting to divert all that water at this
129 point, we introduce the primitive notion of character and use it to
130 define the term coded character set from [ISO10646] and other standards:
131
132 character
133 An atom of information
134 coded character set
135 A function whose domain is a subset of the integers, and whose
136 range is a set of characters.
137
138
139 Note that by the term character, we do not mean a glyph, a name, a
140 phoneme, nor a bit combination. A character is simply an atomic unit of
141 communication. It is typically a symbol whose various representations
142 are understood to mean the same thing by a community of people.
143
144 It might seem more intuitive to map from characters to integers, rather
145 than the way it is defined here. But in practice there are some coded
146 character sets that assign two different numbers to the same character
147 [Lee] , and so the inverse is not a function in the general case.
148
149 There are two other terms used in standards such as [ISO10646] that we
150 define in relation to the first two:
151
152 code position
153 An integer. A coded character set and a code position from its
154 domain determine a character.
155 character repertoire
156 A set of characters; that is, the range of a coded character set.
157
158
159 Character Encoding Schemes
160
161
162
163 The only practical means for exchanging information on the Internet is
164 to represent it as a sequence of octets (bytes).
165
166 One way to transmit a sequence of characters is to agree on a coded
167 character set and transmit the character numbers of each of the
168 characters.
169
170
171
172
173
174 Connolly [Page 3]
175
176 Internet Draft Character Terminology May, 1995
177
178
179 But in practice, characters are encoded using a variety of optimizations
180 of this brute-force approach: code switching techniques, escape
181 sequences, etc. The encoding of a sequence of characters is not, in
182 general, the result of encoding each character independently and then
183 concatenating them. But it is sufficiently general to note that
184 sequences of characters are encoded as a sequence of bytes. So we
185 define:
186
187 octet
188 an element of the set {0, 1, 2, ..., 255}
189 character encoding scheme
190 a function whose domain is the set of sequences of octets, and
191 whose range is the set of sequences of characters over some
192 character repertoire.
193
194
195 Representation of SGML Text Entities
196
197
198
199 An SGML document is made up of entities: a text entity called the
200 document entity, and possibly some other text entities and data
201 entities.
202
203 A text entity is a sequence of characters. The representation of a text
204 entity is not specified by the SGML standard. For the purpose of
205 MIME-based interchange of SGML text entities, we define the following:
206
207 text entity
208 a sequence of characters
209 message entity
210 a pair (T, OS) where T is an Internet Media Type and OS is a
211 sequence of octets.
212
213
214 Note that each text/* media type has an associated charset parameter,
215 which designates a character encoding scheme. The character encoding
216 scheme maps the body -- a sequence of octets -- to a text entity -- a
217 sequence of characters. Hence any message entity of type text/* is
218 equivalent to a text entity.
219
220 Numeric Character References
221
222
223
224 Numeric character references are a great source of confusion. The key
225 insights are that:
226 * Every SGML document has exactly one document character set, which
227 is a coded character set
228 * Numeric character references give code positions in the document
229 character set
230
231
232
233 Connolly [Page 4]
234
235 Internet Draft Character Terminology May, 1995
236
237
238 Example: ISO2022 Encoding with ISO10646 Coded Character Set
239
240
241 Consider the following message entity:
242 Date: Saturday, 29-Apr-95 03:53:33 GMT
243 MIME-version: 1.0
244 Content-Type: text/html; charset=iso-2022-jp
245
246 <TITLE>...</TITLE>
247 <BODY>
248 Here is some normal text.
249 Here is a 10646 numeric character reference &#2432;.
250 Here is some ISO-2022-JP text: ...
251 </BODY>
252
253
254
255 To interpret the message entity, we notice that the Content-Type is
256 text/html , so this represents a text entity. The charset parameter
257 iso-2022-jp , along with the octet sequence of the body, determines a
258 sequence of characters. The octets denoted above by '...' represent
259 characters, as per iso-2022-jp .
260
261 To parse the resulting text entity as per SGML, the sender and receiver
262 must agree on an SGML declaration, since none is present in the document
263 entity. For this example, we assume that SGML declaration specifies
264 ISO10646 as the document character set. So the numeric character
265 reference &#2432; is resolved with respect to ISO10646.
266
267 It may seem contradictory that the ISO-2022-JP character encoding scheme
268 is defined in terms of a collection of coded character sets, none of
269 which is ISO10646. But there is no contradiction. Each character encoded
270 by ISO-2022-JP is in the repertoire of one of those coded character
271 sets, each of which is a subset of the repertoire of ISO10646.
272
273 So while ISO-2022-JP is not sufficient for every ISO10646 document, it
274 is the case that ISO10646 is a sufficient document character set for any
275 entity encoded with ISO-2022-JP .
276
277 Example: Reducing the Repertoire of an Entity
278
279
280 Suppose we have an SGML document D whose document character set is the
281 coded character set ISO10646. We find the document entity DE in the form
282 of sequence of octets OS in a disk file, encoded using the Unicode-UCS-2
283 character encoding scheme.
284 Unicode-UCS-2(OS) = DE
285
286
287
288
289
290
291 Connolly [Page 5]
292
293 Internet Draft Character Terminology May, 1995
294
295
296 We can reduce the character repertoire necessary to represent the
297 document entity by replacing characters outside the ISO-646-IRV
298 character repertoire with numeric character references:
299 DE' = reduce(DE, ISO10646, ISO-646-IRV)
300
301 where
302
303 reduce : SEQ(char) X Coded Character Set X Character Repertoire ->
304 SEQ(char)
305
306 and
307
308 reduce(c . rest, CCS, R) = if c in R, c . reduce(rest, CCS, R)
309 else &#N; . reduce(rest, CCS, R)
310 where CCS(N) = c
311
312
313 The resulting entity, DE' can then be endoded using US-ASCII
314 US-ASCII(OS') = DE' = reduce(DE, ISO10646, ISO-646-IRV)
315
316
317 Hence, we can represent the document D as a message entity whose content
318 type is "text/plain; charset=US-ASCII" and whose body is OS'.
319
320 Conclusion
321
322
323
324 It is critical to keep separate the notion of a simple table of
325 characters and their numbers, i.e. a coded character set, separate from
326 the various algorithms to encoded sequences of characters, i.e.
327 character encoding schemes. This separation allows a representation of a
328 text entity which is consistent with both the MIME and SGML
329 specifications.
330
331 Acknowledgements
332
333
334
335 The idea for the title of this document actually came from John Klensin.
336 The notion of character encoding scheme was inspired by the MIME
337 specification by Ned Freed. James Clark, Ed Levinson, and several other
338 members of the MIMESGML working group collaborated in discussions
339 leading up to this draft. Liam Quin from SoftQuad and Gavin Nicol from
340 EBT have provided guidance on these issues in the past. Erik Naggum has
341 provided invaluable aid in understanding the SGML standard.
342
343 References
344
345
346
347
348
349 Connolly [Page 6]
350
351 Internet Draft Character Terminology May, 1995
352
353
354 [MIME]
355 N. Borenstein and N. Freed. "MIME (Multipurpose Internet Mail
356 Extensions) Part One: Mechanisms for Specifying and Describing the
357 Format of Internet Message Bodies." RFC 1521, Bellcore, Innosoft,
358 September 1993.
359 [ASCII]
360 US-ASCII. Coded Character Set - 7-Bit American Standard Code for
361 Information Interchange. Standard ANSI X3.4-1986, ANSI, 1986.
362 [ISO-8859]
363 ISO 8859. International Standard -- Information Processing -- 8-bit
364 Single-Byte Coded Graphic Character Sets -- Part 1: Latin Alphabet
365 No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2,
366 1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. Part 4: Latin
367 alphabet No. 4, ISO 8859-4, 1988. Part 5: Latin/Cyrillic alphabet,
368 ISO 8859-5, 1988. Part 6: Latin/Arabic alphabet, ISO 8859-6, 1987.
369 Part 7: Latin/Greek alphabet, ISO 8859-7, 1987. Part 8:
370 Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin alphabet No.
371 5, ISO 8859-9, 1990.
372 [SGML]
373 ISO 8879. Information Processing -- Text and Office Systems --
374 Standard Generalized Markup Language (SGML), 1986.
375 [Nicol]
376 The Multilingual World Wide Web , Gavin T. Nicol, Electronic Book
377 Technologies, Japan gtn@ebt.com
378 [Lee]Private communication with Liam Quin, from SoftQuad.
379 [Spivak]
380 Spivak, Michael. Calculus. 2nd Ed. 1967 ISBN 0-914098-77-2
381 [GEB]Hofstadter, Douglas R. G&ouml;del, Escher, Bach: An Eternal Golden
382 Braid, 1979 ISBN 0-394-75682-7
383 [SET]"Investigations in the foundations of set theory I", in Jean van
384 Heijenoort (ed.) _From Frege to Godel: A Source Book in
385 Mathematical Logic, 1879-1931_ (Harvard U.P., 1967)
386
387
388
389
390
391
392
393
394 Author:
395
396 Dan Connolly
397 545 Technology Square
398 Cambridge, MA 02139
399 617-258-8143
400 connolly@w3.org
401
402
403
404
405
406
407 Connolly [Page 7]

admin@suikawiki.org
ViewVC Help
Powered by ViewVC 1.1.24