1 |
|
2 |
|
3 |
HTML Working Group D. Connolly |
4 |
INTERNET-DRAFT MIT/W3C |
5 |
draft-ietf-html-charset-harmful-00.txt May 2, 1995 |
6 |
Expires November, 1995 |
7 |
|
8 |
|
9 |
|
10 |
Character Set Considered Harmful |
11 |
|
12 |
|
13 |
|
14 |
Status of this Document |
15 |
|
16 |
|
17 |
|
18 |
This document is an Internet-Draft. Internet-Drafts are working |
19 |
documents of the Internet Engineering Task Force (IETF), its areas, and |
20 |
its working groups. Note that other groups may also distribute working |
21 |
documents as Internet-Drafts. |
22 |
|
23 |
Internet-Drafts are draft documents valid for a maximum of six months |
24 |
and may be updated, replaced, or obsoleted by other documents at any |
25 |
time. It is inappropriate to use Internet-Drafts as reference material |
26 |
or to cite them other than as "work in progress." |
27 |
|
28 |
To learn the current status of any Internet-Draft, please check the |
29 |
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow |
30 |
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), |
31 |
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or |
32 |
ftp.isi.edu (US West Coast). |
33 |
|
34 |
Distribution of this document is unlimited. Please send comments to the |
35 |
HTML working group (HTML-WG) of the Internet Engineering Task Force |
36 |
(IETF) at <html-wg@oclc.org> ;. Discussions of the group are archived at |
37 |
http://www.acl.lanl.gov/HTML_WG/archives.html . |
38 |
|
39 |
Abstract |
40 |
|
41 |
|
42 |
|
43 |
The term character set is often used to describe a ditigal |
44 |
representation of text. ASCII is perhaps the most widely deployed |
45 |
representation of text, and in the interest of interoperability, |
46 |
information systems on the Internet traditionally rely on it |
47 |
exclusively. |
48 |
|
49 |
The Multipurpose Internet Mail Extensions (MIME) introduces Internet |
50 |
Media Types, including text representations besides ASCII. The Hypertext |
51 |
Markup Language (HTML) used in the World-Wide Web is a proposed Internet |
52 |
Media Type. But HTML is also an application of Standard Generalized |
53 |
Markup Language (SGML). |
54 |
|
55 |
|
56 |
|
57 |
|
58 |
Connolly [Page 1] |
59 |
|
60 |
Internet Draft Character Terminology May, 1995 |
61 |
|
62 |
|
63 |
In the MIME and SGML specifications, the discussion of characters |
64 |
representation is notoriously complex, and apparently subtly |
65 |
inconsistent or incompatible. This document presents a collection of |
66 |
terms intended to reconcile the two specifications and serve as a basis |
67 |
for rigorous discussion of characters and their digital representations. |
68 |
|
69 |
Introduction |
70 |
|
71 |
|
72 |
|
73 |
The term character set is often used to describe a ditigal |
74 |
representation of text. The specification of such a representation |
75 |
typically involves identifying a sufficiently expressive collection of |
76 |
characters, and giving each of them a number. |
77 |
|
78 |
In conventional mathematics terminology then, a "character set" is not |
79 |
just a set of characters, but a function whose domain is a set of |
80 |
integers, and whose range is a set of characters. |
81 |
|
82 |
Some standards documents, including the SGML standard, make little or no |
83 |
use of such conventional mathematical terms as function, domain and |
84 |
range. Perhaps the authors of those documents intend the documents to be |
85 |
comprehensible without a prior understanding of mathematics. But the |
86 |
specification of notions such as the conformance of an SGML document or |
87 |
SGML system are much more complex than the basics of logic and |
88 |
mathematics. |
89 |
|
90 |
In his text on Calculus [Spivak] , Michael Spivak writes: |
91 |
|
92 |
|
93 |
|
94 |
Every aspect of this book was influenced by the desire to |
95 |
present calculus not merely as a prelude to but as the first |
96 |
real encounter with mathematics. Since the foundation of |
97 |
analysis provided the arena in which modern modes of |
98 |
mathematical thinking developed, calculus ought to be the |
99 |
place in which to expect, rather than avoid, the strengthening |
100 |
of insight with logic. In addition to developing the students' |
101 |
intuition about the beautiful concepts of analysis, it is |
102 |
surely equally important to persuade them that precision and |
103 |
rigor are neither deterrents to intuition, nor ends in |
104 |
themselves, but the natural medium in which to formulate and |
105 |
think about mathematical questions. |
106 |
|
107 |
|
108 |
|
109 |
This document is not intended as the first real encounter with |
110 |
mathematics. But neither will we make any effort to avoid or apologize |
111 |
for mathematical terminology. The reader is referred to the large body |
112 |
of literature on logic and set theory, including a history of writings |
113 |
on math and logic[SET] and Douglas Hofstadter's fascinating book [GEB] . |
114 |
|
115 |
|
116 |
Connolly [Page 2] |
117 |
|
118 |
Internet Draft Character Terminology May, 1995 |
119 |
|
120 |
|
121 |
Coded Character Sets |
122 |
|
123 |
|
124 |
|
125 |
Using "character set" rather than something such as character table or |
126 |
even character sequence to denote the functions that maps integers to |
127 |
characters is unfortunate, but it is water under the bridge, and a lot |
128 |
of it by now. Rather than attempting to divert all that water at this |
129 |
point, we introduce the primitive notion of character and use it to |
130 |
define the term coded character set from [ISO10646] and other standards: |
131 |
|
132 |
character |
133 |
An atom of information |
134 |
coded character set |
135 |
A function whose domain is a subset of the integers, and whose |
136 |
range is a set of characters. |
137 |
|
138 |
|
139 |
Note that by the term character, we do not mean a glyph, a name, a |
140 |
phoneme, nor a bit combination. A character is simply an atomic unit of |
141 |
communication. It is typically a symbol whose various representations |
142 |
are understood to mean the same thing by a community of people. |
143 |
|
144 |
It might seem more intuitive to map from characters to integers, rather |
145 |
than the way it is defined here. But in practice there are some coded |
146 |
character sets that assign two different numbers to the same character |
147 |
[Lee] , and so the inverse is not a function in the general case. |
148 |
|
149 |
There are two other terms used in standards such as [ISO10646] that we |
150 |
define in relation to the first two: |
151 |
|
152 |
code position |
153 |
An integer. A coded character set and a code position from its |
154 |
domain determine a character. |
155 |
character repertoire |
156 |
A set of characters; that is, the range of a coded character set. |
157 |
|
158 |
|
159 |
Character Encoding Schemes |
160 |
|
161 |
|
162 |
|
163 |
The only practical means for exchanging information on the Internet is |
164 |
to represent it as a sequence of octets (bytes). |
165 |
|
166 |
One way to transmit a sequence of characters is to agree on a coded |
167 |
character set and transmit the character numbers of each of the |
168 |
characters. |
169 |
|
170 |
|
171 |
|
172 |
|
173 |
|
174 |
Connolly [Page 3] |
175 |
|
176 |
Internet Draft Character Terminology May, 1995 |
177 |
|
178 |
|
179 |
But in practice, characters are encoded using a variety of optimizations |
180 |
of this brute-force approach: code switching techniques, escape |
181 |
sequences, etc. The encoding of a sequence of characters is not, in |
182 |
general, the result of encoding each character independently and then |
183 |
concatenating them. But it is sufficiently general to note that |
184 |
sequences of characters are encoded as a sequence of bytes. So we |
185 |
define: |
186 |
|
187 |
octet |
188 |
an element of the set {0, 1, 2, ..., 255} |
189 |
character encoding scheme |
190 |
a function whose domain is the set of sequences of octets, and |
191 |
whose range is the set of sequences of characters over some |
192 |
character repertoire. |
193 |
|
194 |
|
195 |
Representation of SGML Text Entities |
196 |
|
197 |
|
198 |
|
199 |
An SGML document is made up of entities: a text entity called the |
200 |
document entity, and possibly some other text entities and data |
201 |
entities. |
202 |
|
203 |
A text entity is a sequence of characters. The representation of a text |
204 |
entity is not specified by the SGML standard. For the purpose of |
205 |
MIME-based interchange of SGML text entities, we define the following: |
206 |
|
207 |
text entity |
208 |
a sequence of characters |
209 |
message entity |
210 |
a pair (T, OS) where T is an Internet Media Type and OS is a |
211 |
sequence of octets. |
212 |
|
213 |
|
214 |
Note that each text/* media type has an associated charset parameter, |
215 |
which designates a character encoding scheme. The character encoding |
216 |
scheme maps the body -- a sequence of octets -- to a text entity -- a |
217 |
sequence of characters. Hence any message entity of type text/* is |
218 |
equivalent to a text entity. |
219 |
|
220 |
Numeric Character References |
221 |
|
222 |
|
223 |
|
224 |
Numeric character references are a great source of confusion. The key |
225 |
insights are that: |
226 |
* Every SGML document has exactly one document character set, which |
227 |
is a coded character set |
228 |
* Numeric character references give code positions in the document |
229 |
character set |
230 |
|
231 |
|
232 |
|
233 |
Connolly [Page 4] |
234 |
|
235 |
Internet Draft Character Terminology May, 1995 |
236 |
|
237 |
|
238 |
Example: ISO2022 Encoding with ISO10646 Coded Character Set |
239 |
|
240 |
|
241 |
Consider the following message entity: |
242 |
Date: Saturday, 29-Apr-95 03:53:33 GMT |
243 |
MIME-version: 1.0 |
244 |
Content-Type: text/html; charset=iso-2022-jp |
245 |
|
246 |
<TITLE>...</TITLE> |
247 |
<BODY> |
248 |
Here is some normal text. |
249 |
Here is a 10646 numeric character reference ঀ. |
250 |
Here is some ISO-2022-JP text: ... |
251 |
</BODY> |
252 |
|
253 |
|
254 |
|
255 |
To interpret the message entity, we notice that the Content-Type is |
256 |
text/html , so this represents a text entity. The charset parameter |
257 |
iso-2022-jp , along with the octet sequence of the body, determines a |
258 |
sequence of characters. The octets denoted above by '...' represent |
259 |
characters, as per iso-2022-jp . |
260 |
|
261 |
To parse the resulting text entity as per SGML, the sender and receiver |
262 |
must agree on an SGML declaration, since none is present in the document |
263 |
entity. For this example, we assume that SGML declaration specifies |
264 |
ISO10646 as the document character set. So the numeric character |
265 |
reference ঀ is resolved with respect to ISO10646. |
266 |
|
267 |
It may seem contradictory that the ISO-2022-JP character encoding scheme |
268 |
is defined in terms of a collection of coded character sets, none of |
269 |
which is ISO10646. But there is no contradiction. Each character encoded |
270 |
by ISO-2022-JP is in the repertoire of one of those coded character |
271 |
sets, each of which is a subset of the repertoire of ISO10646. |
272 |
|
273 |
So while ISO-2022-JP is not sufficient for every ISO10646 document, it |
274 |
is the case that ISO10646 is a sufficient document character set for any |
275 |
entity encoded with ISO-2022-JP . |
276 |
|
277 |
Example: Reducing the Repertoire of an Entity |
278 |
|
279 |
|
280 |
Suppose we have an SGML document D whose document character set is the |
281 |
coded character set ISO10646. We find the document entity DE in the form |
282 |
of sequence of octets OS in a disk file, encoded using the Unicode-UCS-2 |
283 |
character encoding scheme. |
284 |
Unicode-UCS-2(OS) = DE |
285 |
|
286 |
|
287 |
|
288 |
|
289 |
|
290 |
|
291 |
Connolly [Page 5] |
292 |
|
293 |
Internet Draft Character Terminology May, 1995 |
294 |
|
295 |
|
296 |
We can reduce the character repertoire necessary to represent the |
297 |
document entity by replacing characters outside the ISO-646-IRV |
298 |
character repertoire with numeric character references: |
299 |
DE' = reduce(DE, ISO10646, ISO-646-IRV) |
300 |
|
301 |
where |
302 |
|
303 |
reduce : SEQ(char) X Coded Character Set X Character Repertoire -> |
304 |
SEQ(char) |
305 |
|
306 |
and |
307 |
|
308 |
reduce(c . rest, CCS, R) = if c in R, c . reduce(rest, CCS, R) |
309 |
else &#N; . reduce(rest, CCS, R) |
310 |
where CCS(N) = c |
311 |
|
312 |
|
313 |
The resulting entity, DE' can then be endoded using US-ASCII |
314 |
US-ASCII(OS') = DE' = reduce(DE, ISO10646, ISO-646-IRV) |
315 |
|
316 |
|
317 |
Hence, we can represent the document D as a message entity whose content |
318 |
type is "text/plain; charset=US-ASCII" and whose body is OS'. |
319 |
|
320 |
Conclusion |
321 |
|
322 |
|
323 |
|
324 |
It is critical to keep separate the notion of a simple table of |
325 |
characters and their numbers, i.e. a coded character set, separate from |
326 |
the various algorithms to encoded sequences of characters, i.e. |
327 |
character encoding schemes. This separation allows a representation of a |
328 |
text entity which is consistent with both the MIME and SGML |
329 |
specifications. |
330 |
|
331 |
Acknowledgements |
332 |
|
333 |
|
334 |
|
335 |
The idea for the title of this document actually came from John Klensin. |
336 |
The notion of character encoding scheme was inspired by the MIME |
337 |
specification by Ned Freed. James Clark, Ed Levinson, and several other |
338 |
members of the MIMESGML working group collaborated in discussions |
339 |
leading up to this draft. Liam Quin from SoftQuad and Gavin Nicol from |
340 |
EBT have provided guidance on these issues in the past. Erik Naggum has |
341 |
provided invaluable aid in understanding the SGML standard. |
342 |
|
343 |
References |
344 |
|
345 |
|
346 |
|
347 |
|
348 |
|
349 |
Connolly [Page 6] |
350 |
|
351 |
Internet Draft Character Terminology May, 1995 |
352 |
|
353 |
|
354 |
[MIME] |
355 |
N. Borenstein and N. Freed. "MIME (Multipurpose Internet Mail |
356 |
Extensions) Part One: Mechanisms for Specifying and Describing the |
357 |
Format of Internet Message Bodies." RFC 1521, Bellcore, Innosoft, |
358 |
September 1993. |
359 |
[ASCII] |
360 |
US-ASCII. Coded Character Set - 7-Bit American Standard Code for |
361 |
Information Interchange. Standard ANSI X3.4-1986, ANSI, 1986. |
362 |
[ISO-8859] |
363 |
ISO 8859. International Standard -- Information Processing -- 8-bit |
364 |
Single-Byte Coded Graphic Character Sets -- Part 1: Latin Alphabet |
365 |
No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2, |
366 |
1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. Part 4: Latin |
367 |
alphabet No. 4, ISO 8859-4, 1988. Part 5: Latin/Cyrillic alphabet, |
368 |
ISO 8859-5, 1988. Part 6: Latin/Arabic alphabet, ISO 8859-6, 1987. |
369 |
Part 7: Latin/Greek alphabet, ISO 8859-7, 1987. Part 8: |
370 |
Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin alphabet No. |
371 |
5, ISO 8859-9, 1990. |
372 |
[SGML] |
373 |
ISO 8879. Information Processing -- Text and Office Systems -- |
374 |
Standard Generalized Markup Language (SGML), 1986. |
375 |
[Nicol] |
376 |
The Multilingual World Wide Web , Gavin T. Nicol, Electronic Book |
377 |
Technologies, Japan gtn@ebt.com |
378 |
[Lee]Private communication with Liam Quin, from SoftQuad. |
379 |
[Spivak] |
380 |
Spivak, Michael. Calculus. 2nd Ed. 1967 ISBN 0-914098-77-2 |
381 |
[GEB]Hofstadter, Douglas R. Gödel, Escher, Bach: An Eternal Golden |
382 |
Braid, 1979 ISBN 0-394-75682-7 |
383 |
[SET]"Investigations in the foundations of set theory I", in Jean van |
384 |
Heijenoort (ed.) _From Frege to Godel: A Source Book in |
385 |
Mathematical Logic, 1879-1931_ (Harvard U.P., 1967) |
386 |
|
387 |
|
388 |
|
389 |
|
390 |
|
391 |
|
392 |
|
393 |
|
394 |
Author: |
395 |
|
396 |
Dan Connolly |
397 |
545 Technology Square |
398 |
Cambridge, MA 02139 |
399 |
617-258-8143 |
400 |
connolly@w3.org |
401 |
|
402 |
|
403 |
|
404 |
|
405 |
|
406 |
|
407 |
Connolly [Page 7] |