| 1 |
|
| 2 |
|
| 3 |
HTML Working Group D. Connolly |
| 4 |
INTERNET-DRAFT MIT/W3C |
| 5 |
draft-ietf-html-charset-harmful-00.txt May 2, 1995 |
| 6 |
Expires November, 1995 |
| 7 |
|
| 8 |
|
| 9 |
|
| 10 |
Character Set Considered Harmful |
| 11 |
|
| 12 |
|
| 13 |
|
| 14 |
Status of this Document |
| 15 |
|
| 16 |
|
| 17 |
|
| 18 |
This document is an Internet-Draft. Internet-Drafts are working |
| 19 |
documents of the Internet Engineering Task Force (IETF), its areas, and |
| 20 |
its working groups. Note that other groups may also distribute working |
| 21 |
documents as Internet-Drafts. |
| 22 |
|
| 23 |
Internet-Drafts are draft documents valid for a maximum of six months |
| 24 |
and may be updated, replaced, or obsoleted by other documents at any |
| 25 |
time. It is inappropriate to use Internet-Drafts as reference material |
| 26 |
or to cite them other than as "work in progress." |
| 27 |
|
| 28 |
To learn the current status of any Internet-Draft, please check the |
| 29 |
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow |
| 30 |
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), |
| 31 |
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or |
| 32 |
ftp.isi.edu (US West Coast). |
| 33 |
|
| 34 |
Distribution of this document is unlimited. Please send comments to the |
| 35 |
HTML working group (HTML-WG) of the Internet Engineering Task Force |
| 36 |
(IETF) at <html-wg@oclc.org> ;. Discussions of the group are archived at |
| 37 |
http://www.acl.lanl.gov/HTML_WG/archives.html . |
| 38 |
|
| 39 |
Abstract |
| 40 |
|
| 41 |
|
| 42 |
|
| 43 |
The term character set is often used to describe a ditigal |
| 44 |
representation of text. ASCII is perhaps the most widely deployed |
| 45 |
representation of text, and in the interest of interoperability, |
| 46 |
information systems on the Internet traditionally rely on it |
| 47 |
exclusively. |
| 48 |
|
| 49 |
The Multipurpose Internet Mail Extensions (MIME) introduces Internet |
| 50 |
Media Types, including text representations besides ASCII. The Hypertext |
| 51 |
Markup Language (HTML) used in the World-Wide Web is a proposed Internet |
| 52 |
Media Type. But HTML is also an application of Standard Generalized |
| 53 |
Markup Language (SGML). |
| 54 |
|
| 55 |
|
| 56 |
|
| 57 |
|
| 58 |
Connolly [Page 1] |
| 59 |
|
| 60 |
Internet Draft Character Terminology May, 1995 |
| 61 |
|
| 62 |
|
| 63 |
In the MIME and SGML specifications, the discussion of characters |
| 64 |
representation is notoriously complex, and apparently subtly |
| 65 |
inconsistent or incompatible. This document presents a collection of |
| 66 |
terms intended to reconcile the two specifications and serve as a basis |
| 67 |
for rigorous discussion of characters and their digital representations. |
| 68 |
|
| 69 |
Introduction |
| 70 |
|
| 71 |
|
| 72 |
|
| 73 |
The term character set is often used to describe a ditigal |
| 74 |
representation of text. The specification of such a representation |
| 75 |
typically involves identifying a sufficiently expressive collection of |
| 76 |
characters, and giving each of them a number. |
| 77 |
|
| 78 |
In conventional mathematics terminology then, a "character set" is not |
| 79 |
just a set of characters, but a function whose domain is a set of |
| 80 |
integers, and whose range is a set of characters. |
| 81 |
|
| 82 |
Some standards documents, including the SGML standard, make little or no |
| 83 |
use of such conventional mathematical terms as function, domain and |
| 84 |
range. Perhaps the authors of those documents intend the documents to be |
| 85 |
comprehensible without a prior understanding of mathematics. But the |
| 86 |
specification of notions such as the conformance of an SGML document or |
| 87 |
SGML system are much more complex than the basics of logic and |
| 88 |
mathematics. |
| 89 |
|
| 90 |
In his text on Calculus [Spivak] , Michael Spivak writes: |
| 91 |
|
| 92 |
|
| 93 |
|
| 94 |
Every aspect of this book was influenced by the desire to |
| 95 |
present calculus not merely as a prelude to but as the first |
| 96 |
real encounter with mathematics. Since the foundation of |
| 97 |
analysis provided the arena in which modern modes of |
| 98 |
mathematical thinking developed, calculus ought to be the |
| 99 |
place in which to expect, rather than avoid, the strengthening |
| 100 |
of insight with logic. In addition to developing the students' |
| 101 |
intuition about the beautiful concepts of analysis, it is |
| 102 |
surely equally important to persuade them that precision and |
| 103 |
rigor are neither deterrents to intuition, nor ends in |
| 104 |
themselves, but the natural medium in which to formulate and |
| 105 |
think about mathematical questions. |
| 106 |
|
| 107 |
|
| 108 |
|
| 109 |
This document is not intended as the first real encounter with |
| 110 |
mathematics. But neither will we make any effort to avoid or apologize |
| 111 |
for mathematical terminology. The reader is referred to the large body |
| 112 |
of literature on logic and set theory, including a history of writings |
| 113 |
on math and logic[SET] and Douglas Hofstadter's fascinating book [GEB] . |
| 114 |
|
| 115 |
|
| 116 |
Connolly [Page 2] |
| 117 |
|
| 118 |
Internet Draft Character Terminology May, 1995 |
| 119 |
|
| 120 |
|
| 121 |
Coded Character Sets |
| 122 |
|
| 123 |
|
| 124 |
|
| 125 |
Using "character set" rather than something such as character table or |
| 126 |
even character sequence to denote the functions that maps integers to |
| 127 |
characters is unfortunate, but it is water under the bridge, and a lot |
| 128 |
of it by now. Rather than attempting to divert all that water at this |
| 129 |
point, we introduce the primitive notion of character and use it to |
| 130 |
define the term coded character set from [ISO10646] and other standards: |
| 131 |
|
| 132 |
character |
| 133 |
An atom of information |
| 134 |
coded character set |
| 135 |
A function whose domain is a subset of the integers, and whose |
| 136 |
range is a set of characters. |
| 137 |
|
| 138 |
|
| 139 |
Note that by the term character, we do not mean a glyph, a name, a |
| 140 |
phoneme, nor a bit combination. A character is simply an atomic unit of |
| 141 |
communication. It is typically a symbol whose various representations |
| 142 |
are understood to mean the same thing by a community of people. |
| 143 |
|
| 144 |
It might seem more intuitive to map from characters to integers, rather |
| 145 |
than the way it is defined here. But in practice there are some coded |
| 146 |
character sets that assign two different numbers to the same character |
| 147 |
[Lee] , and so the inverse is not a function in the general case. |
| 148 |
|
| 149 |
There are two other terms used in standards such as [ISO10646] that we |
| 150 |
define in relation to the first two: |
| 151 |
|
| 152 |
code position |
| 153 |
An integer. A coded character set and a code position from its |
| 154 |
domain determine a character. |
| 155 |
character repertoire |
| 156 |
A set of characters; that is, the range of a coded character set. |
| 157 |
|
| 158 |
|
| 159 |
Character Encoding Schemes |
| 160 |
|
| 161 |
|
| 162 |
|
| 163 |
The only practical means for exchanging information on the Internet is |
| 164 |
to represent it as a sequence of octets (bytes). |
| 165 |
|
| 166 |
One way to transmit a sequence of characters is to agree on a coded |
| 167 |
character set and transmit the character numbers of each of the |
| 168 |
characters. |
| 169 |
|
| 170 |
|
| 171 |
|
| 172 |
|
| 173 |
|
| 174 |
Connolly [Page 3] |
| 175 |
|
| 176 |
Internet Draft Character Terminology May, 1995 |
| 177 |
|
| 178 |
|
| 179 |
But in practice, characters are encoded using a variety of optimizations |
| 180 |
of this brute-force approach: code switching techniques, escape |
| 181 |
sequences, etc. The encoding of a sequence of characters is not, in |
| 182 |
general, the result of encoding each character independently and then |
| 183 |
concatenating them. But it is sufficiently general to note that |
| 184 |
sequences of characters are encoded as a sequence of bytes. So we |
| 185 |
define: |
| 186 |
|
| 187 |
octet |
| 188 |
an element of the set {0, 1, 2, ..., 255} |
| 189 |
character encoding scheme |
| 190 |
a function whose domain is the set of sequences of octets, and |
| 191 |
whose range is the set of sequences of characters over some |
| 192 |
character repertoire. |
| 193 |
|
| 194 |
|
| 195 |
Representation of SGML Text Entities |
| 196 |
|
| 197 |
|
| 198 |
|
| 199 |
An SGML document is made up of entities: a text entity called the |
| 200 |
document entity, and possibly some other text entities and data |
| 201 |
entities. |
| 202 |
|
| 203 |
A text entity is a sequence of characters. The representation of a text |
| 204 |
entity is not specified by the SGML standard. For the purpose of |
| 205 |
MIME-based interchange of SGML text entities, we define the following: |
| 206 |
|
| 207 |
text entity |
| 208 |
a sequence of characters |
| 209 |
message entity |
| 210 |
a pair (T, OS) where T is an Internet Media Type and OS is a |
| 211 |
sequence of octets. |
| 212 |
|
| 213 |
|
| 214 |
Note that each text/* media type has an associated charset parameter, |
| 215 |
which designates a character encoding scheme. The character encoding |
| 216 |
scheme maps the body -- a sequence of octets -- to a text entity -- a |
| 217 |
sequence of characters. Hence any message entity of type text/* is |
| 218 |
equivalent to a text entity. |
| 219 |
|
| 220 |
Numeric Character References |
| 221 |
|
| 222 |
|
| 223 |
|
| 224 |
Numeric character references are a great source of confusion. The key |
| 225 |
insights are that: |
| 226 |
* Every SGML document has exactly one document character set, which |
| 227 |
is a coded character set |
| 228 |
* Numeric character references give code positions in the document |
| 229 |
character set |
| 230 |
|
| 231 |
|
| 232 |
|
| 233 |
Connolly [Page 4] |
| 234 |
|
| 235 |
Internet Draft Character Terminology May, 1995 |
| 236 |
|
| 237 |
|
| 238 |
Example: ISO2022 Encoding with ISO10646 Coded Character Set |
| 239 |
|
| 240 |
|
| 241 |
Consider the following message entity: |
| 242 |
Date: Saturday, 29-Apr-95 03:53:33 GMT |
| 243 |
MIME-version: 1.0 |
| 244 |
Content-Type: text/html; charset=iso-2022-jp |
| 245 |
|
| 246 |
<TITLE>...</TITLE> |
| 247 |
<BODY> |
| 248 |
Here is some normal text. |
| 249 |
Here is a 10646 numeric character reference ঀ. |
| 250 |
Here is some ISO-2022-JP text: ... |
| 251 |
</BODY> |
| 252 |
|
| 253 |
|
| 254 |
|
| 255 |
To interpret the message entity, we notice that the Content-Type is |
| 256 |
text/html , so this represents a text entity. The charset parameter |
| 257 |
iso-2022-jp , along with the octet sequence of the body, determines a |
| 258 |
sequence of characters. The octets denoted above by '...' represent |
| 259 |
characters, as per iso-2022-jp . |
| 260 |
|
| 261 |
To parse the resulting text entity as per SGML, the sender and receiver |
| 262 |
must agree on an SGML declaration, since none is present in the document |
| 263 |
entity. For this example, we assume that SGML declaration specifies |
| 264 |
ISO10646 as the document character set. So the numeric character |
| 265 |
reference ঀ is resolved with respect to ISO10646. |
| 266 |
|
| 267 |
It may seem contradictory that the ISO-2022-JP character encoding scheme |
| 268 |
is defined in terms of a collection of coded character sets, none of |
| 269 |
which is ISO10646. But there is no contradiction. Each character encoded |
| 270 |
by ISO-2022-JP is in the repertoire of one of those coded character |
| 271 |
sets, each of which is a subset of the repertoire of ISO10646. |
| 272 |
|
| 273 |
So while ISO-2022-JP is not sufficient for every ISO10646 document, it |
| 274 |
is the case that ISO10646 is a sufficient document character set for any |
| 275 |
entity encoded with ISO-2022-JP . |
| 276 |
|
| 277 |
Example: Reducing the Repertoire of an Entity |
| 278 |
|
| 279 |
|
| 280 |
Suppose we have an SGML document D whose document character set is the |
| 281 |
coded character set ISO10646. We find the document entity DE in the form |
| 282 |
of sequence of octets OS in a disk file, encoded using the Unicode-UCS-2 |
| 283 |
character encoding scheme. |
| 284 |
Unicode-UCS-2(OS) = DE |
| 285 |
|
| 286 |
|
| 287 |
|
| 288 |
|
| 289 |
|
| 290 |
|
| 291 |
Connolly [Page 5] |
| 292 |
|
| 293 |
Internet Draft Character Terminology May, 1995 |
| 294 |
|
| 295 |
|
| 296 |
We can reduce the character repertoire necessary to represent the |
| 297 |
document entity by replacing characters outside the ISO-646-IRV |
| 298 |
character repertoire with numeric character references: |
| 299 |
DE' = reduce(DE, ISO10646, ISO-646-IRV) |
| 300 |
|
| 301 |
where |
| 302 |
|
| 303 |
reduce : SEQ(char) X Coded Character Set X Character Repertoire -> |
| 304 |
SEQ(char) |
| 305 |
|
| 306 |
and |
| 307 |
|
| 308 |
reduce(c . rest, CCS, R) = if c in R, c . reduce(rest, CCS, R) |
| 309 |
else &#N; . reduce(rest, CCS, R) |
| 310 |
where CCS(N) = c |
| 311 |
|
| 312 |
|
| 313 |
The resulting entity, DE' can then be endoded using US-ASCII |
| 314 |
US-ASCII(OS') = DE' = reduce(DE, ISO10646, ISO-646-IRV) |
| 315 |
|
| 316 |
|
| 317 |
Hence, we can represent the document D as a message entity whose content |
| 318 |
type is "text/plain; charset=US-ASCII" and whose body is OS'. |
| 319 |
|
| 320 |
Conclusion |
| 321 |
|
| 322 |
|
| 323 |
|
| 324 |
It is critical to keep separate the notion of a simple table of |
| 325 |
characters and their numbers, i.e. a coded character set, separate from |
| 326 |
the various algorithms to encoded sequences of characters, i.e. |
| 327 |
character encoding schemes. This separation allows a representation of a |
| 328 |
text entity which is consistent with both the MIME and SGML |
| 329 |
specifications. |
| 330 |
|
| 331 |
Acknowledgements |
| 332 |
|
| 333 |
|
| 334 |
|
| 335 |
The idea for the title of this document actually came from John Klensin. |
| 336 |
The notion of character encoding scheme was inspired by the MIME |
| 337 |
specification by Ned Freed. James Clark, Ed Levinson, and several other |
| 338 |
members of the MIMESGML working group collaborated in discussions |
| 339 |
leading up to this draft. Liam Quin from SoftQuad and Gavin Nicol from |
| 340 |
EBT have provided guidance on these issues in the past. Erik Naggum has |
| 341 |
provided invaluable aid in understanding the SGML standard. |
| 342 |
|
| 343 |
References |
| 344 |
|
| 345 |
|
| 346 |
|
| 347 |
|
| 348 |
|
| 349 |
Connolly [Page 6] |
| 350 |
|
| 351 |
Internet Draft Character Terminology May, 1995 |
| 352 |
|
| 353 |
|
| 354 |
[MIME] |
| 355 |
N. Borenstein and N. Freed. "MIME (Multipurpose Internet Mail |
| 356 |
Extensions) Part One: Mechanisms for Specifying and Describing the |
| 357 |
Format of Internet Message Bodies." RFC 1521, Bellcore, Innosoft, |
| 358 |
September 1993. |
| 359 |
[ASCII] |
| 360 |
US-ASCII. Coded Character Set - 7-Bit American Standard Code for |
| 361 |
Information Interchange. Standard ANSI X3.4-1986, ANSI, 1986. |
| 362 |
[ISO-8859] |
| 363 |
ISO 8859. International Standard -- Information Processing -- 8-bit |
| 364 |
Single-Byte Coded Graphic Character Sets -- Part 1: Latin Alphabet |
| 365 |
No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2, |
| 366 |
1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. Part 4: Latin |
| 367 |
alphabet No. 4, ISO 8859-4, 1988. Part 5: Latin/Cyrillic alphabet, |
| 368 |
ISO 8859-5, 1988. Part 6: Latin/Arabic alphabet, ISO 8859-6, 1987. |
| 369 |
Part 7: Latin/Greek alphabet, ISO 8859-7, 1987. Part 8: |
| 370 |
Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin alphabet No. |
| 371 |
5, ISO 8859-9, 1990. |
| 372 |
[SGML] |
| 373 |
ISO 8879. Information Processing -- Text and Office Systems -- |
| 374 |
Standard Generalized Markup Language (SGML), 1986. |
| 375 |
[Nicol] |
| 376 |
The Multilingual World Wide Web , Gavin T. Nicol, Electronic Book |
| 377 |
Technologies, Japan gtn@ebt.com |
| 378 |
[Lee]Private communication with Liam Quin, from SoftQuad. |
| 379 |
[Spivak] |
| 380 |
Spivak, Michael. Calculus. 2nd Ed. 1967 ISBN 0-914098-77-2 |
| 381 |
[GEB]Hofstadter, Douglas R. Gödel, Escher, Bach: An Eternal Golden |
| 382 |
Braid, 1979 ISBN 0-394-75682-7 |
| 383 |
[SET]"Investigations in the foundations of set theory I", in Jean van |
| 384 |
Heijenoort (ed.) _From Frege to Godel: A Source Book in |
| 385 |
Mathematical Logic, 1879-1931_ (Harvard U.P., 1967) |
| 386 |
|
| 387 |
|
| 388 |
|
| 389 |
|
| 390 |
|
| 391 |
|
| 392 |
|
| 393 |
|
| 394 |
Author: |
| 395 |
|
| 396 |
Dan Connolly |
| 397 |
545 Technology Square |
| 398 |
Cambridge, MA 02139 |
| 399 |
617-258-8143 |
| 400 |
connolly@w3.org |
| 401 |
|
| 402 |
|
| 403 |
|
| 404 |
|
| 405 |
|
| 406 |
|
| 407 |
Connolly [Page 7] |