Contentshtml
In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.网络
本章咱们讨论一个HTML文档通过互联网(Internet)传输后,如何在计算机被展现的一些问题。app
The section on the document character set addresses the issue of what abstract characters may be part of an HTML document. Characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.编辑器
文档字符集部分主要讨论哪些抽象字符能够在HTML文档中出现。例如:拉丁字母“A”,斯拉夫字母"I",中文字符”水“,等等。ide
The section on character encodings addresses the issue of how those characters may be represented in a file or when transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms, called character references, for referring to any character.工具
字符编码部分主要讨论这些字符在文件中存储或者在Internet上进行传输时如何进行表示。因为一些字符编码不能像做者所但愿的那样,对在文档内出现的全部字符进行直接表示,HTML提供了另外的叫作"字符引用"的机制,该机制能够对任何字符进行引用。网站
Since there are a great number of characters throughout human languages, and a great variety of ways to represent those characters, proper care must be taken so that documents may be understood by user agents around the world.ui
因为人类语言拥有数量庞大的字符,而且对于这些字符来讲又有不少种不一样的表示方式,因此为了可以让文档能够被世界上全部的用户代理理解,因此必须在该方面进行正确的处理。this
To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:
为了彰显互操做能力,SGML要求每个应用(固然包括HTML)都要指定文档字符集。一个文档字符集由以下部分组成:
Each SGML document (including each HTML document) is a sequence of characters from the repertoire. Computer systems identify each character by its code position; for example, in the ASCII character set, code positions 65, 66, and 67 refer to the characters 'A', 'B', and 'C', respectively.
每个SGML文档(固然包括HTML文档)都是上述字符全集中字符的序列。计算机会经过它们的代码地址来识别它们。例如:在ASCII字符集中,代码地址65,66,和67分别表明字符'A', 'B', and 'C'。
The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.
因为在像Web这样的面向全球的信息系统中,ASCII字符集字符太少不够使用,因此HTML使用更加彻底的称为统一字符集(UCS),该字符集定义在[ISO10646].该标准定义了全世界全部语境中所使用的成千上万个字符的字符全集
The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, "[ISO10646]" is used to refer to the document character set while "[UNICODE]" is reserved for references to the Unicode bidirectional text algorithm.
在 [ISO10646]定义的字符在Unicode中都有一一对应。ISO10646以及UNICODE这两个标准会不断地引入新字符,因此有关它们的最新 修正应该去看它们相应的网站。在此规范中,"[ISO10646]"用来指文档字符集,"[UNICODE]"被用来专指Unicode双向文本机制。
The document character set, however, does not suffice to allow user agents to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.
因为HTML文档进行交流时须要在存储成文件或在网络传输时编码成字节序列,因此仅有文档字符集对于用户代理正确解析HTML文档是不够的。用户代理必须还要知道将文档字符流转换成字节流所使用的字符编码。
What this specification calls a character encoding is known by different names in other specifications (which may cause some confusion). However, the concept is largely the same across the Internet. Also, protocol headers, attributes, and parameters referring to character encodings share the same name -- "charset" -- and use the same values from the [IANA] registry (see [CHARSETS] for a complete list).
本规范所称的字符编码在其余的规范中可能会有其余不一样的名字(这可能会致使一些冲突)。不过,在Internet领域,这个概念仍是在很大程度上同样的。另外,能够引用到字符编码的"协议头",“属性”,“参数”都共享相同的名字——“charset“——而且使用来自在 [IANA] 登记注册的相同取值。
The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms.
"charset"参数指定一个字符编码,经过该方式将字节序列转换成字符序列。这种转化与Web的运行机制不谋而合:服务器以字节流的方式向用户代理发送数据;用户代理将它们解析成字符序列。这种转换方法多是简单的直接对应也多是其余复杂的方案或机制。
A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as [ISO10646]. There are several different encodings of parts of [ISO10646] in addition to encodings of the entire character set (such as UCS-4).
对于像[ISO10646]这样巨大的字符全集来讲,一个字符一个字节的编码技术是不行的。除了对整个字符集进行编码(例如:UCS-4)外,还有几个针对[ISO10646]不一样子集的编码方式。
Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding.
文 档撰写工具(好比:文本编辑器)能够选择它们对HTML文档的字符编码方式,这种编码方式的选择很大程度上依赖于系统软件的默认约定。这些工具能够指定一 个可以包含文档中全部字符的最经济的编码方式,并将该编码方式正确标记。那些在该编码以外的不经常使用的字符依然能够用字符引用的方式来表示。这些都是在说文 档字符集,而不是字符编码。
Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents (see section 14.2 of [RFC2616], the "Accept-Charset" HTTP request header). Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set.
服 务器或者代理(proxy)为了迎合用户代理的须要(参见[RFC2616]的14.2部分:HTTP请求头部的"Accept-Charset")能够 改变字符编码,这种操做称为编码转换。服务器以及代理(proxy)无须提供彻底编码的文档(即,文档采用涵盖文档所有字符集的编码方式)。
Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO 10646 using a different number of bytes for different characters). Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.
在Web上经常使用的一些字符编码包括:ISO-8859-1 (也被称为 "Latin-1";西欧的绝大部分语言采用该字符编码for ), ISO-8859-5 (支持斯拉夫语), SHIFT_JIS (日文编码), EUC-JP (另一种日文编码), and UTF-8 (对ISO10646字符集进行编码的方式,该编码方式对不一样的字符采用不一样的数量字节进行编码)。字符编码的名字是大小写不敏感的, 因此 "SHIFT_JIS", "Shift_JIS", 和"shift_jis"所表明的编码方式是同样的。
This specification does not mandate which character encodings a user agent must support.
本规范不强制哪一个字符编码用户代理必需要支持。
Conforming user agents must correctly map to ISO 10646 all characters in any character encodings that they recognize (or they must behave as if they did).
符合规范的用户代理必须能够正确地将ISO 10646映射成它们可识别的字符编码(或者它们要表现的至少看起来是正确的)。
When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1.
当HTML文本采用UTF-16(即:chartset=UTF-16)编码进行传输时,根据 [ISO10646], 6.3B部分以及 [UNICODE], C3 段, 页码3-1的规定,文本数据应该以网络字节顺序(“big-endian”,即高位字节在前的顺序)形式进行传输。
Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text.
更进一步,为了最大可能对文档进行正确解析,咱们建议在使用UTF-16传输时,文档应该老是以零宽度不间断空格(ZERO-WIDTH NON-BREAKING SPACE)字符开始,该字符的十六进制编码为FEFF,也被称为字节顺序标记(BOM),该标记被反序解析时为十六进制FFFE,该数字没有被分配给任何字符。当用户代理接收到文本开头的十六进制数字FFFE时,用户代理就会知道余下的文本中全部字节都应该被反向转换。
The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used. For information about ISO 8859-8 and the bidirectional algorithm, please consult the section on bidirectionality and character encoding.
[ISO10646]的UTF-1转换格式(IANA官方名字ISO-10646-UTF-1)不该被使用。有关ISO 8859-8以及双向文本机制,请参考双向文本及字符编码部分。