译网
语言行业资讯

本地化技术一览

=Unicode=

As a universal character set that includes all characters of the world, Unicode assigns code points to its characters by 16-bit integers, which means that up to 65,536 characters can be encoded. However, due to the huge set of CJK characters, this has become insufficient, and Unicode 3.0 has extended the index to 21 bits, which will support up to 1,114,112 characters.

Unicode 是一个包括了世界上所有字符的字符集,用16位整数来编码字符指针,也就是可以编码最多65,536个字符。但是,由于 CJK 字符集的庞大规模,连这个容量也不够使用,因此 Unicode 3.0 把索引字长扩展到21位,支持多达1,114,112个字符。

Planes

=平面=

Unicode code point is a numeric value between 0 and 10FFFF, divided into planes of 64K characters. In Unicode 4.0, allocated planes are Plane 0, 1, 2 and 14.

Unicode 编码指针是一个在0和10FFFF之间的数值,分成64K个字符组成的平面。在 Unicode 4.0 里,分配的平面是平面0,1,2和14。

Plane 0, ranging from 0000 to FFFF, is called Basic Multilingual Plane (BMP), which is the set of characters assigned by the previous 16-bit scheme.

平面0,从0000到FFFF,叫做基本多语言平面(Basic Multilingual Plane, BMP),由过去的16位编码系统下的字符集组成。

Plane 1, ranging from 10000 to 1FFFF and called Supplementary Multilingual Plane (SMP), is dedicated to lesser used historic scripts, special-purpose invented scripts and special notations. These include Gothic, Shavian and musical symbols. Many more historic scripts may be encoded in this plane in the future.

平面1,从10000到1FFFF,叫做辅助多语言平面(Supplementary Multilingual Plane, SMP),用于较少使用的古文字,特殊用途的文字和特殊符号。这些文字包括哥特文字,Shavian 文字和乐谱符号。今后可能会有更多的古文字被编码到这个平面中。

Plane 2, ranging from 20000 to 2FFFF and called Supplementary Ideographic Plane (SIP), is the spillover allocation area for those CJK characters that cannot fit into the blocks for common CJK characters in the BMP. Plane 14, ranging from E0000 to EFFFF and called Supplementary Special-purpose Plane (SSP), is for some control characters that do not fit into the small areas allocated in the BMP.

平面2,从20000到2FFFF,称为辅助表意文字平面(Supplementary Ideographic Plane, SIP),用于容纳 BMP 中一般 CJK 字符容纳不下的字符的区域。平面14,从E0000到EFFFF,称为辅助特殊用途平面(Supplementary Special-purpose Plane, SSP),是为 BMP 中有限的小区域无法容纳的控制字符准备的。

There are two more reserved planes Plane 15 and Plane 16, for private use, where no code point is assigned.

还有两个保留平面,平面15和平面16,用于个别用途,没有分配编码指针。

Basic Multilingual Plane

==基本多语言平面==

Basic Multilingual Plane (BMP), or Plane 0, is most commonly in general documents. Code points are allocated for common characters in contemporary scripts with exactly the same set as ISO/IEC 10646-1, as summarized in Figure 2 in section y0 Note that the code points between E000 and F900 are reserved for the vendors’ private use. No character is assigned in this area.

基本多语言平面(Basic Multilingual Plane, BMP),或平面0,是一般文本中使用最多的平面。现代文字中常用字符的编码指针被按照与 ISO/IEC 10646-1 完全相同的方式分配,如图2所示。注意E000和F900之间的编码指针为软件提供商的特别用途被保留,该区域中没有分配字符。

Character Encoding

==字符编码==

There are several ways of encoding Unicode strings for information interchange. One may simply represent each character using a fixed size integer (called wide char), which is defined by ISO/IEC 10646 as UCS-2 and UCS-4, where 2-byte and 4-byte integers are used, respectively (6) and where UCS-2 is for BMP only. But the common practice is to encode the characters using variable-length sequences of integers called UTF-8, UTF-16 and UTF-32 for 8-bit, 16-bit and 32-bit integers, respectively (7). There is also UTF-7 for e-mail transmissions that are 7-bit strict, but UTF-8 is safe in most cases.

用于信息交换的 Unicode 字符串有几种编码方式。每个字符可以简单地用固定长度的整数表示(称为宽字符),这种方式在 ISO/IEC 10646 中定义为 UCS-2 和 UCS-4,分别使用2字节和4字节长度的整数(6),而且 UCS-2 只用于基本多语言平面。但一般的做法是用可变长度的整数序列表示,根据使用的是8位,16位还是32位的整数,分别称为 UTF-8,UTF-16,和 UTF-32(7)。还有7位的专用于电子邮件传输的 UTF-7 编码,但多数情况下 UTF-8 都被支持。

UTF-32

===UTF-32===

UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit unsigned integer. It is therefore, a fixed-width character encoding form. This makes UTF-32 an ideal form for APIs that pass single character values. However, it is inefficient in terms of storage for Unicode strings.

UTF-32 是最简单的 Unicode 编码形式。每个 Unicode 编码指针都由一个单个32位无符号整数直接表示,因此它是一种固定宽度的编码形式。这使得 UTF-32 适合用于传递单个字符值的应用程序借口。但是,它不能有效满足 Unicode 字符串的存储需要。

UTF-16

===UTF-16===

UTF-16 encodes code points in the range 0000 to FFFF (i.e. BMP) as a single 16-bit unsigned integer. Code points in supplementary planes are instead represented as pairs of 16-bit unsigned integers. These pairs of code units are called surrogate pairs. The values used for the surrogate pairs are in the range D800 – DFFF, which are not assigned to any character. So, UTF-16 readers can easily distinguish between single code unit and surrogate pairs. The Unicode Standard(8) provides more details of surrogates.

UTF-16 在0000到FFFF范围(即基本多语言平面)内以单个16位无符号整数编码指针。辅助平面内的编码指针由两个16位无符号整数代表。这些编码单位被称为代用对。代用对的值在D800到DFFF间,没有分配给任何字符。这样,UTF-16 程序容易分辨单个编码单位和代用对。Unicode 标准(8)给出了代用对的详情。

UTF-16 is a good choice for keeping general Unicode strings, as it is optimized for characters in BMP, which is used in 99 percent of Unicode texts. It consumes about half of the storage required by UTF-32.

UTF-16 是保存一般 Unicode 字符串的好方法,因为它对在99%的 Unicode 文本中使用的基本多语言平面内的字符进行了优化。它只需要相当于 UTF-32 所需一半的存储空间。

UTF-8

===UTF-8===

To meet the requirements of legacy byte-oriented ASCII-based systems, UTF-8 is defined as variable- width encoding form that preserves ASCII compatibility. It uses one to four 8-bit code units to represent a Unicode character, depending on the code point value. The code points between 0000 and 007F are encoded in a single byte, making any ASCII string a valid UTF-8. Beyond the ASCII range of Unicode, some non-ideographic characters between 0080 and 07FF are encoded with two bytes. Then, Indic scripts and CJK ideographs between 0800 and FFFF are encoded with three bytes. Supplementary characters beyond BMP require four bytes. The Unicode Standard(9) provides more detail of UTF-8.

为满足旧式的基于 ASCII 的,面向字节处理的系统的要求,UTF-8 被定义为一种保留了 ASCII 兼容性的可变宽度编码形式。根据编码指针数值的不同,它使用一个到四个8位的编码单位来表示一个 Unicode 字符。在0000到007F范围内的编码指针用一个字节编码,这样任何 ASCII 字符串在 UTF-8 下都同样有效。在 Unicode 的 ASCII 范围外,一些在0080到07FF之间的非表意字符用两个字节编码。在其后的位于0800和FFFF范围内的印地语和 CJK 表意文字用三个字节编码。基本多语言平面之外的辅助字符需要四个字节。Unicode 标准(9)提供了 UTF-8 的详细介绍。

UTF-8 is typically the preferred encoding form for the Internet. The ASCII compatibility helps a lot in migration from old systems. UTF-8 also has the advantage of being byte-serialized and friendly to C or other programming languages APIs. For example, the traditional string collation using byte-wise comparison works with UTF-8.

UTF-8 是因特网上典型的理想编码形式。ASCII 兼容性对从旧系统迁移帮助很大。UTF-8 还有字节串行化和对 C 或其他语言编程接口友好的优点。例如,传统的逐字节比较方式的字符排序表在 UTF-8 下也能工作。

In short, UTF-8 is the most widely adopted encoding form of Unicode.

一句话,UTF-8 是 Unicode 最普及的编码形式。

Character Properties

==字符属性==

In addition to code points, Unicode also provides a database of character properties called the Unicode Character Database (UCD), which consists of a set of files describing the following properties:

除了编码指针外,Unicode 还提供了一个称为 Unicode 字符数据库(Unicode Character Database, UCD)(10)的字符属性数据库,包括一系列文件用来描述以下的属性:

Name.

General category (classification as letters, numbers, symbols, punctuation, etc.).

Other important general characteristics (white space, dash, ideographic, alphabetic, non char-acter, deprecated, etc.).

Character shaping (bidi category, shaping, mirroring, width, etc.).

Case (upper, lower, title, folding; both simple and full).

Numeric values and types (for digits).

Script and block.

Normalization properties (decompositions, decomposition type, canonical combining class, composition exclusions, etc.).

Age (version of the standard in which the code point was first designated).

Boundaries (grapheme cluster, word, line and sentence).

Standardized variants.

名字

一般类别(分类为字母、数字、符号、标点,等等)。

其他重要一般性质(空白,连字符,表意,字母顺序,非字符,已过时,等等)

字符外形(bidi 分类,外形,镜像,宽度,等等)。

形式(大写,小写,标题,折叠;简写和全写)。

数值和类型(用于数字)。

字符和字符块。

标准化属性(分解,分解类型,最简组成类,不合法组合,等等)。

历史(编码指针最初被指定的标准版本)。

边界(字,词,分行和断句)。

标准变形。

The database is useful for Unicode implementation in general. It is available at the Unicode.org Web site. The Unicode Standard(11) provides more details of the database.

这个数据库可用于一般的 Unicode 实现。在 Unicode.org 网站上可以找到它。Unicode 标准(11)提供了这个数据库的详情。

Technical Reports

==技术报告==

In addition to the code points, encoding forms and character properties, Unicode also provides some technical reports that can serve as implementation guidelines. Some of these reports have been included as annexes to the Unicode standard, and some are published individually as Technical Standards.

除了编码指针,编码形式和字符属性外,Unicode 还提供了一些技术报告,可以作为实现的指导。其中一些报告作为 Unicode 标准的附录提供,另一些则单独作为技术标准发布。

In Unicode 4.0, the standard annexes are:

在 Unicode 4.0 中,标准附录包括:

UAX 9: The Bidirectional Algorithm

Specifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.

UAX 11: East-Asian Width

Specifications of an informative property of Unicode characters that is useful when interoperating with East-Asian Legacy character sets.

UAX 14: Line Breaking Properties

Specification of line breaking properties for Unicode characters as well as a model algorithm for determining line break opportunities.

UAX 15: Unicode Normalization Forms

Specifications for four normalized forms of Unicode text. With these forms, equivalent text (canonical or compatibility) will have identical binary representations. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.

UAX 24: Script Names

Assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

UAX 29: Text Boundaries

Guidelines for determining default boundaries between certain significant text elements: grapheme clusters (“user characters”), words and sentences.

UAX 9:双向算法:对于从右向左书写的文字,如阿拉伯文和希伯来文字符位置的规定。

UAX 11:东亚字符宽度:当操作旧式东亚字符集时对 Unicode 字符属性的规定。

UAX 14:断行属性:对 Unicode 字符断行属性的规定,以及决定断行时机的模型算法。

UAX 15:Unicode 标准化形式:规定了 Unicode 字符的四种标准形式。通过这些形式,等价(相同或兼容)的文本将具有同样的二进制值。当实现的程序用标准形式保存字符串时,可以确保等价的字符串有唯一的二进制值。

UAX 24:语言名称:为所有 Unicode 编码指针分配了语言的名称。这种信息在正则表达式这样的机制中能产生比仅仅匹配字符块名称更好的效果,因而非常有用。

UAX 29:字符边界:定义某些重要文本元素,如字符组合(“用户字符”),词和句子缺省边界的指导。

The individual technical standards are:

单独的技术标准包括:

UTS 6: A Standard Compression Scheme for Unicode

Specifications of a compression scheme for Unicode and sample implementation.

UTS 10: Unicode Collation Algorithm

Specifications for how to compare two Unicode strings while conforming to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.

UTS 18: Unicode Regular Expression Guidelines

Guidelines on how to adapt regular expression engines to use Unicode.

UTS 6:Unicode 标准压缩方式:对 Unicode 和样本实现的压缩方式的规定。

UTS 10:Unicode 排序算法(UCA):在与 Unicode 标准兼容的前提下比较两个 Unicode 字符串的规定。UCA 也提供了缺省 Unicode 排序元素表(Default Unicode Collation Element Table, DUCET)用来指定所有 Unicode 字符的缺省排序顺序。

UTS 18:Unicode 正则表达式导则:关于如何让正则表达式引擎使用 Unicode 的导则。

All Unicode Technical Reports are accessible from the Unicode.org web site (12).

所有 Unicode 技术报告都可以从 Unicode.org 网站(12)上得到。

Fonts

=字体=

Font Development Tools

==字体开发工具==

Some FOSS tools for developing fonts are available. Although not as many as their proprietary counterparts, they are adequate to get the job done, and are continuously being improved. Some interesting examples are:

有一些用于开发字体的自由/开源软件工具。虽然这类工具不像私有的开发工具那样丰富,但它们足以胜任工作,而且在不断地改进。一些有趣的例子包括:

1. XmBDFEd(13). Developed by Mark Leisher, XmBDFEd is a Motif-based tool for developing BDF fonts. It allows one to edit bit-map glyphs of a font, do some simple transformations on the glyphs, transfer information between different fonts, and so on.

2. FontForge(14) (formerly PfaEdit(15) ). Developed by George Williams, FontForge is a tool for developing outline fonts, including Postscript Type1, TrueType, and OpenType. Scanned images of letters can be imported and their outline vectors automatically traced. The splines can be edited, and transformations like skewing, scaling, rotating, thickening may be applied and much more. It provides sufficient functionalities for editing Type1 and TrueType fonts properties. OpenType tables can also be edited in its recent versions. One weak point, however, is hinting. It guarantees Type1 hints quality, but not for TrueType.

3. TTX/FontTools(16). Just van Rossum’s TTX/FontTools is a tool to convert OpenType and TrueType fonts to and from XML. FontTools is a library for manipulating fonts, written in Python. It supports TrueType, OpenType, AFM and, to a certain extent, Type 1 and some Mac-specific formats. It allows one to dump OpenType tables, examine and edit them with XML or plain text editor, and merge them back to the font.

XmBDFEd(13)。它由 Mark Leisher 开发,是一个基于 Motif 的开发 BDF 字体的工具。它允许编辑字体的点阵形式,对符号进行简单的变换,在不同字体间传递信息,等等。

FontForge(14)(过去的 PfaEdit(15))。George Williams 开发的 FontForge 是用于开发 Postscript Type1,TrueType 和 OpenType 等轮廓字体的工具。它可以导入字母的扫描图像,并自动追踪其轮廓向量。它还能对样条曲线进行编辑,并进行倾斜、缩放、旋转、加粗以及其他许多种变换。其功能足够用于编辑 Type1 和 TrueType 字体。新版本还能编辑 OpenType 字体表。但 hinting 是它的一个弱项,它只能保证 Type1 hinting 的质量,但 TrueType 的则不理想。

TTX/FontTools(16)。Just van Rossum 的 TTX/FontTools 是一种用于 OpenType 和 TrueType 字体与 XML 文件相互转换的工具。FontTools 是用 Python 写成的处理字体的函数库。它支持 TrueType,OpenType,AFM 并提供了 Type1 和一些 Mac 专用字体的有限支持。它允许导出 OpenType 字体表,并用 XML 和纯文本编辑器检验和编辑,再合并回字体文件。

Font Configuration

==字体配置==

There have been several font configuration systems available in GNU/Linux desktops. The most fundamental one is the X Window font system itself. But, due to some recent developments, another font configuration called fontconfig has been developed to serve some specific requirements of modern desktops. These two font configurations will be discussed briefly.

在 GNU/Linux 桌面上有几种字体配置系统。最基本的是 X Window 字体系统本身。但是,在近期的开发中,另一种称为 fontconfig 的字体配置被开发出来以满足现代桌面的一些特定需要。以下简单讨论这两种字体系统。

First, however, let us briefly discuss the X Window architecture, to understand font systems. X Window(17) is a client-server system. X servers are the agents that provide service to control hardware devices, such as video cards, monitors, keyboards, mice or tablets, as well as passes user input events from the devices to the clients. X clients are GUI application programs that request X server to draw graphical objects on the screen, and accept user inputs via the events fed by X server. Note that with this architecture, X client and server can be on different machines in the network. In which case, X server is the machine that the user operates with, while X client can be a process running on the same machine or on a remote machine in the network.

不过首先,我们简要讨论一下 X Window 架构,以便理解字体系统。X Window(17) 是一种客户端-服务器系统。X 服务器是提供显卡、显示器、键盘、鼠标或触摸板等硬件设备控制服务的主体,也负责把用户输入事件从设备传送到客户。X 客户端是请求 X 服务器在屏幕上描绘图形对象,并通过 X 服务器的事件传送接受用户输入的图形界面程序。注意在这种架构中,X 客户端和服务器可以处在网络中不同的机器上。这种情况下,X 服务器是用户操作的机器,而 X 客户端可以是同一台机器上运行的进程,或网络中的远程机器。

In this client-server architecture, fonts are provided on the server side. Thus, installing fonts means configuring X server by installing fonts and registering them to its font path.

在这个客户端-服务器架构中,字体是服务器端提供的。因此,安装字体意味着在 X 服务器上加入字体并注册其字体路径。

However, since X server is sometimes used to provide thin-client access in some deployments, where X server may run on cheap PCs booted by floppy or across network, or even from ROM, font installation on each X server is not always appropriate. Thus, font service has been delegated to a separate service called X Font Server (XFS). Another machine in the network can be dedicated for font service so that all X servers can request font information. Therefore, with this structure, an X server may be configured to manage fonts by itself or to use fonts from the font server, or both.

但是,由于 X 服务在一些配置中有时被用来提供瘦客户机访问,而这些 X 服务器可能是运行在用软盘或网络方式启动的廉价机器上,甚至是从固化的 ROM 启动,在每台 X 服务器上安装字体不一定合适。因此,字体服务被分离成一个单独的服务,称为 X 字体服务器(X Font Server, XFS)。网络中另一台机器可以专门提供字体服务,这样所有的 X 服务器都可以请求字体信息。这样,在这个构架下,X 服务器可以配置成自我管理字体,或者使用来自字体服务器的字体,或者两者并存。

Nevertheless, recent changes in XFree86 have addressed some requirements to manage fonts at the client side. The Xft extension provides anti-aliased glyph images by font information provided by the X client. With this, the Xft extension also provides font management functionality to X clients in its first version. This was later split from Xft2 into a separate library called fontconfig. fontconfig is a font management system independent of X, which means it can also apply to non-GUI applications such as printing services. Modern desktops, including KDE 3 and GNOME 2 have adopted fontconfig as their font management systems, and have benefited from closer integration in providing easy font installation process. Moreover, client-side fonts also allow applications to do all glyph manipulations, such as making special effects, while enjoying consistent appearance on the screen and in printed outputs.

不过,在 XFree86 中最近的改变注意到了一些在客户端管理字体的需求。Xft 扩展通过 X 客户端提供的字体信息实现了抗锯齿的符号图像。这个功能也使 Xft 在其第一版中提供了 X 客户端的字体管理能力。后来这个功能从 Xft2 中分离成一个单独的库,称为 fontconfig。fontconfig 是独立于 X 的一个字体管理系统,因此它也支持像打印服务这样的非图形界面应用。包括 KDE 3 和 GNOME 2 在内的现代桌面都采用了 fontconfig 作为字体管理系统,并且得益于紧密的整合,提供了简单的字体安装过程。而且,客户端的字体也允许应用程序进行特效等各种符号操作,同时在屏幕上和打印输出中都可以得到一致的效果。

The splitting of the X client-server architecture is not standard practice on stand-alone desktops. However, it is important to always keep the split in mind, to enable particular features.

X 客户端-服务器的分离式架构并不是独立桌面的标准形式。但是,要使用某些特别的功能,必须记住这个特点。

Output Methods

=输出方法=

Since the usefulness of XOM is still being questioned, we shall discuss only the output methods already implemented in the two major toolkits: Pango of GTK+ 2 and Qt 3.

由于 XOM 的有用程度还有疑问,我们将只讨论在两个主要的工具包中已经实现的输出方法:GTK+ 2的 Pango 和 Qt 3。

Pango Text Layout Engines

==Pango 文本外观引擎==

Pango [`Pan’ means `all’ in English and `go’ means `language’ in Japanese](18) is a multilingual text layout engine designed for quality text typesetting. Although it is the text drawing engine of GTK+, it can also be used outside GTK+ for other purposes, such as printing(19). This section will provide localizers with a bird`s eye view of Pango. The Pango reference manual(20) should be consulted for more detail.

Pango(“Pan”在英语里意思是“全部”,而“go”是日语中“语言”的意思)(18) 是一个用于高质量文本排版的多语言文本外观引擎。虽然它是 GTK+ 的文本描绘引擎,它也可以用于 GTK+ 之外的其他用途,例如打印(19)。这一节将为本地化工作者提供 Pango 的概览。如需要更多详情,应阅读 Pango 参考手册(20)。

PangoLayout

===PangoLayout===

At a high level, Pango provides the PangoLayout class that takes care of typesetting text in a column of given width, as well as other information necessary for editing, such as cursor positions. Its features may be summarized as follows:

在较高的层级,Pango 提供了 PangoLayout 类,处理给定宽度内的一列文本的排版,以及光标位置等其他编辑时必要的信息。其功能可以概括如下:

Paragraph Properties

indent justification

spacing word/character wrapping modes

alignment tabs

段落属性

缩进

间距

段落对齐

两端对齐

字/词换行模式

制表位

Text Elements

get lines and their extents character logical attributes (is line break, is cursor position, etc.)

get runs and their extents cursor movements

character search at (x, y) position

文本元素

行及其范围

语流及其范围

在 (x,y) 位置的字符搜索

字符逻辑属性(是换行符,是光标位置控制符,等等)

光标移动

Text Contents

plain text markup text

文本内容

纯文本

标记文本

Middle-level Processing

==中级处理==

Pango also provides access to some middle-level text processing functions, although most clients in general do not use them directly. To gain a brief understanding of Pango internals, some highlights are discussed here.

Pango 还提供了一些中级的文本处理功能,虽然大部分客户端都不直接使用这些功能。为了简单了解 Pango 的能力,这里讨论一些重要特性。

There are three major steps for text processing in Pango(21):

Pango 中的文本处理有三个主要步骤(21):

Itemize. Breaks input text into chunks (items) of consistent direction and shaping engine. This usually means chunks of text of the same language with the same font. Corresponding shaping and language engines are also associated with the items.

分项:将文本打散成具有相同方向和形状引擎的文本块(项目)。这通常是指同一种语言和同一种字体的文本块。相应的形状和语言引擎也和项目相关联。

Break. Determines possible line, word and character breaks within the given text item. It calls the language engine of the item (or the default engine based on Unicode data if no language engine exists) to analyze the logical attributes of the characters (is-line-break, is-char-break, etc.).

分解:确定给定的文本项中可能的行、词和字符分割。它调用项目的语言引擎(如语言引擎不存在则调用基于 Unicode 数据的缺省引擎)来分析字符的逻辑属性(是断行,是断字,等等)。

Shape. Converts the text item into glyphs, with proper positioning. It calls the shaping engine of the item (or the default shaping engine that is currently suitable for European languages) to obtain a glyph string that provides the information required to render the glyphs (code point, width, offsets, etc.).

造型:把文本项转化成具有正确位置的符号。它调用项目的造型引擎(或者适用于欧洲语言的缺省造型引擎)生成提供渲染符号所需信息(编码指针,宽度,偏移量等)的符号串。

Pango Engines

==Pango 引擎==

Pango engines are implemented in loadable modules that provide entry functions for querying and creating the desired engine. During initialization, Pango queries the list of all engines installed in the memory. Then, when it itemizes input text, it also searches the list for the language and shaping engines available for the script of each item and creates them for association to the relevant text item.

Pango 引擎以可加载的模块形式实现,提供查询和建立所需引擎的函数。在初始化时,Pango 查询内存中所有引擎的列表。然后,在对输入文字分项后,它为每个项目中的文字搜索可用的语言和造型引擎并建立与相关的文本项目关联的引擎。

Pango Language Engines

==Pango 语言引擎==

As discussed above, the Pango language engine is called to determine possible break positions in a text item of a certain language. It provides a method to analyze the logical attributes of every character in the text as listed in Table 3.

如上所述,调用 Pango 语言引擎是为了确定某种语言中文本项的可能的分解位置。它提供了分析文本中每个字符逻辑属性的方法,如表3所示:

Table 3 Pango Logical Attributes

Flag Description

is_line_break can break line in front of the character

is_mandatory_break must break line in front of the character

is_char_break can break here when doing character wrap

is_white is white space character

is_cursor_position cursor can appear in front of character

is_word_start is first character in a word

is_word_end is first non-word character after a word

is_sentence_boundary is inter-sentence space

is_sentence_start is first character in a sentence

is_sentence_end is first non-sentence character after a sentence

backspace_deletes_character backspace deletes one character, not entire cluster (new in Pango 1.3.x)

表3 Pango 逻辑属性

标志(Flag) 描述

is_line_break 可以在字符前断行

is_mandatory_break 必须在字符前断行

is_char_break 字符分行时可以在这里断行

is_white 是空格字符

is_cursor_position 光标可以在字符前出现

is_word_start 是单词的第一个字符

is_word_end 是单词后的第一个非单词字符

is_sentence_boundary 是句子间的空格

is_sentence_start 是句子的第一个字符

is_sentence_end 是句子后的第一个非句子的字符

backspace_deletes_character 退格删除一个字符而不是整个字符簇

Pango Shaping Engines

==Pango 造型引擎==

As discussed above, the Pango shaping engine converts characters in a text item in a certain language into glyphs, and positions them according to the script constraints. It provides a method to convert a given text string into a sequence of glyphs information (glyph code, width and positioning) and a logical map that maps the glyphs back to character positions in the original text. With all the information provided, the text can be properly rendered on output devices, as well as accessed by the cursor despite the difference between logical and rendering order in some scripts like Indic, Hebrew and Arabic.

如上所述,Pango 造型引擎把一个特定语言的文本项中的字符转换成符号,并且按照文字的规则放置这些符号。它提供了一种将给定的文本串转化为符号信息序列(符号编码、宽度和位置)的方法以及按原文本中字符位置将符号映射回字符的规则。利用这些信息,文本可以在输出设备上正确地显示,也可以正确地处理光标位置,而不用管像印地语、希伯来语和阿拉伯语这样的语言中不同的逻辑和显示顺序。

Qt Text Layout

==Qt 文本外观==

Qt 3 text rendering is different from that of GTK+/Pango. Instead of modularizing, it handles all complex text rendering in a single class, called QComplexText, which is mostly based on the Unicode character database. This is equivalent to the default routines provided by Pango. Due to the incompleteness of the Unicode database, this class sometimes needs extra workarounds to override some values. Developers should examine this class if a script is not rendered properly.

Qt 3 的文本渲染与 GTK+/Pango 的不同。它不是模块化的,而是在一个称为 QComplexText 的基于 Unicode 字符数据库的类中处理所有复杂文本渲染。它与 Pango 提供的缺省处理方法是一样的。由于 Unicode 数据库的不完整,这个类需要更多的修改来处理某些数值。如果一种语言渲染不正确,开发者需要检查这个类。

Although relying on the Unicode database appears to be a straightforward method for rendering Unicode texts, this makes the class rigid and error prone. Checking the Qt Web site regularly to find out whether there are bugs in latest versions is advisable. However, a big change has been planned for Qt 4, which is the Scribe text layout engine, similar to Pango for GTK+.

虽然依赖于 Unicode 数据库看起来是一种直接的渲染 Unicode 文本的办法,但这样的类不灵活而且容易出错。建议经常查看 Qt 的网站了解最新版本中是否存在问题。不过,Qt 4 当中计划引入一个大的变化,即 Scribe 文本布局引擎,与 GTK+ 的 Pango 类似。

Keyboard Layouts

==键盘布局==

The first step to providing text input for a particular language is to prepare the keyboard map. X Window handles the keyboard map using the X Keyboard (XKB) extension. When you start an X server on GNU/ Linux, a virtual terminal is attached to it in raw mode, so that keyboard events are sent from the kernel without any translation.

为一种特定语言提供文本输入功能的第一步是定义键盘布局。X Window 用 X 键盘扩展处理(X Keyboard extension, XKB)键盘布局。当你在 GNU/Linux 上启动 X 服务器时,它附带一个简单的虚拟控制台,这样内核可以发送键盘事件而不需要任何转换。

The raw scan code of the key is then translated into keycode according to the keyboard model. For XFree86 on PC, the keycode map is usually “xfree86” as kept under /etc/X11/xkb/keycodes directory. The keycodes just represent the key positions in symbolic form, for further referencing.

击键的原始扫描码按照键盘型号被转换成键位代码。对于 PC 上的 XFree86 ,键位映射通常是保存在 /etc/X11/xkb/keycodes 目录下的“xfree86”。键位代码只以符号形式表示键位,以供查询。

The keycode is then translated into a keyboard symbol (keysym) according to the specified layout, such as qwerty, dvorak, or a layout for a specific language, chosen from the data under /etc/X11/xkb/symbols directory. A keysym does not represent a character yet. It requires an input method to translate sequences of key events into characters, which will be described later. For XFree86, all of the above setup is done via the setxkbmap command. (Setting up values in /etc/X11/XF86Config means setting parameters for setxkbmap at initial X server startup.) There are many ways of describing the configuration, as explained in Ivan Pascal’s XKB explanation(22). The default method for XFree86 4.x is the “xfree86” rule (XKB rules are kept under /etc/X11/xkb/rules), with additional parameters:

之后,键位代码根据指定的键盘布局,如 qwerty, dvorak 或者从 /etc/X11/xkb/symbols 目录下的文件中指定的语言的布局,被翻译成键盘符号(keysym)。一个键盘符号还不是一个字符。它需要输入方法来把键盘事件序列转换成字符,后面会提到这个转换过程。对于 XFree86,所有上述的设置都是通过 setxkbmap 命令来完成(在 /etc/X11/XF86Config 中的设置可以在 X 服务器启动时为 setxkbmap 设定参数)。有许多描述配置的方法,在 Ivan Pascal 的 XKB 文档(22)中有说明。XFree86 4.x 的缺省方法是“xfree86”规则(XKB 规则保存在 /etc/X11/xkb/xrules),有以下一些参数:

model – pc104, pc105, microsoft, microsoftplus, …

layout – us, dk, ja, lo, th, …

(For XFree86 4.0+, up to 64 groups can be provided as part of layout definition)

variant – (mostly for Latins) nodeadkeys

option – group switching key, swap caps, LED indicator, etc.

(See /etc/X11/xkb/rules/xfree86 for all available options.)

型号-pc104,pc105,microsoft,microsoftplus,……

布局-us,dk,ja,lo,th,……(对 XFree86 4.0 以上版本,布局定义可以提供最多64个分组)

变形-(主要用于拉丁语系的)语音辅助键

可选项-切换键,大小写交换,LED 指示灯,等等(其他可选项见 /etc/X11/xkb/rules/xfree86)

For example:

例如:

$ setxkbmap us,th -option grp:alt_shift_toggle,grp_led:scroll

Sets layout using US symbols as the first group, and Thai symbols as the second group. The Alt-Shift combination is used to toggle between the two groups. Scroll Lock LED will be the group indicator, which will be on when the current group is not the first group, that is, on for Thai, off for US.

把美国英语符号作为第一组,泰国语符号作为第二组。Alt-Shift 组合键用来在两组之间切换。Scroll Lock LED 指示灯将作为分组状况显示,在当前组不是第一组时将点亮,也就是亮表示泰国语,灭表示美国英语。

You can even mix more than two languages:

你甚至可以混合更多的语言:

$ setxkbmap us,th,lo -option grp:alt_shift_toggle,grp_led:scroll

This loads trilingual layout. Alt-Shift is used to rotate among the three groups; that is, Alt-RightShift chooses the next group and Alt-LeftShift chooses the previous group. Scroll Lock LED will be on when the Thai or Lao group is active.

这个命令装入三种语言的布局。Alt-Shift 用来在三个组之间轮换;Alt-右Shift 选择下一组,而 Alt-左Shift 选择上一组。Scroll Lock LED 指示灯在启用泰国语和老挝语组时点亮。

The arguments for setxkbmap can be specified in /etc/X11/XF86Config for initialization on X server startup by describing the “InputDevice” section for keyboard, for example:

setxkbmap 的参数可以在 /etc/X11/XF86Config 的“InputDevice”小节中为键盘指定,在 X 服务器启动时初始化,例如:

Section “InputDevice”

Identifier “Generic Keyboard”

Driver “keyboard”

Option “CoreKeyboard”

Option “XkbRules” “xfree86”

Option “XkbModel” “microsoftplus”

Option “XkbLayout” “us,th_tis”

Option “XkbOptions grp:alt_shift_toggle,lv3:switch,grp_led:scroll”

EndSection

Notice the last four option lines. They tell setxkbmap to use “xfree86” rule, with “microsoftplus” model (with Internet keys), mixed layout of US and Thai TIS-820.2538, and some more options for group toggle key and LED indicator. The “lv3:switch” option is only for keyboard layouts that require a 3rd level of shift (that is, one more than the normal shift keys). In this case for “th_tis” in XFree86 4.4.0, this option sets RightCtrl as 3rd level of shift.

注意最后四行选项。它们告诉 setxkbmap 使用“xfree86”规则,“microsoftplus”型号(带有因特网功能键),混合美国英语和泰国语 TIS-820.2538 布局,以及组切换键和 LED 指示灯的选项。“lv3:switch”这个选项只用于需要第三级上档键(比一般的上档键多一级)的键盘布局。这里在 XFree86 4.4.0 中的“th_tis”布局设定右 Ctrl 键为第三级上档键。

Providing a Keyboard Map

==提供键盘布局==

If the keyboard map for a language is not available, one needs to prepare a new one. In XKB terms, one needs to prepare a symbols map, associating keysyms to the available keycodes.

如果一种语言没有可用的键盘布局,就需要制作一个新的。对于 XKB,需要提供一个符号映射,并将键盘符号和可用的击键代码联系起来。

The quickest way to start is to read the available symbols files under the /etc/X11/xkb/symbols directory. In particular, the files used by default rules of XFree86 4.3.0 are under the pc/ subdirectory. Here, only one group is defined per file, unlike the old files in its parent directory, in which groups are pre-combined. This is because XFree86 4.3.0 provides a flexible method for mixing keyboard layouts.

开始的最快方法是读取 /etc/X11/xkb/symbols 目录下已有的符号文件。要注意的是 XFree86 4.3.0 使用的缺省规则文件在 pc/ 子目录下。这里每个文件只定义了一个组,而不像上一级目录中那些老式的文件,把不同的组合并在一起。这是因为 XFree86 4.3.0 为混合的键盘布局提供了一种灵活的方法。

Therefore, unless you need to support the old versions of XFree86, all you need to do is to prepare a single-group symbols file under the pc/ subdirectory.

因此,除非你需要支持旧版本的 XFree86,否则只要在 pc/ 目录下提供一个单独的组符号文件就可以了。

Here is an excerpt from the th_tis symbols file:

以下是 th_tis 符号文件的片段:

partial default alphanumeric_keys

xkb_symbols “basic” {

name[Group1]= “Thai (TIS-820.2538)”;

// The Thai layout defines a second keyboard group and changes

// the behavior of a few modifier keys.

key { [ 0x1000e4f, 0x1000e5b ] };

key { [ Thai_baht, Thai_lakkhangyao]

};

key { [ slash, Thai_leknung ] };

key { [ minus, Thai_leksong ] };

key { [ Thai_phosamphao, Thai_leksam ] };



};

Each element in the xkb_symbols data, except the first one, is the association of keysyms to the keycode for unshift and shift versions, respectively. Here, some keysyms are predefined in Xlib. You can find the complete list in . If the keysyms for a language are not defined there, the Unicode keysyms, can be used, as shown in the key entry. (In fact, this may be a more effective way for adding new keysyms.) The Unicode value must be prefixed with “0x100” to describe the keysym for a single character.

xkb_symbols 列表中,除了第一个元素外,每个元素都表示了未切换和切换状态下不同键位代码代表的键盘字符的关联。在这里,有些键盘字符是在 Xlib 中预先定义的。完整的列表可以在 X11/keysymdef.h 文件中找到。如果一种语言的键盘字符没有定义,则可以像 这一元素所示,使用 Unicode 键盘字符。(实际上,这可能是更有效地加入新的键盘字符的方法。描述单个字符的键盘符号时 Unicode 值前面要加”0x100”。

For more details of the file format, see Ivan Pascal’s XKB explanation(23). When finished, the symbols.dir file should be regenerated so that the symbols file is listed:

关于文件格式的更多细节,参看 Ivan Pascal 的 XKB 解释(23)。完成文件编写后,需要重新生成 symbols.dir 文件以便列出新添加的文件。

# cd /etc/X11/xkb/symbols

# xkbcomp -lhlpR `*’ -o ../symbols.dir

Then, the new layout may be tested as described in the previous section.

然后,可以按照上一节描述的方法测试新的布局。

Additionally, entries may be added to /etc/X11/xkbcomp/rules/xfree86.lst so that some GUI keyboard configuration tools can see the layout.

此外,还可以在 /etc/X11/xkbcomp/rules/xfree86.lst 中加入条目以便一些图形界面的键盘配置工具能够发现这些布局。

Once the new keyboard map is completed, it may also be included in XFree86 source where the data for XKB are kept under the xc/programs/xkbcomp subdirectory.

新的键盘布局完成后,也可以加入到 XFree86 源代码中,关于 XKB 的数据放在 xc/programs/xkbcomp 目录下。

XIM – X Input Method

==XIM ― X 输入方法==

For some languages, text input is as straightforward as one-to-one mapping from keysyms to characters, such as English. For European languages, this is a little more complicated because of accents. But for Chinese, Japanese and Korean (CJK), the one-to-one mapping is impossible. They require a series of keystroke interpretations to obtain each character.

对于一些像英语这样的语言,文本输入只是简单的击键代码与字符的一一对应。对于欧洲语言,注音符号只是使得输入稍显复杂。但对于中、日、韩文字(CJK),一一对应的映射是不可能的。这些文字需要一系列的击键来表示每个字符。

X Input Method (XIM) is a locale-based framework designed to address the requirements of text input for any language. It is a separate service for handling input events as requested by X clients. Any text entry in X clients is represented by X Input Context (XIC). All the keyboard events will be propagated to the XIM, which determines the appropriate action for the events based on the current state of the XIC, and passes back the resulting characters.

X 输入方法是一种基于区域设置的框架,用来满足任何语言的文本输入需要。它是一个处理 X 客户程序输入事件请求的单独服务。X 客户程序中的任何文字输入都用 X 输入语境 (X Input Context,XIC)来表示。所有键盘时间都被传递到 XIM,它根据 XIC 的当前状态决定事件对应的正确动作,然后传送回相应的字符。

Internally, a common process of every XIM is to translate keyboard scan code into keycode and then to keysym, by calling XKB, whose process detail has been described in previous sections. The following processes to convert keysyms into characters are different for different locales.

在内部,各种 XIM 的共同机理是把调用 XKB 通过上面叙述的过程把键盘扫描码转换成击键代码和键盘符号。之后的把键盘符号转换成字符的过程则随不同的区域设置而变化。

In general cases, XIM is usually implemented using the client-server model. More detailed discussion of XIM implementation is beyond the scope of this document. Please see Section 13.5 of the Xlib document [18] and the XIM protocol [19] for more information.

一般情况下,XIM 是使用客户端-服务器模型实现的。关于 XIM 实现的更多详细讨论超出了本册的范围。请参考 Xlib 文档[18]的 13.5 节和 XIM 协议[19]获取更多信息。(脚注编号需要确认)

In general, users can choose their favourite XIM server by setting the system environment XMODIFIERS, like this:

通常,用户可以通过设置系统环境变量 XMODIFIERS 选择他们想要的 XIM 服务器,例如:

$ export LANG=th_TH.TIS-620

$ export XMODIFIERS=”@im=Strict”

This specifies Strict input method for Thai locale.

以上命令设置了泰国区域设置中的 Strict 输入法。

——————————————————————————–

6 UCS is the acronym for Universal multi-octet coded Character Set

6 UCS 是通用多字节编码字符集(Universal multi-octet coded Character Set)的缩写形式。

7 UTF is the acronym for Unicode (UCS) Transformation Format

7 UTF 是 Unicode 变换格式(Unicode (UCS) Transformation Format)的缩写。

8 The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 76-77.

9 The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 77-78.

10 Ibid., pp. 95-104.

11 Unicode.org, `Unicode Technical Reports’; available from www.unicode.org/reports/index.html.

12 Unicode.org, `Unicode Technical Reports’; available from www.unicode.org/reports/index.html.

13 Leisher, M., `The XmBDFEd Font Editor`; available from crl.nmsu.edu/~mleisher/xmbdfed.html.

14 Williams, G., `PfaEdit’; available from pfaedit.sourceforge.net.

15 van Rossum, J., S `TTX/FontTools’; available from fonttools.sourceforge.net/.

16 Note the difference with Microsoft’s “Windows” trademark. X Window is without `s’.

16 注意与微软的“Windows“的差别。X Window 不带“s“。

(这里原书的脚注编号不正确――译注)

18 Taylor, O., `Pango – Design’; available from www.pango.org/design.shtml.

19 GNOME Development Site, `Pango Reference Manual’; available from developer.gnome.org/doc/API/2.0/pango/.

20 This is a very rough classification. Obviously, there are further steps, such as line breaking, alignment and justification. They need not be discussed here, as they go beyond localization.

20 这是一种粗糙的分类。显然,还有更多的步骤,例如断行、对齐。由于它们超出了本地化的范围,所以不在这里讨论。

21 Pascal, I., X Keyboard Extension; available from pascal.tsu.ru/en/xkb/.

22 Pascal, I., X Keyboard Extension; available from pascal.tsu.ru/en/xkb/.

23 Gettys, J., Scheifler, R.W., `Xlib – C Language X Interface, X Consortium Standard, X Version 11 Release 6.4.’

(文章来源:洛基开放文化实验室)

未经允许不得转载:『译网』 » 本地化技术一览