The most important issues for multilingual text processing systems are (1) unified mechanism for multiple languages and, (2) adjustability and extensibility of facilities provided by the system. In this paper, we explain how these issues are solved in the system called Mule (MULtilingual Enhancement to GNU Emacs). Mule is a plain-text oriented screen editor running on various systems (UNIX, DOS, Windows and OS/2 on many platforms). It provides users a flexible multilingual text processing environment which covers not only simple text editing but also reading/writing Internet mail and news, browsing Web pages, etc. All these facilities are controlled by a unified mechanism and are fully adjustable and extensible by users.
With the remarkable progress of computer technology, we now have sufficient computer power to handle not only English but also many other languages. Many countries have already developed ways to handle their national languages on computers. Most of them are based on locale mechanism which cause no problem as far as each computer environment is limited to its own locale. But, as the Internet connects the entire world more tightly, the need to make smooth communication among various locales/nations/languages gets larger, and we are now faced to the difficulty in handling multiple languages together. It is because each locale system has been developed independently, without concerning other locale systems.
In order to settle the problem of dealing with multiple languages, we have developed Mule, a multilingual plain-text editor based on GNU Emacs. GNU Emacs is one of the most famous and widely used editor in UNIX world. The biggest feature of Emacs is that it is equipped with Emacs Lisp interpreter, and most of the facilities are implemented in Emacs Lisp, which means that it is possible to enhance GNU Emacs just by modifying or adding Emacs Lisp programs. Emacs can be used not only for an editing work but for almost all kinds of text processing work including reading/writing Internet mail and news, browsing Web pages, manual referencing, looking up on-line dictionary, etc. All these are realized by Emacs Lisp programs, and hence, are fully customizable.
Mule inherits all these facilities and can utilize them in multilingual environment. For instance, Mule users can exchange mails and read Web pages in any languages, as far as those languages are supported by Mule.
In the design of Mule, we have kept the following two points in mind:
Other important issues are adjustability and extensibility, that is, how easy it is to customize or to augment facilities. It is almost impossible to design and provide all the functions necessary for all supported languages beforehand. The number of supported languages may change in the future. In developing countries, even the standard environment for national languages may change. Any system without flexibility of modifying or augmenting its facilities will soon get useless, or get too difficult to add new languages.
Therefore, even when we find that a new function is necessary for some language, we do not add it immediately. At first, we examine if there is any possibility that some other languages also need similar function, which part of the function should be generalized for all languages and which part of the function should be kept modifiable. After these points are decided, we design the most versatile function.
Text processing is a complex work. At least it contains text inputting, storing, restoring, and displaying. When we say Mule supports some language, it means that Mule can at least:
Here, we use the word ``standard'' not only for ``official standard'' but also for ``de facto standard''. Moreover, while processing a text we have to treat characters, words and lines in obedience to standard writing rules of each language. All these facilities are controlled by a unified mechanism for handling character sets, coding systems, input methods, and display routines.
The following sections are organized as:
A character set is a group of characters used together for some regional text (i.e. English text, Japanese text). Although single character set for all languages is a sweetest dream for a system developer, there is no such thing for the moment. ISO 10646(or Unicode) is of no use for the moment especially for Chinese characters because of its inconsistent handling of CJK-unification. That is why we decided to handle multiple character sets in Mule.
Most of Mule's character sets have one to one correspondence to character sets registered in ISO (e.g. ISO 8859-1, JIS X0208). Although each ISO's character set is identified by its size-type (how many characters are contained) and final-character, (one byte code to distinguish character sets of the same size-type) Mule identifies each character set by a unique identification number called charset-id. Thus, to define a character set in Mule is to associate a unique charset-id to the corresponding ISO character set and to inform Mule of several parameters of the character set. These parameters, which are used for editing work contain displaying width, writing direction, etc. Table 1 shows examples of character sets.
| ISO character set | parameters used in Mule | ||||||
| name | size-type | final-character | charset-id | bytes | width | direction | |
| ASCII | 94 | 'B' | 0 | 1 | 1 | left-to-right | |
| ISO8859-1 | (Latin1) | 96 | 'A' | 129 | 2 | 1 | left-to-right |
| ISO8859-8 | (Hebrew) | 96 | 'H' | 136 | 2 | 1 | right-to-left |
| TIS620 | 96 | 'T' | 133 | 2 | 1 | left-to-right | |
| GB2312 | 94 X 94 | 'A' | 145 | 3 | 1 | left-to-right | |
| JISX0208 | 94 X 94 | 'B' | 146 | 3 | 1 | left-to-right | |
| CNS11643-1 | 94 X 94 | 'G' | 149 | 4 | 1 | left-to-right | |
| CNS11643-3 | 94 X 94 | 'I' | 246 | 4 | 1 | left-to-right | |
It is also possible to define a character set which is not registered in ISO. In that case, Mule uses a final-character reserved for private use by ISO. If a character set originally does not conform to the technical requirement of ISO 2022, it should be rearranged by dividing into small character sets, or by changing character code points and made to meet the technical requirement. The typical example of this case is Vietnamese character set, the details of which are described in the next section.
The most common and simplest way to hold a text in computer memory is to represent each character by an array of fixed length elements (one to four bytes). Considering multilingual text, however, using one or two bytes for each character is apparently insufficient to cover characters from all over the world. Using four bytes element would be sufficient but is a waste of memory for English only text. Hence, instead of using fixed length representation, we adopted multi-byte variable length form (multi-byte form in short here after) to represent characters in Mule's buffer both for efficient memory usage and for extensibility. (This idea is originated to Stallman's brief note.)
With multi-byte form, each character is represented by one or two bytes of leading codes for charset-id and the following one or two bytes for the character code. The only exception is ASCII characters, they are represented as is, and the charset-id is 0. Table 2 shows more formal definition of Mule's internal character representation.
CHARACTER := ASCII_CHAR | MULTIBYTE_CHAR
MULTIBYTE_CHAR := PRIMARY_CHAR_1 | PRIMARY_CHAR_2
| SECONDARY_CHAR_1 | SECONDARY_CHAR_2
PRIMARY_CHAR_1 := LEADING_CODE_PRI C1
PRIMARY_CHAR_2 := LEADING_CODE_PRI C1 C2
SECONDARY_CHAR_1 := LEADING_CODE_SEC LEADING_CODE_EXT C1
SECONDARY_CHAR_2 := LEADING_CODE_SEC LEADING_CODE_EXT C1 C2
ASCII_CHAR := 0 | 1 | ... | 127
LEADING_CODE_PRI := 129 | 130 | ... | 153
LEADING_CODE_SEC := 154 | 155 | 156 | 157
C1, C2, LEADING_CODE_EXT := 160 | 161 | ... | 255
In the table, PRIMARY_CHAR and SECONDARY_CHAR differ only in a required memory per character and in editing work there is no difference between them. Charset-id is represented by single LEADING_CODE_PRI (in this case, charset-id is from 129 to 154) or a sequence of LEADING_CODE_SEC and LEADING_CODE_EXT (in this case, charset-id is more than 160). A single byte character set can contain 96 characters at most and is represented by two or three bytes sequence, a double byte character set can contain at most 9216 (96 x 96) characters and is represented by three or four bytes sequence. We have selected frequently used character sets and defined them as PRIMARY. All character sets added by users are defined as SECONDARY. For instance, in the series of Chinese (Taiwanese) character set CNS11643, the first two plains are PRIMARY but the remaining plains are SECONDARY (See Table 1). Figure 1 shows the usage of one byte code area.
| 0x00--0x7f | character code of ASCII_CHAR |
| 0x80--0x99 | LEADING_CODE_PRI |
| 0x9a--0x9f | LEADING_CODE_SEC |
| 0xa0--0xff | 1st and 2nd charactercodes of MULTIBYTE_CHAR or LEADING_CODE_EXT |
A coding system, or an encoding mechanism, is a way how to encode a text. We use many different coding systems on computer. Different countries use different coding systems. Therefore Mule does code conversion automatically from various representation format of text to the internal multi-byte form. whenever Mule interacts with outer world by reading/writing files, communicating with another process, communicating via network, accepting data from a terminal, or outputting data to a terminal.
To make a process of code conversion adjustable and extensible, we avoided equipping Mule with a hard-coded conversion routine. Instead, we made a generic model of coding system with several parameters to be filled in. Fortunately, most of coding systems used now in the world fit in the framework of ISO 2022. Therefore, we categorized a coding system into ISO-2022 type and non-ISO-2022 type. For the former type, we created generic ISO 2022 interpreter. For the latter type, we designed a simple programming language CCL (Code Conversion Language) and created the interpreter of CCL.
When users specify some coding system for some processing (e.g. reading a file, or sending mail), Mule automatically invokes the ISO 2022 interpreter or the CCL interpreter according to the specified coding system.
Although ISO 2022 allows lots of variations to encode the same text, just a few of them are actually used. Thus in order to specify a encoding, small number of parameters listed in Table 3 are enough. For instance, Chinese, Japanese, and Korean variants of EUC (Extended UNIX Code) and all ISO-8859 series (Part number 1 through 10) differ only in the parameters initial designations and reserved designations. The other examples are 7-bit ISO-2022 series such as ISO-2022-JP, ISO-2022-JP-2, ISO-2022-KR, and ISO-2022-CN. They all uses 7-bit environment. They differ only in reserved designations and locking shift. The first two do not use locking shift function, whereas the remaining use it.
| parameter | value | meaning |
| initial designations | list of charset-id | For each graphic register, which character set is initially designated to. |
| reserved designations | list of charset-id | For each graphic register, which character set is designated to exclusively on encoding. |
| 7-bit environment | true/false | Use only lower 7-bit or use full 8-bit on encoding. |
| locking shift | true/false | Use locking shift function or not. |
| single shift | true/false | Use single shift function or not. |
| direction indication | true/false | use an escape sequence of ISO-6429 to indicate writing direction on encoding. |
Table 4 shows how easily we can define a coding system of ISO-2022 type in Mule. After defining a coding system, it can be specified for any situation where code conversion is required. For instance, after the definition of the table 4, users can read/write GB files and display Chinese characters on GB terminal, and exchanging mails in GB.
'*euc-china* ;; Name of the coding system
2 ;; Type, '2' means ISO-2022 type
?C ;; Mnemonic character of the coding system
t ;; auto-detect end-of-line type (CR, CRLF, LF)
(list lc-ascii lc-cn ;; G0 is for ASCII, G1 is for Chinese GB2312
nil nil ;; G2 and G3 are never used.
Examples of non-ISO-2022 coding system are Russian KOI-8 and Vietnamese VISCII. Although KOI-8 conforms to the technical requirement of ISO 2022, the code points of characters are different from ISO 8859-5 (Latin/Cyrillic alphabet) which is the default character set for Cyrillic characters in Mule.
VISCII uses full 8-bit codes for 134 Vietnamese specific characters, which does not conform to the technical requirement of ISO 2022. Hence, in Mule, Vietnamese character set is divided into two, lowercase letters and uppercase letters, and each of them is assigned charset-id.
Since the generic ISO 2022 interpreter can not be used for decoding and encoding these character sets, we equipped Mule with CCL programs to do the conversion. CCL is a simple but powerful programming language suitable for writing code conversion algorithm, which means that theoretically Mule can handle any kind of coding system with an appropriate CCL program. Table 5 shows the source of CCL program to encode KOI-8.
(define-ccl-program ccl-write-koi8
'(1
((read r0)
(loop (if (r0 != 140) (write-read-repeat r0)
((read r0) (r0 -= 160)
(write-read-repeat r0
[ 32 179 32 32 32 32 32 32 32 32 32 32 32 32 32 32
225 226 247 231 228 229 246 250 233 234 235 236 237 238 239 240
242 243 244 245 230 232 227 254 251 253 255 249 248 252 224 241
193 194 215 199 196 197 214 218 201 202 203 204 205 206 207 208
210 211 212 213 198 200 195 222 219 221 223 217 216 220 192 209
32 163 32 32 32 32 32 32 32 32 32 32 32 32 32 32])
)))))
"CCL program to write KOI8.")
We have investigated existing input methods for multilingual characters and categorized them into the following four types.
For instance, in Japanese input method, a Hiragana (Japanese phonetic alphabet) sequence is at first typed, then the sequence is converted by some conversion program into an appropriate mixture of Kanji (Chinese letter) and Hiragana sequence. There exist several conversion programs, such as Wnn, Canna, SJ3 (these are all for Japanese), and cWnn (for Chinese) which can be used from Mule. They usually use vary large dictionary and knowledge about grammar of each language to generate appropriate character sequence.
In Mule, input methods of first three types are realized as a keyboard input translation system named Quail. Quail is given one set of translation rules (called `Quail package') at a time and translates user input accordingly. Each translation rule consists of a key sequence and the corresponding translated text. One rule can have multiple candidates of translated text, in which case, users are prompted to select one interactively.
It is quite easy to customize a Quail package. Users can just add new translation rules or modify existing translation rules. Because of the modularity of the Quail package, adding new rules for a new language is also easy. It can be done just by making a new package with an appropriate name and defining any numbers of translation rules under the package. Table 6 shows how to make a new Quail package which simulate Caps-Lock (i.e. all lower case letters are translated to upper case letters).
;; At first, define new Quail package.
(quail-define-package "caps-lock"
"Caps-Lock" nil "Simulate Caps-Lock")
;; Then, define translation rules of the package.
(quail-defrule "a" "A")
...
Input methods of the last type are realized as a system named Tamago. It first translates keyboard inputs into some phonetic codes (Hiragana in Japanese and PinYin or ZhuYin in Chinese) and send them to an external conversion program through network. It is possible to use Quail system at the first stage. However, Tamago was developed independently from Quail and has its own system named its to generate these phonetic codes for the moment. Since conversions can not be fully automated, user can select some other conversions interactively. For the moment, Tamago can use Wnn and cWnn as conversion server.
Text processing is much more than character inputting. In order to facilitate text processing, Mule provides various tools. The examples are character categories and a powerful regular expression compiler which enables fast regular expression search.
While editing text, it is convenient to group some of the characters and use the group in editing commands. For example, a user may want to search any Cyrillic character whatever, but do not want to specify all Cyrillic characters and combine them with OR operators. For this purpose, original GNU Emacs assigns character syntax code to each character. Character syntax, however, has some restrictions. Each character can have at most one character syntax and users may not define new syntax.
Mule offers an additional way to group characters called character category. Users can define new character category and assign to one character as many character categories as they like. Table 7 shows default character categories of Mule. Editing work often involves word by word processing, but different languages may have different word definitions. With Mule, users can define a word based on these character categories, which realizes customizable editing commands.
| 'b' | Arabic characters |
| 'c' | Chinese 2-byte characters |
| 'g' | Greek characters |
| 'h' | Korean 2 byte characters |
| 'j' | Japanese 2 byte characters |
| 'k' | Japanese 1 byte Katakana characters |
| 'r' | Japanese 1 byte Roman characters |
| 'l' | Latin characters |
| 'w' | Hebrew characters |
| 'y' | Cyrillic characters |
For displaying multilingual text, we must consider two cases: running Mule from some terminal (or terminal emulator such as `xterm', `kterm', `cxterm', etc.), and running Mule under a window system.
In the former case, Mule just sends correctly encoded text to terminals and leaves the task of rendering multilingual text to them. The code conversion is done with accordance to a coding system specified for terminal output. For instance, if a user runs Mule from `cxterm', there is no way to display any other text than English and Chinese.
When Mule is running under a window system, Mule takes responsibility of displaying multilingual text. In Mule, each character set is assigned the corresponding font. A collection of mappings from all character sets to the corresponding fonts is named fontset and is the basis for displaying each character. Mule may use different fontset in different context. For instance, while reading a mail, a subject field is displayed in bold face, in which case, fontset of the collection of bold fonts are used only for the subject field.
On X window, Mule's internal character codes usually match code points of the corresponding font. For instance, Japanese character set JISX0208 can be displayed correctly by the font in which character code points follows JISX0208. But, even if this matching does not hold for a certain combination of a character set and a font, we can convert internal code points to that of a font by CCL program described in the previous section. For instance, Mule's character set for Cyrillic characters is based on ISO 8859-5, whereas a user may only has a KOI-8 font. Even in such a case, all he has to do is to change a mapping of a fontset so that KOI-8 font is used for Cyrillic character set, and associate an appropriate CCL program to Cyrillic character set (see Table 8).
When a user adds a support for a new language, he can simply add a mapping between the new character set for the language and an appropriate font to the existing fontsets.
;; Change font mapping in the fontset DEFAULT-FONTSET.
(set-fontset-font default-fontset lc-crl "HERE_COMES_KOI8_FONT_NAME")
(define-ccl-program ccl-x-koi8
'(0
((r1 -= 160)
(r1 = r1
[ 32 179 32 32 32 32 32 32 32 32 32 32 32 32 32 32
225 226 247 231 228 229 246 250 233 234 235 236 237 238 239 240
242 243 244 245 230 232 227 254 251 253 255 249 248 252 224 241
193 194 215 199 196 197 214 218 201 202 203 204 205 206 207 208
210 211 212 213 198 200 195 222 219 221 223 217 216 220 192 209
32 163 32 32 32 32 32 32 32 32 32 32 32 32 32 32])))
"CCL program to convert chars of lc-crl (ISO8859-5) to KOI8 font.")
;; Associate the CCL program to Cyrillic character set LC-CRL.
(x-set-ccl lc-crl ccl-x-koi8)
We have described the multilingual text editor Mule, stressing its adjustability and extensibility. Mule is equipped with Emacs Lisp interpreter which makes the system open. Simple Emacs Lisp programs can do all customization.
Mule was released in 1993. Since then, people in the world have contributed a lot of supports for their own languages. Now it can handle most of the European (including Russian and Greek) and East Asian (Chinese, Japanese, Korean) languages in addition to Thai, Vietnamese, Hebrew, Arabic, Turkish, and others. We are now working hard to support Indian languages (Devanagali scripts).
Aside from new language support, there also exist many contributed applications running on Mule, such as on-line dictionary looking-up tools, MIME encoder and decoder. The existence of these tools proves that Mule can be a multilingual workbench/environment, rather than a mere editor.
Now we are integrating Mule into the original GNU Emacs in cooperation with Free Software Foundation, the organization distributing GNU Emacs. The future release of GNU Emacs will contain Mule's multilingual facilities.
Mule is distributed for free under the term of GNU GENERAL PUBLIC LICENSE. Mule Version 2.3 is available through anonymous ftp from the following sites and many other mirroring sites.