Appendix B: Email and International Character Sets

Computers store all information in the form of “bits” or their 8-bit conglomerations “bytes”. Bits are also the entities that are transferred from the sender’s computer to the recipient’s computer whenever an email message is sent. Email programs take the message and convert it to bits. The message is sent and the receiving email client program translates these bits back into a readable message for the recipient. This process takes place seamlessly for the sender and the recipient. The sender first creates a text message and the recipient receives a text message – all the converting remains behind the scenes.

In order for characters from an alphabet to be converted into bits for transmission and then converted back into the message, the bits have to be arranged into sequences representing each character in the alphabet. Matching the bit sequences to alphabetical characters is called “mapping”. Mapping bit sequences to alphabets has resulted in several different so called “character sets” (short: “charsets”) that have been defined and standardized by the international community.

In the English-speaking world, probably the most widely used charset is ASCII (sometimes also called US-ASCII), which is a charset that maps 7-bit sequences to the 26 characters from the Latin alphabet. Because 7 bits have enough room for 128 characters (0-127), there are more than the 26 Latin characters in the ASCII charset: First, each character appears twice (as upper case and lower case), then there are the ten digits, 0-9, various punctuation marks like comma, dot, semi-colon, colon, dash, slash, backslash, exclamation, question mark, and so forth. There are also other characters that can act as control characters, that is, characters that have special meaning to certain protocols, such as “#” and “&”.

Used almost as frequently, at least in the western world, are the charsets from the ISO 8859 family. These charsets map 8-bit sequences to letters, digits, and characters from various European languages, Hebrew and Arabic. Since the ISO-8859 charsets use 8 bits, they have twice the range as ASCII – enough room for 256 characters (0-255). For convenience, all ISO-8859 charsets contain the full range of ASCII in their “lower” 128 characters; the bytes 0-127 from any ISO-8859 charset map directly to the corresponding ASCII character making ISO-8859 a superset of ASCII. The differences of each ISO charset are in the “upper” 128 characters, the bytes 128-255.

For example, ISO-8859-1, mapping an alphabet suitable for West-European languages, has the umlauts Ä, Ö and Ü at the positions 196, 214, and 220. In comparison, ISO 8859-7, mapping the Greek alphabet, has the Greek letters Δ, Φ, and ά at the same positions.

In addition to the ISO-8859 charsets, there are of course a multitude of other charsets, including the “Unicode” charset (which aims to include all characters from all languages), and, for example, charsets for the east Asian languages, such as Chinese, Japanese, and Korean.

The following charsets are currently supported by LISTSERV Maestro:

· ISO-8859-1 Latin 1 (West European)

· ISO-8859-2 Latin 2 (East European)

· ISO-8859-3 Latin 3 (South European)

· ISO-8859-4 Latin 4 (North European)

· ISO-8859-5 Cyrillic

· ISO-8859-6 Arabic

· ISO-8859-7 Greek

· ISO-8859-8 Hebrew

· ISO-8859-9 Latin 5 (Turkish)

· ISO-8859-15 Latin 9 (West European, update of Latin 1 with some French and Finnish letters that were omitted in Latin 1, plus the Euro currency symbol € instead of the international currency symbol .)

· BIG5 Traditional Chinese

· GB-2312 Simplified Chinese

· ISO-2022-JP Japanese

· EUC-JP Japanese

· Shift-JIS Japanese

· KSC-5601 Korean

· EIC-KR Korean

· UTF-8 International Unicode (encoded in UTF-8 format, Unicode is a very large charset with room for almost all characters of many different languages of the world, even the many Asian characters).

The 8-bit range of 0-255 is not enough to accomodate all letters from even the European languages at once (therefore, there is a need for more than a dozen different members of the ISO-8859 family). Also, 8-bit charsets do not take into account the other major language groups of the world, such as Asian languages.

To address the limitations of 8-bit charsets, recently the 16-bit charset Unicode with a range for 65536 characters has become more and more widespread. This charset contains more or less all letters and characters from the most widely used languages, as well as a set of symbols and other useful characters. LISTSERVMaestro offers Unicode in the form of its UTF-8 variant. UTF-8 is a transfer encoding for the 16-bit Unicode charset, which maps Unicode characters to one, two, or more bytes, in a way that more common characters (like ASCII characters) need fewer bytes than uncommon characters.

Again, for convenience, the first 128 characters of Unicode (0-127) are the same as in the ASCII charset, while the first 256 characters (0-255) are the same as in ISO-8859-1 (West European). A large percentage of all other letters of world languages are assigned values from 256 to 65535 (although, not even the large range of Unicode is enough to accommodate all letters from all languages).

LISTSERV Maestro and International Character Sets

What happens when international characters are used in email messages written and delivered in LISTSERV Maestro?

Internally, LISTSERV Maestro uses pure Unicode, allowing for the mixture of any characters in email, including the subject line and any data merged from uploaded files or selected from a database – as long as there is a way of inputting them. For some languages, this simply requires the installation of a special keyboard and display driver for that language.

For sending, LISTSERV Maestro needs to decide on a charset that it can use to encode the message. Specify the charset to use while defining the content (there is a special item for this on the content definition page), or tell LISTSERV Maestro that it should attempt to automatically determine which charset is the optimal one for the text contained in the message.

In the latter case, LISTSERV Maestro scans the written text to determine the optimal charset. If the message contains characters that are displayed with the ASCII charset, then LISTSERV Maestro will choose the ASCII charset. If the message contains characters outside of the ASCII range, but can still be displayed with one of the supported ISO-8859 charsets, then LISTSERV Maestro will choose the corresponding ISO-8859 charset. Optionally (only if LISTSERV Maestro is set to allow Unicode), if the message has characters that cannot be displayed with one of the ISO-8859 charsets (for example Asian characters) or there are mixed characters from several ISO-8859 charsets, then LISTSERV Maestro will choose Unicode as the charset.

Similarly, if you have used Chinese, Japanese, or Korean characters, which can be displayed with one of the supported Asian charsets, then LISTSERV Maestro will choose such a charset. And, optionally (only if you have told LISTSERV Maestro that using Unicode is OK), if you have used characters that can not be displayed with one of the supported ISO-8859 or Asian charsets, or if you have mixed characters from several ISO-8859 charsets and/or from other languages, then LISTSERV Maestro will choose Unicode as the charset.

Once a charset is chosen, LISTSERV Maestro encodes each character as a bit sequence according to that charset. The email that is sent is then augmented by the information of which charset was used to encode it. This information is then used by the receiving mail client to decode the bit sequence into characters that can be displayed to the recipient.

For example, with ASCII charset, (where each 7-bit sequence denotes one character) the sequence “1000001” would mean the character with the decimal value 65, which is the Latin ‘A’. With the ISO-8859-1 charset, (where each 8-bit sequence denotes one character) the sequence “11000100” would mean the character with the decimal value 196, which is the umlaut ‘Ä’. However, with the ISO-8859-7 charset, (also 8-bit) the same value 196 would mean the Greek letter ‘Δ’ instead. Consequently, the decoding scheme or charset that makes the message readable to the recipient is very important. LISTSERV Maestro takes care to include this information in the email, so that it is not lost during the transfer.

Merging Fields with International Character Sets

The issue of international character sets in combination with merging fields needs to be considered very carefully to make sure that the results of the merging appear to the recipient as intended. The main problem when merging fields containing text using international charsets is to decide which charset to use. Potentially, the characters in the body of the message require a certain charset, while some of the merge values may require a different charset. For example, a message may have English text as the body of the message but a recipient list with recipients from all over the world, with names that contain letters from various languages. It is likely that these international names would be encoded using a different charset than the text of the message. It is important to consider what happens when merging these names into the English body text.

The effect that the chosen charset has on the merge values depends on the kind of recipients definition selected for a particular job. If recipients are uploaded as a text file, based on the reaction of a previous job, selected from a database by the Maestro User Interface, or come from a target group based on a hosted recipient list, then all recipients and their merge values are already known to the Maestro User Interface before the job is submitted to LISTSERV for delivery. LISTSERV Maestro can therefore encode each merge value with the same charset that is used for the email text. Consequently, if the values are later merged into the text, their charset will match that of the text. However, if a merge value contains a character that cannot be displayed in the charset chosen for the text, then this character will be replaced with a question mark ”?” during the encoding, and this question mark will appear in the mail that reaches the recipient to which the merge value belongs.

In the example described above, where the message body was in plain text and the recipient list was composed of recipient names from all over the world, a problem could occur because LISTSERV Maestro chooses the charset based on the message text, not on the recipient values. If the mail text itself is plain English, then LISTSERV Maestro will determine ASCII as the correct encoding for the message and the recipient data. If then the names of the international recipients are encoded as ASCII, all non-ASCII international characters will be replaced with question marks. To avoid this problem, use the same charset for the message body as was used for the merge data. If the recipients’ information was uploaded as a text file, then simply use the same charset for sending as was used during the initial upload. And if the recipients information was selected from a database, then use the same charset as was used by the database (ask the database administrator for this information if it is unclear).

In summary, recipients that are uploaded as a text file or are selected from a database by the Maestro User Interface, then merge value characters that have no representation in the charset that was chosen for the mail text will be displayed as ”?”. To avoid this problem, make sure the message body is encoded with the same charset as the recipient list.

If recipients are defined by sending to an existing LISTSERV list, a hosted LISTSERV list, or by letting LISTSERV select from a database, then the Maestro User Interface will not see the actual recipients or their merge values, and cannot do any special charset encoding on them. Instead, LISTSERV will simply merge the bytes from the recipients source (from the LISTSERV list or from the database LISTSERV connects to) into the mail text. Consequently, make sure that the merge values in the original recipients source (LISTSERV list or LISTSERV DBMS) already have the correct charset for the mail they are merged into.

For example, in emails sent with ISO-8859-1 (West-European), all appearances of the byte 196 in the merge values will be interpreted as the umlaut ‘Ä’ (even if the merge value is actually a Greek word where the byte 196 should have been interpreted as a ‘Δ’).

While mixing characters from different ISO-8859 charsets will simply display the wrong character to the recipient, mixing ASCII and ISO-8859 or ISO-8859 and Unicode may even result in characters that cannot be displayed at all. Most importantly, if the mail text uses the Unicode encoding UTF-8, then it is necessary to make sure that the merge value texts in the recipients source are also UTF-8 encoded (the byte sequence that stands for each merge value must be a valid UTF-8 encoded sequence representing a string of characters from the Unicode charset).

Then again, it is usually not possible to define a charset for the mail and then in some way make sure that the merge values in the list or in the LISTSERV database match this charset, since those merge values have usually been stored long before the mail was created. Therefore, the best way to proceed is to check which encoding was used when the data was stored in the list or LISTSERV database (again, you might need to ask your administrator for that information) and then use the same charset for the mail.

In summary, for the recipient types of an existing LISTSERV list or LISTSERV selecting from a database, the merge value characters that have no representation in the charset that was chosen for the mail text will be displayed as a different character. The character displayed will be from the actual charset that has the same byte value (like ‘Ä’ from ISO-8859-1 and ‘Δ’ from ISO-8859-7). If there is no corresponding byte value in the charset, they may not be displayed at all.

International Character Set Recipient Names in the Mail-TO-Header

The previous section outlined the problems of mixing a mail text in one language with merge values from a different language. As an example, an English text mail was described, with an international recipient list where the recipient names contain characters from many languages, with the languages possibly differing between recipients from different countries. The recipient’s name as a merge value is probably one of the most common uses for merging fields – to be able to merge the recipient’s name into the text of the message, to personalize the mail. If this is done, the problems described earlier need to be considered.

However, the recipient’s name is also often used in the ”To:” header field of the mail, so that the mail appears to the recipient with the recipient’s own name visible in the ”To:” field (which is usually displayed by the email client in some fashion), personalizing the email one step further.

When using recipients uploaded as text files, selected from the database by the Maestro User Interface, or that come from a target group based on a hosted recipient list, then the use of the name in the ”To:”-header field does not fall under the constraints regarding charsets and text-merging. The name in the “To:”-header field will always be encoded with the charset that is optimal for exactly this name. Users may safely write an email message in English and send it to international recipients. Each recipient will see his or her name with the correct characters in the “To:”header. This means that a German recipient will correctly see umlauts, a Russian will see Cyrillic and a Greek will see Greek letters (under the condition that the original recipient list was in Unicode format and contained the names of the recipients with their respective international characters).

Just remember that with such a mixed-language list of recipients merge values, you should not also merge the name into the text body itself, unless the text is encoded as Unicode (UTF-8) as well, due to of the problems described earlier.

When using recipients that are defined by sending to an existing LISTSERV list, a hosted LISTSERV list, or by letting LISTSERV select from a database, then again the bytes from the name-merge value will be merged into the “To:”-header correctly by LISTSERV, without the Maestro User Interface having a chance to encode them. And, because it is very improbable that the names (the byte sequences representing them) already contain the special MIME-header encoding necessary for non-ASCII “To:”-header fields, then you’ll have to make sure that only ASCII characters are allowed in recipient names when creating the list or database data for these recipient types.

LISTSERV Maestro and Bi-Directional Character Sets

Of the ISO-8859 charset family, there are two charsets that contain letters from languages that have a standard reading direction of right-to-left. These are the charsets ISO-8859-6 (Arabic) and ISO-8859-8 (Hebrew), both of which are supported by LISTSERV Maestro.

Actually, LISTSERVMaestro will not use the charsets with the names ISO-8859-6 and ISO-8859-8, but will instead use the special bi-directional versions ISO-8859-6-i and ISO-8859-8-i. These charsets contain the same characters as their non-i-suffix counterparts, but the ”-i” suffix tells the receiving mail client that the text should be displayed with right-to-left reading direction. Without the ”-i” suffix in the charset name, many email clients would probably display the correct characters, but in the (for that language) incorrect left-to-right reading direction.

Even with the ”-i” suffix, the recipient might require a special mail client version (or even a special mail client) that is prepared to display text with right-to-left reading direction properly and is also able to properly display bi-directional text (text that mixes characters with left-to-right and characters with right-to-left reading direction, in the case of a Hebrew text that contains English names, for example). Some clients may only display the characters with the right direction, but still left-align each line of text, instead of the correct right-alignment. Occurrences such as this are subject to the mail client itself, and out of the scope of LISTSERV Maestro.