Character data, Unicode and Collations
While modeled on the ASCII character set, the Unicode Standard goes far beyond ASCIIs limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world more than 1 million characters can be encoded. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility.
In addition to the Unicode standard, there is a technical standard called Unicode Collation Algorithm (UCA). UCA provides a specification for how to compare two Unicode strings while remaining conformant to the requirements of The Unicode Standard. Comparisons like this are done by using so called Collations.
This article presents the Mimer SQL perspective of character data, Unicode and collations. The functionality described herein is fully available in Mimer SQL from version 9.2.
(The Unicode Standard is fully compatible with the international standard ISO/IEC 10646, Information Technology Universal Multiple-Octet Coded Character Set (UCS). The Unicode Collation Algorithm (UCA) standard is kept synchronized with the ISO/IEC 14651 standard for International String Ordering).
Unicode Data (NCHAR)
Mimer SQL implements Unicode 4.0 using the data type NCHAR. You may use any of the three encoding forms UTF-8, UTF-16, or UTF-32 when you store NCHAR data in the database. (UTF stands for Unicode Transformation Format, which is standard character encoding schemes in accordance with ISO 10646). The encoding forms are fully transparent; for example, you may use UTF-16 to store data, and you can use UTF-8 for fetching data. The NCHAR data type is logically UTF-32, however, it is stored in a compressed form.
The CHAR data type is based on ISO 8859-1 (Latin-1), which is a true subset of Unicode, and therefore CHAR are NCHAR are fully compatible. Of course, storing NCHAR data in a CHAR column have certain restrictions.
Unicode string literal
A unicode-string-literal facilitates the specification of Unicode characters in an ASCII environment. It consists of a sequence of Unicode characters enclosed in quotationmarks and preceded by the letter U and an ampersand (&). Two consecutive single quotation marks within a string are interpreted as a single quotation mark. Unicode characters are given by four hexadecimal digits preceeded by a backslash character or, by six hexadecimal digits preceeded with a backslash character and a plus character. Two consecutive backslash characters within a string are interpreted as a single backslash character. The case of the preceding U is irrelevant.
The Unicode string literal will be part of the new SQL:2003 standard.
U&'\0141ód\017A' -- 'Łódź' (Polish town)
A Unicode character can have several equivalent representations. There are precomposed characters and there are combining characters that can be used together with base characters to form a specific character.
Consider the letter E with circumflex and dot below, a character that occurs in Vietnamese, Ệ. This character has five possible representations in Unicode. (About the notation below; In running text, individual Unicode code points usually are expressed as U+n, where n is from four to six hexadecimal digits):
1) U+0045 LATIN CAPITAL LETTER E
U+0302 COMBINING CIRCUMFLEX ACCENT
U+0323 COMBINING DOT BELOW
2) U+0045 LATIN CAPITAL LETTER E
U+0323 COMBINING DOT BELOW
U+0302 COMBINING CIRCUMFLEX ACCENT
3) U+00CA LATIN CAPITAL LETTER E WITH CIRCUMFLEX
U+0323 COMBINING DOT BELOW
4) U+1EB8 LATIN CAPITAL LETTER E WITH DOT BELOW
U+0302 COMBINING CIRCUMFLEX ACCENT
5) U+1EC6 LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND DOT BELOW
Any two of these sequences should compare equal. The Normalization Form C (NFC) of all five sequences is U+1EC6.
In Mimer SQL, Unicode data (NCHAR) is automatically transformed to NFC. When needed, literals and variables are implicitly normalized. The result of a concatenation will always be normalized, and string functions, like UPPER and LOWER, will always return a normalized result string. This will assert that all Unicode data will be in NFC, thus giving the expected result in search operations.
SQL>create table t(c nchar(1));
SQL>insert into t values(u&'E\0302\0323');
SQL>insert into t values(u&'E\0323\0302');
SQL>insert into t values(u&'\00CA\0323');
SQL>insert into t values(u&'\1EB8\0302');
SQL>insert into t values(u&'\1EC6');
SQL>select count(c) as equal from t where c = u&'\1EC6';
The normalization forms are fully described in the Unicode standard annex #15. Mimer SQL have passed the Unicode normalization conformance test.
Collations and Linguistic Sorting
Character and national character data is sorted according to specific collations.
A collation, also known as a collating sequence, is a database object containing a set of rules that determines how character strings are compared, searched and alphabetically sorted. The rules in the collation determine whether one character string is less than, equal to or greater than another. A collation also determines how case-sensitivity and accents are handled, using different comparison levels.
You can specify a different collation for ordering characters when you create or alter a table or create a domain.
If you have specified a collation for a column, the collation is used implicitly in SQL statements.
You only need to explicitly use a collation clause in SQL statements if you want to override the default collation or the collation you specified when creating or altering the table or creating the domain. The following example will explicitely give a swedish sort order (where swedish_3 is a predefined collation in Mimer SQL):
ORDER BY name COLLATE swedish_3
Since Unicode is a universal character set the Unicode sorting order can be employed on any arbitrary character set that is a subset of Unicode.
The default Unicode sorting order is provided in allkeys.txt (Unicode 3.1.1 mapping). This table (the Default Unicode Collation Element Table) provides a mapping from characters to collation elements, i.e. sorting sequences, for all the explicitly weighted characters.
Mimer SQL has passed the UCA conformance test. The UCA footprint in Mimer SQL is 61K bytes.
As mentioned above there are different levels of comparisons to pay attention to, such as case and accent sensitivity. There are four levels of comparison keys; the Primary level, the Secondary level, the Tertiary level and the Quaternary level:
- Primary level
Typically, this is used to denote differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character. This is also called the level-1 strength.
- Secondary level
Accents in the characters are usually considered secondary differences (for example, "ab" < "áb" < "ac"). A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.
Note: In some languages (such as Icelandic), certain accented letters are considered to be separate base characters.
- Tertiary level
Upper and lower case differences in characters are distinguished at the tertiary level (for example, "ab" < "Ab" < "áb"). In addition, a variant of a letter differs from the base form on the tertiary level (such as "a" and "ª"). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called level-3 strength.
- Quaternary level
When punctuation is ignored (such as space and hyphen) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a c" < "a-c" < "ac"). A quaternary difference is ignored when there is a primary, secondary, or tertiary difference. This is also called the level-4 strength.
Multilevel comparison means the following; Two strings are compared on the primary level. If the comparison for this level fails to establish a unique and determined sequence for the strings, the second level are taken into consideration. If this likewise fails to produce a unique sequence, the tertiary level is invoked, and after this the quaternary level is used. If still a unique sequence cannot be established, the two strings are regarded as equivalent.
How far to go in this comparison chain is decided by the definition of the collation used on the data. See "Tailorings" below.
Concerning sorting of punctuation characters (such as SPACE, HYPHEN, COMMA, etc.) alternate weighting can be used.
The collating keys for these characters are known as alternate collation elements. These elements can be treated differently depending on the weighting method used:
The Non-ignorable method means that alternate collation elements are treated as normal collation elements. This is the default option.
The Shifted method sets the quaternary level (4:th) equal to the value of the primary level for alternate collation elements, and then the primary, secondary, and tertiary levels are set to zero. The collation elements for all other characters with a non-zero primary level weight, will receive a fourth-level weight of 0xFFFF.
Note: If the primary weight is zero, the fourth-level weight is also zero.
The Shift-trimmed method is equal to the Shifted method concerning the alternate collation elements, but all other collation elements are set to zero. This will emulate POSIX behavior.
The following gives an example of the alternate weighting differences.
|de Luge||de luge||deluge|
Contracting Character Sequences
Many languages have digraphs, i.e. two characters that should be considered as one, which actually counts as separate letters. In traditional Spanish, "ch" sorts between "c" and "d", and "ll" sorts between "l" and "m". Two characters are mapped into a single collation element that cause the combination to be ordered differently from either character individually.
Another example of contractions are "lj" and "nj" in Bosnian and Croatian, which sorts after "l" and "n" respectively.
|Order without contraction||Order with contraction "nj" sorting after "n"|
A single character can map to a sequence of collation elements. For instance, "ß" is equivalent to "ss". In traditional German "ä", "ö", and "ü" sort as though they were "ae", "oe", and "ue" respectively.
Mimer SQL includes predefined collations for many languages. Usually there are two predefined collations per language, one for the primary comparison level and one for the tertiary comparison level. These are named <language>_<level>, e.g. SWEDISH_1.
If a new collation should be created, this is done by customizing an existing collation using a tailoring. A tailoring is a set of rules and attributes that forms a so called collation delta string, which is used as the basis when creating a collation. The tailoring describes how to modify an existing collation definition to get the new one. A collation is created by the CREATE COLLATION statement.
When creating a collation, the tailoring string can include attribute settings for comparison level, accent order, which case that should be first in order and alternate weighting.
Attributes are optional, the boldedvalues in the table below are the default values.
|Sort level for the collation.|
|Accent order||[AccentOrder Forward]|
|Secondary level ordering direction.|
|Case first||[CaseFirst Lower]|
|Tertiary level case ordering.|
|Alternatives for punctuation characters.|
The rules in a tailoring string defines how to change the underlying collation. Each rule contains a string of ordered characters that starts with a reset value.
|&||&Z||Reset at this letter. Rules will be relative to this letter from here on.|
|<||a < b||Identifies a primary difference between "a" and "b".|
|<<||e << ê||Identifies a secondary difference between "e" and "ê".|
|<<<||s <<< S||Identifies a tertiary difference between "s" and "S".|
|=||i = y||Signifies no difference between "i" and "y".|
|"||","||The quoted character "," (comma).|
|#||#0141#||Hexadecimal representation of "Ł" (L with stroke).|
Note: ";" can be used to represent secondary relations and "," to represent tertiary relations, instead of "<<" and "<<<" respectively.
A Danish tailoring example:
& TH << þ <<< Þ
& Y << ü <<< Ü
& Z < æ <<< Æ << ä <<< Ä < ø <<< Ø << ö <<< Ö < å <<< Å << aa <<< Aa <<< AA
A Create Collation example:
In the following example the my_collation is created. The new collation is based upon the existing SWEDISH_3 collation and tailored so that "aa" is sorted as a variant of "å" and that upper case letters are sorted before lower case letters:
CREATE COLLATION my_collation
USING '[casefirst upper] &Å << aa <<< Aa <<< AA'
A sorting example:
Here is an example of how to sort two fields properly; in this case 'last name', 'first name'. The example data is from the excellent book Unicode Demystified by Richard Gillam.
SQL>create table name(last varchar(32),first varchar(32));
SQL>insert into name values('van Dusen','Paul');
SQL>insert into name values('van Dusen','Tom');
SQL>insert into name values('vanDusen','Paul');
SQL>insert into name values('vanDusen','Tom');
SQL>insert into name values('Van Dusen','Paul');
SQL>insert into name values('Van Dusen','Tom');
SQL>insert into name values('Van','Steve');
SQL>insert into name values('Van','Bill');
SQL>create collation names from EOR
SQL&using '[level 4][alternate shifted]&9<","'; -- "," before "A"
SQL>select last || ', ' || first as fullname
SQL&from name order by fullname collate names;
van Dusen, Paul
Van Dusen, Paul
van Dusen, Tom
Van Dusen, Tom
Backward Accent Ordering
Some languages, particularly French, require words to be ordered on the secondary level by comparing backwards from right to left. This behaviour is also called "French accent sorting" or "French secondary order". See the following standard example:
|English ordering||French ordering|
The Thai and Lao scripts contain left-joining vowels, that precede the consonants. To get the right sort order, the collation element for the vowel is exchanged with the collation element for the consonant that follows the vowel.
EOR European Ordering Rules
The Unicode Default Order (and ISO/IEC 14651) have defined the default Latin alphabet to contain not only the base letters "A" through "Z", but also a number of more or less language specific base letters. One example, the Polish "Ł" (L with stroke) is not a variant of "L", it is a separate base letter between "L" and "M".
The EOR, European Ordering Rules, ENV 13710 (and ISO 12199 Alphabetical ordering of multilingual terminilogical and lexicogrphical data represented in the Latin alphabet) have taken a more natural approach: The alphabet is "A" through "Z", and the other language specific letters are secondary variants of the corresponding base letter.
In Mimer SQL, we have used the EOR tailoring as the basis for all specific language tailorings, see the Mimer SQL Unicode Collation Tailorings for details.
Using the Unicode Default Collation ordering, Mimer SQL supports sorting for in principle all languages by default. Still, some languages need a specific tailoring to achieve a correct sort order. Mimer SQL contains a large number of such predefined tailorings for specific languages, see the Mimer SQL Collation page for further details on languages and collations.
The Mimer SQL Unicode knowledge base, stored in a number of fast index structures with a total space requirement of only 33K bytes, can be seen in the unidata.txt file.
The Unicode Home Page
The Mimer SQL Multilingual page
The Mimer SQL Collation charts page
Last updated: 2005-11-18