Oracle9i Database Globalization Support Guide Release 2 (9.2) Part Number A96529-01 |
|
This chapter explains how characters are sorted in an Oracle environment. It contains the following topics:
Different languages have different sort orders. In addition, different cultures or countries that use the same alphabets may sort words differently. For example, in Danish, Æ is after Z
, while Y
and Ü
are considered to be variants of the same letter.
Sort order can be case-sensitive or case-insensitive. Case refers to the condition of being uppercase or lowercase. For example, in a Latin alphabet, A
is the uppercase glyph for a
, the lowercase glyph.
Sort order can ignore or consider diacritics. A diacritic is a mark near or through a character or combination of characters that indicates a different sound than the sound of the character without the diacritic. For example, the cedilla (,
) in façade
is a diacritic. It changes the sound of c
.
Sort order can be phonetic or it can be based on the appearance of the character. For example, sort order can be based on the number of strokes in East Asian ideographs. Another common sorting issue is combining letters into a single character. For example, in traditional Spanish, ch
is a distinct character that comes after c
, which means that the correct order is: cerveza, colorado, cheremoya. This means that the letter c
cannot be sorted until Oracle has checked whether the next letter is an h
.
Oracle provides the following types of sorts:
It can achieve a linguistically correct sort for a single language as well as a sort based on the multilingual ISO standard (ISO-14651), which is designed to handle many languages at the same time.
One way to sort character data is based on the numeric values of the characters defined by the character encoding scheme. This is called a binary sort. Binary sorts are the fastest type of sort. They produce reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters A to Z in ascending numeric value.
Note: In the ASCII standard, all uppercase letters appear before any lowercase letters. In the EBCDIC standard, the opposite is true: all lowercase letters appear before any uppercase letters. |
When characters used in other languages are present, a binary sort usually does not produce reasonable results. For example, an ascending ORDER
BY
query returns the character strings ABC
, ABZ
, BCD
, ÄBC
, when Ä
has a higher numeric value than B
in the character encoding scheme. A binary sort is not usually linguistically meaningful for Asian languages that use ideographic characters.
To produce a sort sequence that matches the alphabetic sequence of characters, another sort technique must be used that sorts characters independently of their numeric values in the character encoding scheme. This technique is called a linguistic sort. A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order.
Oracle offers two kinds of linguistic sorts: monolingual and multilingual.
This section includes the following topics:
Oracle compares character strings in two steps for monolingual sorts. The first step compares the major value of the entire string from a table of major values. Usually, letters with the same appearance have the same major value. The second step compares the minor value from a table of minor values. The major and minor values are defined by Oracle. Oracle defines letters with diacritic and case differences as having the same major value but different minor values.
Each major table entry contains the Unicode code point and major value for a character. The Unicode code point is a 16-bit binary value that represents a character.
Table 4-1 illustrates sample values for sorting a
, A
, ä
, Ä
, and b
.
Glyph | Major Value | Minor Value |
---|---|---|
a |
15 |
5 |
A |
15 |
10 |
ä |
15 |
15 |
Ä |
15 |
20 |
b |
20 |
5 |
Oracle9i provides multilingual linguistic sorts so that you can sort data in more than one language in one sort. This is useful for regions or languages that have complex sorting rules and for multilingual databases. Oracle9i supports all of the sort orders defined by previous releases.
For Asian language data or multilingual data, Oracle provides a sorting mechanism based on the ISO 14651 standard and the Unicode 3.1 standard. Chinese characters are ordered by the number of strokes, PinYin, or radicals.
In addition, multilingual sorts can handle canonical equivalence and supplementary characters. Canonical equivalence is a basic equivalence between characters or sequences of characters. For example, ç
is equivalent to the combination of c
and ,
. Supplementary characters are user-defined characters or predefined characters in Unicode 3.1 that require two code points within a specific code range. You can define up to 1.1 million code points in one multilingual sort.
For example, Oracle9i supports a monolingual French sort (FRENCH
), but you can specify a multilingual French sort (FRENCH_M
). _M
represents the ISO 14651 standard for multilingual sorting. The sorting order is based on the GENERIC_M
sorting order and can sort diacritical marks from right to left. Oracle Corporation recommends using a multilingual linguistic sort if the tables contain multilingual data. If the tables contain only French, then a monolingual French sort may have better performance because it uses less memory. It uses less memory because fewer characters are defined in a monolingual French sort than in a multilingual French sort. There is a tradeoff between the scope and the performance of a sort.
Oracle evaluates multilingual sorts at three levels of precision:
A primary level sort distinguishes between base characters, such as the difference between characters a
and b
. It is up to individual locales to define if a
is before b
, b
is before a
, or they are equal. The binary representation of the characters is completely irrelevant. If a character is an ignorable character, then it is assigned a primary level order (or weight) of zero, which means it is ignored at the primary level. Characters that are ignorable on other levels are given an order of zero at those levels.
For example, at the primary level, all variations of bat
come before all variations of bet
. The variations of bat
can appear in any order, and the variations of bet
can appear in any order:
Bat bat BAT BET Bet bet
A secondary level sort distinguishes between base characters (the primary level sort) before distinguishing between diacritics on a given base character. For example, the character Ä
differs from the character A
only because it has a diacritic. Thus, Ä
and A
are the same on the primary level because they have the same base character (A
) but differ on the secondary level.
The following list has been sorted on the primary level (resume
comes before resumes
) and on the secondary level (strings without diacritics come before strings with diacritics):
resume résumé Résumé Resumes resumes résumés
A tertiary level sort distinguishes between base characters (primary level sort), diacritics (secondary level sort), and case (upper case and lower case). It can also include special characters such as +
, -
, and *
.
The following are examples of tertiary level sorts:
a
and A
are equal on the primary and secondary levels but different on the tertiary level because they have different cases.ä
and A
are equal on the primary level and different on the secondary and tertiary levels.-
is 0. That is, it is ignored on the primary and secondary levels. If a dash is compared with another character whose primary level order is nonzero, for example, u
, then no result for the primary level is available because u
is not compared with anything. In this case, Oracle finds a difference between -
and u
only at the tertiary level.The following list has been sorted on the primary level (resume
comes before resumes
) and on the secondary level (strings without diacritics come before strings with diacritics) and on the tertiary level (lower case comes before upper case):
resume Resume résumé Résumé resumes résumés Resumes Résumés
The examples in this section demonstrate a binary sort, a monolingual sort, and a multilingual sort. To prepare for the examples, create and populate a table called test
. Enter the following statements:
SQL> CREATE TABLE test (name VARCHAR2(20)); SQL> INSERT INTO test VALUES('Diet'); SQL> INSERT INTO test VALUES('À voir'); SQL> INSERT INTO test VALUES('Freizeit');
The ORDER BY
clause uses a binary sort.
SQL> SELECT * FROM test ORDER BY name;
You should see the following output:
Diet Freizeit À voir
Note that a binary sort results in À voir
being at the end of the list.
Use the NLSSORT
function with the NLS_SORT
parameter set to german
to obtain a German sort.
SQL> SELECT * FROM test ORDER BY NLSSORT(name, 'NLS_SORT=german');
You should see the following output:
À voir Diet Freizeit
Note that À voir
is at the beginning of the list in a German sort.
Insert the character string shown in Figure 4-1 into test
. It is a D
with a crossbar followed by ñ
.
Perform a monolingual German sort by using the NLSSORT
function with the NLS_SORT
parameter set to german
.
SQL> SELECT * FROM test ORDER BY NLSSORT(name, 'NLS_SORT=german');
The output from the German sort shows the new character string last in the list of entries because the characters are not recognized in a German sort.
Perform a multilingual sort by entering the following statement:
SQL> SELECT * FROM test ORDER BY NLSSORT(name, 'NLS_SORT=generic_m');
The output shows the new character string after Diet
, following ISO sorting rules.
See Also:
|
This section contains information about different features that a linguistic sort may have:
You can customize linguistic sorts to include the desired characteristics.
Base letters are defined in a base letter table, which maps each letter to its base letter. For example, a
, A
, ä
, and Ä
all map to a
, which is the base letter. This concept is particularly relevant for working with Oracle Text.
Some characters can be ignored in a linguistic sort. These characters are called ignorable characters. There are two kinds of ignorable characters: diacritics and punctuation.
Examples of ignorable diacritics are:
And example of an ignorable punctuation character is the dash character -
. If it is ignored, then multi-lingual
can be treated that same as multilingual
and e-mail
can be treated the same as email.
Sorting elements usually consist of a single character, but in some locales, two or more characters in a character string must be considered as a single sorting element during sorting. For example, in traditional Spanish, the string ch
is composed of two characters. These characters are called contracting characters in multilingual linguistic sorting and special combination letters in monolingual linguistic sorting.
Do not confuse a composed character with a contracting character. A composed character like á
can be decomposed into a
and '
, each with their own encoding. The difference between a composed character and a contracting character is that a composed character can be displayed as a single character on a terminal, while a contracting character is used only for sorting, and its component characters must be rendered separately.
In some locales, certain characters must be sorted as if they were character strings. An example is the German character ß
(sharp s). It is sorted exactly the same as the string SS
. Another example is that ö
sorts as if it were oe
, after od
and before of
. These characters are known as expanding characters in multilingual linguistic sorting and special letters in monolingual linguistic sorting. Just as with contracting characters, the replacement string for an expanding character is meaningful only for sorting.
In Japanese, a prolonged sound mark that resembles an em dash --
represents a length mark that lengthens the vowel of the preceding character. The sort order depends on the vowel that precedes the length mark. This is called context-sensitive sorting. For example, after the character ka
, the --
length mark indicates a long a
and is treated the same as a
, while after the character ki
, the --
length mark indicates a long i
and is treated the same as i
. Transliterating this to Latin characters, a sort might look like this:
kaa ka-- -- kaa and ka-- are the same kai -- kai follows ka- because i is after a kia -- kia follows kai because i is after a kii -- kii follows kia because i is after a ki-- -- kii and ki-- are the same
One Unicode code point may be equivalent to a sequence of base character code points plus diacritic code points, regardless of the locale. This is called the Unicode canonical equivalence. For example, ä
equals its base letter a
and an umlaut. A linguistic flag, CANONICAL_EQUIVALENCE=TRUE
, indicates that all canonical equivalence rules defined in Unicode 3.1 need to be applied. You can change this flag to FALSE
to speed up the comparison and ordering functions if all the data is in its composed form.
See Also:
"Creating a New Linguistic Sort with the Oracle Locale Builder" for more information about setting the canonical equivalence flag |
In French, sorting strings of characters with diacritics first compares base characters from left to right, but compares characters with diacritics from right to left. For example, by default, a character with a diacritic is placed after its unmarked variant. Thus Èdit
comes before Edít
in a French sort. They are equal on the primary level, and the secondary order is determined by examining characters with diacritics from right to left. Individual locales can request that the characters with diacritics be sorted with the right-to-left rule. Set the REVERSE_SECONDARY
linguistic flag to TRUE
to enable reverse secondary sorting.
See Also:
"Creating a New Linguistic Sort with the Oracle Locale Builder" for more information about setting the reverse secondary flag |
In Thai and Lao, some characters must first change places with the following character before sorting. Normally, these types of character are symbols representing vowel sounds, and the next character is a consonant. Consonants and vowels must change places before sorting. Set the SWAP_WITH_NEXT
linguistic flag for all characters that must change places before sorting.
See Also:
"Creating a New Linguistic Sort with the Oracle Locale Builder" for more information about setting the |
Special letters is a term used in monolingual sorts. They are called expanding characters in multilingual sorts.
Special combination letters is the term used in monolingual sorts. They are called contracting letters in multilingual sorts.
One lowercase letter may map to multiple uppercase letters. For example, in traditional German, the uppercase letters for ß
are SS
.
These case conversions are handled by the NLS_UPPER
, NLS_LOWER
, and NLS_INITCAP
SQL functions, according to the conventions established by the linguistic sort sequence. The UPPER
, LOWER
, and INITCAP
SQL functions cannot handle these special characters.
The NLS_UPPER
SQL function returns all uppercase characters from the same character set as the lowercase string. The following example shows the result of the NLS_UPPER
function when NLS_SORT
is set to XGERMAN
:
SELECT NLS_UPPER ('große') "Uppercase" FROM DUAL; Upper ----- GROSSE
Oracle supports special lowercase letters. One uppercase letter may map to multiple lowercase letters. An example is the Turkish uppercase I
becoming a small, dotless i: .
Linguistic sorting is language-specific and requires more data processing than binary sorting. Using a binary sort for ASCII is accurate and fast because the binary codes for ASCII characters reflect their linguistic order. When data in multiple languages is stored in the database, you may want applications to sort the data returned from a SELECT...ORDER BY
statement according to different sort sequences depending on the language. You can accomplish this without sacrificing performance by using linguistic indexes. Although a linguistic index for a column slows down inserts and updates, it greatly improves the performance of linguistic sorting with the ORDER
BY
clause.
You can create a function-based index that uses languages other than English. The index does not change the linguistic sort order determined by NLS_SORT
. The index simply improves the performance. The following statement creates an index based on a German sort:
CREATE TABLE my_table(name VARCHAR(20) NOT NULL) /*NOT NULL ensures that the index will be used */ CREATE INDEX nls_index ON my_table (NLSSORT(name, 'NLS_SORT = German'));
After the index has been created, enter a SELECT
statement similar to the following:
SELECT * FROM my_table ORDER BY name;
It returns the result much faster than the same SELECT
statement without an index.
The rest of this section contains the following topics:
See Also:
|
There are three ways to build linguistic indexes for data in multiple languages:
CREATE INDEX french_index ON employees (NLSSORT(employee_id, 'NLS_ SORT=FRENCH')); CREATE INDEX german_index ON employees (NLSSORT(employee_id, 'NLS_ SORT=GERMAN'));
Oracle chooses the index based on the NLS_SORT
session parameter or the arguments of the NLSSORT
function specified in the ORDER
BY
clause. For example, if the NLS_SORT
session parameter is set to FRENCH
, Oracle uses french_index
. When it is set to GERMAN
, Oracle uses german_index
.
LANG_COL
in "Example: Setting Up a French Linguistic Index") to be used as a parameter of the NLSSORT
function. The language column contains NLS_LANGUAGE
values for the data in the column on which the index is built. The following example builds a single linguistic index for multiple languages. With this index, the rows with the same values for NLS_LANGUAGE
are sorted together.
CREATE INDEX i ON t (NLSSORT(col, 'NLS_SORT=' || LANG_COL));
Queries choose an index based on the argument of the NLSSORT
function specified in the ORDER BY
clause.
GENERIC_M
or FRENCH_M
. These indexes sort characters according to the rules defined in ISO 14651. For example:
CREATE INDEX i on t (NLSSORT(col, 'NLS_SORT=GENERIC_M');
See Also:
"Multilingual Linguistic Sorts" for more information about Unicode sorts |
The following are requirements for using linguistic indexes:
This section also includes:
The QUERY_REWRITE_ENABLED
initialization parameter must be set to TRUE
. This is required for all function-based indexes. You can use an ALTER SESSION
statement to set QUERY_REWRITE_ENABLED
to TRUE
. For example:
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE;
See Also:
Oracle9i Database Reference for more information about the |
The NLS_COMP
parameter should be set to ANSI
. There are several ways to set NLS_COMP
. For example:
ALTER SESSION SET NLS_COMP = ANSI;
The NLS_SORT
parameter should indicate the linguistic definition you want to use for the linguistic sort. If you want a French linguistic sort order, NLS_SORT
should be set to FRENCH
. If you want a German linguistic sort order, NLS_SORT
should be set to GERMAN
.
There are several ways to set NLS_SORT
. You should set NLS_SORT
as a client environment variable so that you can use the same SQL statements for all languages. Different linguistic indexes can be used when NLS_SORT
is set in the client environment.
Use the cost-based optimizer with the optimizer mode set to FIRST_ROWS
, because linguistic indexes are not recognized by the rule-based optimizer. The following is an example of setting the optimizer mode:
ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS;
See Also:
Oracle9i Database Performance Guide and Reference for more information about the cost-based optimizer |
The following example shows how to set up a French linguistic index. You may want to set NLS_SORT
as a client environment variable instead of using the ALTER
SESSION
statement.
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE; ALTER SESSION SET NLS_COMP = ANSI; ALTER SESSION SET NLS_SORT='FRENCH'; ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS; CREATE INDEX test_idx ON test(NLSSORT(col, 'NLS_SORT=FRENCH')); SELECT * FROM test ORDER BY col; SELECT * FROM test WHERE col > 'JJJ';
You can create a function-based index that improves the performance of case-insensitive searches. For example:
CREATE INDEX case_insensitive_ind ON employees(NLS_UPPER(first_name)); SELECT * FROM employees WHERE NLS_UPPER(first_name) = 'KARL';
You can perform a search that ignores case and diacritics. Enter the following statements:
ALTER SESSION SET NLS_COMP=ANSI; ALTER SESSION SET NLS_SORT=GENERIC_BASELETTER;
Then enter a statement similar to the following:
SELECT * FROM emp WHERE ename='miller';
This statement can now return names that include the following:
Miller MILLER Millér
Note that this is not a linguistic search; that is, it is not based on a specific language. It uses the base letters only.
|
Copyright © 1996, 2002 Oracle Corporation. All Rights Reserved. |
|