Oracle® Database Globalization Support Guide 10g Release 1 (10.1) Part Number B10749-02 |
|
|
View PDF |
This chapter illustrates how to use Unicode in an Oracle database environment. It includes the following topics:
Dealing with many different languages in the same application or database has been complicated and difficult for a long time. To overcome the limitations of existing character encodings, several organizations began working on the creation of a global character set in the late 1980s. The need for this became even greater with the development of the World Wide Web in the mid-1990s. The Internet has changed how companies do business, with an emphasis on the global market that has made a universal character set a major requirement. A global character set needs to fulfill the following conditions:
Contain all major living scripts
Support legacy data and implementations
Be simple enough that a single implementation of an application is sufficient for worldwide use
A global character set should also have the following capabilities:
Support multilingual users and organizations
Conform to international standards
Enable worldwide interchange of data
This global character set exists, is in wide use, and is called Unicode.
Unicode is a universal encoded character set that enables information from any language to be stored using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.
The Unicode standard has been adopted by many software and hardware vendors. Many operating systems and browsers now support Unicode. Unicode is required by standards such as XML, Java, JavaScript, LDAP, and WML. It is also synchronized with the ISO/IEC 10646 standard.
Oracle Corporation started supporting Unicode as a database character set in version 7. In Oracle Database 10g, Unicode support has been expanded. Oracle Database 10g supports Unicode 3.2.
See Also: http://www.unicode.org for more information about the Unicode standard |
This section contains the following topics:
The first version of Unicode was a 16-bit, fixed-width encoding that used two bytes to encode each character. This allowed 65,536 characters to be represented. However, more characters need to be supported, especially additional CJK ideographs that are important for the Chinese, Japanese, and Korean markets.
Unicode 3.2 defines supplementary characters to meet this need. It uses two 16-bit code points (also known as supplementary characters) to represent a single character. This enables an additional 1,048,576 characters to be defined. The Unicode 3.2 standard defines 45,960 supplementary characters.
Adding supplementary characters increases the complexity of Unicode, but it is less complex than managing several different encodings in the same configuration.
Unicode 3.2 encodes characters in different ways: UTF-8, UCS-2, and UTF-16. Conversion between different Unicode encodings is a simple bit-wise operation that is defined in the Unicode standard.
This section contains the following topics:
UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. This means that each and every character in the ASCII character set is available in UTF-8 with the same code point values. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes.
UTF-8 is the Unicode encoding supported on UNIX platforms and used for HTML and most Internet browsers. Other environments such as Windows and Java use UCS-2 encoding.
The benefits of UTF-8 are as follows:
Compact storage requirement for European scripts because it is a strict superset of ASCII
Ease of migration between ASCII-based characters sets and UTF-8
UCS-2 is a fixed-width, 16-bit encoding. Each character is 2 bytes. UCS-2 is the Unicode encoding used by Java and Microsoft Windows NT 4.0. UCS-2 supports characters defined for Unicode 3.0, so there is no support for supplementary characters.
The benefits of UCS-2 over UTF-8 are as follows:
More compact storage for Asian scripts because all characters are two bytes
Faster string processing because characters are fixed-width
Better compatibility with Java and Microsoft clients
See Also: "Supplementary Characters" |
UTF-16 encoding is the 16-bit encoding of Unicode. UTF-16 is an extension of UCS-2 because it supports the supplementary characters that are defined in Unicode 3.2 by using two UCS-2 code points for each supplementary character. UTF-16 is a strict superset of UCS-2.
One character can be either 2 bytes or 4 bytes in UTF-16. Characters from European and most Asian scripts are represented in 2 bytes. Supplementary characters are represented in 4 bytes. UTF-16 is the main Unicode encoding used by Microsoft Windows 2000.
The benefits of UTF-16 over UTF-8 are as follows:
More compact storage for Asian scripts because most of the commonly used Asian characters are represented in two bytes.
Better compatibility with Java and Microsoft clients
Figure 6-1 shows some characters and their character codes in UTF-16, UTF-8, and UCS-2 encoding. The last character is a treble clef (a music symbol), a supplementary character that has been added to the Unicode 3.2 standard.
Oracle Corporation started supporting Unicode as a database character set in version 7. Table 6-1 summarizes the Unicode character sets supported by the Oracle database server.
Table 6-1 Unicode Character Sets Supported by the Oracle Database Server
You can store Unicode characters in an Oracle database in two ways.
You can create a Unicode database that enables you to store UTF-8 encoded characters as SQL CHAR
datatypes (CHAR
, VARCHAR2
, CLOB
, and LONG)
.
If you prefer to implement Unicode support incrementally or if you need to support multilingual data only in certain columns, then you can store Unicode data in either the UTF-16 or UTF-8 encoding form in SQL NCHAR
datatypes (NCHAR
, NVARCHAR2
, and NCLOB
). The SQL NCHAR
datatypes are called Unicode datatypes because they are used only for storing Unicode data.
Note: You can combine a Unicode database solution with a Unicode datatype solution. |
The following sections explain how to use the two Unicode solutions and how to choose between them:
How to Choose Between a Unicode Database and a Unicode Datatype Solution
Comparing Unicode Character Sets for Database and Datatype Solutions
The database character set specifies the encoding to be used in the SQL CHAR
datatypes as well as the metadata such as table names, column names, and SQL statements. A Unicode database is a database with a UTF-8 character set as the database character set. There are three Oracle character sets that implement the UTF-8 encoding. The first two are designed for ASCII-based platforms while the third one should be used on EBCDIC platforms.
AL32UTF8
The AL32UTF8 character set supports the latest version of the Unicode standard. It encodes characters in one, two, or three bytes. Supplementary characters require four bytes. It is for ASCII-based platforms.
UTF8
The UTF8 character set encodes characters in one, two, or three bytes. It is for ASCII-based platforms.
The UTF8 character set has supported Unicode 3.0 since Oracle8i release 8.1.7 and will continue to support Unicode 3.0 in future releases of the Oracle database server. Although specific supplementary characters were not assigned code points in Unicode until version 3.1, the code point range was allocated for supplementary characters in Unicode 3.0. If supplementary characters are inserted into a UTF8 database, then it does not corrupt the data in the database. The supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes. Oracle Corporation recommends that you switch to AL32UTF8 for full support of supplementary characters in the database character set.
UTFE
The UTFE character set is for EBCDIC platforms. It is similar to UTF8 on ASCII platforms, but it encodes characters in one, two, three, and four bytes. Supplementary characters are converted as two 4-byte characters.
Example 6-1 Creating a Database with a Unicode Character Set
To create a database with the AL32UTF8 character set, use the CREATE
DATABASE
statement and include the CHARACTER SET AL32UTF8
clause. For example:
CREATE DATABASE sample
CONTROLFILE REUSE LOGFILE
GROUP 1 ('diskx:log1.log', 'disky:log1.log') SIZE 50K, GROUP 2 ('diskx:log2.log', 'disky:log2.log') SIZE 50K
MAXLOGFILES 5 MAXLOGHISTORY 100 MAXDATAFILES 10 MAXINSTANCES 2 ARCHIVELOG CHARACTER SET AL32UTF8 NATIONAL CHARACTER SET AL16UTF16 DATAFILE
'disk1:df1.dbf' AUTOEXTEND ON, 'disk2:df2.dbf' AUTOEXTEND ON NEXT 10M MAXSIZE UNLIMITED
DEFAULT TEMPORARY TABLESPACE temp_ts UNDO TABLESPACE undo_ts SET TIME_ZONE = '+02:00';
Note: Specify the database character set when you create the database. |
An alternative to storing Unicode data in the database is to use the SQL NCHAR
datatypes (NCHAR
, NVARCHAR
, NCLOB
). You can store Unicode characters into columns of these datatypes regardless of how the database character set has been defined. The NCHAR
datatype is a Unicode datatype exclusively. In other words, it stores data encoded as Unicode.
In releases before Oracle9i, the NCHAR
datatype supported fixed-width Asian character sets that were designed to provide higher performance. Examples of fixed-width character sets are JA16SJISFIXED and ZHT32EUCFIXED. No Unicode character set was supported as the national character set before Oracle9i.
You can create a table using the NVARCHAR2
and NCHAR
datatypes. The column length specified for the NCHAR
and NVARCHAR2
columns is always the number of characters instead of the number of bytes:
CREATE TABLE product_information ( product_id NUMBER(6) , product_name NVARCHAR2(100) , product_description VARCHAR2(1000));
The encoding used in the SQL NCHAR
datatypes is the national character set specified for the database. You can specify one of the following Oracle character sets as the national character set:
AL16UTF16
This is the default character set for SQL NCHAR
datatypes. The character set encodes Unicode data in the UTF-16 encoding. It supports supplementary characters, which are stored as four bytes.
UTF8
When UTF8 is specified for SQL NCHAR
datatypes, the data stored in the SQL datatypes is in UTF-8 encoding.
You can specify the national character set for the SQL NCHAR
datatypes when you create a database using the CREATE
DATABASE
statement with the NATIONAL CHARACTER SET
clause. The following statement creates a database with WE8ISO8859P1 as the database character set and AL16UTF16 as the national character set.
Example 6-2 Creating a Database with a National Character Set
CREATE DATABASE sample
CONTROLFILE REUSE LOGFILE
GROUP 1 ('diskx:log1.log', 'disky:log1.log') SIZE 50K, GROUP 2 ('diskx:log2.log', 'disky:log2.log') SIZE 50K
MAXLOGFILES 5 MAXLOGHISTORY 100 MAXDATAFILES 10 MAXINSTANCES 2 ARCHIVELOG CHARACTER SET WE8ISO8859P1 NATIONAL CHARACTER SET AL16UTF16 DATAFILE
'disk1:df1.dbf' AUTOEXTEND ON, 'disk2:df2.dbf' AUTOEXTEND ON NEXT 10M MAXSIZE UNLIMITED
DEFAULT TEMPORARY TABLESPACE temp_ts UNDO TABLESPACE undo_ts SET TIME_ZONE = '+02:00';
To choose the right Unicode solution for your database, consider the following questions:
Programming environment: What are the main programming languages used in your applications? How do they support Unicode?
Ease of migration: How easily can your data and applications be migrated to take advantage of the Unicode solution?
Performance: How much performance overhead are you willing to accept in order to use Unicode in the database?
Type of data: Is your data mostly Asian or European? Do you need to store multilingual documents into LOB
columns?
Type of applications: What type of applications are you implementing: a packaged application or a customized end-user application?
This section describes some general guidelines for choosing a Unicode database or a Unicode datatype solution. The final decision largely depends on your exact environment and requirements. This section contains the following topics:
Use a Unicode database in the situations described in Table 6-2.
Table 6-2 Using a Unicode Database
Situation | Explanation |
---|---|
You need easy code migration for Java or PL/SQL. | If your existing application is mainly written in Java and PL/SQL and your main concern is to minimize the code changes required to support multiple languages, then you may want to use a Unicode database solution. If the datatypes used to stored data remain as SQL CHAR datatypes, then the Java and PL/SQL code that accesses these columns does not need to change. |
You have evenly distributed multilingual data. | If the multilingual data is evenly distributed in existing schema tables and you are not sure which tables contain multilingual data, then you should use a Unicode database because it does not require you to identify the kind of data that is stored in each column. |
Your SQL statements and PL/SQL code contain Unicode data. | You must use a Unicode database. SQL statements and PL/SQL code are converted into the database character set before being processed. If the SQL statements and PL/SQL code contain characters that cannot be converted to the database character set, then those characters are lost. A common place to use Unicode data in a SQL statement is in a string literal. |
You want to store multilingual documents in BLOB format and use Oracle Text for content searching. |
You must use a Unicode database. The BLOB data is converted to the database character set before being indexed by Oracle Text. If your database character set is not UTF8, then data are lost when the documents contain characters that cannot be converted to the database character set. |
Use Unicode datatypes in the situations described in Table 6-3.
Table 6-3 Using Unicode Datatypes
Situation | Explanation |
---|---|
You want to add multilingual support incrementally. | If you want to add Unicode support to the existing database without migrating the character set, then consider using Unicode datatypes to store Unicode data. You can add columns of the SQL NCHAR datatypes to existing tables or new tables to support multiple languages incrementally. |
You want to build a packaged application. | If you are building a packaged application to sell to customers, then you may want to build the application using SQL NCHAR datatypes. The SQL NCHAR datatype is a reliable Unicode datatype in which the data is always stored in Unicode, and the length of the data is always specified in UTF-16 code units. As a result, you need to test the application only once. The application will run on customer databases with any database character set. |
You want better performance with single-byte database character sets. | If performance is your main concern, then consider using a single-byte database character set and storing Unicode data in the SQL NCHAR datatypes. Databases that use a multibyte database character set such as UTF8 have a performance overhead. |
You require UTF-16 support in Windows clients. | If your applications are written in Visual C/C++ or Visual Basic running on Windows, then you may want to use the SQL NCHAR datatypes. You can store UTF-16 data in SQL NCHAR datatypes in the same way that you store it in the wchar_t buffer in Visual C/C++ and string buffer in Visual Basic. You can avoid buffer overflow in client applications because the length of the wchar_t and string datatypes match the length of the SQL NCHAR datatypes in the database. |
Note: You can use a Unicode database with Unicode datatypes. |
Oracle provides two solutions to store Unicode characters in the database: a Unicode database solution and a Unicode datatype solution. After you select the Unicode database solution, the Unicode datatype solution or a combination of both, determine the character set to be used in the Unicode database or the Unicode datatype.
Table 6-4 contains advantages and disadvantages of different character sets for a Unicode database solution. The Oracle character sets that can be Unicode database character sets are AL32UTF8, UTF8, and UTFE.
Table 6-4 Character Set Advantages and Disadvantages for a Unicode Database Solution
Database Character Set | Advantages | Disadvantages |
---|---|---|
AL32UTF8 |
|
|
UTF8 |
|
|
UTFE |
|
|
Table 6-5 contains advantages and disadvantages of different character sets for a Unicode datatype solution. The Oracle character sets that can be national character sets are AL16UTF16 and UTF8. The default is AL16UTF16.
Table 6-5 Character Set Advantages and Disadvantages for a Unicode Datatype Solution
National Character Set | Advantages | Disadvantages |
---|---|---|
AL16UTF16 |
|
|
UTF8 |
|
|
This section describes typical scenarios for storing Unicode characters in an Oracle database:
Example 6-3 Unicode Solution with a Unicode Database
An American company running a Java application would like to add German and French support in the next release of the application. They would like to add Japanese support at a later time. The company currently has the following system configuration:
The existing database has a database character set of US7ASCII.
All character data in the existing database is composed of ASCII characters.
PL/SQL stored procedures are used in the database.
The database is around 300 GB.
There is a nightly downtime of 4 hours.
In this case, a typical solution is to choose UTF8 for the database character set because of the following reasons:
The database is very large and the scheduled downtime is short. Fast migration of the database to Unicode is vital. Because the database is in US7ASCII, the easiest and fastest way of enabling the database to support Unicode is to switch the database character set to UTF8 by issuing the ALTER
DATABASE
statement. No data conversion is required because US7ASCII is a subset of UTF8.
Because most of the code is written in Java and PL/SQL, changing the database character set to UTF8 is unlikely to break existing code. Unicode support is automatically enabled in the application.
Because the application supports French, German, and Japanese, there are few supplementary characters. Both AL32UTF8 and UTF8 are suitable.
Example 6-4 Unicode Solution with Unicode Datatypes
A European company that runs its applications mainly on Windows platforms wants to add new Windows applications written in Visual C/C++. The new applications will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:
The existing database has a database character set of WE8ISO8859P1.
All character data in the existing database is composed of Western European characters.
The database is around 50 GB.
A typical solution is take the following actions:
Use NCHAR
and NVARCHAR2
datatypes to store Unicode characters
Keep WE8ISO8859P1 as the database character set
Use AL16UTF16 as the national character set
The reasons for this solution are:
Migrating the existing database to a Unicode database required data conversion because the database character set is WE8ISO8859P1 (a Latin-1 character set), which is not a subset of UTF8. As a result, there would be some overhead in converting the data to UTF8.
The additional languages are supported in new applications only. They do not depend on the existing applications or schemas. It is simpler to use the Unicode datatype in the new schema and keep the existing schemas unchanged.
Only customer name columns require Unicode support. Using a single NCHAR
column meets the customer's requirements without migrating the entire database.
Because the languages to be supported are mostly Asian languages, AL16UTF16 should be used as the national character set so that disk space is used more efficiently.
The lengths of the SQL NCHAR
datatypes are defined as number of characters. This is the same as the way they are treated when using wchar_t
strings in Windows C/C++ programs. This reduces programming complexity.
Existing applications using the existing schemas are unaffected.
Example 6-5 Unicode Solution with a Unicode Database and Unicode Datatypes
A Japanese company wants to develop a new Java application. The company expects that the application will support as many languages as possible in the long run.
In order to store documents as is, the company decided to use the BLOB
datatype to store documents of multiple languages.
The company may also want to generate UTF-8 XML documents from the relational data for business-to-business data exchange.
The back-end has Windows applications written in C/C++ using ODBC to access the Oracle database.
In this case, the typical solution is to create a Unicode database using AL32UTF8 as the database character set and use the SQL NCHAR
datatypes to store multilingual data. The national character set should be set to AL16UTF16. The reasons for this solution are as follows:
When documents of different languages are stored BLOB
format, Oracle Text requires the database character set to be one of the UTF-8 character sets. Because the applications may retrieve relational data as UTF-8 XML format (where supplementary characters are stored as four bytes), AL32UTF8 should be used as the database character set to avoid data conversion when UTF-8 data is retrieved or inserted.
Because applications are new and written in both Java and Windows C/C++, the company should use the SQL NCHAR
datatype for its relational data. Both Java and Windows support the UTF-16 character datatype, and the length of a character string is always measured in the number of characters.
If most of the data is for Asian languages, then AL16UTF16 should be used with the SQL NCHAR
datatypes because AL16UTF16 offers better performance and storage efficiency.
In addition to choosing a Unicode solution, the following issues should be taken into consideration when the database schema is designed to support multiple languages:
When you use NCHAR
and NVARCHAR2
datatypes for storing multilingual data, the column size specified for a column is defined in number of characters. (The number of characters means the number of Unicode code units.) Table 6-6 shows the maximum size of the NCHAR
and NVARCHAR2
datatypes for the AL16UTF16 and UTF8 national character sets.
Table 6-6 Maximum Datatype Size
National Character Set | Maximum Column Size of NCHAR Datatype | Maximum Column Size of NVARCHAR2 Datatype |
---|---|---|
AL16UTF16 | 1000 characters | 2000 characters |
UTF8 | 2000 bytes | 4000 bytes |
When you use CHAR
and VARCHAR2
datatypes for storing multilingual data, the maximum length specified for each column is, by default, in number of bytes. If the database needs to support Thai, Arabic, or multibyte languages such as Chinese and Japanese, then the maximum lengths of the CHAR
, VARCHAR
, and VARCHAR2
columns may need to be extended. This is because the number of bytes required to encode these languages in UTF8 or AL32UTF8 may be significantly larger than the number of bytes for encoding English and Western European languages. For example, one Thai character in the Thai character set requires 3 bytes in UTF8 or AL32UTF8. In addition, the maximum column lengths for CHAR
, VARCHAR
, and VARCHAR2
datatypes are 2000 bytes, 4000 bytes, and 4000 bytes respectively. If applications need to store more than 4000 bytes, then they should use the CLOB
datatype.
The Unicode character set includes characters of most written languages around the world, but it does not contain information about the language to which a given character belongs. In other words, a character such as ä
does not contain information about whether it is a French or German character. In order to provide information in the language a user desires, data stored in a Unicode database should accompany the language information to which the data belongs.
There are many ways for a database schema to relate data to a language. The following sections provide different approaches:
For data such as product descriptions or product names, you can add a language column (language_id
) of CHAR
or VARCHAR2
datatype to the product table to identify the language of the corresponding product information. This enables applications to retrieve the information in the desired language. The possible values for this language column are the 3-letter abbreviations of the valid NLS_LANGUAGE
values of the database.
See Also: Appendix A, "Locale Data" for a list ofNLS_LANGUAGE values and their abbreviations |
You can also create a view to select the data of the current language. For example:
ALTER TABLE scott.product_information add (language_id VARCHAR2(50)): CREATE OR REPLACE VIEW product AS SELECT product_id, product_name FROM product_information WHERE language_id = sys_context('USERENV','LANG');
Fine-grained access control enables you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE
clause. when you add a WHERE
clause as a fine-grained access policy to a table or view, Oracle automatically appends the WHERE
clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE
clause can be accessed.
You can use this feature to avoid specifying the desired language of an user in the WHERE
clause in every SELECT
statement in your applications. The following WHERE
clause limits the view of a table to the rows corresponding to the desired language of a user:
WHERE language_id = sys_context('userenv', 'LANG')
Specify this WHERE
clause as a fine-grained access policy for product_information
as follows:
create function func1 ( sch varchar2 , obj varchar2 ) return varchar2(100); begin return 'language_id = sys_context(''userenv'', ''LANG'')'; end / DBMS_RLS.ADD_POLICY ('scott', 'product_information', 'lang_policy', 'scott', 'func1', 'select');
Then any SELECT
statement on the product_information
table automatically appends the WHERE
clause.
See Also: Oracle Database Application Developer's Guide - Fundamentals for more information about fine-grained access control |
You can store documents in multiple languages in CLOB
, NCLOB
, or BLOB
datatypes and set up Oracle Text to enable content search for the documents.
Data in CLOB
columns is stored in a format that is compatible with UCS-2 when the database character set is multibyte, such as UTF8 or AL32UTF8. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document in a CLOB
column requires less storage space than the same document in a LONG
column using UTF8, typically around 30% less, depending on the contents of the document.
Documents in NCLOB
format are also stored in a proprietary format that is compatible with UCS-2 regardless of the database character set or national character set. The storage space requirement is the same as for CLOB
data. Document contents are converted to UTF-16 when they are inserted into a NCLOB
column. If you want to store multilingual documents in a non-Unicode database, then choose NCLOB
. However, content search on NCLOB
is not yet supported.
Documents in BLOB
format are stored as they are. No data conversion occurs during insertion and retrieval. However, SQL string manipulation functions (such as LENGTH
or SUBSTR
) and collation functions (such as NLS_SORT
and ORDER BY
) cannot be applied to the BLOB
datatype.
Table 6-7 lists the advantages and disadvantages of the CLOB
, NCLOB
, and BLOB
datatypes when storing documents:
Table 6-7 Comparison of LOB Datatypes for Document Storage
Datatypes | Advantages | Disadvantages |
---|---|---|
CLOB |
|
|
NCLOB |
|
|
BLOB |
|
|
Oracle Text enables you to build indexes for content search on multilingual documents stored in CLOB
format and BLOB
format. It uses a language-specific lexer to parse the CLOB
or BLOB
data and produces a list of searchable keywords.
Create a multilexer to search multilingual documents. The multilexer chooses a language-specific lexer for each row, based on a language column. This section describe the high level steps to create indexes for documents in multiple languages. It contains the following topics:
The first step in creating the multilexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, German, and Japanese lexers with PL/SQL procedures:
ctx_ddl.create_preference('english_lexer', 'basic_lexer'); ctx_ddl.set_attribute('english_lexer','index_themes','yes'); ctx_ddl.create_preference('german_lexer', 'basic_lexer'); ctx_ddl.set_attribute('german_lexer','composite','german'); ctx_ddl.set_attribute('german_lexer','alternate_spelling','german'); ctx_ddl.set_attribute('german_lexer','mixed_case','yes'); ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');
After the language-specific lexer preferences are created, they need to be gathered together under a single multilexer preference. First, create the multilexer preference, using the MULTI_LEXER
object:
ctx_ddl.create_preference('global_lexer','multi_lexer');
Now add the language-specific lexers to the multilexer preference using the add_sub_lexer
call:
ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer'); ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer'); ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');
This nominates the german_lexer
preference to handle German documents, the japanese_lexer
preference to handle Japanese documents, and the english_lexer
preference to handle everything else, using DEFAULT
as the language.
The multilexer decides which lexer to use for each row based on a language column in the table. This is a character column that stores the language of the document in a text column. Use the Oracle language name to identify the language of a document in this column. For example, if you use the CLOB
datatype to store your documents, then add the language column to the table where the documents are stored:
CREATE TABLE globaldoc (doc_id NUMBER PRIMARY KEY, language VARCHAR2(30), text CLOB);
To create an index for this table, use the multilexer preference and specify the name of the language column:
CREATE INDEX globalx ON globaldoc(text) indextype IS ctxsys.context parameters ('lexer global_lexer language column language');
In addition to the language column, the character set and format columns must be added in the table where the documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For example, the CREATE TABLE
statement can specify columns called characterset
and format
:
CREATE TABLE globaldoc ( doc_id NUMBER PRIMARY KEY, language VARCHAR2(30), characterset VARCHAR2(30), format VARCHAR2(10), text BLOB );
You can put word-processing or spreadsheet documents into the table and specify binary
in the format
column. For documents in HTML, XML and text format, you can put them into the table and specify text
in the format
column.
Because there is a column in which to specify the character set, you can store text documents in different character sets.
When you create the index, specify the names of the format and character set columns:
CREATE INDEX globalx ON globaldoc(text) indextype is ctxsys.context parameters ('filter inso_filter lexer global_lexer language column language format column format charset column characterset');
You can use the charset_filter
if all documents are in text format. The charset_filter
converts data from the character set specified in the charset
column to the database character set.