Oracle® Text Reference 11g Release 1 (11.1) Part Number B28304-01 |
|
|
View PDF |
This appendix contains a list of the document formats supported by the automatic (AUTO_FILTER
) filtering technology. The following topics are covered in this appendix:
The automatic filtering technology in Oracle Text, which is licensed from Autonomy, Inc., enables you to index most document formats. This technology also enables you to convert documents to HTML for document presentation with the CTX_DOC
package.
To use automatic filtering for indexing and DML processing, you must specify the AUTO_FILTER
object in your filter preference.
To use automatic filtering technology for converting documents to HTML with the CTX_DOC
package, you need not use the AUTO_FILTER
indexing preference, but you must still set up your environment to use this filtering technology, as described in this appendix.
The supported platforms and formats listed in this appendix apply for this release. These supported formats are updated for patch releases. To view the latest formats, refer to the Oracle Technology Network:
http://www.oracle.com/technology/products/text
Password-protected documents and documents with password-protected content are not supported by the AUTO_FILTER
filter.
For other limitations, refer to sections in this chapter concerning specific document types.
Several platforms can take advantage of AUTO_FILTER
filter technology.
AUTO_FILTER
filter technology is supported on the following platforms:
Microsoft Windows
Server 2003 (x86 and IA-64)
XP (Service Packs 1 and 2)
2000 x86 (Service Pack 2)
Sun Solaris 8.0, 9.0, and 10 SPARC (built on Solaris 8.0)
Sun Solaris on x86
HP-UX 11.0 and 11i, PA-RISC
HP-UX 11i v2, IA-64
IBM AIX L5.1, L5.2, and L5.3 (Power)
Red Hat Enterprise Linux AS 3.0 and 4.0 x86 and IA-64 (built on AS 3.0)
SuSE Linux Enterprise Server 8 and 9 x86 (built on Red Hat). For version 8.0 support, applications must use a GCC 3.2.3 runtime library.
Linux on Power
The tables in this section list the document formats that Oracle Text supports for filtering. Oracle Text licenses its filtering technology from Autonomy, Inc.
Document filtering is used for indexing, DML, and for converting documents to HTML with the CTX_DOC
package.
Note:
These lists do not represent the complete list of formats that Oracle Text is able to process. TheUSER_FILTER
and PROCEDURE_FILTER
enable Oracle Text to process any document format, provided an external filter exists that can filter to some textual format like plain-text, HTML, XML, and so forth.Plain-text, HTML, XHTML, XML, and SGML formats pass through the filter without any conversion.
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
ANSI (TXT) | All versions | Y | Y | Y |
ASCII (TXT) | All versions | Y | Y | Y |
HTML | 2.0, 3.2, 4.0 | Y | Y | Y |
IBM DCA/RFT (Revisable Form Text) (DC) | SC23-0758-1 | Character sets 500 and 1026 only | N | N |
Rich Text Format (RTF) | 1 through 1.7 | Y | Y | Y |
Unicode Text | 3, 4 | Y | Y | Y |
XHTML | 1.0 | Y | Y | Y |
Generic XML | 1.0 | Y | Y | Y |
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Adobe Maker Interchange Format (MIF) | 5, 5.5, 6, 7 | Character set 1252 only | N | N |
Applix Words (AW) | 3.11, 4.0, 4.1, 4.2, 4.3, 4.4 | Character set 1252 only | N | N |
DisplayWrite (IP) | 4 | Character sets 500 and 1026 only | N | N |
Folio Flat File (FFF) | 3.1 | Character set 1252 only | N | N |
Fujitsu Oasys (OA2) | 7 | Y | Y | N |
JustSystems Ichitaro (JTD) | 8 through 2005 | Y | Y | N |
Lotus AMI Pro (SAM) | 2, 3 | Y | Simplified Chinese, Traditional Chinese, Japanese, and Thai only | Y |
Lotus Word Pro (LWP) | 96, 97, Millennium Edition R9, 9.8 (supported on Windows 32-bit platform only) | Y | Y | Y |
Lotus Master (MWP) | 96, 97 (supported on Windows 32-bit platform only) | Y | Y | N |
Lotus AMI Professional Write Plus (AMI) | 2.1 | Y | Simplified Chinese, Traditional Chinese, Japanese, and Thai only | N |
Microsoft Word for PC (DOC) | 4, 5, 5.5, 6 | character set 1252 only | N | N |
Microsoft Word for Windows (DOC) | 1 through 2003 | Y | N: versions 1-2
Y: versions 6,7,8,95,97,2000,XP,2002,2003 |
N: versions 1-2
Hebrew only: versions 6,7,8,95 Y: versions 97,2000,XP,2002,2003 |
Microsoft Word for Windows XML format | 2003 (No formatting extracted) | Y | Y | Y |
Microsoft Word for Macintosh (DOC) | 4, 5, 6, 98 | Y | N | Y |
Microsoft Works (WPS) | 1 through 2000 | Y | Japanese only | N |
Microsoft Windows Write (WRI) | 1, 2, 3 | Y | Japanese only | N |
OpenOffice (SXW) | 1, 1.1 (No formatting extracted) | Y | Y | Y |
StarOffice (SXW) | 6, 7 (No formatting extracted) | Y | Y | Y |
WordPad (RTF) | Through 2003 | Y | Y | Y |
WordPerfect for Windows (WO) | 5, 5.1 | Y | N | Y |
WordPerfect for Windows (WPD) | 6, 7, 8, 9, 10, 2000, 2002, 11 | Y | N | N |
WordPerfect for Macintosh (WPS) | 1.02, 2, 2.1, 2.2, 3, 3.1 | Y | N | N |
WordPerfect for Linux | 6.0, 8.1 | Y | N | N |
XYWrite (XY4) | 4.12 | Character set 1252 only | N | N |
The following limitations apply to filtering of word processing documents:
If a graphic or table appears in a word processing text box, the filter cannot position it correctly in the HTML output.
Nested tables (a table inside another table) in word processing documents are not supported.
Line numbers in Microsoft Word documents are not supported.
Columns in word processing documents are not supported. Text and graphics in multiple columns in the source document appear in a single flow in the HTML output.
WordArt is converted to text. Display enhancements such as curves, angles, 3-D effects, and shadows are not shown.
Because the concept of a ÒpageÓ does not exist in HTML, page borders are not supported.
Because the concept of a ÒpageÓ does not exist in HTML, all page headers appear at the beginning of the HTML output file, and all page footers appear at the end of the HTML output file.
Because the concept of a ÒpageÓ does not exist in HTML, page orientation (landscape and portrait) is not supported.
For XML-based formats, only the following character sets are supported:
Table B-1 Character Sets Supported in XML-based Formats
Character Set | Description |
---|---|
AL32UTF8 |
Unicode 4.0 UTF-8 Universal character set |
AL16UTF16 |
Unicode 4.0 UTF-16 Universal 32-bit characters character set |
JA16EUC |
EUC 24-bit Japanese |
KO16MSWIN949 |
MS Windows Code Page 949 Korean |
JA16EUC |
EUC 24-bit Japanese |
KO16MSWIN949 |
MS Windows Code Page 949 Korean |
WE8ISO8859P1 |
ISO 8859-1 West European |
US7ASCII |
ASCII 7-bit American |
ZHS16GBK |
GBK 16-bit Simplified Chinese |
ZHT16HKSCS |
MS Windows Code Page 950 with Hong Kong Supplementary Character Set HKSCS-2001 (character set conversion to and from Unicode is based on Unicode 3.0) |
JA16SJISTILDE |
Shift-JIS 16-bit Japanese, except the wave dash and tilde are mapped differently to and from Unicode. |
Note:
A Unicode text document is not recognized if it lacks a byte-order mark. An exception is when the first 1024 bytes are Basic Latin Unicode. In this case a byte order mark is not required.Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Applix Spreadsheets (AS) | 4.2, 4.3, 4.4 | Character set 1252 only | N | N |
Corel Quattro Pro (QPW, WB3) | 5, 6, 7, 8 (Later versions not supported) | Y | N | N |
Lotus 1-2-3 (123) | 96, 97, Millennium Edition R9, 9.8 | Y | Y | Y |
Lotus 1-2-3 (WK4) | 2, 3, 4, 5 | Y | Y | N |
Lotus 1-2-3 Charts (123) | 2, 3, 4, 5 | Y | Y | N |
Microsoft Excel for Windows (XLS) | 2.2 through 2003 | Y | Y | Y |
Microsoft Excel for Windows XML format | 2003 (No formatting extracted) | Y | Y | Y |
Microsoft Excel for Macintosh (XLS) | 98 | Y | N | N |
Microsoft Excel Charts (XLS) | 2, 3, 4, 5, 6, 7 | Y | Y | N |
Microsoft Works Spreadsheet (S30,S40) | 1, 2, 3, 4 | Y | N | N |
OpenOffice (SXC) | 1, 1.1 (No formatting extracted) | Y | Y | Y |
StarOffice (SXC) | 6, 7 (No formatting extracted) | Y | Y | Y |
Comma Separated Values (CSV) | Character set 1252 only | N | N |
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Applix Presents (AG) | 4.0, 4.2, 4.3, 4.4 | character set 1252 only | N | N |
Corel Presentations (SHW) | 6, 7, 8, 9, 10, 11, 2000, 2002 | character set 1252 only | N | N |
Lotus Freelance Graphics (PRE) | 2, 96, 97, 98, Millennium Edition R9, 9.8 | character set 850 only | N | N |
Lotus Freelance Graphics 2 (PRZ) | 2 | Y | Japanese, Simplified Chinese, Traditional Chinese, and Thai only | N |
Microsoft PowerPoint for Windows (PPT) | 95 through 2003 | Y | Japanese, Simplified Chinese, Traditional Chinese, and Korean only | N |
Microsoft PowerPoint for PC (PPT) | 4 | character set 1252 only | Traditional Chinese only | N |
Microsoft PowerPoint for Macintosh (PPT) | 98 | Y | N | Y |
Microsoft Project (MPP) | 98, 2000, 2002 (XP) | character set 1252 only | N | N |
Microsoft Visio (VSD) | 5, 6, 2000, 2002, 2003 | Y | Y | Y |
Microsoft Visio XML format | 2003 (No formatting extracted) | Y | Y | Y |
OpenOffice (SXI, SXP) | 1, 1.1 (No formatting extracted) | Y | Y | Y |
StarOffice (SXI, SXP) | 6, 7 (No formatting extracted) | Y | Y | Y |
The following limitations apply to the formatting of spreadsheets:
Hyperlinks are not supported. Hyperlinks within a document are not preserved.
Right-aligned and center-aligned tabs are displayed as left-aligned.
WordArt is converted to text. Display enhancements such as curves, angles, 3-D effects, and shadows are not shown.
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Adobe Portable Document Format (PDF) | 1.1 (Acrobat 2.0) to 1.6 (Acrobat 7.0) | Y | Y | Y |
Multi-byte PDFs are supported, provided the PDF document is created using Character ID-keyed (CID) fonts, predefined CJK CMap files, or ToUnicode font encodings, and the document does not contain embedded fonts. See the Adobe website and the Adobe Acrobat documentation for more information.
To determine the type of font encodings that are used in a PDF, open the PDF document in Adobe Acrobat, and select File->Document Info->Fonts. If the Encodings column lists Custom or Embedded encodings, then you may encounter problems filtering the PDF document.
Limitations apply to PDF documents as described in this section.
Embedded fonts in a PDF document are not filtered correctly. They are usually displayed using the question mark (?) replacement character.
The following color spaces are supported:
DeviceRGB
DeviceGray
DeviceCMYK
CalGray
CalRGB
Index color spaces are supported as long as they are used with a supported basic color space.
Hyperlinks in a PDF are not active when displayed in a browser or a viewing window.
All pre-defined CMaps in PDF 1.3 specification are supported. CMaps added in PDF 1.4 and PDF 1.5 specifications are not supported. A CMap specifies a mapping from a character code to the Adobe Character Identifier Number (CID). Characters with unsupported CMaps are not translated correctly. They are usually displayed using the question mark (?) replacement character.
Annotations, such as notes, sound, or movie, are not supported.
The following features of PDF 1.5 for Acrobat 6.0 are not supported:
Tagged PDFs. When processing a ÒtaggedÓ PDF, the structure defined by the PDF tags is ignored and is not used to determine the paragraph flow of the output.
Images compressed in JPEG2000
Hidden content in a PDF document, such as, Optional Content and OCG-State Actions
Interactive forms
Embedded multimedia presentations
Digital signatures and signature fields
Interactive presentations, that is, navigation between pages and transition actions.
Vector images are not supported. Because background colors are defined in PDF as vector images, background colors are also not supported. Raster images are supported.
Typeface styles that are rendered in a PDF by printing the character multiple times in the same space (such as shadow fonts) are not consolidated into a single character. For example, the shadow character ÒBÓ in a PDF is extracted as "BBBB."
Table B-2 lists the graphic formats that the AUTO_FILTER
filter recognizes. This means that indexing a text column that contains any of these formats produces no error. As such, it is safe for the column to contain any of these formats.
Formats are categorized as either embedded graphics or standalone graphics. Embedded graphics are inserted or referenced within a document.
Note:
TheAUTO_FILTER
filter cannot extract textual information from graphics.Table B-2 Supported Graphics Formats for AUTO_FILTER Filter
Graphics Format | Version |
---|---|
AutoCAD Drawing format (DWG) |
R13, R14, 2000, and 2004 (standalone only) |
AutoCAD Drawing format (DXF) |
R13, R14, 2000, and 2004 (standalone only) |
Encapsulated PostScript (EPS) (raster only) |
TIFF header only |
Enhanced Metafile (EMF) |
no specific version |
Graphics Interchange Format (GIF) |
87, 89 |
JPEG File Interchange Format |
no specific version |
Lotus AMIDraw Graphics (SDW) |
no specific version |
Lotus Pic (PIC) |
no specific version |
Macintosh Raster (PICT/PCT) |
2 |
MacPaint (PNTG) |
no specific version |
Microsoft Windows Bitmap (BMP) |
no specific version |
PC Paintbrush (PCX) |
3 |
Portable Network Graphics (PNG) |
no specific version |
SGI RGB Image (RGB) |
no specific version |
Sun Raster Image (RS) |
no specific version |
Tagged Image File (TIFF) |
through 6.0Foot 1 |
Truevision TARGA (TGA) |
2 |
Windows Animated Cursor (ANI) |
no specific version |
Windows Metafile (WMF) |
3 |
WordPerfect Graphics 1 (WPG) |
1 |
WordPerfect Graphics 2 (WPG) |
2, 7 |
Computer Graphics Metafile (CGM) |
no specific version |
Corel DRAW (CDR) |
through to 9.0 |
DCX Fax System (DCX) |
no specific version |
Microsoft Office Drawing (MSO) |
no specific version |
Windows Icon Cursor (ICO) |
no specific version |
Footnote 1 For Tagged Image File (TIFF), the following compression types are supported: no compression, CCITT Group 3 1-Dimensional Modified Huffman, CCITT Group 3 T4 1-Dimensional, CCITT Group 4 T6, LZW, JPEG (only Gray, RGB and CMYK color space are supported), and PackBits.