Oracle Text Application Developer's Guide Release 9.2 Part Number A96517-01 |
|
This chapter describes Oracle Text querying and associated features. The following topics are covered:
The basic Oracle Text query takes a query expression, usually a word with or without operators, as input. Oracle returns all documents (previously indexed) that satisfy the expression along with a relevance score for each document. Scores can be used to order the documents in the result set.
To issue an Oracle Text query, use the SQL SELECT
statement. Depending on the type of index you create, you use either the CONTAINS
or CATSEARCH
operator in the WHERE
clause. You can use these operators programatically wherever you can use the SELECT
statement, such as in PL/SQL cursors.
Use the MATCHES
operator to classify documents with a CTXRULE
index.
When you create an index of type CONTEXT
, you must use the CONTAINS
operator to issue your query. An index of type CONTEXT
is suited for indexing collections of large coherent documents.
With the CONTAINS
operator, you can use a number of operators to define your search criteria. These operators enable you to issue logical, proximity, fuzzy, stemming, thesaurus and wildcard searches. With a correctly configured index, you can also issue section searches on documents that have internal structure such as HTML and XML.
With CONTAINS
, you can also use the ABOUT
operator to search on document themes.
In the SELECT statement, specify the query in the WHERE
clause with the CONTAINS
operator. Also specify the SCORE
operator to return the score of each hit in the hitlist. The following example shows how to issue a query:
SELECT SCORE(1) title from news WHERE CONTAINS(text, 'oracle', 1) > 0;
You can order the results from the highest scoring documents to the lowest scoring documents using the ORDER
BY
clause as follows:
SELECT SCORE(1), title from news WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
In a PL/SQL application, you can use a cursor to fetch the results of the query.
The following example issues a CONTAINS
query against the NEWS
table to find all articles that contain the word oracle. The titles and scores of the first ten hits are output.
declare rowno number := 0; begin for c1 in (SELECT SCORE(1) score, title FROM news WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC) loop rowno := rowno + 1; dbms_output.put_line(c1.title||': '||c1.score); exit when rowno = 10; end loop; end;
This example uses a cursor FOR
loop to retrieve the first ten hits. An alias score is declared for the return value of the SCORE
operator. The score and title are output to standard out using cursor dot notation.
A structured query, also called a mixed query, is a query that has a CONTAINS
predicate to query a text column and has another predicate to query a structured data column.
To issue a structured query, you specify the structured clause in the WHERE
condition of the SELECT
statement.
For example, the following SELECT
statement returns all articles that contain the word oracle that were written on or after October 1, 1997:
SELECT SCORE(1), title, issue_date from news WHERE CONTAINS(text, 'oracle', 1) > 0 AND issue_date >= ('01-OCT-97') ORDER BY SCORE(1) DESC;
Note: Even though you can issue structured queries with |
When you create an index of type CTXCAT
, you must use the CATSEARCH
operator to issue your query. An index of type CTXCAT
is best suited when your application stores short text fragments in the text column and other associated information in related columns.
For example, an application serving an online auction site might have a table that stores item description in a text column and associated information such as date and price in other columns. With a CTXCAT
index, you can create b-tree indexes on one or more of these columns. The result is that when you use the CATSEARCH
operator to search a CTXCAT
index, query performance is generally faster for mixed queries.
The operators available for CATSEARCH
queries are limited to logical operations such as AND
or OR
. The operators you can use to define your structured criteria are greater than, less than, equality, BETWEEN
, and IN
.
A typical query with CATSEARCH
might include a structured clause as follows to find all rows that contain the word camera ordered by the bid_close
date:
SELECT FROM auction WHERE CATSEARCH(title, 'camera', 'order by bid_close desc')> 0;
The type of structured query you can issue depends on how you create your sub-indexes.
You specify the structured part of a CATSEARCH
query with the structured_query
parameter. The columns you name in the structured expression must have a corresponding sub-index.
For example, assuming that category_id
and bid_close
have a sub-index in the ctxcat
index for the AUCTION
table, you can issue the following structured query:
SELECT FROM auction WHERE CATSEARCH(title, 'camera', 'category_id=99 order by bid_close desc')> 0;
You can use a cursor to process the output of a CATSEARCH
query as you do for CONTAINS
.
When you create an index of type CTXRULE
, you must use the MATCHES
operator to classify your documents. The CTXRULE
index is essentially an index on the set of queries that define your classifications.
For example, if you have an incoming stream of documents that need to be routed according to content, you can create a set of queries that define your categories. You create the queries as rows in a text column. It is possible to create this type of table with the CTX_CLS.TRAIN procedure.
You then index the table to create a CTXRULE
index. When documents arrive, you use the MATCHES
operator to classify each document.
See Also:
Oracle Text Reference for more information on CTX_CLS.TRAIN. |
A MATCHES
query finds all rows in a query table that match a given document. Assuming that a table querytable
has a CTXRULE
index associated with it, you can issue the following query:
SELECT classification FROM querytable WHERE MATCHES(text, 'Smith is a common name in the United States') > 0;
The following example assumes that the table of queries profiles
has a CTXRULE
index associated with it. It also assumes that the table newsfeed
contains a set of news articles to be categorized.
This example loops through the newsfeed table, categorizing each article using the MATCHES
operator. The results are stored in the results
table.
PROMPT Populate the category table based on newsfeed articles PROMPT set serveroutput on; declare mypk number; mytitle varchar2(1000); myarticles clob; mycategory varchar2(100); cursor doccur is select pk,title,articles from newsfeed; cursor mycur is select category from profiles where matches(rule, myarticles)>0; cursor rescur is select category, pk, title from results order by category,pk; begin dbms_output.enable(1000000); open doccur; loop fetch doccur into mypk, mytitle, myarticles; exit when doccur%notfound; open mycur; loop fetch mycur into mycategory; exit when mycur%notfound; insert into results values(mycategory, mypk, mytitle); end loop; close mycur; commit; end loop; close doccur; commit; end; /
The following example displays the categorized articles by category.
PROMPT display the list of articles for every category PROMPT declare mypk number; mytitle varchar2(1000); mycategory varchar2(100); cursor catcur is select category from profiles order by category; cursor rescur is select pk, title from results where category=mycategory order by pk; begin dbms_output.enable(1000000); open catcur; loop fetch catcur into mycategory; exit when catcur%notfound; dbms_output.put_line('********** CATEGORY: '||mycategory||' *************'); open rescur; loop fetch rescur into mypk, mytitle; exit when rescur%notfound; dbms_output.put_line('** ('||mypk||'). '||mytitle); end loop; close rescur; dbms_output.put_line('**'); dbms_output.put_line('*******************************************************'); end loop; close catcur; end; /
A word query is a query on a word or phrase. For example, to find all the rows in your text table that contain the word dog, you issue a query specifying dog as your query term.
You can issue word queries with both CONTAINS
and CATSEARCH
SQL operators.
If multiple words are contained in a query expression, separated only by blank spaces (no operators), the string of words is considered a phrase and Oracle searches for the entire string during a query.
For example, to find all documents that contain the phrase international law, you issue your query with the phrase international law.
Stopwords are words for which Oracle does not create an index entry. They are usually common words in your language that are unlikely to be searched on by themselves.
Oracle Text includes a default list of stopwords for your language. This list is called a stoplist. For example, in English, the words this and that are defined as stopwords in the default stoplist. You can modify the default stoplist or create new stoplists with the CTX_DDL
package. You can also add stopwords after indexing with the ALTERINDEX
statement.
You cannot query on a stopword by itself or on a phrase composed of only stopwords. For example, a query on the word this returns no hits when this is defined as a stopword.
You can query on phrases that contain stopwords as well as non-stopwords such as this boy talks to that girl. This is possible because the Oracle Text index records the position of stopwords even though it does not create an index entry for them.
When you include a stopword within your query phrase, the stopword matches any word. For example, the query:
'Jack was big'
matches phrases such as Jack is big and Jack grew big assuming was is a stopword.
An ABOUT
query is a query on a document theme. A document theme is a concept that is sufficiently developed in the text. For example, an ABOUT
query on US politics might return documents containing information about US presidential elections and US foreign policy. Documents need not contain the exact phrase US politics to be returned.
During indexing, document themes are derived from the knowledge base, which is a hierarchical list of categories and concepts that represents a view of the world. Some examples of themes in the knowledge catalog are concrete concepts such as jazz music, football, or Nelson Mandela. Themes can also be abstract concepts such as happiness or honesty.
During indexing, the system can also identify and index document themes that are sufficiently developed in the document, but do not exist in the knowledge base.
You can augment the knowledge base to define concepts and terms specific to your industry or query application. When you do so, ABOUT
queries are more precise for the added concepts.
ABOUT
queries perform best when you create a theme component in your index. Theme components are created by default for English and French.
Oracle enables you to query on themes with the ABOUT
operator. A stoptheme is a theme that is not to be indexed. You can add and remove stopthemes with the CTX_DLL package. You can add stopthemes after indexing with the ALTER INDEX
statement.
A query expression is everything in between the single quotes in the text_query
argument of the CONTAINS
or CATSEARCH
operator. What you can include in a query expression in a CONTAINS
query is different from what you can include in a CATSEARCH
operator.
A CONTAINS
query expression can contain query operators that enable logical, proximity, thesaural, fuzzy, and wildcard searching. Querying with stored expressions is also possible. Within the query expression, you can use grouping characters to alter operator precedence. This book refers to these operators as the CONTEXT
grammar.
With CONTAINS
, you can also use the ABOUT
query to query document themes.
See Also:
"The CONTEXT Grammar" in this chapter. |
With the CATSEARCH
operator, you specify your query expression with the text_query
operator and your optional structured criteria with the structured_query
argument. The text_query
argument is limited to querying words and phrases. You can use logical operations, such as logical and, or, and not. This book refers to these operators as the CTXCAT
grammar.
If you want to use the much richer set of operators supported by the CONTEXT grammar, you can use the query template feature with CATSEARCH.
With structured_query
argument, you specify your structured criteria. You can use the following SQL operations:
You can also use ORDER BY
clause to order your output.
See Also:
"The CTXCAT Grammar" in this chapter. |
The MATCHES
operator takes a document as input and finds all rows in a query table that match it. You do not specify query expressions in the MATCHES
operator.
Oracle Text supports case-sensitivity for word and ABOUT
queries.
Word queries are case-insensitive by default. This means that a query on the term dog returns the rows in your text table that contain the word dog, Dog, or DOG.
You can enable case-sensitive searching by enabling the mixed_case
attribute in your BASIC_LEXER
index preference. With a case-sensitive index, your queries must be issued in exact case. This means that a query on Dog matches only documents with Dog. Documents with dog or DOG are not returned as hits.
If you have case-sensitivity enabled for word queries and you issue a query on a phrase containing stopwords and non-stopwords, you must specify the correct case for the stopwords. For example, a query on this boy talks to that girl does not return text that contains the phrase This boy talks to that girl, assuming this is a stopword.
ABOUT
queries give the best results when your query is formulated with proper case. This is because the normalization of your query is based on the knowledge catalog which is case-sensitive. Attention to case is required especially for words that have different meanings depending on case, such as turkey the bird and Turkey the country.
However, you need not enter your query in exact case to obtain relevant results from an ABOUT
query. The system does its best to interpret your query. For example, if you enter a query of ORACLE
and the system does not find this concept in the knowledge catalog, the system might use Oracle as a related concept for look-up.
Feedback information provides broader term, narrower term, and related term information for a specified query with a context index. You obtain this information programatically with the CTX_QUERY.HFEEDBACK
procedure.
Broader term, narrower term, and related term information is useful for suggesting other query terms to the user in your query application.
The feedback information returned is obtained from the knowledge base and contains only those terms that are also in the index. This increases the chances that terms returned from HFEEDBACK
produce hits over the currently indexed document set.
See Also:
Oracle Text Reference for more information about using |
Explain plan information provides a graphical representation of the parse tree for a CONTAINS
query expression. You can obtain this information programatically with the CTX_QUERY.EXPLAIN
procedure.
Explain plan information tells you how a query is expanded and parsed without having the system execute the query. Obtaining explain information is useful for knowing the expansion for a particular stem, wildcard, thesaurus, fuzzy, soundex, or ABOUT
query. Parse trees also show the following information:
ABOUT
query normalizationSee Also:
Oracle Text Reference for more information about using |
The CONTEXT
grammar is the default grammar for CONTAINS
. With this grammar, you can add complexity to your searches with operators. You use the query operators in your query expression. For example, the logical operator AND
allows you to search for all documents that contain two different words. The ABOUT
operator allows you to search on concepts.
You can also use the WITHIN
operator for section searching, the NEAR
operator for proximity searches, the stem, fuzzy, and thesaural operators for expanding a query expression.
With CONTAINS
, you can also use the CTXCAT
grammar with the query template feature.
The following sections describe some of the Oracle Text operators.
See Also:
Oracle Text Reference for complete information about using query operators. |
Use the ABOUT
operator in English or French to query on a concept. The query string is usually a concept or theme that represents the idea to be searched on. Oracle returns the documents that contain the theme.
Word information and theme information are combined into a single index. To issue a theme query, your index must have a theme component which is created by default in English and French.
You issue a theme query using the ABOUT
operator inside the query expression. For example, to retrieve all documents that are about politics, write your query as follows:
SELECT SCORE(1), title FROM news WHERE CONTAINS(text, 'about(politics)', 1) > 0 ORDER BY SCORE(1) DESC;
See Also:
Oracle Text Reference for more information about using the ABOUT operator. |
Logical operators such as AND
or OR
allow you to limit your search criteria in a number of ways. The following table describes some of these operators.
Section searching is useful for when your document set is HTML or XML. For HTML, you can define sections using embedded tags and then use the WITHIN
operator to search these sections.
For XML, you can have the system automatically create sections for you. You can query with the WITHIN
operator or with the INPATH
operator for path searching.
You can search for terms that are near to one another in a document with the NEAR
operator.
For example, to find all documents where dog is within 6 words of cat, issue the following query:
'near((dog, cat), 6)'
See Also:
Oracle Text Reference for more information about using the |
You can expand your queries into longer word lists with operators such as wildcard, fuzzy, stem, soundex, and thesaurus.
See Also:
Oracle Text Reference for more information about using these operators. "Is it OK to have many expansions in a query?" in Chapter 5, "Performance Tuning" |
You can use the CTXCAT grammar in CONTAINS queries. To do so, use a query template specification in the text_query
parameter of CONTAINS.
You might take advantage of the CTXCAT grammar when you need an alternative and simpler query grammar.
See Also:
Oracle Text Reference for more information about using these operators. |
You can use the procedure CTX_QUERY.STORE_SQE
to store the definition of a query without storing any results. Referencing the query with the CONTAINS
SQE operator references the definition of the query. In this way, stored query expressions make it easy for defining long or frequently used query expressions.
Stored query expressions are not attached to an index. When you call CTX_QUERY.STORE_SQE
, you specify only the name of the stored query expression and the query expression.
The query definitions are stored in the Text data dictionary. Any user can reference a stored query expression.
See Also:
Oracle Text Reference to learn more about the syntax of |
You define and use a stored query expression as follows:
CTX_QUERY.STORE_SQE
to store the results for the text column. With STORE_SQE
, you specify a name for the stored query expression and a query expression.SQE
operator. Oracle returns the results of the stored query expression in the same way it returns the results of a regular query. The query is evaluated at the time the stored query expression is called.
You can delete a stored query expression using REMOVE_SQE
.
The following example creates a stored query expression called disaster that searches for documents containing the words tornado, hurricane, or earthquake:
begin ctx_query.store_sqe('disaster', 'tornado | hurricane | earthquake'); end;
To execute this query in an expression, write your query as follows:
SELECT SCORE(1), title from news WHERE CONTAINS(text, 'SQE(disaster)', 1) > 0 ORDER BY SCORE(1);
See Also:
Oracle Text Reference to learn more about the syntax of |
You can call user-defined functions directly in the CONTAINS
clause as long as the function satisfies the requirements for being named in a SQL statement. The caller must also have EXECUTE
privilege on the function.
For example, assuming the function french returns the French equivalent of an English word, you can search on the French word for cat by writing:
SELECT SCORE(1), title from news WHERE CONTAINS(text, french('cat'), 1) > 0 ORDER BY SCORE(1);
See Also:
Oracle9i SQL Reference for more information about creating user functions and calling user functions from |
The CTXCAT
grammar is the default grammar for CATSEARCH
. This grammar supports logical operations such as AND
and OR
as well as phrase queries.
The CATSEARCH
query operators have the following syntax:
In addition, you can use the CONTEXT grammar in CATSEARCH queries. To do so, use a query template specification in the text_query parameter.
You might use the CONTAINS grammar as such when you need to issue proximity, thesaurus, or ABOUT queries with a CTXCAT index.
See Also:
Oracle Text Reference for more information about using these operators. |
A CONTAINS
query optimized for response time provides a fast solution for when you need the highest scoring documents from a hitlist.
The example below returns the first twenty hits to standard out. This example uses the FIRST_ROWS
(n) hint and a cursor.
declare cursor c is select /*+ FIRST_ROWS(20) */ title, score(1) score from news where contains(txt_col, 'dog', 1) > 0 order by score(1) desc; begin for c1 in c loop dbms_output.put_line(c1.score||':'||substr(c1.title,1,50)); exit when c%rowcount = 21; end loop; end; /
Besides using query hints, there are other factors that can influence query response time such as:
To count the number of hits returned from a query with only a CONTAINS
predicate, you can use CTX_QUERY.COUNT_HITS
in PL/SQL or COUNT(*)
in a SQL SELECT
statement.
If you want a rough hit count, you can use CTX_QUERY.COUNT_HITS
in estimate mode (EXACT
parameter set to FALSE
). With respect to response time, this is the fastest count you can get.
To count the number of hits returned from a query that contains a structured predicate, use the COUNT(*)
function in a SELECT
statement.
To find the number of documents that contain the word oracle, issue the query with the SQL COUNT
function as follows:
SELECT count(*) FROM news WHERE CONTAINS(text, 'oracle', 1) > 0;
To find the number of documents returned by a query with a structured predicate, use COUNT(*)
as follows:
SELECT COUNT(*) FROM news WHERE CONTAINS(text, 'oracle', 1) > 0 and author = 'jones';
To find the number of documents that contain the word oracle, use COUNT_HITS
as follows:
declare count number; begin count := ctx_query.count_hits(index_name => my_index, text_query => 'oracle', exact => TRUE); dbms_output.put_line('Number of docs with oracle:'); dbms_output.put_line(count); end;
See Also:
Oracle Text Reference to learn more about the syntax of |
|
Copyright © 2000, 2002 Oracle Corporation. All Rights Reserved. |
|