Oracle Context Option Administrator's Guide

Library

Product

Contents

Index

CHAPTER 3. Text Concepts

Text Operations
Text Columns
Text Loading
Text Storage
External Text
Text Filtering
Text Indexes
Theme Indexes
Base-letter Conversion
Thesauri

This chapter introduces the concepts necessary for understanding how text is setup and managed by ConText Option.

The following topics are discussed in this chapter:

Text Operations

Text Columns

Text Loading

Text Storage

External Text

Text Filtering

Text Indexes

Theme Indexes

Base-letter Conversion

Thesauri

Text Operations

ConText Option supports five types of operations that are processed by ConText servers:

text/theme queries

Linguistic Services

The personality mask for a ConText server determines which operations the server can process.

For more information about personality masks, see "Personality Masks" in "Administration Concepts (Chapter 1)."

Loading

Text loading is an ongoing operation performed by ConText servers running with the Loader personality. It differs from the other text operations in that a request is not made to the Text Request Queue for handling by the appropriate ConText server.

Instead, ConText servers with the Loader personality regularly scan a document repository (i.e. operating system directory) for documents to be loaded into text columns for indexing.

If a file is found in the directory, the contents of the file are automatically loaded by the ConText server into the appropriate table and column.

For more information about text loading using ConText servers, see "Text Loading" in this chapter.

DDL

A ConText Option DDL operation is a request for the creation, deletion, or optimization of a text/theme index on a text column. DDL requests are sent to the DDL pipe in the Text Request Queue, where available ConText servers with the DDL personality pick up the requests and perform the operation.

DDL operations are requested through the administration tool or the CTX_DDL package.

DML

A text DML operation is a request for the index (text or theme) of a column to be updated. An index update is necessary for a text column in a table when the following modifications have been made to the table:

insertion of a new row

deletion of an existing row

update of the primary key or text columns for an existing row

Requests for index updates are stored in the DML Queue where they are picked up and processed by available ConText servers. The requests can be placed on the queue automatically by ConText Option or they can be placed on the queue manually.

In addition, the system can be configured so DML requests in the queue are processed immediately or in batch mode.

Automatic and Manual DML Queue Notification

DML requests are automatically placed in the queue via a trigger that is created the first time an index is created for a text column.

In some cases, an application developer may not want DML Queue notification to happen automatically, in which case the trigger can be deleted for the table.

DML operations may be called manually using the CTX_DML.REINDEX procedure, which places a request in the DML Queue for a specified document.

Immediate DML Processing

In immediate mode, one or more ConText servers are running with the DML personality. The ConText servers regularly poll the DML queue for requests, pick up any pending requests (up to 10,000 at a time) and update the indexes in real-time.

In this mode, an index is only briefly out of synchronization with the last insert, delete, or update that was performed on the table; however, immediate DML processing can use considerable system resources and create index fragmentation.

Batch DML Processing

In batch mode, no ConText servers are running with the DML personality. DML requests are still placed on the queue via the database triggers; however, the requests are not processed because no DML servers are available.

To start DML processing, the CTX_DML.SYNC procedure is called. This procedure batches all the pending requests in the queue and sends them to the next available ConText server with a DDL personality. Any DML requests that are placed in the queue after SYNC is called are not included in the batch. They are included in the batch that is created the next time SYNC is called.

SYNC can be called with a level of parallelism. The level of parallelism determine the number of batches into which the pending requests are grouped. For example, if SYNC is called with a parallelism level of two, the pending requests are grouped into two batches and the next two available DDL ConText servers process the batches.

Calling SYNC in parallel speeds up the updating of the indexes, but can increase the degree of index fragmentation.

Concurrent Index Creation

A text column within a table can be updated while a ConText server is creating an index on the same text column. Any changes to the table being indexed by a ConText server are stored as entries in the DML Queue, pending the completion of the index creation.

After index creation completes, the entries are picked up by the next available DML ConText server and the index is updated to reflect the changes. This avoids a race condition in which the DML Queue request might be processed, but then overwritten by index creation, even though the index creation was processing an older version of the document.

Text/Theme Queries

A text query is any query that selects rows from a table based on the contents of the text stored in the text column(s) of the table.

A theme query is any query that selects rows from a table based on the themes generated for the text stored in the text column(s) of the table.

Note: Theme queries are only supported for English-language text.

ConText Option supports three methods for text and theme queries:

two-step queries

one-step queries

in-memory queries

Before a user can perform a query using any of the methods, the column to be queried must be defined as a text column in the ConText data dictionary and a text and/or theme index must be generated for the column.

For more information about text columns, see "Text Columns" in this chapter.

For more information about text and theme queries, see Oracle ConText Option Application Developer's Guide

Two-step Queries

In a two-step query, the user performs two distinct operations. First, the user calls the ConText Option CONTAINS PL/SQL procedure for a column. The CONTAINS procedure performs a query of the text stored in a text column and stores the results in a user-defined table.

The user then executes a SQL statement on the result table to return the list of documents (hitlist) or some subset of the documents.

One-step Queries

In a one-step query, the ConText Option SQL function, CONTAINS, is called directly in the WHERE clause of a SQL statement. The CONTAINS function accepts a column name and query expression as arguments and generates a list of the textkeys that match the query expression and a relevance score for each document.

The results generated by CONTAINS are returned through the SELECT clause of the SQL statement.

In-memory Queries

In an in-memory query, PL/SQL stored procedures and functions are used to query a text column and store the results in a query buffer, rather than in the result tables used in two-step queries.

The user opens a CONTAINS cursor to the query buffer in memory, executes a text query, then fetches the hits from the buffer, one at a time.

Stored Query Expressions

In a stored query expression (SQE), the results of a query expression for a text column, as well as the definition of the SQE, are stored in database tables. The results of a SQE can be accessed within a query (one-step, two-step, or in-memory) for performing iterative queries and improving query response.

The results of an SQE are stored in an internal table in the index (text or theme) for the text column. The SQE definition is stored in a system-wide, internal table owned by CTXSYS. The SQE definitions can be accessed through the views, CTX_SQES and CTX_USER_SQES.

For more information about the SQE result table, see "SQR Table" in "ConText Index Tables and Indexes (Appendix C)."

For more information about the SQE views, see "ConText Data Dictionary Views (Appendix B)."

Linguistic Services

The Linguistic Services are used to analyze the content of English-language documents. The application developer uses Linguistics Services to create different views of the contents of documents.

The Linguistic Services currently provide two services for English-language documents stored in an Oracle database:

theme generation

Gist generation

Note: Once the theme and Gist information has been generated, the ConText servers that have been designated for processing Linguistic Services operations can be shut down or redesignated.

For more information about themes and Gists, as well as using the Linguistic Services in applications, see Oracle ConText Option Application Developer's Guide

Text Columns

A text column is any column used to store either text or text references (pointers) in a database table or view. ConText Option recognizes a column as a text column if one or more policies are defined for the column.

Text columns can be any of the supported Oracle datatypes; however, text columns are usually one of the following datatypes:

CHAR

VARCHAR2

LONG

LONG RAW

A table can contain more than one text column; however, each text column requires a separate policy.

For more information about policies and text columns, see "Policies" in "Understanding the ConText Data Dictionary (Chapter 4)."

Textkeys

ConText Option uses textkeys to uniquely identify a document in a text column. The textkey for a text column usually corresponds to the primary key for the table or view in which the column is located; however, the textkey for a column can also reference unique keys (columns) that have been defined for the table.

When a policy is defined for a column, the textkey for the column is specified.

Composite Textkeys

A textkey for a text column can consist of up to sixteen primary or unique key columns.

During policy definition, the primary/unique key columns are specified, using a comma to separate each column name.

In two-step queries, the columns in a composite textkey are returned in the order in which the columns were specified in the policy.

In an in-memory queries, the columns in a composite textkey are returned in encoded form (e.g. 'P1,P2,P3'). This encoded textkey must be decoded to access the individual columns in the textkey.

For more information about encoding and decoding composite textkeys, see Oracle ConText Option Application Developer's Guide

Note: There are some limits to composite textkeys that must be considered when setting up your tables and columns, and when creating policies for the columns.

Column Name Limitations

There is a 256 character limit, including the comma seperators, on the string of column names that can be specified for a composite textkey.

Because the comma separators are included in this limit, the actual limit is 256 minus (no. of columns minus 1), with a maximum of 241 characters (256 - 15), for the combined length of all the column names in the textkey.

This limit is enforced during policy creation.

Column Length Limitations

There is a 256 character limit on the combined lengths of the columns in a composite textkey. This is due to the way the textkey values for composite textkeys are stored in the index.

For a given row, ConText Option concatenates all of the values from the columns that constitute the composite textkey into a single value, using commas to separate the values from each column.

As such, the actual limit for the lengths of the textkey columns is 256 minus (no. of columns minus 1), with a maximum of 241 characters (256 - 15), for the combined length of all the columns.

Note: If you allow values that contain commas (i.e dates) in your textkey columns, the commas are escaped automatically by ConText Option during indexing. The escape character is the backslash character.

In addition, if you allow values that contain backslashes (i.e. dates or directory structures in Windows), ConText Option uses the backslash character to escape the backslashes.

As a result, when calculating the limit for the length of columns in a composite textkey, the overall limit of 256 (241) characters must include the backslash characters used to escape commas and backslashes contained in the data.

Text Loading

The initial loading of text into tables in the database is required for using ConText Option to perform queries and generate linguistic output. This task can be performed from within an application; however, if you have a large set of documents, you may want to perform the loading as a batch process.

For more information about building text loading capabilities into your applications, see Oracle ConText Option Application Developer's Guide.

Loading Text Strings

For loading strings of plain (ASCII) text into individual rows (documents), you can use the INSERT command in SQL.

For more information about INSERT and SQL, see "Oracle7 Server SQL Reference"

Batch Loading

Either SQL*Loader or ctxload can be used to perform batch loading of text into a database column.

SQL*Loader

To perform batch loading of plain (ASCII) text into a table, you can use SQL*Loader, a data loading utility provided by Oracle.

For more information about SQL*Loader, see "Oracle7 Server Utilities"

ctxload Utility

For batch text loading of plain or formatted text, you can use the ctxload command-line utility provided by ConText Option.

The ctxload utility loads text from a load file into a specified database table. The load file can contain multiple documents, but must use a defined structure and syntax.

In addition, the load file can contain ASCII text or it can contain pointers to separate files containing ASCII text or formatted text.

Note: ctxload is best suited for loading text that you want to store in a table using direct data store. If you want to use the external data store (i.e. the OSFILE or URL Tile) to store file pointers in the database, it is possible to use ctxload; however, you should use another loading method, such as SQL*Loader.

For more information, see "Using ctxload with External Data Store Columns" in "Executables and Utilities (Chapter 8)."

Automated Batch Loading

If you set up sources for your columns, you can use ConText servers running with the Loader personality to automate batch loading of text from load files.

If a ConText server is running with the Loader personality, it regularly checks all the sources that have been defined for columns in the database, then scans specified directories for new files. When a new file appears, it calls ctxload to load the contents of the file into the appropriate column.

When loading of the file contents is successful, the server deletes the file to prevent the contents from being loaded again.

User-Defined Translators

If the contents of the file to be loaded are not in the load file format required by ctxload, the file needs to be formatted before loading.

To ensure that the files are in the correct format, a user-defined translator can be specified as one of the preferences in the source for the column.

A user-defined translator is any program that accepts a plain ASCII text file as input and generates a formatted, ASCII text load file for ctxload as its output. The user-defined translator could also be used to perform pre-loading cleanup and spell-checking of your text.

After the contents of the load file have been successfully loaded into the column, the load file generated by the translator is deleted along with the original input file to prevent the contents from being loaded again.

Error Handling

If an error occurs while loading, the error is written to the error log, which can be viewed using CTX_INDEX_ERRORS. In addition, the original file is not deleted.

Text Storage

ConText Option supports three methods of storing text in a column:

direct

external

master/detail

Note: The tables illustrated in the following sections are examples only. The column names and definitions for actual tables used to store text will vary depending on the needs of your application.

Direct Storage

With direct storage, text for documents is stored directly in a database column.

The following example illustrates a table in which text is stored directly in a column:

Table: DIRECT_TEXT
Columns: TEXTKEY   NUMBER (primary or unique key)
         TEXTDATE  DATE
         AUTHOR    VARCHAR2(50)
         NOTES     VARCHAR2(2000) (text column with
                   direct torage)
         TEXT      LONG (text column with direct storage)

The requirements for storing text directly in a column are relatively straightforward. The text is physically stored in a text column and the policy for the text column contains a Data Store preference that utilizes the DIRECT Tile.

External Storage

With external storage, the text column does not contain the actual text of the document, but rather stores a pointer to the file that contains the text of the document.

Suggestion: If text is stored as external text in a column, the column should be be either a CHAR or VARCHAR2 column. LONG and LONG RAW columns are best suited for documents stored internally in the database.

The pointer can be either:

a file name for accessing text stored in local operating system files

a Uniform Resource Locator (URL) for accessing text stored in HTML files on the World Wide Web or locally

The following example illustrates a table that uses external data storage:

Table:  EXTERNAL_TEXT
Columns:  TEXTKEY   number (primary or unique key)
          TEXTDATE  date
          AUTHOR    VARCHAR2(50)
          NOTES     VARCHAR2(2000) (text column with
                    direct text storage)
          TEXT      VARCHAR2(100) (text column storing
                    OS file name)

The only difference between a table used to store text internally and externally would be the datatype of the text column. In an external table, the text column would typically be assigned a datatype of VARCHAR2, rather than LONG, because the column contains a pointer to a file rather than the contents of the file (which would require more space to store).

However, there are additional requirements for storing text externally due to the different methods (file names and URLs) of accessing text stored in flat files.

For more information about the requirements for storing text externally, see "External Text" in this chapter.

Master/Detail Storage

Master/detail storage is for documents stored directly in a text column; however, each document consists of one or more rows which must be indexed as a single row.

The text column used for storing text in a master/detail relationship can be in a single table or in master/detail tables. In a single table configuration, the table contains a textkey column to identify the document and a line number column to identify each segment of the document.

In a two table configuration, the master table contains the textkey column and the detail table contains the line number column and a foreign key to the textkey column in the master table.

In either configuration, the textkey and the line number columns comprise the primary key for the table used to store the text.

The following example illustrates a two table configuration that could be used for storing text in a master-detail relationship:

Table:  MD_HEADER
Columns:  TEXTKEY   NUMBER (primary or unique key)
          TEXTDATE  DATE
          AUTHOR    VARCHAR2(50)

Table:  MD_TEXT
Columns:  TEXTKEY   NUMBER (foreign key to
                    MD_HEADER.TEXTKEY)
          LINE_NUM  NUMBER (unique identifier for text
                    column -- TEXTKEY and LINE_NUM are
                    primary key)
          TEXT      VARCHAR2(80) (text column
                    with direct text storage)

External Text

The requirements for storing text externally are more complicated than storing text directly in a column due to the two different methods of accessing text stored in external files:

file names using OSFILE Data Store Tile

URLs using the URL Data Store Tile

Text Stored as File Names

For text stored as file names pointing to external files, the name and location of the file must be stored.

Directory Path Names

For external files accessed through the file system, the directory path where the files are located must be specified. The path can be stored as part of the file name either in the text column or in the Data Store preference that you create for the OSFILE Tile.

Note: If the preference does not contain the directory path for the files, ConText Option requires the directory path to be included as part of the file name stored in the text column.

File Access

All the external files referenced in the column must be accessible from the server machine on which the ConText server is running. This can be accomplished by storing the files locally in the file system for the server machine or by mounting the remote file system to the server machine.

File Permissions

File permissions for external files in which text is stored must be set accordingly to allow ConText Option to access the files. If the file permissions are not set properly for a file and ConText Option cannot access the file, the file cannot be indexed or retrieved by ConText Option.

Text Stored as URLs

For external Web files, the complete address for each file must be stored as a URL in the text column and the URL Tile utilized in the policy for the column.

Note: Text that contains HTML tags and is stored directly in a text column is considered internal text. As such, the Data Store preference for the text column policy would use the DIRECT or MASTER DETAIL Tiles.

In addition, Web files can be any format supported by the World Wide Web, including HTML files, plain ASCII files, and proprietary formats such as PDF and Word. The filter for the column must be able to recognize and process any of the possible documents formats that may be encountered on the Web.

A URL consists of the protocol for accessing the Web file and the address of the file, in the following format:

	protocol://file_address

The ConText Option URL data store supports two protocols:

hypertext transfer protocol (HTTP)

file protocol

Hypertext Transfer Protocol

If a URL uses HTTP, the file address contains the name of the Web server where the file is located and the location of the file on the Web server.

For example:

	http://my_server.com/welcome.html

	http://www.oracle.com

Note: The file address may also (optionally) contain the port on which the Web server is listening.

A Web server is any machine that uses HTTP to accept requests for files and transfer the files to the requestor.

With HTTP, the URL data store can be used to index files in an intranet, as well as files on any publicly-accessible Web servers on the World Wide Web.

Intranets are private, company-wide networks that use the Internet to link machines in the network, but are protected from public access on the Internet via a gateway proxy server which acts as a firewall.

For security reasons, access to an intranet is generally restricted to machines within the firewall; however, machines in an intranet can access the World Wide Web through the gateway server if they have the appropriate permission and security clearance.

File Protocol

If a URL uses the file protocol, the address for the file contains the directory path for the location of the file on the local file system.

For example:

	file://private/docs/html/intro.html

The file referenced by a URL using the file protocol must reside locally on a file system that is accessible to the machine running ConText Option.

Because the file is accessed through the operating system, the machine on which the file is located does not need to be configured as a Web server. However, the same requirements that apply to text stored as file names apply to text stored as URLs which use the file protocol.

If the requirements are not met, ConText Option returns one or more error messages.

For more information, see "Text Stored as File Names" in this chapter

For a complete list of the error messages returned by the URL data store, see Oracle ConText Option Messages.

Document Access Using HTTP

When HTTP is used to retrieve a URL from the data store, ConText Option acts as a client, submitting a request to a Web server for the file (document) referenced by the URL. If the request is successful, the Web server returns the file to ConText Option where it can be indexed.

Proxy Servers

If the document to be accessed is located on the World Wide Web outside a firewall and the machine on which ConText Option is installed is inside a firewall, the host machine that serves as the proxy (gateway) for the machine must be specified as one of the attributes for the URL Tile.

In addition, a sub-string of host or domain names can be specified which identify machines internal to a firewall. Access to these machines do not require a proxy.

Multi-threading

In a single-threaded environment, a request for a URL blocks all other requests until a response to the request is returned. Because a response may not be returned for a long time, a single-threaded environment in any text system using HTTP to access files could create a bottleneck.

To prevent this type of bottleneck, the URL data store supports multi-threading. With multi-threading, while one thread is blocked, waiting to communicate with a Web server, another thread can retrieve a document from another Web server.

The URL Tile supports specifying the number of simultaneous threads allowed for a text column.

Redirection

The response to a request to retrieve a URL may be a new (redirected) document to retrieve. The URL data store supports this type of redirection by automatically processing the redirection to retrieve the new document. However, to avoid infinite loops, the URL data store limits the number or redirections that it attempts to process.

Timeouts

The time necessary to retrieve a URL using HTTP may vary widely, depending on where the Web server is geographically located. The Web server may even be temporarily unreachable.

To allow control over the length of time an application waits for a response to an HTTP request for a URL, the URL Tile supports specifying a maximum timeout.

Exception Handling

When using HTTP to access files stored as URLs in the database, a number of exceptions can occur. These exceptions are written as errors to the CTX_INDEX_ERRORS view.

The URL data store returns error messages for the following exceptions:

the document referenced in the URL has been permanently moved or cannot be found

access to the document referenced in the URL requires authentication which the user does not have or requires payment which the user must provide

access to the document referenced in the URL is denied by the Web server

the Web server referenced in the URL does not comply with HTTP standards

the specified URL is incorrectly formatted

connection to the Web server is denied (this may occur when the incorrect port is referenced in the URL or the Web server is outside the firewall of an intranet)

the wait for a response to a request to retrieve a URL from a Web server exceeds the maximum timeout specified for the URL preference in the text column policy

the maximum number of supported redirections were encountered in attempting to retrieve the document referenced in the URL

the length of the URL exceeds the maximum specified for the URL preference in the text column policy

the size of the document referenced in the URL exceeds the maximum specified for the URL preference in the text column policy

For a complete list of the error messages returned by the URL data store, see Oracle ConText Option Messages.

Text Filtering

ConText Option supports both plain text and formatted text (i.e. Microsoft Word, WordPerfect). In addition, ConText Option supports text that contains hypertext markup language (HTML) tags.

Regardless of the format, ConText Option requires text to be filtered for the purposes of:

indexing (DDL and DML) or processing through the Linguistic Services

producing CTX_QUERY.HIGHLIGHT output

ConText Option provides a number of internal filters for filtering plain and formatted text. In addition, ConText Option provides the ability for users to define their own external filters.

ConText Option also provides the ability to store multiple document formats in a column.

Internal Filters

ConText Option provides internal filters for:

plain text

plain text containing HTML tags (used in the World Wide Web)

formatted text

Plain Text Filtering

Plain text requires little or no filtering because the text is already in the format that ConText Option requires for identifying tokens.

HTML Filtering

ConText Option provides an internal filter that supports English and Japanese text with HTML tags for versions 1, 2, and 3.

Note: For non-English and non-Japanese documents that contain HTML tags, an external filter must be used.

For English and Japanese text with HTML tags, the HTML filter processes all text that is delimited by the standard HTML tag characters (angle brackets).

All HTML tags are either ignored or converted to their representative characters in the ASCII character set. This ensures that only the text of the document is processed during indexing or by the Linguistic Services.

Formatted Text Filtering

ConText Option provides internal filters for filtering English and Western European text in a number of proprietary word processing formats.

Note: For Japanese, Korean, and Chinese formatted text, external filters must be used.

The filters extract plain, ASCII text from a document, then pass the text to ConText Option, where the text is indexed or processed through the Linguistic services.

The following document formats are supported by the internal filters:

Format Version
AMIPRO for Windows 1, 2, 3
Lotus 1-2-3 for DOS 4, 5
Lotus 1-2-3 for Windows 2, 3, 4, 5
Microsoft Word for Windows 2, 6.x, 7.0
Microsoft Word for DOS 5.0, 5.5
Microsoft Word for MAC 3, 4, 5.x
Word Perfect for Windows 5.x, 6.x
Word Perfect for DOS 5.0, 5.1, 6.0
Xerox XIF for UNIX 5, 6

For those formats not supported by ConText Option, user can define/create their own external filters.

Note: Of the document formats listed above, only the following formats support WYSIWYG viewing in the Windows 32-bit viewer (OCX):

Microsoft Word for Windows 2 and 6.x
Word Perfect for DOS 5.0, 5.1, 6.0
Word Perfect for Windows 5.x, 6.x

However, plain (ASCII) text also can be viewed in the 32-bit viewer.

For more information about the 32-bit viewer, see Oracle ConText Option Application Developer's Guide.

External Filters

External filters can be used for a number of purposes, including:

indexing text stored in a format, such as PDF, for which an internal filter does not exist

removing unnecessary text or markup in a document prior to processing

For example, the Linguistic Services rely on text that is grouped into logical paragraphs. If the text stored in the database does not contain clearly-identified paragraphs, the Linguistic Services may generate erroneous or incomplete output for the text.

An external filter that outlines the paragraph boundaries according to ConText Option standards could be created to ensure that the Linguistic Services are provided with an ordered, logical text feed.

Note: External filters do not support WYSIWYG viewing in the Windows 32-bit viewer (OCX).

For more information about the 32-bit viewer, see Oracle ConText Option Application Developer's Guide.

External Filter Requirements

An external filter can be any program (e.g. shell script, C program, perl script) that processes a document and produces ASCII output that can be indexed or processed through the Linguistic Services.

If the document is in a proprietary format, the program must recognize the format tags for the document and be able to convert the formatted text into ASCII text.

In addition, the program must be an executable that can be run from the command-line and accepts two arguments:

the name of an input file (stores the document to be filtered)

the name of an output file (stores the ASCII text of the filtered document)

Using External Filters

The process model for using external filters is:

1. Create a filter in the form of a command-line executable.

2. Store the executable on the same server machine as ConText Option. Note: The filter executable must be located in the bin subdirectory in the ConText Option directory on the server machine where ConText Option is installed and running.

3. Create a Filter preference that calls the filter executable.

The Tile you use to create the preference depends on whether you you use the column to store documents in a single format or multiple formats.

4. Create a policy that includes the Filter preference for the external filter. For more information about creating Filter preferences, see "Managing Preferences" in "Setting Up and Managing Text (Chapter 5)."

Supplied External Filters

ConText Option provides a number of external filters which can be used for filtering documents in a variety of formats.

Note: These external filters are not shipped on the Oracle7 Server/ConText Option product CD. They are shipped on a separate CD that is included in your distribution.

Filtering for Single-Format Columns

For columns that store documents in only one format, a single filter is specified in the Filter preference for the column policy. The filtering method that you specify for the column is determined by whether the format is supported by the internal or external filters.

If the format is supported by the internal filters, the appropriate internal filter can be used

If the format is not one of the supported internal filter formats, an external filter must be used

For more information about creating Filter preferences, see "Managing Preferences" in "Setting Up and Managing Text (Chapter 5)."

Filtering for Multiple-Format Columns

For columns that store multiple formats, the filtering method is determined by whether the formats are supported by the internal filters, external filters, or both:

If the formats are all supported by the internal filters, the internal Autorecognize filter can be used in the Filter preference

If none of the formats are supported by the internal filter formats, an external filter must be specified for each of the formats in the column

If some, but not all of the formats are supported by the internal filter formats, an external filter must be specified for each of the unsupported formats in the column

Note: In columns that use external filters, only those external filter formats supported by ConText Option for multiple-format columns can be used.

For a complete list of supported formats, see "Supported Formats for Multiple-Format Columns" in "ConText Data Dictionary (Chapter 9)."

For more information about creating Filter preferences, see "Managing Preferences" in "Setting Up and Managing Text (Chapter 5)."

Autorecognize Filter (Internal)

Autorecognize is an internal filter that automatically recognizes any of the following document formats and extracts the text from the document using the appropriate internal filters:

Format Version
AMIPRO for Windows 1, 2, 3
ASCII N/A
HTML 1, 2, 3
Lotus 1-2-3 for DOS 4, 5
Lotus 1-2-3 for Windows 2, 3, 4, 5
Microsoft Word for Windows 2, 6.x
Microsoft Word for DOS 5.0, 5.5
Microsoft Word for MAC 3, 4, 5.x
Word Perfect for Windows 5.x, 6.x
Word Perfect for DOS 5.0, 5.1, 6.0
Xerox XIF for UNIX 5, 6

Note: Microsoft Word for Windows 7.0 is not one of the supported formats for Autorecognize. As a result, ConText Option only supports storing Microsoft Word for Windows 7.0 documents in single-format columns.

External-Only Columns

For multiple-format columns that use only external filters, each filter executable for the formats in the column must be explicitly named in the Filter preference for the column policy.

Mixed Format Columns

If the column uses both internal and external filters, each external filter executable must be explicitly named in the Filter preference for the column policy. The internal filters do not have to be specified.

During filtering, ConText Option recognizes whether a format uses the internal or external filters and calls the appropriate filter.

Note: If required, internal filters can be overridden in a Filter preference by explicitly calling an external filter for the format. This can be useful if you have an external filter that provides additional filtering not provided by the internal filters.

For example, you may have MS Word documents that you want spellchecked before indexing. You could create an external MS Word filter that performs the spellchecking and specify the external filter in the Filter preference for the column policy.

Text Indexes

A text index is the ConText Option construct which allows ConText servers to process text queries and return information based on the content of the text stored in an Oracle database.

A text index is basically an inverted index, consisting of:

a list of every unique token (word) in the collected documents in a text column

for each word, a string that identifies each document in which the word occurs and the location of each occurrence within each document

There is a one-to-one relationship between a text index and the policy for which it was created.

Stages of Text Indexing

Text indexing takes place in three stages:

initialization

population

termination

Text Index Initialization

During text index initialization, the tables used to store the text index are created.

For more information about the tables used to store the text index, see "Text Index Tables" in this chapter.

Text Index Population

During text index population, the text index is created in memory, then transferred to the text index tables.

Each document in the text column is retrieved and filtered by ConText Option. Then, the tokens (words) are identified and extracted from the filtered text and stored in memory, along with the document ID and locations for each word, until all of the documents in the column have been processed or the memory buffer is full.

The index entries, consisting of each indexed word and its location string, are then written to the text index tables as individual rows and the buffer is flushed.

If the buffer fills up before all of the documents in the text column have been processed, ConText Option writes the index entries to the text index tables and retrieves the next document from the text column to continue text indexing.

The amount of memory allocated for text indexing for a text column determines the size of the memory buffer and, consequently, how often the text index entries are written to the text index tables.

For more information about the effects of frequently writing to the text index tables, see "Text Index Fragmentation" and "Memory Allocation" in this chapter.

Text Index Termination

During text index termination, the Oracle indexes are created for the text index tables. Each text index table has one or more Oracle indexes that are created automatically by ConText Option.

The termination stage only starts when the population stage has completed for all of the documents in the text column.

Text Index Tables

The text index for a text column consists of the following internal tables:

DR_nnnnn_I1Tn

DR_nnnnn_KTB

DR_nnnnn_LST

DR_nnnnn_I1W (only if Soundex is enabled)

DR_nnnnn_SQR (stored query expression table)

The nnnnn string is an identifier (from 1000-99999) which indicates the policy of the text column for which the text index is created.

In addition, ConText Option automatically creates one or more Oracle indexes for each text index table.

For a description of the text index tables, see "ConText Index Tables" and "SQR Table" in "ConText Index Tables and Indexes (Appendix C)."

For more information about stored query expressions, see Oracle ConText Option Application Developer's Guide

Table Creation

The text index tables for a text column, as well as the Oracle indexes for the tables, are created automatically by ConText Option during text indexing of the column.

The tablespace, storage clause, and other parameters used to create the text index tables and Oracle indexes are specified by the attributes set for the Engine preference in the policy for the text column.

For more information about the Engine attributes, see "Tiles, Tile Attributes, and Attribute Values" in "ConText Data Dictionary (Chapter 9)."

Text Index Fragmentation

As ConText Option builds a text index entry for each word that appears in the text in a column, it caches the index entries in memory. When the memory buffer is full, the text index entries are written to the text index tables as individual rows.

If all the documents (rows) in a text column haven't been indexed when the text index entries are written to the text index tables, the text index entry for a word may not included all of the documents in the column. If the same word is encountered again as text indexing continues, a new text index entry for the word is stored in memory and written to the text index table when the buffer is full.

As a result, a word may have multiple rows in the text index table, with each row representing a text index fragment. The aggregate of all the rows for a word represents the complete text index for the word.

Memory Allocation

A machine performing text indexing should have enough memory allocated for text indexing to prevent excessive text index fragmentation. The amount of memory allocated depends on the capacity of the host machine doing the text indexing and the amount of text being indexed.

If a large amount of text is being indexed, the text index can be very large, resulting in more frequent inserts of the index text strings to the tables. By allocating more memory, fewer inserts of text index strings to the tables are required, resulting in faster text indexing and fewer text index fragments.

For more information about allocating memory for text indexing, see "Managing Preferences" in "Setting Up and Managing Text."

Text Indexing in Parallel

Parallel text indexing is the process of dividing text indexing between two or more ConText servers. Dividing indexing between servers can help reduce the time it takes to index large amounts of text.

To perform text indexing in parallel, you must start two or more ConText servers (each with the DDL personality) and you must correctly allocate text indexing memory.

The amount of allocated index memory should not exceed the total memory available on the host machine(s) divided by the number of ConText servers performing the parallel text indexing.

For example, say you allocate 10 Mb of memory in the policy for the text column for which you want to create a text index. If you want to use two servers to perform parallel text indexing on your machine, you should have at least 20 Mb of memory available during indexing.

Note: When using multiple ConText servers to perform parallel text indexing, the servers can run on different host machines if the machines are able to connect to the database where the text index is stored.

Text Index Updates

When an existing document in a text column is deleted or modified such that the text index is no longer up-to-date, the text index must be updated.

However, updating the text index for modified/deleted documents affects every row that contains references to the document in the text index. Because this can take considerable time, ConText Option utilizes a deferred delete mechanism for updating the text index for modified/deleted documents.

In a deferred delete, the document references in the text index table (DR_nnnnn_I1Tn) for the modified/deleted document are not actually removed. Instead, the status of the document is recorded in the text index control table (DR_nnnnn_LST), so that the ID for the document is not returned in subsequent text queries that would normally return the document.

Actual deletion of the document references from the I1T table takes place only during optimization of a text index.

Text Index Log

The text index log records all the indexing operations performed on a policy for a text column. Each time a text index is created, optimized, or deleted for a text column, an entry is created in the text index log.

Log Details

Each entry in the log provides detailed information about the specified indexing operation, including:

the policy for the text column on which the indexing operation was performed

the indexing operation that was performed (creation, optimization, deletion)

if the indexing operation was performed in parallel, the ID of the server that processed the operation

whether the operation failed and, if it did, the stage at which it failed

the number of documents selected for processing and the number of documents actually processed during the indexing operation

the textkeys of the first and last documents processed

Accessing the Log

The text index log is stored in an internal table (DR$TEXT_INDEX_LOG) and can be viewed using the CTX_INDEX_LOG and CTX_USER_LOG views. The text index log can also be viewed in the administration tool by all users with the CTXAPP role.

Optimization

Optimization performs two functions for an index:

compaction of text index fragments

removal of document references and related information for modified/deleted documents (also known as actual deletion or garbage collection)

Compaction of index fragments results in fewer rows in the index tables, which results in faster and more efficient text queries. It also allows for more efficient use of tablespace.

Garbage collection updates the text index strings to accurately reflect the status of deleted and modified documents.

Compaction of Index Fragments

Compaction of index fragments (multiple rows in the text index tables for the same indexed word) combines the index fragments for a word into longer, more complete strings, up to a maximum of 64 Kb for any individual string.

ConText Option provides two methods of text index compaction:

in-place compaction

two-table compaction (default)

In-place compaction uses available memory to compact index fragments, then writes the compacted strings back into the original (existing) text index table.

Two-table compaction creates a second text index table into which the compacted text fragments are written. When compaction is complete, the original text index table is deleted.

Two-table compaction is faster than in-place compaction; however, it requires enough tablespace to be available during compaction to accommodate the creation and population of the second text index table.

Removal of Document References

ConText Option provides optimization methods which can be used to perform the actual deletion of all references to modified/deleted documents in a text index.

During an actual delete, the text index references for all modified/deleted documents are removed from the text index table, leaving only references to existing, unchanged documents. In addition, in an actual delete, the text index control table is cleared of the information which records the status of documents.

Similar to index fragment compaction, ConText Option supports in-place or two-table actual deletion.

When to Optimize

Index optimization should be performed regularly, as the indexing process can create many rows in the database depending on the amount of memory allocated for text indexing and the amount of text being indexed.

In general, optimize an index after:

large amounts of text are indexed

parallel text indexing has been utilized

large numbers of documents in a table have been modified or deleted (text DML)

Theme Indexes

Theme indexes are functionally identical to text indexes and are created in the same way as text indexes:

a policy is created for a column

a DDL request for index creation is submitted for the column

once the theme index has been generated, the column is enabled for all three query methods

The key to generating a theme index is the lexer that you specify for the column policy. Instead of specifying the basic (default) lexer, the theme lexer is specified.

Note: Theme indexing is only supported for English-language text.

Theme Lexer

The theme lexer is a special Lexer Tile that bypasses the standard text parsing routines and, instead, accesses the linguistic core in ConText Option to generate themes for documents.

The theme lexer analyzes text at the sentence, paragraph, and document level to create a context in which the document can be understood. It uses a mixture of statistical methods and hueristics to determine the main topics that are developed over the breadth of the document.

It also uses the ConText Option Knowledge Catalog, a collection of over 200,000 words and phrases, organized into a conceptual hierarchy with over 2,000 categories, to generate its theme information.

Linguistic Settings

The linguistic core uses settings that can affect the themes that are generated for a document. These settings are collected into setting configurations, which can be specified at the session level before the linguistic core performs any operations.

A number of predefined setting configurations are provided by ConText Option to allow users to tailor the output of the linguistic core to the style and content of their documents.

In addition, custom setting configurations can be created using the ConText Option administration tool.

Note: Since the settings can affect the themes that are generated for a document, once a theme index has been created for a column, the settings should not be altered.

If the settings are altered, the results generated for incremental changes to existing documents, as well as new documents, may be inconsistent with the results generated for the initial index creation. In this event, the theme index for the column should be dropped and the entire column reindexed to account for the new settings.

For more information about creating custom setting configurations, see the ConText Option administration tool.

For more information about setting the linguistic settings, see Oracle ConText Option Application Developer's Guide.

What's in a Theme Index

A theme index contains a list of all the themes for the documents in a column and the documents in which each theme is found. Each document can have up to sixteen main themes.

Note: Offset and frequency information are not relevant in a theme query, so this type of information is not stored in theme indexes.

Theme Signatures

A maximum of sixteen themes are generated for each document; however, each theme is expanded during indexing to include higher level concepts and related themes from the ConText Option Knowledge Catalog. The collection of themes and related themes is known as the theme signature for the document.

ConText Option uses the theme signature for a document to find documents that match the themes in a theme query.

Tokens in Theme Indexes

Unlike the single tokens found in a text index, the entries in a theme index often consist of phrases.

In addition, these phrases may be common terms or they may be the names of companies, products, and fields of study as defined in the Knowledge Catalog.

As such, theme indexes may contain words and phrases in uppercase, lowercase, and mixed-case.

For example, a document about Oracle contains the phrase Oracle Corp.. In the text index for the document, this phrase would have two entries, all in lowercase. In the theme index for the document, the entry would be Oracle Corporation, which is the form stored in the Knowledge Catalog.

Theme Indexing Policies

By specifying the theme lexer in the Lexer preference used in a column policy, you designate the policy as a theme indexing policy.

Once a theme index is created for the policy, any text requests, including queries, on the policy will result in the theme index being accessed.

For more information about creating a theme indexing policy, see "Creating a Theme Indexing Policy" in "Setting Up and Managing Text (Chapter 5)."

For more information about theme queries, see Oracle ConText Option Application Developer's Guide.

DDL and DML

In contrast to the Linguistic Services, which use Linguistic servers for all processing, operations such as index creation, optimization, and updating for theme indexes do not require Linguistic servers.

Theme indexes are processed identically to text indexes, meaning that DDL requests for index creation and optimization are processed by any currently available DDL servers.

Similarly, theme indexes do not have to be manually updated. All DML requests are processed automatically by any DML servers that are running at the time.

Columns with Theme and Text Indexes

Text and theme indexes can exist for the same column, by simply creating a text indexing policy and a theme indexing policy for the column, then requesting index creation once for each policy.

When two indexes exist for the same column, one-step queries (theme or text) require the policy name to be specified as part of the CONTAINS function. In this way, the correct index is accessed for the query.

This requirement is not enforced for two-step and in-memory queries, because they use policy name, rather than column name, to identify the column to be queried.

ConText Servers for Theme Indexing and Theme Queries

If theme indexing and theme querying are going to be performed, all ConText server processes must be started using the ctxsrv executable. The ctxsrv executable automatically initializes the ConText Option linguistics during startup of the ConText server process.

If any of the ConText server processes are started using the ctxsrvx executable, which does not initialize the ConText Option linguistics, theme indexing and theme querying may fail.

For more information about starting ConText servers and specifying personalities, see "Managing ConText Servers" in "Administering ConText Option (Chapter 2)."

For more information about ctxsrv/ctxsrvx, see "ctxsrv/ctxsrvx Executable" in "Utilities and Executables (Chapter 8)."

Base-letter Conversion

For each text column in a table, you can specify whether characters used in single-byte, non-English languages are to be converted to their base-letter representation. This means that words with diacritical marks (accents, umlauts, etc.) are converted to their base form before their tokens are placed in the text index for the column.

Text Indexing

Base-letter conversion is an attribute that you can set when creating a Lexer preference.

If base-letter conversion is enabled for the Lexer preference in a policy, during text indexing of the column for the policy, all characters containing diacritical marks are converted to their base form in the text index. The original text is not affected.

Base-letter conversion requires that the database character set is a subset of the NLS_LANG character set. For example, suppose the NLS_LANG parameter is set to French_France.WE8ISO8859P1 and the following piece of text is to be converted to its base-letter representation:

La référence de session doit être égale à 'name'.

The words of this sentence are indexed under the entries:

la
reference
de
session
doit
etre
egale
a
name

Note: Base-letter conversion requires that the language component for NLS_LANG is set to a language (e.g. French, German) that supports an extended (8-bit) character set. In addition, the charset component must be set to one of the 8-bit character sets (e.g. WE8ISO8859P1).

Text Queries

In a text query on a column with base-letter conversion enabled, the query terms are automatically converted to match the base-letter conversion that was performed during text indexing.

Note: Base-letter conversion works with all of the query operators (logical, control, expansion, thesaurus, etc.), except the STEM expansion operator.

For more information about text queries and the query operators, see Oracle ConText Option Application Developer's Guide.

Thesauri

Users looking for information on a given topic may not know which words have been used in documents that refer to that topic.

Oracle ConText Option enables users to create ISO-2788 compliant thesauri which define relationships between lexically equivalent words and phrases. Users can then retrieve documents that contain relevant text by expanding queries to include similar or related terms as defined in a thesaurus.

Note: ConText Option supports creating multiple thesauri; however, only one thesaurus can be used at a time in a query.

Three types of relationships can be defined for terms (words and phrases) in a thesaurus:

synonyms

hierarchical relationships

related terms

In addition, each entry in a thesaurus can have scope notes associated with it.

Synonyms

Support for synonyms is implemented through synonym entries in a thesaurus. The collection of all of the synonym entries for a term and its associated terms is known as a synonym ring.

Synonym Rings

Synonym rings are transitive. If term A is synonymous with term B and term B is synonymous with term C, term A and term C are synonymous. Similarly, if term A is synonymous with both terms B and C, terms B and C are synonymous. In either case, the three terms together form a synonym ring.

For example, in the synonym rings shown in Figure 3 - 2, the terms car, auto, and automobile are all synonymous. Similarly, the terms main, principal, major, and predominant are all synonymous.

Figure 3 - 1. Synonym Rings in a Thesaurus

Note: A thesaurus can contain multiple synonym rings; however, synonym rings are not named. A synonym ring is created implicitly by the transitive association of the terms in the ring.

As such, a term cannot exist twice within the same synonym ring or within more than one synonym ring in a thesaurus.

Preferred Terms

Synonym rings are not named, but they have an ID associated with them. The ID is assigned when the synonym entry is first created.

Each synonym ring can have one, and only one, term that is designated as the preferred term. A preferred term is used in place of the other terms in a synonym ring when one of the terms in the ring is specified with the PT operator in a query.

Note: A term in a preferred term (PT) query is replaced by, rather than expanded to include, the preferred term in the synonym ring.

Hierarchical Relationships

Hierarchical relationships consist of broader and narrower terms represented as an inverted tree. Each entry in the hierarchy is a narrower term for the entry immediately above it and to which it is linked. The term at the root of each tree is known as the top term.

For example, in the tree structure shown in Figure 3 - 2, the term elephant is a narrower term for the term mammal. Conversely, mammal is a broader term for elephant. The top term is animal.

Figure 3 - 2. Narrower and Broader Thesaurus Term Hierarchy

ConText Option also supports the following hierarchical relationships in thesauri:

generic

partitive

Each of the three hierarchical relationships supported by ConText Option represents a separate branch of the hierarchy and are accessed in a query using different thesaurus operators.

Note: The three types of hierarchical relationships are optional. Any of the three hierarchical relationships can be specified for a term.

Generic Hierarchy

The generic hierarchy represents relationships between terms in which one term is a generic name for the other.

For example, the terms rat and rabbit can be specified as generic narrower terms for rodent.

Partitive Hierarchy

The partitive hierarchy represents relationships between terms in which one term is part of another.

For example, the provinces of british columbia and quebec can be specified as partitive narrower terms for canada.

Multiple Occurrences of the Same Term

Because the branches of the hierarchy are treated as separate relationships, the same term can exist in more than one branch of the hierarchy. In addition, a term can exist more than once in a single branch; however, each occurrence of the term must be accompanied by a qualifier.

If a term exists more than once as a narrower term in a branch, broader term queries for the term are expanded to include all of the broader terms for the term.

If a term exists more than once as a broader term in a branch, narrower term queries for the term are expanded to include the narrower terms for each occurrence of the broader term. of the the broader term for the entry is specified in an NT query, the query is expanded to include the narrower term.

For example, C is a narrower generic term for both A and B. D and E are narrower generic terms for C. In queries for terms A, B, or C, the following expansions take place:

NTG(A) expands to {C}, {A}
NTG(B) expands to {C}, {B}
NTG(C) expands to {C}, {D}, {E}
BTG(C) expands to {C}, {A}, {B}

Note: The same expansions hold true for standard and partitive hierarchical relationships.

Qualifiers

For homographs (terms that are spelled the same way, but have different meanings) in a hierarchical branch, a qualifier must be specified as part of the entry for the word. In a sense, a term that appears twice in a create a separate entry in the hierarchy.

For example, the term spring has different meanings relating to seasons of the year and mechanisms/machines. The term could be qualified in the hierarchy by the terms season and machinery.

To differentiate between the terms during a query, the qualifier can be specified. Then, only the terms that are broader terms, narrower terms, or related terms for the term and its qualifier are returned. If no qualifier is specified, all of the related, narrower, and broader terms for the terms are returned.

Note: In thesaural queries that include a term and its qualifier, the qualifier must be escaped, because the parentheses required to identify the qualifier for a term will cause the query to fail.

Related terms

Each entry in a thesaurus can have one or more related terms associated with it. Related terms are terms that are close in meaning to, but not synonymous with, their related term. Similar to synonyms, related terms are reflexive; however, related terms are not transitive.

If a term that has one or more related terms defined for it is specified in a related term query, the query is expanded to include all of the related terms.

For example, B and C are related terms for A. In queries for A, B, and C, the following expansions take place:

RT(A) expands to {A}, {B}, {C}
RT(B) expands to {A}, {B}
RT(C) expands to {C}, {A}

Note: Terms B and C are not related terms and, as such, are not returned in the expansions performed by ConText Option.

Scope Notes

Each entry in the hierarchy, whether it is a main entry or one of the synonymous, hierarchical, or related entries for a main entry, can have scope notes associated with it.

Scope notes can be used to provide descriptions or comments for the entry.

Thesaural Maintenance

Thesauri are stored in internal tables (DR$THS, DR$THS_BT, and DR$THS_PHRASE) owned by CTXSYS. Each thesaurus is uniquely identified by a name that is specified when the thesaurus is created.

Thesaurus Creation and Modification

Thesauri can be created and modified by all ConText Option users with the CTXAPP role.

ConText Option provides both a PL/SQL interface (CTX_THES) and a GUI administration tool for viewing, creating, updating, and deleting thesauri.

Note: Thesauri can be created, updated, and deleted by all users with the CTXAPP role.

In addition, the ctxload utility can be used for loading (creating) thesauri from a load file into the thesaurus tables, as well as dumping thesauri from the tables into output (dump) files.

The thesaurus dump files created by ctxload can be printed out or used as input for other applications. The dump files can also be used to load a thesaurus into the thesaurus tables. This can be useful for using an existing thesaurus as the basis for creating a new thesaurus.

Default Thesaurus

Before the query operators can be used in a query expression, a thesaurus named 'DEFAULT' must be created either through the administration tool or through ctxload.

The thesaurus used by the thesaurus operators is DEFAULT, unless a different thesaurus is explicitly called by name in the query expression.

Query Expansion

The expansions returned by the thesaurus operators are combined using the ACCUMULATE operator ( , ) in the query expression.

For more information about query expressions and the thesaurus operators, see Oracle ConText Option Application Developer's Guide.

Text and Theme Queries

Thesauri are primarily used for expanding text queries, but can be used for expanding theme queries, provided a thesaurus has been created for the themes that can be generated by ConText Option.

Similar to text queries and theme queries, thesauri for text queries are case-insensitive and thesauri for theme queries are case-sensitive.

Limitations

In a query, the expansions generated by the thesaurus operators don't follow nested thesaural relationships. In other words, only one thesaural relationship at a time is used to expand a query.

For example, B is a narrower term for A. B is also in a synonym ring with terms C and D, and has two related terms, E and F. In a narrower term query for A, the following expansion occurs:

NT(A) query is expanded to {A}, {B}

Note: The query expression is not expanded to include C and D (as synonyms of B) or E and F (as related terms for B).

Prev Next

Library

Product

Contents

Index

Format	Version
AMIPRO for Windows	1, 2, 3
Lotus 1-2-3 for DOS	4, 5
Lotus 1-2-3 for Windows	2, 3, 4, 5
Microsoft Word for Windows	2, 6.x, 7.0
Microsoft Word for DOS	5.0, 5.5
Microsoft Word for MAC	3, 4, 5.x
Word Perfect for Windows	5.x, 6.x
Word Perfect for DOS	5.0, 5.1, 6.0
Xerox XIF for UNIX	5, 6