Oracle® Ultra Search Administrator's Guide 10g Release 2 (10.2) Part Number B14222-01 |
|
|
PDF · Mobi · ePub |
This chapter contains the following topics:
Oracle Ultra Search is built on the Oracle Database and Oracle Text technology that provides uniform search-and-location capabilities over multiple repositories such as, Oracle databases, other ODBC compliant databases, IMAP mail servers, HTML documents served up by a Web server, files on disk, and more.
Oracle Ultra Search uses a crawler to collect documents. You can schedule the crawler to the Web sites that you want to search. The documents stay in their repositories, and the collected information is used to build an index that stays within the firewall in a designated Oracle Database. Oracle Ultra Search also provides APIs for building content management solutions.
In addition, Oracle Ultra Search offers the following:
A complete text query language for text search inside the database
Full integration with the Oracle Database and the SQL query language
Advanced features like concept searching and theme analysis
Attribute mapping to facilitate attribute search across disparate repositories
Indexing of more than 150 file formats
Full globalization, including support for Chinese, Japanese, and Korean (CJK), and Unicode
Oracle Ultra Search is made up of the following components:
The Oracle Ultra Search crawler is a Java process activated by your Oracle server based on a a set schedule. When activated, the crawler spawns a configurable number of processor threads that fetch documents from various data sources and index them using Oracle Text. This index is used for querying. Data sources can be Web sites, database tables, files, mailing lists, Oracle Application Server Portal page groups, or user-defined data sources.
The crawler maps links and analyzes relationships. The crawler schedule is integrated with and driven by the DBMS_JOB
queue mechanism. Whenever the crawler encounters embedded, non-HTML documents during the crawling, it uses Oracle Text filters to automatically detect the document type and to filter and index the document.
The Oracle Ultra Search back end consists of an Oracle Ultra Search repository and Oracle Text. Oracle Text provides text indexing and search capabilities required to index and query data retrieved from the data sources. The back end indexes information from the crawler and serves up the query results.
The Oracle Ultra Search middle tier components are Web applications. The middle tier includes the Oracle Ultra Search administration tool, the APIs and the query applications.
In the Oracle Database release, the Oracle Ultra Search middle tier and back end can reside in the same Oracle Home. However, in the OracleAS and Oracle Collaboration Suite releases, the middle tier is located in a different Oracle Home.
The administration tool is a J2EE-compliant Web application. You can use it to manage Oracle Ultra Search instances and access it from your intranet. The administration tool is independent from the Oracle Ultra Search query application. Therefore, the administration tool and query application can be hosted on different computers to enhance security and scalability.
Oracle Ultra Search provides the following APIs:
The query API works with indexed data. The Java API does not impose any HTML rendering elements. The application can completely customize the HTML interface.
The crawler agent API crawls and indexes proprietary document repositories.
The e-mail Java API accesses archived e-mails and is used by the query application to display e-mails. It can also be used to build your own custom query application.
The URL rewriter API is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue.
The Document Service crawler agent API enables generation of attribute data based on the document contents. It accepts robot metatag instructions from the agent for the target document, and it transforms the original document contents for indexing control.
Oracle Ultra Search includes highly functional query applications to query and display search results. The query applications are based on JSP and work with any JSP1.1 compliant engine.
The Oracle Ultra Search administration tool and the Oracle Ultra Search query applications are J2EE-compliant Web applications. They are three tier architecture applications. Figure 1-1 shows the relationship between the browser (the first tier), the Web server and the servlet engine (the middle tier), and the Oracle Database (the third tier).
The Web server accepts requests from the browser and forwards the requests to the servlet engine for processing. The Oracle Ultra Search middle tier then communicates with the Oracle Database through the JDBC, as in Figure 1-1.
You can use any browser to access the Oracle Ultra Search administration tool or Oracle Ultra Search query application. The URLs are described in the following section.
This section explains the features in Oracle Ultra Search. It includes the following topics:
An Ultra Search instance can be created to provide isolation for the data collections that have been crawled.
You can create a read-only snapshot of a master Oracle Ultra Search instance. This is useful for query processing or for backup. You can also make a snapshot instance updatable. This is useful when the master instance is corrupted and you want to use a snapshot as a new master instance.
See Also:
"Instances Page"Document attributes or metadata describes the properties of a document. Each data source has its own set of document attributes. The value is retrieved during the crawling process which is then mapped to one of the search attributes. It is stored and indexed in the database. This enables you to query documents based on their attributes. Document attributes in different data sources can be mapped to a search attribute. Therefore, you can query documents from multiple data sources based on the same search attribute.
Oracle Ultra Search has the following default search attributes: Title, Author, Description, Subject, Mimetype, Language, Host, and LastModifedDate. The default search attributes can be incorporated in search applications for a more detailed search and richer presentation. The list of values (LOV) for a search attribute can help you specify a search query. If attribute LOV is available, then the crawler registers the LOV definition, which includes attribute value, attribute value display name, and its translation.
See Also:
"Synchronizing Data Sources"Oracle Ultra Search provides a command-line tool to load metadata into an Oracle Ultra Search database. If you have to load a large amount of data, using command-line is faster than using the HTML-based administration tool.
The metadata loader is a Java application. To use the tool, you must put the metadata in an XML file. It supports the following types of metadata:
Translators can enter the following translation strings:
Search attribute names
Attribute LOVs
Data group names
Federated data source names
During query time, they can be displayed according to the language preference.
You can define, edit, or delete your data sources and types in addition to the ones provided. You can implement your crawler agent to crawl and index a proprietary document repository, such as Lotus Notes or Documentum, which contain their own databases and interfaces. The proprietary repository is called a user-defined data source. The module that enables the crawler to access the data source is called a crawler agent.
You can decide which part of your site can be visited by robots. If robots exclusion is enabled, which is by default, then the Web crawler traverses the pages based on the access policy specified in the Web server robots
.txt
file. For example, when a robot visits http
://www
.oracle
.com
/, it checks for http
://www
.oracle
.com
/robots
.txt
. If the robot finds it, then the crawler analyzes its contents to check if it is permitted to retrieve the document. If you own the Web site, then you can disable robots exclusions. However, when crawling other Web sites, you should always comply with robots
.txt
by enabling robots exclusion.
See Also:
"Web Sources"Initially you might want the crawler to collect the URLs without indexing. After crawling is done you can examine the document URLs and status. Then remove the unwanted documents and start indexing. You can update the crawling mode to the following:
Automatically accept all URLs for indexing
Examine URLs before indexing
Index only
See Also:
"Schedules Page"The URL rewriter is a user-supplied Java module for implementing the Oracle Ultra Search UrlRewriter interface. It is used by the crawler to filter or rewrite extracted URL links before they are put into the URL queue. URL filtering removes unwanted links, and URL rewriting transforms the URL. This transformation is necessary when access URLs are used.
Oracle Ultra Search offers a flexible query API to incorporate search functionality to your sites. The query API includes the following functionality:
Three attribute types: string, number, and date
Multivalued attributes
Display name support for attributes, attribute list of values (LOV), and data groups
Document relevancy boosting
Arbitrary grouping of attribute query operator using operators (AND
, OR
), with control over attribute operator evaluation order
Selection of metadata returned in query result
Oracle Ultra Search supports secure searches. Ultra Search returns only the documents that satisfy the search criteria specified by you. For secure searches, each indexed document is protected by an Access Control List (ACL), which is evaluated during the search. The API query returns the documents only if you have the permission to read a protected document.
There are two ways to secure a data source:
Specify a single ACL for protecting all documents of a data source.
The administrator specifies the permissions of the single ACL in the Oracle Ultra Search administration tool. The resulting ACL is used to protect all documents belonging to that data source.
Crawl ACLs from the data source.
The data source is expected to provide the ACL along with the document. This enables each document be protected by its own unique ACL.
Oracle Ultra Search only supports this mode for user-defined data source types where the crawler agent retrieves the ACL from the data source along with other document attributes. You cannot get an ACL from a data source if it is a Web, table, portal, e-mail, or file type. With agent APIs, the URL property "UrlData.ACL" lets the agent to set the ACL of the URL that is submitted. The AclHelper
class in the Agent APIs generates the ACL string to make sure that the ACL string format is correct. Only Distinguished Name (DN) and Global User Id (GUID) can be used as the principal of an ACL.
Oracle Ultra Search performs ACL duplicate detection. This means that if a crawled document's ACL already exists in the Oracle Ultra Search system, then the existing ACL is used to protect the document, instead of creating a new ACL within Oracle Ultra Search. This policy reduces storage space and increases performance.
Oracle Ultra Search supports only a single LDAP domain. The LDAP users and groups specified in the ACL must belong to the same LDAP domain.
Caution:
If ACLs are crawled from data sources, then it is the responsibility of the administrator to ensure that the data sources being crawled belong to the same LDAP domain. Otherwise, it is possible that search users can inadvertently be granted permissions to documents that they should not be able to access.Searches run against a secure-search enabled Oracle Ultra Search instance are slower than those run against a non-secure-search enabled instance. This is because each candidate result could require an ACL evaluation. ACLs are evaluated natively by the Oracle server for optimum performance. Nevertheless, the time taken to return hits in a secure search varies depending on the number of ACL evaluations that must be made.
Oracle Ultra Search stores ACLs in the Oracle XML DB repository. Oracle Ultra Search also uses Oracle XML DB functionality to evaluate ACLs. This dependency exists only for the users who use secure searching.
The ACLs are managed by Oracle Ultra Search. ACLs are uniquely referenced by documents from a single Oracle Ultra Search instance and ACLs are not shared by multiple Oracle Ultra Search instances. For acceptable performance, the ACL cache size must be large enough to contain all ACLs evaluated at run time.
ACLs in the XML DB repository are protected by other ACLs, known as protector ACLs. Oracle Ultra Search ensures that the protector ACLs grant appropriate privileges to Oracle Ultra Search to call the XML DB ACL evaluation mechanism. The evaluation performance is primarily affected by the total number of ACLs used by all the XML DB client applications that also utilize its ACL evaluation mechanism. This set of applications includes Oracle Ultra Search.
An Oracle Ultra Search data source can be protected by a single administrator-specified ACL. This ACL specifies the users and groups who are permitted to view the documents belonging to a data source.
Oracle Ultra Search uses the Oracle Server's ACL evaluation engine to evaluate permissions when queries are performed by search users. This ACL evaluation engine is a feature of Oracle XML DB. If an Oracle Ultra Search query attempts to retrieve a document that is protected by an administrator-specified ACL, then the ACL is evaluated and subsequently cached.
The time required for the ACL to be cached is controlled by an XML DB configuration parameter. The acl-max-age
parameter must be modified. The value is a number in seconds that determines the time taken to cache the ACLs.
As ACLs are cached, it is important to remember that changes to an administrator-specified ACL may not propagate immediately. This only applies to database sessions that existed before the change was made.
You can override the search results and influence the order in which the documents are ranked in the query result list, with document relevancy boosting. This can promote important documents to higher scores and make them easier to find dring searches.
Relevancy boosting assigns a score to a document for specific queries entered by you.
Note:
The document still has a score computed by Oracle Text if you enter a query that is not one of the boosted queries.Relevancy boosting has the following limitations:
Comparison of the user's query against the boosted queries uses exact string matching. This means that the comparison is case-sensitive and space-aware. Therefore, a document with a boosted score for "Ultra Search" is not boosted when you enter "ultrasearch".
Relevancy boosting requires that the query application pass in the search term using the query API getResult
method call. The applications are designed to pass the basic search terms as the boost term. Advanced search criteria based on search attributes are ignored.
See Also:
"Queries Page"Oracle Ultra Search translates each user query into a database query. This process is called as the query syntax expansion. The Oracle Ultra Search default expansion logic boosts the relevancy of the documents that match the user's query. The query syntax expansion can be customized with the query API.
See Also:
"Customizing the Query Syntax Expansion"While gathering information from a database Web application, Oracle Ultra Search lets you specify a URL to display the retrieved data on a browser. The URL points to a screen in the Web application corresponding to the data in the database. This facility is available for table data sources, file data sources, and user-defined data sources.
See Also:
"Using Crawler Agents"Orginally Oracle Ultra Search used centralized search to gather data on a regular basis and to update an index that cataloged all searchable data, which provided fast searching, but the data source to be crawlable before it could be searched. Oracle Ultra Search provides federated search, which enables multiple indexes to perform a single search. Each index can be maintained separately. By querying the data source at search time, search results are always the latest results. User credentials can be passed to the data source and authenticated by the data source itself. Queries can be processed efficiently using the data's native format. To use federated search, you must deploy an Oracle Ultra Search search adapter, or searchlet, and create an Oracle source. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system. When a user's query is delegated to the searchlet, the searchlet runs the query on behalf of the user. Every searchlet is a JCA 1.0 compliant resource adapter.
See Also:
"Federated Sources"The Oracle Ultra Search administration tool supports the following modes of logging on, depending on the type of user. You can log on as:
A single sign-on user managed in the Oracle Internet Directory and authenticated with Oracle Application Server Single Sign-On
A local database schema user in the Oracle Ultra Search database (who is not using single sign-on)
A Portal user
An Enterprise Manager user
Note:
Single sign-on is available only with the Oracle Identity Management infrastructure.See Also:
"Logging On to Oracle Ultra Search"Oracle Internet Directory is Oracle's native LDAP v3-compliant directory service, built as an application on the Oracle Database. Oracle Internet Directory hosts the Oracle common identity. All Oracle Web-based products integrate with Oracle Application Server Single Sign-On.
An Oracle Ultra Search administration group contains a set of users. Each user can belong to one or more groups. All groups are created using the groupOfUniqueNames
and orclGroup
object classes.
The only way to grant a user administration privileges is to assign them to an administration group. Oracle Ultra Search authorizes the user administration privileges based on the administration groups to which the user belongs. The following groups are created for each Oracle Ultra Search instance:
Super-users: Users in this group can create or drop Oracle Ultra Search instances and can administer Oracle Ultra Search instances within the installation. Super-users must obey the rules for document relevancy boosting and ACLs defined for each of the documents associated with the Oracle Ultra Search instance. For example, if a document ACL does not grant access to the super-user or group, then the super-user cannot search and browse the document.
Instance administrators: Users in this group can administer the Oracle Ultra Search instance. Only the instance database schema user and members in the super-users group can drop the instance.
The authorization of the administration user is performed in the following steps:
After the administration user is successfully authenticated by Oracle Application Server Single Sign-On or the Oracle Ultra Search database, the Oracle Ultra Search GUI brings up a screen for the user to choose an Oracle Ultra Search instance.
The Oracle Ultra Search GUI looks up the Oracle Internet Directory server or Oracle Ultra Search repository to find all Oracle Ultra Search instances that the administration user has privileges to administer.
The administration user chooses the Oracle Ultra Search instance from the list.
Oracle Ultra Search includes fully functional query applications to query and display search results. The query applications include a search portlet.
The Oracle Ultra Search portlet demonstrates how to write a search portlet for Oracle Application Server Portal. It is implemented as a JavaServer Page application.The same portlet is installed as a feature of the Oracle Application Server Portal product.
See Also:
The Oracle Application Server Portal documentation for more information about portlets
Oracle Ultra Search Query Applications Readme for more information about the query API application
Although Oracle Ultra Search in the Oracle Application Server is the same product as Oracle Ultra Search in Oracle Collaboration Suite and Oracle Ultra Search in the Oracle Database, there are a few functional differences:
The Oracle Database is not integrated with OracleAS Portal. However, with OracleAS and Oracle Collaboration Suite installations, Oracle Ultra Search lets Portal users add powerful multi-repository search to their Portal pages. OracleAS and Oracle Collaboration Suite also have the capability to crawl and make searchable Portal's own repository. The Portal crawler recognizes Portal page groups as data sources.
OracleAS Single Sign-On users can log on once for all components of the Oracle Application Server product, and the Oracle Ultra Search administrative interface enables user management operations on either database users or single sign-on users. Authenticated single sign-on users never see the Oracle Ultra Search logon screen. Instead, they can immediately choose an instance. If the single sign-on user does not have permissions to manage Oracle Ultra Search (set in the Users page), then the single sign-on user receives an error. Single sign-on is available only with the Oracle Identity Management infrastructure.
See Also:
http://portalstudio.oracle.com
Oracle Ultra Search runs as a client program to the Oracle server. It can be deployed in the back end or in the middle tier of a server configuration.
The Oracle Ultra Search query interface and the administration tool can be accessed from any HTML browser client. The administration tool relies on certain Java classes in the middle tier. This logical middle tier can be the same physical computer as the one that runs the database server, or on a different one running Oracle Application Server. The Oracle Ultra Search database back end consists of the Oracle Ultra Search data dictionary that stores metadata on all the different repositories, as well as the schedules and Java classes needed to drive the crawler. The crawler itself can run either on the database server computer or remotely on another computer.
See Also:
Chapter 4, "Installing Oracle Ultra Search" for more information about the componentsFigure 1-2 illustrates the Oracle Ultra Search system configuration.