Oracle® Ultra Search Administrator's Guide 10g Release 2 (10.2) Part Number B14222-01 |
|
|
PDF · Mobi · ePub |
The Oracle Ultra Search administration tool lets you manage Oracle Ultra Search instances. This chapter helps guide you through the screens on the Oracle Ultra Search administration tool. It contains the following topics:
The Oracle Ultra Search administration tool is a J2EE-compliant Web application. You can use it to manage Oracle Ultra Search instances. To use the administration tool, log on as either a database user, an Enterprise Manager super-user, a Portal user, or a single sign-on user through any browser.
Note:
The Oracle Ultra Search administration tool and the Oracle Ultra Search query applications are part of the Oracle Ultra Search middle tier. However, the Oracle Ultra Search administration tool is independent from the Oracle Ultra Search query application. Therefore, they can be hosted on different computers to enhance security or scalability.With the administration tool, you can do the following:
Log on to Oracle Ultra Search
Create Oracle Ultra Search instances
Manage administrative users
Define data sources and assign them to data groups
Configure and schedule the Oracle Ultra Search crawler
Set query options
Translate search attributes and LOV and data group display names to different languages
To configure the Oracle Ultra Search crawler, you must do the following:
Set crawler parameters, such as the crawler log file directory. To do so, use the Crawler Page.
Set Web access parameters, such as authentication and the proxy server. To do so, use the Web Access Page.
Define data sources. Data sources can be Web pages, database tables, files, e-mail mailing lists, Oracle Sources (for example, Oracle Application Server Portals or federated sources), or user-defined data sources. You can assign one or more data sources to a crawler schedule. To define data sources, use the Sources Page. You can also set parameters for the source, such as domain inclusions or exclusions for Web sources or the display URL template or column for table sources.
Define synchronization schedules. The crawler uses the synchronization schedule to reconcile the Oracle Ultra Search index with current data source content. To define crawling schedules, use the Schedules Page.
Use query options to let users limit their searches. Searches can be limited by document attributes and data groups.
Search attributes can be mapped to table columns, document attributes, and e-mail headers. Some attributes, such as author and description, are predefined and need no configuration. However, you can customize your own attributes. To set custom search attributes to expose to the query user, use the Attributes Page.
Data source groups are logical entities exposed to the search engine user. When entering a query, the search engine user is asked to select one or more data groups to search from. A data group consists of one or more data sources. To define data groups, use the Queries Page.
Oracle Ultra Search provides context-sensitive online help, which can be viewed in different languages. You can change the language preferences on the Users Page.
The following users can log on to the Oracle Ultra Search administration tool:
Single Sign-on users: These users are managed by the Oracle Internet Directory and are authenticated by OracleAS Single Sign-On. The Oracle Ultra Search administration tool identifies all Oracle Ultra Search instances to which the single sign-on user has access. This is available only if you have the Oracle Identity Management infrastructure installed.
Database users (non-single sign-on): These users exist in the database on which Oracle Ultra Search runs.
Portal single sign-on users
To log on to the administration tool, point your Web browser to one of the following URLs:
For non-single sign-on mode:
http://hostname:port/ultrasearch/admin/index.jsp
For single sign-on mode:
http://hostname:port/ultrasearch/admin_sso/index.jsp
Immediately after installation, the only users able to create and manage instances are the following:
The Enterprise Manager user
The PORTAL
single sign-on user belonging to the default company [not supported in the Oracle database release]
The ORCLADMIN
single sign-on user belonging to the default company [this is available only if the Oracle Identity Management infrastructure is installed]
After you are logged on as one of these special users, you can grant permission to other users, enabling them to create and manage Oracle Ultra Search instances. Using the Oracle Ultra Search administration tool, you can only grant and revoke Oracle Ultra Search related permissions to and from exiting users. To add or delete users, use the Oracle Internet Directory for single sign-on users or Oracle Enterprise Manager for local database users.
Note:
The Oracle Ultra Search product database dictionary is installed in theWKSYS
schema.See Also:
"Changing Oracle Ultra Search Schema Passwords" for information about changing the WKSYS
password
"Instances Page" for more information about creating Oracle Ultra Search instances
"Users Page" for more information about granting permission to other users
"Logging On and Managing Instances as Single Sign-On Users" for more information about how Oracle Ultra Search handles single sign-on users
Note:
Single Sign-On is available only if the Oracle Identity Management infrastructure is installed.When a single sign-on user logs on to the Oracle Ultra Search administration tool, the user is first prompted with the single sign-on login screen. Enter the single sign-on user name and password. After OracleAS Single Sign-On authenticates the user, the user sees a list of Oracle Ultra Search instances that they have the privileges to manage.
There are different URLs for different users. For example:
Single sign-on users:
http://host:http_port/ultrasearch_admin_sso/index.jsp
Portal users:
http://host:http_port/pls/portal
Enterprise Manager users:
http://host:em_port/
You might need to grant super-user privileges, or privileges for managing an Oracle Ultra Search instance, to a single sign-on user. This process is slightly different, depending on whether Oracle Application Server Portal is running in hosted mode or non-hosted mode, as described in the following list:
Note:
A single sign-on user is uniquely identified by Oracle Ultra Search with a single sign-on nickname and subscriber nickname combination.In non-hosted mode, the subscriber nickname is not required when granting privileges to a single sign-on user. This is because there is exactly one subscriber in Oracle Application Server Portal in non-hosted mode.
In hosted mode, the subscriber nickname is required when granting privileges to a single sign-on user. This is because there can be more than one subscriber in Oracle Application Server Portal, and two or more users with the same single sign-on nickname (for example, PORTAL
) could be distinct single sign-on users distinguished by their subscriber nickname. When running Portal in hosted mode, also note the following:
When granting permissions for the default subscriber user, always specify DEFAULT
COMPANY
for the subscriber nickname, even though the actual nickname could be different; for example, ORACLE
. The actual nickname is not recognized by Oracle Ultra Search.
When logging in to OracleAS Single Sign-On as the default subscriber user, leave the subscriber nickname blank. Alternatively, enter DEFAULT
COMPANY
instead of the actual subscriber nickname; for example, ORACLE
so that it is recognized by Oracle Ultra Search.
Note:
At any point after installation, you can run an Oracle Application Server Portal script to alter the running mode from non-hosted to hosted. Whenever this is done, the Oracle Application Server Portal script invokes an Oracle Ultra Search script to inform Oracle Ultra Search of the change from non-hosted to hosted modes.After successfully logging on to the Oracle Ultra Search administration tool, you find yourself on the Instances Page. This page manages all Oracle Ultra Search instances in the local database. In the top left corner of the page, there are tabs for creating, selecting, editing, and deleting instances.
Before you can use the administration tool to configure crawling and indexing, you must create an Oracle Ultra Search instance. An Oracle Ultra Search instance is identified with a name and has its own crawling schedules and index. Only users granted super-user privileges can create Oracle Ultra Search instances.
To create an instance, click Create. You can create a regular instance or a read-only snapshot instance. Only users with super-user privileges can create new instances.
Note:
If you define the same data source within different instances Oracle Ultra Search, then there could be crawling conflicts for table data sources with logging enabled, e-mail data sources, and some user-defined data sources.To create an instance, do the following:
Prepare the database user.
Every Oracle Ultra Search instance is based on a database user and schema with the WKUSER
role.
The database user you create to house the Oracle Ultra Search instance should be assigned a dedicated self-contained tablespace. This is important if you plan to ever create snapshot instances of this instance. To do this, create a new tablespace. Then, create a new database user whose default tablespace is the one you just created.
See Also:
"Configuring the Oracle Server for Oracle Ultra Search" for information and instructions on configuring database users for Oracle Ultra Search
Follow the instance creation steps in the Oracle Ultra Search administration tool.
From the main instance creation page, click Create Instance, and provide the following information:
Instance name
Database schema: this is the user name from Step 1.
Schema password
You can also enter the following optional index preferences:
Lexer
Specify the name of the lexer you want to use for indexing. The lexer breaks text into tokens according to your language. These tokens are usually words. The default lexer is wksys
.wk_lexer
, as defined in the wk0pref
.sql
file. After the instance is created, the lexer can no longer be changed.
Stoplist
Specify the name of a stoplist you want to use during indexing. The default stoplist is wksys
.wk_stoplist
, as defined in the wk0pref
.sql
file. Avoid modifying the stoplist after the instance has been created.
Storage
Specify the name of the storage preference for the index of your instance. The default storage preference is wksys
.wk_storage
, as defined in the wk0pref
.sql
file. After the instance is created, the storage preference cannot be changed.
See Also:
Oracle Text Reference for more information on these creating and modifying lexers, stoplists, and storage
A snapshot instance is a copy of another instance. Unlike a regular instance, a snapshot instance is read only; it does not synchronize its index to the search domain. After the master instance re-synchronizes to the search domain, the snapshot instance becomes out of date. At that point, you should delete the snapshot and create a new one.
Note:
The snapshot and its master instance cannot reside on the same database.A snapshot instance is useful for the following purposes:
Query Processing
Two Oracle Ultra Search instances can answer queries about the same search domain. Therefore, in a set amount of time, two instances can answer more queries about that domain than one instance. Because snapshot instances do not involve crawling and indexing, snapshot instance creation is fast and inexpensive. Thus, snapshot instances can improve scalability.
Backups
If the master instance becomes corrupted, its snapshot can be transformed into a regular instance by editing the instance mode to updatable. Because the snapshot and its master instance cannot reside on the same database, a snapshot instance should be made updatable only to replace a corrupted master instance.
A snapshot instance does not inherit authentication from the master instance. Therefore, if you make a snapshot instance updatable, you must re-enter any authentication information needed to crawl the search domain.
To create a snapshot instance, do the following:
Prepare the database user.
As with regular instances, snapshot instances require a database user. This user must have been granted the WKUSER
role.
Copy the data from the master instance.
This is done with the transportable tablespace mechanism, which does not permit renaming of tablespaces. Therefore, a snapshot instance cannot be created on the same database as its master.
Identify the tablespace or the set of tablespaces that contain all the master instance data. Then, copy it, and plug it into the database user from Step 1.
Follow snapshot instance creation in the Oracle Ultra Search administration tool.
From the main instance creation page, click Create Read-Only Snapshot Instance, and provide the following information:
Snapshot instance name
Snapshot schema name: this is the database user from Step 1.
Snapshot schema password
Database link: this is the name of the database link to the database where the master instance lives.
Master instance name
Enable the snapshot for secure searches.
If the master instance for the snapshot of is secure-search enabled and if the destination database that you are making a snapshot in supports secure-search enabled instances, then you must also run a PL/SQL procedure in the destination database where you are creating the snapshot.
Running this procedure translates the IDs of the access control lists (ACLs) in the destination database, rendering them usable. Log on to the database as the WKSYS
user. Invoke the procedure as follows:
exec WK_ADM.USE_INSTANCE('instance_name');
exec WK_ADM.TRANSLATE_ACL_IDS();
where instance_name
is the name of the snapshot instance
Make sure that this statement completes successfully without error.
See Also:
Chapter 5, "Oracle Ultra Search Postinstallation Information" for information on changing the WKSYS
password and for instructions on configuring database users for Oracle Ultra Search
Oracle Database Administrator's Guide for details on using transportable tablespaces
You can have multiple Oracle Ultra Search instances. For example, an organization can have separate Oracle Ultra Search instances for its marketing, human resources, and development portals. The administration tool requires you to specify an instance before it lets you make any instance-specific changes.
To select an instance, do the following:
Click Select on the Instances Page.
Select an instance from the pull-down menu.
Click Apply.
Note:
Instances do not share data. Data sources, schedules, and indexes are specific to each instance.To delete an instance, do the following:
Click Delete on the Instances Page.
Select an instance from the pull-down menu.
Click Apply.
Note:
To delete an Oracle Ultra Search instance, the user must be granted the super-user privileges.To edit an instance, click Edit on the Instances Page.
You can change the instance mode (make the instance updatable) or change the instance password.
You can change the instance mode to updatable or read only. Updatable instances synchronize themselves to the search domain on a set schedule, whereas read-only instances (snapshot instances) do not do any synchronization. To set the instance mode, select the box corresponding the to mode you want, and click Apply.
An Oracle Ultra Search instance must know the password of the database user in which it resides. The instance cannot get this information directly from the database. During instance creation, Oracle provides the database user password, and the instance caches this information.
If this database user password changes, then the password that the instance has cached must be updated. To do this, enter the new password and click Apply. After the new password is verified against the database, it replaces the cached password.
The Oracle Ultra Search crawler is a Java application that spawns threads to crawl defined data sources, such as Web sites, database tables, or e-mail archives. Crawling occurs at regularly scheduled intervals, as defined in the Schedules Page.
On the Crawler page, you can configure various crawler settings.
Crawler Threads
Specify the number of crawler threads to be spawned at run time.
Number of Processors
Specify the number of central processing units (CPUs) that exist on the server where the Oracle Ultra Search crawler will run. This setting determines the optimal number of document conversion threads used by the system. A document conversion thread converts multiformat documents into HTML documents for proper indexing.
Automatic Language Detection
Not all documents retrieved by the Oracle Ultra Search crawler specify the language. For documents with no language specification, the Oracle Ultra Search crawler attempts to automatically detect language. Click Yes to turn on this feature.
The language recognizer is trained statistically using trigram data from documents in various languages (for instance, Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (like Chinese, Japanese, Korean, and so on).
The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE
column, if it is a table data source. If it cannot determine the language, then it takes the following steps:
If the language recognizer is not available or if it is unable to determine a language code, then the default language code is used
If the language recognizer is available, then the output from the recognizer is used.
This language code is populated in LANG
column of the wk$url
and wk$doc
tables. Multilexer is the only lexer used for Oracle Ultra Search. All document URLs are stored in wk$doc
for indexing and wk$url
for crawling.
Default Language
If automatic language detection is disabled, or if a Web document does not have a specified language, then the crawler assumes that the Web page is written in this default language. This setting is important, because language directly determines how a document is indexed.
Note:
This default language is used only if the crawler cannot determine the document language during crawling. Set language preference in the Users Page.You can select a default language for the crawler or for data sources. Default language support for indexing and querying is available for the following languages:
Polish
Chinese
Hungarian
Norwegian
Romanian
Finnish
Japanese
Spanish
Slovak
English
Turkish
Danish
Swedish
Russian
German
Korean
Dutch
Italian
Greek
Portuguese
Czech
Hebrew
French
Arabic
Crawling Depth
A Web document can contain links to other Web documents, which can contain more links. This setting lets you specify the maximum number of nested links the crawler will follow.
See Also:
"Tuning the Web Crawling Process" for more information on the importance of the crawling depthCrawler Timeout Threshold
Specify, in seconds, a crawler timeout threshold. The crawler timeout threshold is used to force a timeout when the crawler cannot access a Web page.
Default Character Set
Specify the default character set. The crawler uses this setting when an HTML document does not have its character set specified.
Cache Directory
Specify the absolute path of the cache directory. During crawling, documents are stored in the cache directory. Every time the preset size is reached, crawling stops and indexing starts.
If you are crawling sensitive information, then make sure that you set the appropriate file system read permission to the cache directory.
You can choose whether or not to have the cache cleared after indexing.
Specify the following:
Level of detail: everything or only a summary
Crawler logfile directory
Crawler logfile language
The log file directory stores the crawler log files. The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file. The crawler logfile language is the language the crawler uses to generate the log file.
The crawler maintains multiple versions of its log file. The format of the log file name is:
iinstance_iddsdata_source_id.MMDDhhmm.log
where MM
is the month, DD
is the date, hh
is the launching hour in 24-hour format, and mm
is the minutes. For example, if a schedule for data source 23 of instance 3 is launched at 10 pm, July 8th, then the log file name is i3ds23.07082200.log
. Each successive schedule launching will have a unique log file name. If the total number of log files for a data source reaches the system-specified limit, then the oldest log file will be deleted. The number of log files is a scheduler property and applies to all of the data sources assigned to the scheduler.
Database Connect String
The database connect string is a standard JDBC connect string used by the remote crawler when it connects to the database. The connect string can be provided in the form of [hostname]:[port]:[sid]
or in the form of a TNS keyword-value syntax; for example:
"(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=1521)...))"
You can update the JDBC connect string to a different format; for example, an LDAP format. However, you cannot change the JDBC connect string to point to a different database. The JDBC connect string must be set to the database where the middle tier points; that is, the middle tier and the JDBC should point to the same database.
In a Real Application Clusters environment, the TNS keyword-value syntax should be used, because it enables connection to any node of the system. For example,
"(DESCRIPTION=(LOAD_BALANCE=yes)(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3001)) (ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3001)))(CONNECT_DATA=(SERVICE_NAME=sales.us.acme.com)))"
Use this page to view and edit remote crawler profiles.
A remote crawler profile consists of all parameters needed to run the Oracle Ultra Search crawler on a remote computer other than the Oracle Ultra Search database. To register a remote crawler, you need to use the PL/SQL API wk_crw.register_remote_crawler
. You can choose either RMI-based or JDBC-based remote crawling.
To configure the remote crawler, click Edit. Here is a list of configuration parameters that you can change for the remote crawler:
Cache file access mode. You have two options for the remote crawler to handle cache files:
In this case, the remote crawler will send cache files over the crawler's JDBC connection to the server's cache directory.
Through a mounted file system.
If you choose this option, the cache file will be saved in the remote crawler cache directory. The remote crawler cache directory must be mounted to the server side crawler cache directory (specified under Crawler -> Settings tab); otherwise, the documents cannot be indexed.
See Also:
"Using the Remote Crawler" for more on crawling with JDBC connectionsCache directory location (absolute path)
Crawler log file directory
Mail archive path
Number of crawler threads
Number of processors
Initial Java heap size (in megabytes)
Maximum Java heap size (in megabytes)
Java classpath
Use this page to view the following crawler statistics:
This provides a general summary of crawler activity:
Aggregate crawler statistics
Total number of documents indexed
Crawler statistics by data source type
This includes the following:
List of hosts crawled and indexed
Document distribution by depth
Document distribution by document type
Document distribution by data source type
For the crawler to contact Web pages that reside outside your firewall, you must register a proxy on the Proxies page. If the proxy requires authentication, then you must enter the proxy authentication information on the Authentication page.
Specify a proxy server if the search space includes Web pages that reside outside your organization's firewall.
To set the proxy, enter the proxy server name and port. For example, myproxy.mydomain
, 8080
. Because internal Web sites should not go through the proxy server, specify proxy domain exceptions if the proxy server is set. Enter the host name suffix that should not go through the proxy in the exception field. Use the suffix of the host name without "http". For example, us.oracle.com
, oracle.com
, uk.oracle.com
, and oraclecorp.com
. The no proxy checking is strictly a suffix matching of the host name. IP address can only be used when the URL crawled is also specified in IP for the host name. In other words, they must be consistent.
If the proxy requires authentication, then specify the proxy login user name and password in the Authentication page.
Use this page to enter authentication information that applies to all data sources.
Note:
The data source specific authentication takes precedence over this global authentication.Oracle Ultra Search supports both basic and digest authentication.
Specify the user name and password for the host and realm for which HTTP authentication is required. For example, enter "myproxy.mydomain", "LDAP", "myname", and "mypassword" as the host, realm, user name, and password.
The realm is a name associated with the protected area of a Web site. It is a string that you provide to log on to such a protected page.
For proxy authentication, enter the user name, password, and realm of the proxy server.
HTML forms are used on the World Wide Web to collect information from a user. One common usage is to collect authentication information. Text boxes for entering your username and password on an HTML page use HTML form authentication.
HTML forms vary in complexity. They can be simple or very complex incorporating a lot of Javascript. The following example shows a simple HTML form with two text fields.
<FORM action="http://somesite.com/prog/adduser" method="post" name="MyForm"> <LABEL for"i1>Username: </LABEL> <INPUT name="username" type="text" id="i1"><BR> <LABEL for="i2>Password: </LABEL> <INPUT name="passwd" type="passwd" id="i2"><BR> </FORM>
In a browser this will look like Figure 8-1.
Where:
name
= form name
method
= weather an HTTP POST or GET should be used for form data submission
action
= the URL where the data should be submitted (this is usually a servlet of some sort)
INPUT
elements = these are called form controls. In this example, there is a "text" box type control, and a "password" type, which is also a text box but it hides the text typed into it. Although there are many different types of controls, when the form is submitted, the controls translate to simple name-value pairs and are sent as such to the action URL. The name is the name of the INPUT element, whereas the value is the value entered by the user (using a text box, a dropdown, and so on).
See Also:
HTML form documentation available from W3C (www.w3c.org)Ultra Search lets you register HTML forms. During crawling, if the crawler finds one of the registered forms, it automatically fills out the form using the data you submitted during form's registration.
You can register HTML forms using the Ultra Search administration too. You can create Ultra Search instance-level form registration entries or data source-level entries that are only visible to the particular data source.
The preferred way to register a form is using the HTML form registration wizard. The lets you fill out the form as usual while it tries to capture the submitted data. You specify the page with the form; the wizard fetches it and displays it; you fill out and submit the form, as when accessing it directly through a Web browser; the wizard shows the Web site's response to the submitted form; and then you confirm if the form submission was a success.
This whole process is quick and simple. However, the HTML form registration wizard cannot handle forms that use Javascript. It might handle some depending on what is being done with Javascript, but there are no guarantees. The alternative is to do a manual form registration.
To do manual registration you must be familiar with the anatomy of the form you are registering. You can see the form by browsing to it and asking from the Web browser to display the page source.
As with the wizard based registration, you start by specifying the URL of the page containing the HTML form. Next you provide the following bits of information:
name = Form name
action URL = URL where the form is submitted (see preceding section on the form anatomy)
success URL = URL where one gets redirected upon successful submission of the form
controls = All the form controls and their desired values
When Javascript is used, some of these values might not be obvious, because the Javascript might be manipulating them or adding new controls dynamically. So, you need to understand the role that Javascript is playing.
Note:
The Oracle Ultra Search crawler will choose the form to use based on the form's URL and the form name. URL parameters are not included during matching; thus, they are truncated during form registration.When your indexed documents contain metadata, such as author and date information, you can let users refine their searches based on this information. For example, users can search for all documents where the author attribute has a certain value.
The list of values (LOV) for a document attribute can help specify a search query. An attribute value can have a display name for it. For example, the attribute country might use country code as the attribute value, but show the name of the country to the user. There can be multiple translations of the attribute display name.
To define a search attribute, use the Search Attributes subtab. Oracle Ultra Search provides the following default search attributes: Title, Author, Description, Subject, Mimetype, Language, Host, and LastModifedDate. They can be incorporated in search applications for a more detailed search and richer presentation.You can also define your own.
After defining search attributes, you must map between document attributes and global search attributes for data sources. To do so, use the Mappings subtab.
Note:
Oracle Ultra Search provides a command-line tool to load metadata, such as search attribute LOVs and display names, into an Oracle Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix A, "Loading Metadata into Oracle Ultra Search".Search attributes are attributes exposed to the query user. Oracle Ultra Search provides system-defined attributes, such as author and description. Oracle Ultra Search maintains a global list of search attributes. You can add, edit, or delete search attributes. You can also click Manage LOV to change the list of values (LOV) for the search attribute. There are two categories of attribute LOVs: one is global across all data sources, the other is data source-specific.
To define your own attribute, enter the name of the attribute in the text box; select string, date, or number; and click Add.
You can add or delete LOV entry and display name for search attributes. Display name is optional. If display name is absent, then LOV entry is used in the query screen.
Note:
LOV is only represented as string type. If LOV is in date format, then you must use "DD-MM-YYYY" to enter the LOV.To update the policy value, click Manage LOV for any attribute.
A data source-specific LOV can be updated in three ways:
Update the LOV manually.
The crawler agent can automatically update the LOV during the crawling process.
New LOV entries can be automatically added by inspecting attribute values of incoming documents.
Caution:
If the update policy is agent-controlled, then the LOV and all translated values are erased in the next crawling.This section displays mapping information for all data sources. For user-defined data sources, mapping is done at the agent level, and document attributes are automatically mapped to search attributes with the same name initially. Document attributes and search attributes are mapped one-to-one. For each user-defined data source, you can edit the global search attribute to which the document attribute is mapped.
For Web, file, or table data sources, mappings are created manually when you create the data source. For user-defined data sources, mappings are automatically created on subsequent crawls.
Click Edit Mappings to change this mapping.
Editing the existing mapping is costly, because the crawler must recrawl all documents for this data source. Avoid this step, unless necessary.
Note:
There are no user-managed mappings for e-mail sources. There are two predefined mappings for e-mails. The "From" field of an e-mail is intrinsically mapped to the Oracle Ultra Search author attribute. Likewise, the "Subject" field of an e-mail is mapped to the Oracle Ultra Search subject attribute. The abstract of the e-mail message is mapped to the description attribute.A collection of documents is called a source. The data source is characterized by the properties of its location, such as a Web site or an e-mail inbox. The Oracle Ultra Search crawler retrieves data from one or more data sources.
The different types of sources are:
User-Defined Sources (requires a crawler agent)
See Also:
"Schedules Page" to assign one or more data sources to a synchronization schedule
"Queries Page" to assign data sources to data groups to enable restrictive querying
You can create as many data sources as you want. The following section explains how to create and edit data sources.
A Web source represents the content on a specific Web site. Web sources facilitate maintenance crawling of specific Web sites.
To create a new Web source, do the following:
Specify a name for the Web source and a starting address. This is the URL for the crawler to begin crawling. The starting address can be HTTP or HTTPS.
Set URL boundary rules to refine the crawling space. You can include or exclude hosts or domains beginning with, ending with, or equal to a specific name.
For example, an inclusion domain ending with oracle.com
limits the Oracle Ultra Search crawler to hosts belonging to Oracle worldwide. Anything ending with oracle.com
is crawled; but, http://www.oracle.com.tw
is not crawled. If you change the inclusion domain to yahoo.com
with a new seed http://www.yahoo.com
, then all oracle.com URLs are dropped by the crawler.
An exclusion domain uk.oracle.com prevents the crawler from crawling Oracle hosts in the United Kingdom. You can also include or exclude Web sites with a specific port. (By default, all ports are crawled.) You can have port inclusion or port exclusion rules for a specific host, but not both. Exclusion rules always override inclusion rules.
Specify the types of documents the Oracle Ultra Search crawler should process for this source. HTML and plain text are default document types that the crawler always processes.
Specify the authentication settings. This step is optional. Under HTTP Authentication, specify the user name and password for host and realm for which authentication is required. The realm is a name associated with the protected area of a Web site. Under HTML Forms, you can register HTML forms that you want the Oracle Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled. Cookies remember context between HTTP requests. For example, the server can send a cookie such that it knows if a user has already logged on and does not need to log on again. Cookie support is enabled by default. Click Register HTML Form to register authentication forms protecting the data source. Note: For the form URL to be crawled, you must verify that the URL is not excluded in the robots.txt
file. If so, then you must disable robot exclusion for this data source. (By default, Oracle Ultra Search enables robot exclusion.)
Choose either No ACL or Ultra Search ACL for the data source. When a user performs a search, the ACL (access control list) controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.
Define, edit, or delete metatag mappings for your Web source. Metatags are descriptive tags in the HTML document header. One metatag can map to only one search attribute.
Override the default crawler settings for each Web source. This step is optional. The parameters you can override are the crawling depth, the number of crawler threads, the language, the crawler timeout threshold, the character set, the maximum cookie size, the maximum number of cookies, and the maximum number of cookies for each host. You can also enable or disable robots exclusion, language detection, the UrlRewriter, indexing dynamic Web pages, HTTP cookies, and whether content of the cookie log file is shown. (You can also edit those in Edit Web Sources.)
Robots exclusion lets you control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots
.txt
file. For example, when a robot visits http://www.foobar.com/
, it checks for http://www.foobar.com/robots.txt.
If the robot finds it, the crawler analyzes its contents to see if it is permitted to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, you should always comply with robots
.txt
by enabling robots exclusion.
The URL Rewriter is a user-supplied Java module for implementing the Oracle Ultra Search UrlRewriter interface. It is used by the crawler to filter or rewrite extracted URL links before they are put into the URL queue. URL filtering removes unwanted links, and ULR rewriting transforms the URL link. This transformation is necessary when access URLs are used.
The UrlRewriter provides the following possible outcomes for links:
There is no change to the link. The crawler inserts it as it is.
Discard the link. There is no insertion.
A new display URL is returned, replacing the URL link for insertion.
A display URL and an access URL are returned. The display URL may or may not be identical to the URL link.
The generated new URL link is subject to all existing host, path, and mimetype inclusion and exclusion rules.
You must put the implemented rewriter class in a jar file and provide the class name and jar file name here.
If Index Dynamic Page is set to Yes, then dynamic URLs are crawled and indexed. For data sources already crawled with this option, setting Index Dynamic Page to No and recrawling the data source removes all dynamic URLs from the index.
Note:
There is a restriction that the crawler cannot crawl and index dynamic Web pages written in JavaScript.Some dynamic pages appear as multiple search hits for the same page, and you may not want them all indexed. Other dynamic pages are each different and need to be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed. Consider the following three URLs:
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14z1 http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14
The question mark ('?') in the URL indicates that the rest of the strings are input parameters. The duplicate hits are essentially the same page with different side menu expansion. Ideally, the query should yield only one result:
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html
Dynamic page index control applies to the whole data source. So, if a Web site has both kinds of dynamic pages, you need to define them separately as two data sources in order to control the indexing of those dynamic pages.
See Also:
"Crawler Page" for information on default languages
A table source represents content in a database table or view. The database table or view can reside in the Oracle Ultra Search database instance or in a remote database. Oracle Ultra Search accesses remote databases using database links.
See Also:
"Limitations With Database Links"To create a table source, click Create Table Source, and follow these steps:
Specify a table source name, and the name of the database link, schema, and table. (Database links are configured manually using SQL CREATE DATABASE LINK
against the Oracle Ultra Search instance in question. After you create the database link, it shows up in the drop down list.) Click Locate Table.
Specify settings for your table source, such as the default language and the primary key column. You can also specify the column where final content should be delivered, and the type of data stored in that column; for example, HTML, plain text, or binary. For information on default languages, see "Crawler Page".
Verify the information about your table source.
Decide whether or not to use the Oracle Ultra Search logging mechanism to optimize the crawling of table data sources. When crawling is enabled, only newly updated documents are revisited during the crawling process. You can enable logging for Oracle tables, enable logging for non-Oracle tables, or disable the logging mechanism. If you enable logging, then you are prompted to create a log table and log triggers. Oracle SQL statements are provided for Oracle tables. If you are using non-Oracle tables, then you must manually create a log table and log triggers. Follow the examples provided to create the log table and log triggers. After you have created the table, enter the table name in Log Table Name.
Map table columns to search attributes. Each table column can be mapped to exactly one search attribute. This lets the search engine seamlessly search data from the table source.
Specify the display URL template or column for the table source. This step is optional. Oracle Ultra Search uses a default text viewer for table data sources. If you specify Display URL, then Oracle Ultra Search uses the Web URL defined to display the table data retrieved. If Display URL column is available, then Oracle Ultra Search uses the column to get the URL to display the table data source content. You can also specify display URL templates in the following format: http://
hostname:port/path?parameter_name
=$(key1)
where key1 is the corresponding table's primary key column. For example, assume that you can use the following URL to query the bug number 1234567, and the bug number is the primary key of the table: http://bug:7777/pls/bug?rptno=1234567
. You can set the table source display URL template to http://bug:7777/pls/bug?rptno=$(key1)
.
The Table Column to Key Mappings section provides mapping information. Oracle Ultra Search supports table keys in STRING
, NUMBER
, or DATE
type. If key1 is of NUMBER
or DATE
type, then you must specify the format model used by the Web site so that Oracle knows how to interpret the string. For example, the date format model for the string '11-Nov-1999' is 'DD-Mon-YYYY'. You can also map other table columns to Oracle Ultra Search attributes. Do not map the text column.
Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered public and visible. Alternatively, you can specify to use Oracle Ultra Search ACL. You can add more than one group and user to the ACL for the data source. The option to choose is available only if the instance is security-enabled.
See Also:
Oracle Database SQL Reference for more on format modelsOn the main Table Sources page, click Edit to change the name of the table source. You can change, add, or delete table column and search attribute mappings; change the display URL template or column; and view values of the table source settings.
If a table source has more than one table, then a view joining the relevant tables must be created. Oracle Ultra Search then uses this view as the table source. For example, two tables with a master-detail relationship can be joined through a SELECT
statement on the master table and a user-implemented PL/SQL function that concatenate the detail table rows.
The following restrictions apply to base tables or views on a remote database that are accessed over a database link by the crawler.
If the text column of the base table or view is of type BLOB
or CLOB
, then the table must have a ROWID
column. A table or view might not have a ROWID
column for various reasons, including the following:
A view is comprised of a join of one or more tables.
A view is based on a single table using a GROUP
BY
clause.
The best way to know if a remote table or view can be safely crawled by Oracle Ultra Search is to check for the existence of the ROWID
column. To do so, run the following SQL statement against that table or view using SQL*Plus:
SELECT MIN(ROWID) FROM table_name/view_name;
The base table or view cannot have text columns of type BFILE
, RAW
.
An e-mail source derives its content from e-mails sent to a specific e-mail address. When the Oracle Ultra Search crawler searches an e-mail source, it collects all e-mails that have the specific e-mail address in any of the "To:" or "Cc:" e-mail header fields.
The most popular application of an e-mail source is where an e-mail source represents all e-mails sent to a mailing list. In such a scenario, multiple e-mail sources are defined where each e-mail source represents an e-mail list.
To crawl e-mail sources, you need an IMAP account. At present, the Oracle Ultra Search crawler can only crawl one IMAP account. Therefore, all e-mails to be crawled must be found in the inbox of that IMAP account. For example, in the case of mailing lists, the IMAP account should be subscribed to all desired mailing lists. All new postings to the mailing lists are sent to the IMAP e-mail account and subsequently crawled. The Oracle Ultra Search crawler is IMAP4 compliant.
When the Oracle Ultra Search crawler retrieves an e-mail message, it deletes the e-mail message from the IMAP server. Then, it converts the e-mail message content to HTML and temporarily stores that HTML in the cache directory for indexing. Next, the Oracle Ultra Search crawler stores all retrieved messages in a directory known as the archive directory. The e-mail files stored in this directory are displayed to the search end user when referenced by a query result.
To crawl e-mail sources, you must specify the user name and password of the e-mail account on the IMAP server. Also specify the IMAP server host name and the archive directory.
To create e-mail sources, you must enter an e-mail address and a description. Optionally, you can specify e-mail aliases and ACL policy. The description can be viewed by all search end users, so you should specify a short but meaningful name. When you create (register) an e-mail source, the name you use is the e-mail of the mailing list. If the e-mails are not sent to one of the registered mailing lists, then those e-mails are not crawled.
You can specify e-mail address aliases for an e-mail source. Specifying an alias for an e-mail source causes all e-mails sent to the main e-mail address, as well as the alias address, to be gathered by the crawler. An alias is useful when two or more e-mail addresses are logically the same. For example, an e-mail source representing the distribution list list@company.com
can have the alternate address list@my.company.com
. If list@my.company.com
is added to the alias list, then e-mail sent to that address are treated as if they were sent to list@company.com
.
Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. You can add more than one group and user to the ACL for the data source.
A file source is the set of documents that can be accessed through the file protocol on the local computer.
To edit the name of a file source, click Edit.
To create a new file source, do the following:
Specify a name for the file source and the default language.
Designate files or directories to be crawled. If a URL represents a single file, then the Oracle Ultra Search crawler searches only that file. If a URL represents a directory, then the crawler recursively crawls all files and subdirectories in that directory.
Specify inclusion and exclusion paths to modify the crawling space associated with this file source. This step is optional. An inclusion path limits the crawling space. An exclusion path lets you further define the crawling space. If neither path is specified, then crawling is limited to the underlying file system access privileges. Path rules are host-specific, but you can specify more than one path rule for each host. For example, on the same host, you can include path files://host/doc
and exclude path files://host/doc/unwanted
.
Specify the types of documents the Oracle Ultra Search crawler should process for this file source. HTML and plain text are default document types that the crawler always processes.
Oracle Ultra Search displays file data sources in text format. However, if you specify display URL for the file data source, then Oracle Ultra Search uses the URL to display the file data source.
With display URL for file data sources, the URL uses network protocols, such as HTTP or HTTPS, to access the file data source. To generate display URL for the file data source, specify the prefix of the original file URL and the prefix of the display URL. Oracle Ultra Search replaces the prefix of the file URL with the prefix of the display URL.
For example, if your file URL is file:///home/operation/doc/file.doc
and the display URL is https://webhost/client/doc/file.doc
, then you can specify the file URL prefix to file:///home/operation
and the display URL prefix to https://webhost/client.
Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. Alternatively, you can specify using the Oracle Ultra Search ACL. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.
You can create, edit, or delete Oracle sources. You can choose a federated source or a crawlable source from Oracle Application Server Portal. A federated source is a repository that maintains its own index. Oracle Ultra Search can issue a query, and the repository can return query results. Oracle Ultra Search also supports the crawling and indexing of Oracle Application Server Portal installations. This enables searching across multiple portal installations.
Note:
When Oracle Ultra Search crawls content from Oracle Portal, it gathers all metadata (that is, attribute values) with the actual indexable content. This is then text-indexed. When you search for string "xxx", if that string occurs in either the attributes or in the content, then the document is returned.This is different from how Oracle Portal behaves alone. With Oracle Portal, when you search for string "xxx", only the content of a document (page/item in Portal terminology) is searched. Attributes are treated separately.
Oracle Ultra Search can only crawl public Oracle AS Portal sources. See the Oracle Application Server Portal Configuration Guide for how to set up public pages.
To create a Portal source, you must first register your portal with Oracle Ultra Search. To register your portal:
Provide a name and portal URL base. The portal name is used to identify this portal entry in the Oracle Portal List page. The URL base is the beginning portion of the portal homepage. This include host name, port number, and DAD. After it is created, the portal URL base is not updatable. Click Register Portal. Oracle Ultra Search attempts to contact the Oracle Application Server Portal instance and retrieve information about it.
Choose one or more page groups for indexing. A portal data source is created for each page group. Click Delete to remove existing portal data sources.
You can edit the types of documents the Oracle Ultra Search crawler should process for a portal source. HTML and plain text are default document types that the crawler always processes. To edit document types, click Edit for the portal source after it has been created.
See Also:
The Oracle Application Server Portal documentationTo create a federated source, specify the name and JNDI for the new data source. By default, no resource adapter is available.
To create a federated source, you must manually deploy the Oracle Ultra Search resource adapter, or searchlet. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system on behalf of a user. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system on behalf of a user. When a user's query is delegated to the searchlet, the searchlet runs the query on behalf of the user. Every searchlet is a JCA 1.0 compliant resource adapter.
See Also:
The JCA 1.0 specification from Javasoft for detailed information on resource adapters and Java Connector ArchitectureThe Oracle Ultra Search searchlet enables queries against one Oracle Ultra Search instance. The Oracle Ultra Search searchlet is packaged as ultrasearch_searchlet.rar
and is shipped under the $ORACLE_HOME/ultrasearch/adapter/
directory.
To deploy the Oracle Ultra Search searchlet in OC4J standalone mode, use admin.jar
as follows:
java -jar admin.jar ormi://<hostname> <admin> <welcome> -deployconnector -file ultrasearch_searchlet.rar -name UltraSearchSearchlet
At this point, ultrasearch_searchlet.rar
has been deployed in OC4J. However, it has not been instantiated to connect to any Oracle Ultra Search instance. The Oracle Ultra Search searchlet can be instantiated multiple times, to connect to several Oracle Ultra Search instances, by repeating the following steps. To instantiate the searchlet, configuration parameters values must be specified, and a JNDI location must be specified where the searchlet instance should be bound to. To do this, you must manually edit oc4j-ra.xml
. This file is typically located under the $J2EE_HOME/application-deployments/default/UltraSearchSearchlet/
directory. The Oracle Ultra Search searchlet requires four configuration properties: connectionURL
, userName
, password
, and instanceName
. For example, to bind a searchlet under eis/UltraSearch
to connect to the default instance WK_TEST
on computer dbhost
, the following entry can be used:
<connector-factory location="eis/UltraSearch" connector-name="Ultra Search Adapter"> <config-property name="connectionURL" value="jdbc:oracle:thin:@dbhost:1521:sid"/> <config-property name=:userName" value="wk_test"/> <config-property name="passwors" value="wk_test"/> <config-property name="instanceName" value="wk_test"/> </connector-factory>
After editing oc4j-ra.xml
, restart the OC4J instance. If you do not see errors upon restart, then the searchlet has been successfully instantiated and bound to JNDI.
The federator searchlet interacts with other searchlets to provide a single point of search against multiple repositories. For example, the federator searchlet can invoke multiple Oracle Ultra Search searchlets to simultaneously query against multiple Oracle Ultra Search instances. In the same manner, the federator searchlet can invoke searchlets for Oracle Files, Email, and so on.The federator searchlet is configured and managed with the Oracle Ultra Search administration tool, under the Federated Sources tab. The federator searchlet is packaged as federator_searchlet.rar
and is shipped under the $ORACLE_HOME/ultrasearch/adapter/
directory. The deployment procedure for federator_searchlet.rar
is similar to the deployment of the Oracle Ultra Search searchlet. To deploy the federator searchlet in OC4J standalone, use admin.jar as follows:
java -jar admin.jar ormi://<hostname> <admin> <welcome> -deployment -file federator_searchlet.rar -name FederatorSearchlet
To instantiate the searchlet, the federator searchlet requires four configuration properties: connectionURL
, userName
, password
, and instanceName
in the oc4j-ra.xml
file. This file is typically located under the $J2EE_HOME/application-deployments/default/FederatorSearchlet/
directory. For example:
<connector-factory location="eis/Federator" connector-name="Federator Adapter"> <config-property name="connectionURL" value="jdbc:oracle:thin:@dbhost:1521:sid"/> <config-property name="userName" value="wk_test"/> <config-property name="password" value="wk_test"/> <config-property name=InstanceName" value="wk_test"/> </connector-factory>
After editing oc4j-ra.xml
, restart the OC4J instance. If you do not see errors upon restart, then the searchlet has been successfully instantiated and bound to JNDI.
Oracle Ultra Search lets you define, edit, or delete your own data sources and types in addition to the ones provided. You might implement your own crawler agent to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, which contain their own databases and interfaces.
For each new data source type, you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Oracle Ultra Search crawler, which enqueues it for later crawling.
See Also:
"Oracle Ultra Search Crawler Agent API"To define a new data source, you first define a data source type to represent it.
To create, edit, or delete data source types, click Manage Source Types. To create a new type, click Create New Type.
Specify data source type name, description, and crawler agent Java class file or jar file name. The crawler agent Java classpath is predefined at installation time. The agent collects the list of document URLs and associated metadata from the proprietary document source and returns it to the Oracle Ultra Search crawler, which enqueues the information for later crawling. The agent class file or jar file must be located under $ORACLE_HOME/ultrasearch/lib/agent/
.
Specify parameters for this data source type. If you add parameters, you must enter the parameter name and a description. Also, you must decide whether to encrypt the parameter value.
Edit data source type information by changing the data source type name, description, crawler agent Java class/jar file name, or parameters.
To create a user-defined data source, select the type and click Go.
Specify a name, default language, and parameter values for the data source. For information on default languages, see the Crawler Page.
Specify the authentication settings. This step is optional. Under HTTP Authentication, specify the user name and password for host and realm for which authentication is required. The realm is a name associated with the protected area of a Web site. Under HTML Forms, you can register HTML forms that you want the Oracle Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled. Cookies remember context between HTTP requests. For example, the server can send a cookie such that it knows if a user has already logged on and does not need to log on again. Cookie support is enabled by default. Click Register HTML Form to register authentication forms protecting the data source.
Specify the ACL policy for the data source: no ACL, repository-generated ACL, or Oracle Ultra Search ACL. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. For the Oracle Ultra Search ACL, you can add more than one group and user to the ACL for the data source.
Specify mappings. This step is optional. Document attributes are automatically mapped directly to the search attribute with the same name during crawling. If you want document attributes to map to another search attribute, then you can specify it here. The crawler picks up attributes that have been returned by the crawler agent or specified here.
Edit crawling parameters.
Specify the document types that the crawler should process for this data source. By default, HTML and plain text are always processed.
You can edit user-defined data sources by changing the name, type, default language, or starting address.
Use this page to schedule data synchronization and index optimization. Data synchronization means keeping the Oracle Ultra Search index up to date with all data sources. Index optimization means keeping the updated index optimized for best query performance.
See Also:
"Synchronizing Data Sources"The tables on this page display information about synchronization schedules. A synchronization schedule has one or more data sources assigned to it. The synchronization schedule frequency specifies when the assigned data sources should be synchronized. Schedules are sorted first by name. Within a synchronization schedule, individual data sources are listed and can be sorted by source name or source type.
To create a new schedule, click Create New Schedule and follow these steps:
Name the schedule.
Pick a schedule frequency and determine whether the schedule should automatically accept all URLs for indexing or examine URLs before indexing. For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing. You can also associate the schedule with a remote crawler profile.
You can set the frequency to Manual Launch. In this case, the interval remains in SCHEDULED
status until you explicitly invoke data synchronization by clicking Execute Immediately (see "Launching Synchronization Schedules").
Assign data sources to the schedule. After a data source has been assigned to a group, it cannot be assigned to other groups.
After a synchronization schedule has been defined, you can do the following in the Synchronization Schedules List:
To assign the schedule to either a crawler that runs on the database host or a remote crawler that runs on a separate host, click Hostname.
To change its frequency, click the schedule interval text.
To alter its status, click Status.
To delete it, click Delete.
To edit its name, data source assignments, recrawl policy, or crawling mode, click Edit. When the crawler retrieves a document, it checks to see if it has changed. By default, if the document has not changed, the crawler does not process it. In certain situations, you might want to force the crawler to reprocess all documents. Click Edit to edit schedules in the following ways:
Update schedule name. This step is optional. To change the schedule name, specify a name for the schedule, and click Update Schedule Name.
Assign data sources to schedule. To assign a data source, select one or more available sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To undo assignments of a data source, select one or more scheduled sources and click <<.
Update crawler recrawl policy. You can update the recrawl policy to the following:
Process Documents That Have Changed: This is maintenance crawling. Only documents that have changed are recrawled and indexed. For Web data sources, if there are new links in the updated document, then they are followed. For file data sources, new files are collected if its parent directory has changed.
Process All Documents: The crawler recrawls the data source. For example, suppose you want to crawl only text and HTML on a Web site. Later, you also want to crawl Microsoft Word and Adobe PDF documents. You must modify the document types for the source, edit the schedule to select Process All Documents, then rerun the schedule so that the crawler picks up PDF and doc document types for this data source. The crawler treats every document as if it has been changed, which means each document is fetched and processed again.
Upon relaunching the schedule, the following rules determine which URLs will be recrawled:
If the previous crawl did not finish (for example, you stopped the crawling or the database tablespace was full), then the crawler only crawls URLs left in the URL queue. URLs already crawled are not touched on recrawl.
If the URL queue is empty but there is a new seed added since the last crawl, then the crawler only crawls the new seed.
If the URL queue is empty and there is no new seed URL, then the crawler recrawls all crawled URLs.
Therefore, if you stop the crawler and set Index Dynamic Pages to No, this only affects the URLs in the queue yet to be crawled. The already crawled dynamic pages are removed from the index on the third recrawl when the queue is empty.
Note:
All crawled URLs are subject to crawler setting enforcement, not just newly crawled URLs.Update crawling mode. You can update the crawling mode to the following:
Automatically accept all URLs for indexing: This mode crawls and indexes.
Examine URLs before indexing: This mode crawls only. For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing.
Index only: This mode indexes only.
The crawler behaves differently for the documents collected.
Crawling mode and recrawl policy can be combined for six different combinations. For example, Process All Documents and Index Only forces reindexing existing documents in this data source, while Process Documents That Have Changed and Index Only re-indexes only changed documents.
A schedule's synchronization frequency can be identical to another schedule's synchronization frequency. This gives you maximum flexibility in managing data source synchronization.
You can launch a synchronization schedule in the following ways:
Set a schedule frequency and wait for the predetermined launch time.
Run it immediately. To do so, click Status, then Execute Immediately.
Manually start the schedule.
Note:
Launching a synchronization schedule can take a very long time. If a schedule has been launched before, then the next time a schedule is launched, all URLs that belong to the data source to be crawled by the schedule are updated to put into a queue. Depending on the number of URLs associated with that data source, the enqueue operation may take a long time. The administration tool displays the schedule state as 'Launching' the entire time.The launch of a schedule does not perform any enqueue if the URL queue is not empty or if there is a new seed added since the last crawl. For example, if the user stopped the crawler earlier or if the crawler terminated because of insufficient Oracle table space, then the URL queue is not empty. So, on the next launch the crawler does not try to enqueue; instead it works on the existing URL queue until it is empty. In other words, enqueue is only performed when the queue is empty at launch time.
Click the link in the status column to see the synchronization schedule status. To see the crawling progress for any data source associated with this schedule, click Statistics.
If you decide to examine URLs before indexing for the schedule, then after you run the schedule, the schedule status is shown as "Indexing Pending".
In data harvesting mode, you should begin crawling first. After crawling is done, click Examine URL to examine document URLs and status, remove unwanted documents, and start indexing. After you click Begin Index, you see schedule status change from launching, running, scheduled, and so on.
You can see the following statistics:
Data source type
Data source name
Start time
Finish time
Elapsed time
Total indexing time
Total size of document data collected
Average document size
Average fetch throughput
It also contains the following statistics:
Documents to fetch
Documents fetched: This includes all types of documents fetched.
Document fetch failures: This can be an Oracle HTTP Server timeout or another HTTP server error.
Documents rejected: The document is not within the URL boundary rule.
Documents discovered: This includes all types of documents discovered.
Documents indexed
Documents non-indexable: This can be a file directory, a portal page that is a discovery node, or a robot metatag that specifies no index.
Document conversion failures: The binary file filter failed.
Index Optimization
To ensure fast query results, the Oracle Ultra Search crawler maintains an active index of all documents crawled over all data sources. This lets you schedule when you would like the index to be optimized. The index should be optimized during hours of low usage.
Note:
Increasing the crawler cache directory size can reduce index fragmentation.Index Optimization Schedule
You can specify the index optimization schedule frequency. Be sure to specify all required data for the option that you select. You can optimize the index immediately, or you can enable the schedule.
Optimization Process Duration
Specify a maximum duration for the index optimization process. The actual time taken for optimization does not exceed this limit, but it could be shorter. Specifying a longer optimization time results in a more optimized index. Alternatively, you can specify that the optimization continue until it is finished.
If your Oracle Ultra Search instance is secure-search enabled, then the index optimization process also triggers garbage collection of unused access control lists (ACLs).
This section lets you specify query-related settings, such as data source groups, URL submission, relevancy boosting, and query statistics.
Data groups are logical entities exposed to the search engine user. When entering a query, the user is asked to select one or more data groups from which to search.
A data group consists of one or more data sources. A data source can be assigned to multiple data groups. Data groups are sorted first by name. Within each data group, individual data sources are listed and can be sorted by source name or source type.
To create a new data source group, do the following:
Specify a name for the group.
Assign data sources to the group. To assign a Web or table data source to this data group, select one or more available Web sources or table sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To unassign a Web or table data source, select one or more scheduled sources and click <<.
Click Finish.
URL Submission Methods
URL submission lets query users submit URLs. These URLs are added to the seed URL list and included in the Oracle Ultra Search crawler search space. You can permit or disallow query users to submit URLs.
URL Boundary Rules Checking
URLs are submitted to a specific Web data source. URL boundary rules checking ensures that submitted URLs comply with the URL boundary rules of the Web data source. You can permit or disallow URL boundary rules checking.
Relevancy boosting lets administrators override the search results and influence the order that documents are ranked in the query result list. This can be used to promote important documents to higher scores, which makes these documents easier to find.
See Also:
"Document Relevancy Boosting"There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.
Locate by Search
To boost a URL, first locate a URL by performing a search. You can specify a host name to narrow the search. After you have located the URL, click Information to edit the query string and score for the document.
Manual URL Entry
If a document has not been crawled or indexed, then it cannot be found in a search. However, you can provide a URL and enter the relevancy boosting information with it. To do so, click Create, and enter the following:
Specify the document URL. You must assign the URL to a data source. This document is indexed the next time it is crawled.
Enter scores in the range of 1 to 100 for one or more query strings. When a user performs a search using the exact query string, the score applies for this URL.
The document is searchable after the document is loaded for the term. The document is also indexed the next time the schedule is run.
With manual URL entry, you can only assign URLs for Web data sources. Users get an error message on this page if no Web data source is defined.
Note:
Oracle Ultra Search provides a command-line tool to load metadata, such as document relevance boosting, into an Oracle Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix A, "Loading Metadata into Oracle Ultra Search".Enabling Query Statistics
This section lets you enable or disable the collection of query statistics. The logging of query statistics reduces query performance. Therefore, Oracle recommends that you disable the collection of query statistics during regular operation.
Note:
After you enable query statistics, the table that stores statistics data is truncated every Sunday at 1:00 A.M.Viewing Statistics
If query statistics is enabled, you can click one of the following categories:
Daily Summary of Query Statistics This summarizes all query activity on a daily basis. The statistics gathered are:
Average query time: the average time taken over all queries
Number of queries: the total number of queries made in the day
Number of hits: the average number of results returned by each query
Top 50 Queries This summarizes the 50 most frequent queries in the past 24 hours.
Query string: the query string
Average query time: the average time to return a result
Number of queries: the total number of queries in the past 24 hours
Number of hits: the average number of results returned by each query
Frequency: the number of queries divided by total number of queries over all query strings
Percentage of ineffective queries: the number of ineffective queries divided by total number of queries over all query strings
Top 50 Ineffective Queries This summarizes the 50 most frequent queries in the past 24 hours. Each row in the table describes statistics for a particular query string.
Query string: the query string
Number of queries: the total number of queries made in the past 24 hours
Percentage of ineffective queries: the number of ineffective queries divided by total number of queries for that string
Top 50 Failed Queries This summarizes the top 50 queries that failed over the past 24 hours. A failed query is one where the search engine user did not locate any query results.
The columns are:
Query string: the query string
Number of queries: the total number of queries made in the past 24 hours
Frequency: the percentage occurrence of a failed query
Cumulative frequency: the cumulative percentage occurrence of all failed queries
See Also:
"Tuning Query Performance"You can configure the query application and the federation engine with several parameters, including the maximum number of hits and whether to enable relevancy boosting.
You can configure the following federator parameters:
Timeout threshold: the maximum waiting time for getting search results from each of the repositories. The unit is in millisecond.
Maximum number of results: federator retrieves maximum number of results based on this parameter. If it is set to a large number, then the search response will be slower. If it is set to a small number, then the number of search results will be limited.
Parallel query mode: parallel query mode will make the query more efficient, but will also consume more memory.
Min/max thread pool size: this parameter is also used for performance tuning. If there are more users running the search application concurrently, then use larger pool size.
Note:
The Table Display URL, the File Display URL, and the E-mail Display URL are relative URLs. For Oracle Portal to work, you must replace these URLs with full URLs here, including hostname, port, and path.Use this page to manage Oracle Ultra Search administrative users. You can assign a user to manage an Oracle Ultra Search instance. You can also select a language preference.
This section lets you set preference options for the Oracle Ultra Search administrator.
You can specify the date and time format. The pull-down menu lists the following languages:
English
Brazilian Portuguese
French
German
Italian
Japanese
Korean
Simplified Chinese
Spanish
Traditional Chinese
You can also select the number of rows to display on each page.
A user with super-user privileges can perform all administrative functions on all instances, including creating instances, dropping instances, and granting privileges. Only super-users can access this page.
Single sign-on users can use a delegated administrative service (DAS) list of values to add another single sign-on user as a super-user. These users are authenticated by OracleAS Single Sign-On before enabling access. Database users can add another database user as a super-user.
To grant super-user administrative privileges to another user, enter the user name of the user. Specify also whether the user should be permitted to grant super-user privileges to other users. Then click Add.
Only instance owners, users that have been granted general administrative privileges on this instance, or super-users are permitted to access the Privileges page. Instance owners must have been granted the WKUSER
role.
Single sign-on users can use a delegated administrative service (DAS) list of values to add privileges to another single sign-on user. These users are authenticated by OracleAS Single Sign-On before enabling access. Database users can add privileges to another database user.
Note:
Database users cannot grant privileges to single sign-on users, and single sign-on users cannot grant privileges to database users. The DAS list of values only shows single sign-on users.Granting general administrative privileges to a user enables that user to modify general settings for this instance. To do this, enter the user name and specify whether the user should be permitted to grant administrative privileges to other users. Then click Add.
To remove one ore more users from the list of administrators for this instance, select one or more user names from the list of current administrators and click Remove.
Note:
General administrative privileges do not include the ability to create or delete an instance. These privileges belong to super-users.Oracle Ultra Search lets you translate names to different languages. This page lets you enter multiple values for search attributes, list of values (LOV) display names, and data groups.
This section lets you translate attribute display names to different languages.
The pull-down menu lists the following languages:
English
Arabic
Brazilian Portuguese
Canadian French
Czech
Danish
Dutch
Finnish
French
German
Greek
Hebrew
Hungarian
Italian
Japanese
Korean
Latin American Spanish
Norwegian
Polish
Portuguese
Romanian
Russian
Simplified Chinese
Slovak
Spanish
Swedish
Thai
Traditional Chinese
Turkish