Skip Headers
Oracle® Secure Enterprise Search Administrator's Guide
10g Release 1 (10.1.8)

Part Number B32259-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

3 Understanding Crawling and Searching

This chapter contains the following topics:

See Also:

Overview of the Oracle Secure Enterprise Search Crawler

The Oracle Secure Enterprise Search (SES) crawler is a Java process activated by a set schedule. When activated, the crawler spawns processor threads that fetch documents from sources. These documents are cached in the local file system. When the cache is full, the crawler indexes the cached files. This index is used for searching.

In the administration tool, you can create schedules with one or more sources attached to them. Schedules define the frequency at which the Oracle SES index is kept up to date with existing information in the associated sources.

Crawler URL Queue

In the process of crawling, the crawler maintains a list of URLs of the documents that are discovered and will be fetched and indexed in an internal URL queue. The queue is persistently stored, so that crawls can be resumed after the Oracle SES instance is restarted.

Understanding Access URLs and Display URLs

A display URL is a URL string used for search result display. This is the URL used when users click the search result link. An access URL is a URL string used by the crawler for crawling and indexing. An access URL is optional. If it does not exist, then the crawler uses the display URL for crawling and indexing. If it does exist, then it is used by the crawler instead of the display URL for crawling. For regular Web crawling, there are only display URLs available. But in some situations, the crawler needs an access URL for crawling the internal site while keeping a display URL for the external use. For every internal URL, there is an external mirrored one.

For example, for file sources, by defining display URLs, end users can access the original document with the HTTP or HTTPS protocols. These provide the appropriate authentication and personalization and result in better user experience.

Display URLs can be provided using the URL Rewriter API. Or, they can be generated by specifying the mapping between the prefix of the original file URL and the prefix of the display URL. Oracle SES replaces the prefix of the file URL with the prefix of the display URL. For example, if the file URL is file://localhost/home/operation/doc/file.doc and the display URL is https://webhost/client/doc/file.doc, then specify the file URL prefix to file://localhost/home/operation and the display URL prefix to https://webhost/client.

Using Crawler Plug-ins

In addition to the default source types Oracle SES provides (such as Web, file, OracleAS Portal, and so on), you can also crawl proprietary sources. This is accomplished by implementing a crawler plug-in as a Java class. The plug-in collects document URLs and associated metadata (including access privilege) and contents from the proprietary source and returns the information to the Oracle SES crawler. The crawler starts processing each document as it is collected.

Overview of Crawler Settings

You can alter the crawler's operating parameters, such as the crawler timeout threshold and the default character set, on the Global Settings - Crawler Configuration page in the administration tool.

This section describes crawler settings, as well as other mechanisms to control the scope of Web crawling:

See Also:

"Tuning Crawl Performance" for more detailed information on these settings and other issues affecting crawl performance

Crawling Mode

For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is finished, examine the document URLs and status, remove unwanted documents, and start indexing. The crawling mode is set on the Home - Schedules - Edit Schedules page.

Note:

If you are using a custom crawler created with the Crawler Plug-in API, then the crawling mode set here will not apply. The implemented plug-in controls the crawling mode.

These are the crawling mode options:

  • Automatically Accept All URLs for Indexing: This crawls and indexes all URLs in the source. For Web sources, it also extracts and indexes any links found in those URLs. If the URL has been crawled before, then it will be reindexed only if it has changed.

  • Examine URLs Before Indexing: This crawls but does not index any URLs in the source. It also crawls any links found in those URLs.

  • Index Only: This crawls and indexes all URLs in the source. It does not extract any links from those URLs. In general, select this option for a source that has been crawled previously under "Examine URLs Before Indexing".

URL Boundary Rules

URL boundary rules limit the crawling space. When boundary rules are added, the crawler is restricted to URLs that match the indicated rules. The order in which rules are specified has no impact, but exclusion rules always override inclusion rules.

This is set on the Home - Sources - Boundary Rules page.

Inclusion Rules

Specify an inclusion rule that a URL contain, start with, or end with a term. Use an asterisk (*) to represents a wildcard. For example, www.*.example.com. Simple inclusion rules are case-insensitive. For case-sensitivity, use regular expression rules.

An inclusion rule ending with example.com limits the search to URLs ending with the string example.com. Anything ending with example.com is crawled, but http://www.example.com.tw is not crawled.

If the URL Submission functionality is enabled on the Global Settings - Query Configuration page, then URLs that are submitted by end users are added to the inclusion rules list. You can delete URLs that you do not want to index.

Oracle SES supports the regular expression syntax used in Java JDK 1.4.2 Pattern class (java.util.regex.Pattern). Regular expression rules use special characters. The following is a summary of some basic regular expression constructs.

  • Use a caret (^) to denote the beginning of a URL and a dollar sign ($) to denote the end of a URL.

  • Use a period (.) to match any one character.

  • Use a question mark (?) to match zero or one occurrence of the character that it follows.

  • Use an asterisk (*) to match zero or more occurrences of the pattern that it follows. An asterisk can be used in the starts with, ends with, and contains rule.

  • Use a backslash (\) to escape any special characters, such as periods (\.), question marks (\?), or asterisks (\*).

See Also:

http://java.sun.com for a complete description on Sun Microsystems Java documentation

Exclusion Rules

You can specify an exclusion rule that a URL contains, starts with or ends with a term.

An exclusion of uk.example.com prevents the crawling of Example hosts in the United Kingdom.

Default Exclusion Rules

The crawler contains a default exclusion rule to exclude non-textual files. The following file extensions are included in the default exclusion rule.

  • Image: jpg, gif, tif, bmp, png

  • Audio: wav, mp3, wma

  • Video: avi, mpg, mpeg, wmv

  • Binary: bin, exe, so, dll, iso, jar, war, ear, tar, wmv, scm, cab, dmp

Example Using Regular Expression

The following example uses several regular expression constructs that are not described earlier, including range quantifiers, non-grouping parentheses, and mode switches. For a complete description, see the Sun Microsystems Java documentation.

Suppose you want to crawl only HTTPS URLs in the example.com and examplecorp.com domains. Also, you want to exclude files ending in .doc and .ppt.

  • Inclusion: URL regular expression ^https://.*\.example(?:corp){0,1}\.com

  • Exclusion: URL regular expression (?i:\.doc|\.ppt)$

Crawling Depth

Crawling depth is the maximum number of nested links the crawler will follow. (A Web document could contain links to other Web documents, which could contain more links.)

This is set on the Home - Sources - Crawling Parameters page.

Robots Exclusion

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. The crawler also respects the page-level robot exclusion specified in HTML metatags.

For example, when a robot visits http://www.example.com/, it checks for http://www.example.com/robots.txt. If it finds it, then the crawler checks to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, always comply with robots.txt by enabling robots exclusion.

This is set on the Home - Sources - Crawling Parameters page.

Index Dynamic Pages

By default, Oracle SES will process dynamic pages. Dynamic pages are generally served from a database application and have a URL that contains a question mark (?). Oracle SES identifies URLs with question marks as dynamic pages.

Some dynamic pages appear as multiple search results for the same page, and you might not want them all indexed. Other dynamic pages are each different and need to be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed. Consider the following three URLs:

http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html
 
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14z1
 
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14

The question mark ('?') in the URL indicates that the rest of the strings are input parameters. The duplicate results are essentially the same page with different side menu expansion. Ideally, the search should yield only one result:

http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html

Note:

The crawler cannot crawl and index dynamic Web pages written in Javascript.

This is set on the Home - Sources - Crawling Parameters page.

URL Rewriter API

The URL Rewriter is a user-supplied Java module for implementing the Oracle SES UrlRewriter interface. The crawler uses it to filter or rewrite extracted URL links before they are put into the URL queue. The API enables ultimate control over which links extracted from a Web page are allowed and which ones should be discarded.

URL filtering removes unwanted links, and URL rewriting transforms the URL link. This transformation is necessary when access URLs are used and alternate display URLs need to be presented to the user in the search results.

This is set on the Home - Sources - Crawling Parameters page.

Title Fallback

You can override a default document title with a meaningful title if the default title is irrelevant. For example, suppose that the result list shows numerous documents with the title "Daily Memo". The documents had been created with the same template file, but the document properties had not been changed. Overriding this title in Oracle SES can help users better understand their search results.

Title fallback can be used for any source type. Oracle SES uses different logic for each document type to determine which fallback title to use. For example, for HTML documents, Oracle SES looks for the first heading, such as <h1>. For Microsoft Word documents, Oracle SES looks for text with the largest font.

If the default title was collected in the initial crawl, then the fallback title will only be used after the document is reindexed during a re-crawl. This means if there is no change to the document, then you must force the change by setting the re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedule page.

This feature is not currently supported in the Oracle SES administration tool. Configure title fallback in the crawler configuration file: $ORACLE_HOME/search/data/config/crawler.dat.

Notes:

  • When a title is null, Oracle SES automatically indexes the fallback title for all binary documents (for example, .doc, .ppt, .pdf). For HTML and documents, Oracle SES does not automatically index the fallback title. This means that the replaced title on HTML or text documents cannot be searched with the title attribute on the Advanced Search page. You can turn on indexing for HTML and text documents in the crawler.dat file. (For example, set NULL_TITLE_FALLBACK_INDEX ALL)

  • The crawler.dat file is not included in the backup available on the Global Settings - Configuration Data Backup and Recovery page. Make sure you manually back up the crawler.dat file.

Overview of Attributes

Each source has its own set of document attributes. Document attributes, like metadata, describe the properties of a document. The crawler retrieves values and maps them to one of the search attributes. This mapping lets users search documents based on their attributes. Document attributes in different sources can be mapped to the same search attribute. Therefore, users can search documents from multiple sources based on the same search attribute.

Document attribute information is obtained differently depending on the source type. For example, with Web sources, document attributes are extracted from HTML META tags. With table sources, any column in the source table can be chosen as a document attribute. With user-defined sources, document attributes and values can be returned by the crawler plug-in module.

Document attributes can be used for many things, including document management, access control, or version control. Different sources can have different attribute names that are used for the same idea; for example, "version" and "revision". It can also have the same attribute name for different ideas; for example, "language" as in natural language in one source but as programming language in another.

Oracle SES has several default search attributes. They can be incorporated in search applications for a more detailed search and richer presentation.

Search attributes are defined in the following ways:

The list of values (LOV) for a search attribute can help you specify a search. Global search attributes can be specified on the Global Settings - Search Attributes page. For user-defined sources where LOV information is supplied through a crawler plug-in, the crawler registers the LOV definition. Use the administration tool or the crawler plug-in to specify attribute LOVs, attribute value, attribute value display name, and its translation.

Note:

When multiple sources define the LOV for a common attribute, such as title, the user sees all the possible values for the attribute. When the user restricts search within a particular source group, only LOVs provided by the corresponding sources in the source group will be shown.

Understanding the Crawling Process

The first time the crawler runs, it must fetch data (Web pages, table rows, files, and so on) based on the source. It then adds the document to the Oracle SES index.

The Initial Crawl

This section describes a Web source crawling process for a schedule. It is broken into two phases:

  1. Queuing and Caching Documents

  2. Indexing Documents

Queuing and Caching Documents

The steps in the crawling cycle are the following:

  1. Oracle spawns the crawler according to the schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs.

  2. The crawler initiates multiple crawling threads.

  3. The crawler thread removes the next URL in the queue.

  4. The crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links.

  5. The crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.

  6. The crawler caches the HTML file in the local file system.

  7. The crawler registers URL in the URL table.

  8. The crawler thread starts over by repeating Step 3.

Fetching a document, as described in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Indexing Documents

When the file system cache is full (default maximum size is 250 MB), the indexing process begins. At this point, the document content and any searchable attributes are pushed into the index. After indexing of the document in the batch is completed, the crawler switches back to the queuing and caching mode.

Maintenance Crawls

After the initial crawl, a URL page is only crawled and indexed if it has changed since the last crawl. The crawler determines if it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for reindexing.

The steps involved in data synchronization are the following:

  1. Oracle spawns the crawler according to the schedule you specify with the administration tool. The URL queue is populated with the seed URLs of the source assigned to the schedule.

  2. The crawler initiates multiple crawling threads.

  3. Each crawler thread removes the next URL in the queue.

  4. Each crawler thread fetches the document from the Web. The page is usually an HTML file containing text and hypertext links. When the document is not in HTML format, the crawler tries to convert the document into HTML before caching.

  5. Each crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and the crawler goes to Step 3. Otherwise, the crawler moves to the next step.

  6. Each crawler thread scans the document for hypertext links and inserts new links into the URL queue. Links that are already in the document table are discarded. (Oracle SES does not follow links from filtered binary documents.)

  7. The crawler marks the URL as "accepted". The URL will be crawled in future maintenance crawls.

  8. The crawler registers the URL in the document table.

  9. If the file system cache is full or if the URL queue is empty, then Web page caching stops and indexing begins. Otherwise, the crawler thread starts over at Step 3.

Monitoring the Crawling Process

Monitor the crawling process in the administration tool by using a combination of the following:

Crawler Statistics

The following crawler statistics are shown on the Home - Schedules - Crawler Progress Summary page. Some of these statistics are also shown in the log file, under "Crawling results".

  • Documents to Fetch: Number of URLs in the queue waiting to be crawled. The log file uses the term "Documents to Process".

  • Documents Fetched: Number of documents retrieved by the crawler.

  • Document Fetch Failures: Number of documents whose contents cannot be retrieved by the crawler. This could be due to an inability to connect to the Web site, slow server response time causing timeouts, or authorization requirements. Problems encountered after successfully fetching the document are not considered here; for example, documents that are too big or documents ignored due to duplicates.

  • Documents Rejected: Number of URL links encountered but not considered for crawling. The rejection could be due to boundary rules, the robots exclusion rule, the mime type inclusion rule, the crawling depth limit, or the URL rewriter discard directive.

  • Documents Discovered: All documents discovered during crawling. This is roughly equal to (documents to fetch) + (documents fetched) + (document fetch failures) + (documents rejected).

  • Documents Indexed: Number of documents that have been indexed or are pending indexing.

  • Documents non-indexable: Number of documents that cannot be indexed; for example, a file source directory or a document with robots NOINDEX metatag.

  • Document Conversion Failures: Number of document filtering errors. This is counted whenever a document cannot be converted to HTML format.

Crawler Log File

The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file.

A new log file is created when you restart the crawler. The crawler maintains the past seven versions of its log file, but only the most recent log file is shown in the administration tool. You can view the other log files in the file system. The location of the crawler log file can be found on the Home - Schedules - Crawler Progress Summary page.

The naming convention of the log file name is ids.MMDDhhmm.log, where ids is a system-generated ID that uniquely identifies the source, MM is the month, DD is the date, hh is the launching hour in 24-hour format, and mm is the minutes.

For example, if a schedule for a source identified as i3ds23 is launched at 10 pm, July 8th, then the log file name is i3ds23.07082200.log. Each successive schedule launching will have a unique log file name. If the total number of log files for a source reaches seven, then the oldest log file is deleted.

Each logging message in the log file is one line, containing the following six tab delimited columns, in order:

  1. Timestamp

  2. Message level

  3. Crawler thread name

  4. Component name. It is in general the name of the executing Java class.

  5. Module name. It can be internal Java class method name

  6. Message

Crawler Configuration File

The crawler configuration file is $ORACLE_HOME/search/data/config/crawler.dat. All crawler configuration tasks except title fallback are controlled in the Oracle SES administration tool. The only reason to configure this file is to replace default document titles using the title fallback feature.

Note:

The crawler.dat file is not backed up with Oracle SES backup and recovery. If you edit this file, make sure to back it up manually.
Setting the Logging Level

Specify the crawler logging level with the parameter Doracle.search.logLevel. The defined levels are DEBUG(2), INFO(4), WARN(6), ERROR(8), FATAL(10). The default value is 4, which means that messages of level 4 and higher will be logged. DEBUG (level=2) messages are not logged by default.

For example, the following "info" message is logged at 23:10:39330. It is from thread name crawler_2, and the message is Processing file://localhost/net/stawg02/. The component and module names are not specified.

23:10:39:330 INFO    crawler_2      Processing file://localhost/net/stawg02/

The crawler uses a set of codes to indicate the crawling result of the crawled URL. Besides the standard HTTP status codes, it uses its own codes for non-HTTP related situations.

Replacing Default Document Titles Using Title Fallback

Override a default document title with a meaningful title by adding the keyword BAD_TITLE to the crawler.dat file. For example:

BAD_TITLE Daily Memo

Where Daily Memo is the title string that should be overridden. The title string is case-insensitive and can use multibyte characters in UTF8 character set.

Multiple bad titles can be specified, each one on a separate line.

See Also:

"Title Fallback" for more information on this feature

Overview of Searching in Oracle Secure Enterprise Search

To get to the end user search page from any page in the administration tool, click the Search link in the top right corner. This brings up the Basic Search page in a new window, with a text box to enter a search string. This section contains the following topics:

Basic Search

The search string can consist of one or more words. Clicking the search button returns all matches for that search string. The results can include the following links:

Cached: The cached HTML version of the document.

Links: Pages that link to and from this document.

Source Group: This link leads to Browse Source Groups.

Any links on top of the search text box are source groups. Clicking a source group restricts the search to that group.

The following table describes rules that apply to the search string. Text in square brackets represents characters entered into the search.

Table 3-1 Search String Rules

Rule Description

Single word search

Entering one word finds documents that contain that word.

For example, searching for [Oracle] finds all documents that contain the word Oracle anywhere in that document.

Compulsory inclusion [+]

Attaching a [+] in front of a word requires that the word be found in all matching documents.

For example, searching for [Oracle +Applications] only finds documents that contain the words Oracle and Applications. Note: in a multiple word search, you can attach a [+] in front of every token including the very first token. A token is a phrase enclosed in double-quotes ("). It can be a single word or a phrase, but there should be no space between the [+] and the token.

Compulsory exclusion [-]

Attaching a [-] in front of a word requires that the word not be found in all matching documents. For example, searching for [Oracle -Applications] only finds documents that contain the word Oracle and not the word Applications. Note: in a multiple word search, you can attach a [-] in front of every token except the very first token. A token is a phrase enclosed in double-quotes ("). It can be a single word or a phrase, but there should be no space between the [-] and the token.

Phrase matching ["..."]

Putting quotes around a set of words only finds documents that contain that precise phrase. For example, searching for ["Oracle Applications"] only finds documents that contain the string Oracle Applications.

Wildcard matching [*]

Attaching a [*] to the right side of a word returns left side partial matches. For example, searching for the string [Ora*] finds documents that contain all words beginning with Ora, such as Oracle and Orator. You can also insert an asterisk in the middle of a word. For example, searching for the string [A*e] finds documents that contain words such as Apple or Ape.

Wildcard matching cannot be used with Chinese or Japanese native characters.

Site search

Attaching [site:host] after the search term limits results to that particular site. For example, "documentation site:www.oracle.com". Oracle SES supports exact host matching (that is, site:*.oracle.com does not work) and one "site:" for each search.

File type filtering

Attaching [filetype:filetype] after the search term limits results to that particular file type. For example, "documentation filetype:pdf", returns PDF format documents for the term documentation. A search can have only one filetype shortcut. The following file types are supported (with their corresponding "string"):

filetype string: mimetype

ps: application/postscript

ppt: application/vnd.ms-powerpoint, application/x-mspowerpoint

doc: application/msword

xls: application/vnd.ms-excel, application/x-msexcel, application/ms-excel

txt: text/plain

html: text/html

htm: text/html

pdf: application/pdf

xml: text/xml

rtf: application/rtf


Oracle SES supports the STRING, NUMBER, and DATE (mm/dd/yyyy) attributes with the following operators:

  • CONTAINS operator applies only to the STRING attribute; Oracle SES returns documents with an attribute containing the query terms.

  • EQUALS operator applies to all three attributes; Oracle SES returns documents with an attribute equaling the query with case-insensitivity.

  • GREATERTHAN operator applies to NUMBER and DATE attributes; Oracle SES returns documents with an attribute value greater than or later than the query value.

  • LESSTHAN operator applies to NUMBER and DATE attributes.

  • GREATERTHANEQUALS operator applies to NUMBER and DATE attributes.

  • LESSTHANEQUALS operator applies to NUMBER and DATE attributes.

Advanced Search

The Advanced Search page lets you refine searches in the following ways:

Narrowing Searches by Search Attributes

With the Advanced Search page, you can require that documents matching your search have specific attributes values. To specify a search attribute value, use the list boxes to select a search attribute. Enter the search attribute value in the text box immediately to the right of the list box. Date format must be entered as MM/DD/YYYY format.

Limiting Searches to Certain Source

If one or more source groups are defined, then corresponding check boxes appear when you select specific categories. You can limit your search to source groups by selecting those check boxes. If no source group is selected, then all documents are searched. If you select All, (that is, all source groups present), then the documents not in the selected groups (in the default group) will not be searched.

A source group represents a collection of documents. They are created by the Oracle SES administrator.

Limiting Searches to Documents Written in a Specific Language

Oracle SES can search documents in different languages. Specifying a language restricts searches to documents that are written in that language. Use the language list box to specify a language.

Browse Source Groups

Source groups are groups of sources that can be searched together. A source group consists of one or more sources, and a source can be assigned to multiple source groups. Source groups are defined on the Search - Source Groups page. Groups, or folders, are only generated for Web, e-mail, and OracleAS Portal source types.

On Search page, users can browse source groups that the administrator created. Click a source group name to see the subgroups under it, or drill down further into the hierarchy by clicking a subgroup name.To view all the documents under a particular group, click the number next to the source group name. You can also perform a restricted search in the source group from this page.

The source hierarchy lets end users limit search results based on document source type. The hierarchy is generated automatically during crawl time.

Submit URL

The URL submission feature lets users submit URLs to be crawled and indexed. These URLs are added to the seed URL list for a particular source and included in the crawler search space. If you allow URL submission (on the Global Settings - Query Configuration page), then you must select the Web source to which submitted URLs will be added.

Note:

This feature is disabled on the Search page if no sources have been created.