7 Oracle Secure Enterprise Search APIs

This chapter explains the Oracle Secure Enterprise Search (SES) APIs and related information. This chapter contains the following topics:

Overview of Oracle Secure Enterprise Search APIs
Oracle Secure Enterprise Search Web Services APIs
Oracle Secure Enterprise Search Java SDK

Overview of Oracle Secure Enterprise Search APIs

Oracle Secure Enterprise Search provides the following APIs:

Web Services APIs

The Web Services APIs are used to integrate Oracle SES search capabilities into your search application. Oracle SES provides Java proxy libraries. You either can use the Java libraries or create proxies, based on the published Web Services Description Language (WSDL) files, to access Oracle SES Web Services.

The Query Web Service API lets you perform search queries; for example, search for "oracle benefits" and return all the documents.

The Admin Web Service API lets you perform a subset of administrative actions, such as starting and stopping a crawler schedule or getting the index fragmentation level.

See Also:

The "Web Services Interface" section in the Oracle SES administration tutorial:

http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm

Crawler Plug-in API

The Crawler Plug-in API is used to crawl and index proprietary document repositories. This is included in the SDK.

Query-time Authorization API

The Query-time Authorization API filters search results and access to document information at search time. Query-time filtering can be used in addition to, or in place of, ACLs. This is included in the SDK.

URL Rewriter API

The URL Rewriter API is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue. This is included in the SDK.

Oracle Secure Enterprise Search Web Services APIs

Oracle Secure Enterprise Search Web Services APIs let you write your own application to search and administer Oracle SES over the network. The APIs provide the following benefits:

Applications can be deployed into any machine that connects to Oracle SES server through a standard Internet protocol.
Web Services protocol is XML-based, which makes for easy application integration.

Oracle SES also provides the client-side Java proxies for marshalling and parsing Web Services SOAP messages. Client applications can use the library instead of creating SOAP requests and parsing SOAP responses by themselves to access Oracle SES Web Services.

This section contains the following topics:

Web Services Concepts
Oracle Secure Enterprise Search Web Services Architecture
Oracle Secure Enterprise Search Web Services Common Data Types
Oracle Secure Enterprise Search Query Web Service Operations
Oracle Secure Enterprise Search Query Web Service Query Syntax
Oracle Secure Enterprise Search Query Web Service Example
Oracle Secure Enterprise Search Query Web Service Installation
Client-Side Query Java Proxy Library
Internally Used Query Web Service Messages
Oracle Secure Enterprise Search Admin Web Service Endpoint Location
Client-Side Admin Java Proxy Library
Oracle Secure Enterprise Search Admin Web Service SOAP Fault Error Codes

Web Services Concepts

Oracle SES Web Services consists of a remote procedure call (RPC) interface to Oracle SES that enables the client application to invoke operations on Oracle SES over the network. The client application uses Web Services Description Language (WSDL) specification published by Oracle SES Web Services URL to send a request message using Simple Object Access Protocol (SOAP). The server then responds to the client application with a SOAP response message.

This section explains the following concepts:

Web Services
Simple Object Access Protocol
Web Services Description Language

Web Services

A Web Service is a software application identified by a URI whose interfaces and binding are capable of being defined, described, and discovered by XML artifacts. A Web Service supports direct interactions with other software applications using XML-based messages and internet-based products.

A Web Service does the following:

Exposes and describes itself: A Web Service defines its functionality and attributes so that other applications can understand it. By providing a WSDL file, a Web Service makes its functionality available to other applications.
Allows other services to locate it on the Web: A Web Service can be registered in a UDDI registry so that applications can locate it.
Can be invoked: After a Web Service has been located and examined, the remote application can invoke the service using an Internet standard protocol.
Web Services are of either request and response or one-way style, and they can use either synchronous or asynchronous communication. However, the fundamental unit of exchange between Web Services clients and Web Services, of either style or type of communication, is a message.

Simple Object Access Protocol

The Simple Object Access Protocol (SOAP) is a lightweight XML-based protocol for exchanging information in a decentralized distributed environment. SOAP supports different styles of information e xchange, including RPC-oriented and message-oriented exchange. RPC style information exchange allows for request-response processing, where an endpoint receives a procedure-oriented message and replies with a correlated response message. Message-oriented information exchange supports organizations and applications that need to exchange messages or other types of documents where a message is sent, but the sender might not expect or wait for an immediate response. Message-oriented information exchange is also called document style exchange.

SOAP has the following features:

Protocol independence
Language independence
Platform and operating system independence
Support for SOAP XML messages incorporating attachments (using the multipart MIME structure)

Web Services Description Language

The Web Services Description Language (WSDL) is an XML format for describing network services containing RPC-oriented and message-oriented information. Programmers or automat ed development tools can create WSDL files to describe a service and can make the description available over the Internet. Client-side programmers and development tools can use published WSDL specifications to obtain information about available Web Services and to build and create proxies or program templates that access available services.

Oracle Secure Enterprise Search Web Services Architecture

Oracle Secure Enterprise Search Web Services is powered by the Oracle SES middle tier OC4J server. The implementation, configuration, and deployment of Oracle SES Web Services follow the procedures and standards provided by OC4J server.

Oracle SES WSDL defines the operations and messages for Oracle SES Web Services. The message exchange of Oracle SES Web Services is RPC style, in which the contents of the SOAP message body conform to a structure that specifies a procedure and includes set of parameters or a response with a result and any additional parameters.

Oracle SES SOAP messages use HTTP binding where a SOAP message is embedded in the body of a HTTP request and a SOAP message is returned in the HTTP response.

The following diagram illustrates the architecture of Oracle SES Web Services:

Description of the illustration benri001.gif

Development Platforms

You can implement client applications using platforms that support Simple Object Access Protocol (SOAP), such as Oracle JDeveloper, Microsoft .NET, or Apache Axis. These platforms allow you to automatically create code using the Oracle SES WSDL interface. Include the generated code along with the application logic to create a request, invoke the Web Services, and interpret the response.

Oracle Secure Enterprise Search Web Services Operations

Oracle Secure Enterprise Search provides the following categories of Web Services operations:

Authentication: Authenticate a user's access to Oracle SES. The operation is only required if the user performs secure search.
Search: Run a search on Oracle SES and obtain a hitlist along with information such as estimated hit count, near duplicate documents in the hitlist, suggested links, and alternate keywords for the executed search. Get suggested content from external providers for the given query.
Metadata: Obtain the search metadata, such as the list of source groups, the list of supported languages, or the list of search attributes.
Search Hit: Obtain the search result details, such as the cached version of search result and in-links and out-links of the search hit.
User Feedback: Send user feedback to Oracle SES, such as user submitted URL.

Oracle Secure Enterprise Search Web Services Common Data Types

This section contains the following topics:

Base Data Types
XML-to-Java Data Type Mappings
Complex Types
Array Types

Base Data Types

Oracle Secure Enterprise Search Web Services uses the following base data types:

Table 7-1 Base Data Types

Base Type	Description	Example
xsd:Boolean	Boolean	true, false
xsd:date	Date	2005-12-31
xsd:int	Integer	256
xsd:long	Long integer	12345678900
xsd:string	String	Oracle Secure Enterprise Search

XML-to-Java Data Type Mappings

The mapping between XML schema data types and Java data types depends on the SOAP development environment. The following table shows mappings for the Oracle JDeveloper environment:

Table 7-2 XML-to-Java Type Mappings

XML Schema	Oracle JDeveloper
xsd:Boolean	java.lang.Boolean
xsd:date	java.util.Date
xsd:int	java.lang.Integer
xsd:long	java.lang.Long
xsd:string	java.lang.String

Complex Types

Oracle Secure Enterprise Search Web Services uses the following complex data types:

OracleSearchResult

The search result container. It has the following elements:

returnCount: A Boolean value indicating whether the result return count estimate for the hitlist
estimatedHitCount: The estimated count of the search result, -1 means the search result does not return estimated hit count
dupRemoved: A Boolean value indicating whether duplicate documents have been removed from search result
dupMarked: A Boolean value indicating whether duplicate documents have been marked in search result. If dupRemoved is true, then dupMarked is always false
resultElements: An array of resultElement, which represents the actual hitlist
suggestedLinks: An array of suggestedLink for the given search
query: The actual search string. The search string should follow Oracle SES query syntax
altKeywords: Alternate keywords (suggestions) for the given search
startIndex: The start index of search results
docsReturned: The number of search hits returned

ResultElement

This is the data type for search result element. It has the following elements:

author: Primary author of the document
description: Description of the document
url: URL of the document
snippet: Keywords in context (KWIC) of the document
title: Title of the document
lastModified: Last modified date of the document
mimetype: Mime type of the document
score: Oracle Text score of the document
docID: Document ID
language: Language of the document
contentLength: Content length of the document
signature: Signature of the document
infoSourceID: InfoSource ID of the document
infoSourcePath: InfoSource path of the document
groups: Array of groups to which the document belongs
isDuplicate: Boolean value indicating whether this document is a duplicate of another document in the hitlist
hasDuplicate: Boolean value indicating whether this document has one or more duplicates in the hitlist
fedID: Federated instance ID, used to track which federated instance the document is fetched from
customAttributes: Array of custom nondefault attributes extracted from/for the document during crawling that should be fetched with the results

SCElement

Suggested content from a provider. It has following elements:

name: name of the suggested content provider
content: suggested content from the provider. The content is a byte array of the XML or HTML content

DataGroup

The source group. It has the following elements:

groupID: Source group ID
groupName: Source group name
groupDisplayName: Display name for the source group

Attribute

The data type for search attribute. It has the following elements:

id: Search attribute ID
name: Internal name of search attribute
displayName: Display name of search attribute
type: The search attribute type. Value is either number, string, or date.

Filter

The data type for filter condition (predicate). It has the following elements:

attributeId: Search attribute ID
attributeType: Search attribute type. Value is either number, string, or date.
operator: Operator of the filter condition
- If attributeType is string, then it should be either equals or contains.
- If attributeType is number or date, then it should be either greaterthan, greaterthanequals, lessthan, lessthanequals, or equals.
attributeValue: Value of the filter condition (predicate)
- For string type attribute, the value is simply the string itself.
- For number type attribute, the value should be represented by a string consisting of an optional sign, (+) or (-), followed by a sequence of zero or more decimal digits ("the integer"), optionally followed by a fraction. The fraction consists of a decimal point followed by zero or more decimal digits. The string must contain at least one digit in either the integer or the fraction.
- For date type attribute, the value should be in the format mm/dd/yyyy, where mm is the month (00~12), dd is the date (01~31), yyyy is the year (for example, 2005)

Examples:

If the filter condition is Title contains 'Oracle Secure Enterprise Search', then the client application needs to lookup the attribute ID of search attribute 'Title' and include the following (element, value) pairs:
- attributeID = 1 (assuming the search attribute id of 'Title' is 1)
- operator = contains
- attributeValue = Oracle Secure Enterprise Search
If the filter condition is Price greater than 1000, then the client application needs to lookup the attribute ID of search attribute 'Price' and include the following (element, value) pairs:
- attributeID = 2 (assuming the search attribute id of 'Price' is 2)
- operator = greaterthan
- attributeValue = 1000

Node

This is the data type for the infosource node. It has the following elements:

id: Infosource node ID
fedId: Federated instance ID, used to track which federated instance the node belongs to
name: Name of the node
docCount: Number of documents under the node. If the value is –1, then there exists documents under the node but the count cannot be shown.
hasChildren: Indicates if the node has any children
fullpath: Full path of the category node
fullpathIds: The IDs of each node in the full path

AttributeLOVElement

This is the element of AttributeLOV, the list of search attribute values. It has the following elements:

value: Attribute value (internal value)
displayValue: Display value

SessionContextElement

This data structure is used to store authentication information for the search user in the form of a name-value pair, which can be used during query-time authorization filtering of the results. It has following elements:

name: Name of the authentication attribute
value: Value of the authentication attribute

Status

This is the status of the request. It has the following elements:

status: Status code. Value is either successful or 'ailed
message: Status message. Value is null, or an error message if the status is 'ailed

Language

This is the language data type. It has the following elements:

languageName: Name of the language
languageDisplayName: Display name (translated name) of the language

Array Types

Oracle Secure Enterprise Search Web Services uses the following complex array types:

AttributeArray: Array of Attribute
AttributeLOVElementArray: Array of AttributeLOVElement
CustomAttributeArray: Array of CustomAttribute
SCElementArray: Array of SCElement
DataGroupArray: Array of DataGroup
FilterArray: Array of Filter
IntArray: Array of int
LanguageArray: Array of Language
NodeArray: Array of Node
ResultElementArray: Array of ResultElement
SessionContextElementArray: Array of SessionContextElement
StringArray: Array of String

See Also:

Appendix D, "WSDL Specifications"

Oracle Secure Enterprise Search Query Web Service Operations

This section contains the following topics:

Authentication Operations
Search Operations
Metadata Operations
Search Hit Operations
User Feedback Operations

Authentication Operations

This section describes the following authentication operations:

loginRequest Message
loginResponse Message
logoutRequest Message
logoutResponse Message
setSessionContextRequest Message
setSessionContextResponse Message
proxyLoginRequest Message
proxyLoginResponse Message

loginRequest Message

This message requests Oracle SES to authenticate the search user. It consists of the following parameters:

username: User name for the search user
password: Password for the search user

<message name="loginRequest">
   <part name="username"      type="xsd:string"/>
   <part name="password"      type="xsd:string"/>
</message>

Note:

User name is not case-sensitive.

loginResponse Message

This message contains the return status for the loginRequest message.

<message name="loginResponse">
   <part name="return"      type="typens:Status"/>   
</message>

logoutRequest Message

This message is used when the user logs out from the search application.

<message name="logoutRequest">
</message>

logoutResponse Message

This message contains the return status for the logoutRequest message.

<message name="logoutResponse">
   <part name="return"      type="typens:Status"/>   
</message>

setSessionContextRequest Message

This message is used to pass authentication information for the search user, which can be used during query-time filtering.

Note:

Login and logout Web Services calls cause Oracle SES to automatically set or reset the AUTH_USER value in the session context that is passed to the query-time filter. This session context attribute cannot be overwritten explicitly through the setSessionContext call.

It consists of the following parameter:

sessionContext: An array of SessionContextElement. This array stores the authentication information needed for the query-time authentication filtering in the form of name-value pairs.

<message name="setSessionContextRequest">
    <part name="sessionContext"     type="typens:SessionContextElementArray"/>
</message>

setSessionContextResponse Message

This message contains the return status for the setSessionContext message.

<message name="setSessionContextResponse">
  <part name="return"             type="typens:Status"/>
</message>

proxyLoginRequest Message

This message logs in the end user to Oracle SES using proxy authentication. It consists of following parameters:

username: User name of the proxy user
password: Password of the proxy user
searchUser: User name of the end user

<message name="proxyLoginRequest">
  <part name="username"            type="xsd:string"/>
  <part name="password"            type="xsd:string"/>
  <part name="searchUser"          type="xsd:string"/>
</message>

The proxy user must be one of the federation trusted entities created on the Oracle SES instance.

See Also:

"Federation Trusted Entities"

proxyLoginResponse Message

This message contains the return status for the proxyLoginRequest message.

<message name="proxyLoginResponse">
  <part name="return"         type="typens:Status"/>
</message>

Search Operations

This section describes the following search operations:

doOracleSearch Message
doOracleSearchResponse Message
doOracleBrowseSearch Message
doOracleBrowseSearchResponse Message
doOracleSimpleSearch Message
doOracleSimpleSearchResponse Message
getSuggestedContent Message
getSuggestedContentResponse Message

doOracleSearch Message

This is the main message for the search application. It consists of the following parameters:

query: A search string. It must be a valid string and it cannot be null. The search string should follow Oracle SES query syntax. See "Oracle Secure Enterprise Search Query Web Service Query Syntax" for details.
startIndex: The index of the first result to be returned. For example, if there are 67 results, you might want to start at 20. The default is 1 if not set explicitly.
docsRequested: The maximum number of results to be returned. The default is 10 if not set explicitly.
dupRemoved: Enable or disable duplicate removal. If turned on, the search result will eliminate all duplicate and near duplicate documents from the result list. The dupMarked switch will have no effect when dupRemoved is turned on. The default is false if not set explicitly.
dupMarked: Enable or disable duplicate detection. If dupRemoved is turned off and dupMarked is turned on, then the search result will keep all duplicate and near duplicate documents from the result list and mark them as duplicates. If dupRemoved is turned on, then the dupMarked switch will have no effect. The default is false if not set explicitly.
groups: Limit the search result to the documents from specified source groups. The default is for all groups if not set explicitly.
queryLang: Set the language of the query. This is equivalent to locale. The default is English ("en") if not set explicitly. This is used for relevancy boosting.
docLang: Set the language of the documents to limit the search. If the value is not set explicitly, then search is performed against documents of all the languages.
returnCount: Set to true to return total hit count with the result. The default is false if not set explicitly.
filterConnector: The connector between all filters: "and" indicates the search result must satisfy all filters, "or" indicates the search result just needs to satisfy at least one filter. The default is "and" if not set explicitly.
filters: An array of filters. Each filter is a restriction on search results. Filters are connected by filterConnector. The default is null (no filter applies to the search result) if not set explicitly.
fetchAttributes: Array of integers representing the nondefault attribute IDs to be fetched in the resultElements. The default is null (or set one int value '0'), so no attributes other than default-attributes are fetched in the resultElements.

<message name="doOracleSearch">
   <part name="query"            type="xsd:string"/>
   <part name="startIndex"       type="xsd:int"/>
   <part name="docsRequested"    type="xsd:int"/>
   <part name="dupRemoved"       type="xsd:boolean"/>
   <part name="dupMarked"        type="xsd:boolean"/>
   <part name="groups"           type="typens:DataGroupArray"/>
   <part name="queryLang"        type="xsd:string"/>
   <part name="docLang"          type="xsd:string"/>
   <part name="returnCount"      type="xsd:boolean"/>
   <part name="filterConnector"  type="xsd:string"/>
   <part name="filters"          type="typens:FilterArray"/>
   <part name="fetchAttributes"  type="typens:IntArray"/>
</message>

doOracleSearchResponse Message

This message returns the search result in OracleSearchResult data type.

<message name="doOracleSearchResponse">
    <part name="return"  type="typens:OracleSearchResult"/>
</message>

doOracleBrowseSearch Message

This message restricts a search to a particular node. It consists of the following parameters:

query: A search string. It must be a valid string, and it cannot be null. The search string should follow Oracle SES query syntax. See "Oracle Secure Enterprise Search Query Web Service Query Syntax" for more details.
nodeID: The ID of the node to restrict the search to.
fedID: The ID of the federated instance the parent node belongs to ("-1" for local node).
startIndex: The index of the first result to be returned. For example, if there are 67 results, then you might want to start at 20. The default is 1 if not set explicitly.
docsRequested: The maximum number of results to be returned. The default is 10 if not set explicitly.
dupRemoved: Enable or disable duplicate removal. If turned on, then the search result will eliminate all duplicate and near duplicate documents from the result list, and the dupMarked switch will have no effect when dupRemoved is turned on. The default is false if not set explicitly.
dupMarked: Enable or disable duplicate detection. If dupRemoved is turned off and dupMarked is turned on, then the search result will keep all duplicate and near duplicate documents from the result list and mark them as duplicates. If dupRemoved is turned on, then the dupMarked switch will have no effect. The default is false if not set explicitly.
queryLang: Set the language of the query. This is equivalent to locale. The default is English ("en") if not set explicitly. This is used for relevancy boosting.
docLang: Set the language of the documents to limit the search. If the value is not set explicitly, then search is performed against documents of all the languages.
returnCount: Set to true to return total hit count with the result. The default is false if not set explicitly.
fetchAttributes: Array of integers representing the nondefault attribute IDs to be fetched in the resultElements. The default is null (or set one int value '0'), so no attributes other than default-attributes are fetched in the resultElements.

<message name="doOracleBrowseSearch">
    <part name="query"              type="xsd:string"/>
    <part name="nodeID"             type="xsd:string"/>
    <part name="fedID"              type="xsd:string"/>
    <part name="startIndex"         type="xsd:int"/>
    <part name="docsRequested"      type="xsd:int"/>
    <part name="dupRemoved"         type="xsd:boolean"/>
    <part name="dupMarked"          type="xsd:boolean"/>
    <part name="queryLang"          type="xsd:string"/>
    <part name="docLang"            type="xsd:string"/>
    <part name="returnCount"        type="xsd:boolean"/>
    <part name="fetchAttributes"    type="typens:IntArray"/>
  </message>

doOracleBrowseSearchResponse Message

This message returns the search result in OracleSearchResult data type.

<message name="doOracleBrowseSearchResponse">
    <part name="return"  type="typens:OracleSearchResult"/>
</message>

doOracleSimpleSearch Message

This is a simplified form of the doOracleSearch message. In this message you don't need to specify the advanced search parameters that are specified in the doOracleSearch message. It consists of following parameters:

query: A search string. It must be a valid string and it cannot be null. The search string should follow Oracle SES query syntax. See "Oracle Secure Enterprise Search Query Web Service Query Syntax" for details.
startIndex: The index of the first result to be returned. For example, if there are 67 results, you might want to start at 20. The default is 1, if not set explicitly.
docsRequested: The maximum number of results to be returned. The default is 10, if not set explicitly.
dupRemoved: Enable or disable duplicate removal. If turned on, then the search result will eliminate all duplicate and near duplicate documents from the result list. The dupMarked switch will have no effect when dupRemoved is turned on. The default is false if not set explicitly.
dupMarked: Enable or disable duplicate detection. If dupRemoved is turned off and dupMarked is turned on, then the search result will keep all duplicate and near duplicate documents from the result list and mark them as duplicates. If dupRemoved is turned on, then the dupMarked switch will have no effect. The default is false if not set explicitly.
returnCount: Set to true to return total hit count with the result. The default is false if not set explicitly.

<message name="doOracleSimpleSearch">
  <part name="query"              type="xsd:string"/>
  <part name="startIndex"         type="xsd:int"/>
  <part name="docsRequested"      type="xsd:int"/>
  <part name="dupRemoved"         type="xsd:boolean"/>
  <part name="dupMarked"          type="xsd:boolean"/>
  <part name="returnCount"        type="xsd:boolean"/>
</message>

doOracleSimpleSearchResponse Message

This message returns the search result in OracleSearchResult data type.

<message name="doOracleSimpleSearchResponse">
  <part name="return"         type="typens:OracleSearchResult"/>
</message>

getSuggestedContent Message

This message returns the suggested content for the given query. It consists of the following parameters:

query: Query string
returnType: Format in which the content is to be returned, either "html" or "xml". If no style sheet is configured for a given provider, then the return type is the return type of the content returned by the provider, regardless of whether "html" or "xml" is specified.

<message name="getSuggestedContent">
  <part name="query"                  type="xsd:string"/>
  <part name="returnType"             type="xsd:string"/>
</message>

getSuggestedContentResponse Message

This message returns the suggested content for the query.

<message name="getSuggestedContentResponse">
  <part name="return"                 type="typens:SCElementArray"/>
</message>

Browse Operations

This section describes the following browse operations:

getInfoSourceNodesRequest Message
getInfoSourceNodesResponse Message
getInfoSourceAncestorNodesRequest Message
getInfoSourceAncestorNodesResponse Message
getInfoSourceNodeRequest Message
getInfoSourceNodeResponse Message

getInfoSourceNodesRequest Message

This message gets the list of info source nodes given the parent node ID. It consists of the following parameters:

parentNodeID: The node ID for which all children nodes will be returned. If it is not set, then the message will return all the root nodes.
fedID: The ID of the federated instance the parent node belongs to ("-1" for local node).
locale: A two letter representation of locale. The default is English ("en") if not set explicitly.

<message name="getInfoSourceNodesRequest">     
    <part name="parentNodeID"      type="xsd:string"/>
    <part name="fedID"             type="xsd:string"/>
    <part name="locale"            type="xsd:string"/>
</message>

getInfoSourceNodesResponse Message

This message returns an array of info source nodes.

<message name="getInfoSourceNodesResponse">
    <part name="nodes"    type="typens:NodeArray"/>
</message>

getInfoSourceAncestorNodesRequest Message

This message gets the full path of a node, from root to node, given an info source node. It consists of the following parameters:

nodeID: The node ID for which all the nodes in the path from root to node will be returned, nodeID must be set and it cannot be null.
locale: A two letter representation of locale. The default is English ("en") if not set explicitly.

<message name="getInfoSourceAncestorNodesRequest">     
    <part name="nodeID"        type="xsd:string"/>     
    <part name="locale"        type="xsd:string"/>   
</message>

getInfoSourceAncestorNodesResponse Message

This message returns an array of info source ancestor nodes.

<message name="getInfoSourceAncestorNodesResponse">     
    <part name="nodes"    type="typens:NodeArray"/>   
</message>

getInfoSourceNodeRequest Message

This message retrieves a particular node. It consists of the following parameters:

nodeID: The node ID of the node to get, nodeID must be set and it cannot be null.
fedID: The ID of the federated instance the parent node belongs to ("-1" for local node).
locale: A two letter representation of Locale, the default is English ("en") if not set explicitly.

Message format:

<message name="getInfoSourceNodeRequest">
   <part name="nodeID        "type="xsd:string"/>
   <part name="fedID"        "type="xsd:string"/>
   <part name="locale        "type="xsd:string"/>
</message>

getInfoSourceNodeResponse Message

This message returns the node requested.

<message name="getInfoSourceNodeResponse">    
    <part name="node    "type="typens:Node"/>  
</message>

Metadata Operations

This section describes the following metadata operations:

getLanguageRequest Message
getLanguageResponse Message
getDataGroupsRequest Message
getDataGroupsResponse Message
getAttributesRequest Message
getAttributesResponse Message
getAllAttributesRequest Message
getAllAttributesResponse Message
getAttributeLOVRequest Message
getAttributeLOVResponse Message

getLanguageRequest Message

This message gets all the languages supported by Oracle SES. It is used by the client application to display the list of languages. It consists of the following parameter:

locale: A two letter representation of locale. The default is English ("en") if not set explicitly.

<message name="getLanguagesRequest">     
    <part name="locale"        type="xsd:string"/>   
</message>

getLanguageResponse Message

This message returns all supported languages.

<message name="getLanguagesResponse">     
    <part name="return"        type="typens:LanguageArray"/>   
</message>

getDataGroupsRequest Message

This message requests for all source groups defined in Oracle SES. It is used by the client application to show all source groups in the search page, such that the end user can restrict their search results within one or multiple source groups. It consists of the following parameter:

locale: A two letter representation of locale. The default is English ("en") if not set explicitly.

<message name="getDataGroupsRequest">
    <part name="locale"        type="xsd:string"/>             
</message>

getDataGroupsResponse Message

This message returns all source groups defined in Oracle SES.

<message name="getDataGroupsResponse">     
    <part name="groups"        type="typens:DataGroupArray"/>              
</message>

getAttributesRequest Message

This message gets a list of search attributes that applied to the given source groups. It consists of the following parameters:

locale: A two letter representation of locale. The default is English ("en") if not set explicitly.
groups: Limit the request to the attributes from specified source groups. The default is all groups if not set explicitly.
groupConnector: The connector between all groups: "and" indicates the response is the attributes available in the set of source groups by finding the intersection of each group's attributes, "or" indicates the response is the attributes available in the set of source groups by finding the union of each group's attributes. The default is "or" if not set explicitly.

<message name="getAttributesRequest">   
    <part name="locale"          type="xsd:string"/>          
    <part name="groups"          type="typens:DataGroupArray"/>
    <part name="groupConnector"  type="xsd:string"/>   
</message>

getAttributesResponse Message

This message returns an array of search attributes.

<message name="getAttributesResponse">     
    <part name="return"       type="typens:AttributeArray"/>   
</message>

getAllAttributesRequest Message

This message gets all search attributes defined in Oracle SES. It consists of the following parameter:

locale: A two letter representation of locale. The default is English ("en") if not set explicitly.

<message name="getAllAttributesRequest">
  <part name="locale" type="xsd:string"/>
</message>

getAllAttributesResponse Message

This message returns all search attributes defined in Oracle SES.

<message name="getAllAttributesResponse">     
    <part name="return"       type="typens:AttributeArray"/>   
</message>

getAttributeLOVRequest Message

This message gets the LOV items given a search attribute. It consists of the following parameters:

attribute: A search attribute for the LOV (list of values) requested.
locale: A two letter representation of locale. The default is English ("en") if not set explicitly.

<message name="getAttributeLOVRequest">
  <part name="attribute"     type="typens:Attribute"/>        
  <part name="locale"        type="xsd:string"/>   
</message>

getAttributeLOVResponse Message

This message returns an array of search attribute LOV elements.

<message name="getAttributeLOVResponse">     
  <part name="return"    type="typens:AttributeLOVElementArray"/>   
</message>

Search Hit Operations

This section describes the following search hit operations:

getCachedPageRequest Message
getCachedPageResponse Message
getInLinksRequest Message
getInLinksResponse Message
getOutLinksRequest Message
getOutLinksResponse Message
logUserClickRequest Message
logUserClickResponse Message

getCachedPageRequest Message

This message gets the cached version of a document given the document ID and the search string. The search string will be highlighted in the output. It consists of the following parameters:

query: The search string
docID: The document ID to be fetched
fedID: The federated instance ID, used to track which federated instance the document is fetched from

<message name="getCachedPageRequest">
    <part name="query"          type="xsd:string"/>
    <part name="docID"          type="xsd:int"/>
    <part name="fedID"          type="xsd:string"/>
</message>

getCachedPageResponse Message

This message returns the byte array of the cached HTML page.

<message name="getCachedPageResponse">
    <part name="return"         type="xsd:base64Binary"/>
</message>

getInLinksRequest Message

This message gets all the incoming links for a given search hit (document). It consists of the following parameters:

docID: The document ID for which the incoming links to be fetched. It must be a valid document ID and it cannot be null.
maxNum: The maximum number of incoming links requested. The default is 25 if not set explicitly.
fedID: The federated instance ID, used to track which federated instance the document is fetched from

<message name="getInLinksRequest">     
    <part name="docID"                 type="xsd:int"/>   
    <part name="maxNum"                type="xsd:int"/>   
    <part name="fedID"                 type="xsd:string"/>
</message>

getInLinksResponse Message

This message returns an array of incoming link URL strings.

<message name="getInLinksResponse">     
    <part name="return"      type="typens:StringArray"/>     
</message>

getOutLinksRequest Message

This message gets all the outgoing links for a given search hit (document). It consists of the following parameters:

docID: The document ID for which the outgoing links to be fetched. It must be a valid document ID and it cannot be null.
maxNum: The maximum number of outgoing links requested. The default is 25 if not set explicitly.
fedID: The federated instance ID, used to track which federated instance the document is fetched from

<message name="getOutLinksRequest">     
    <part name="docID"                type="xsd:int"/>   
    <part name="maxNum"               type="xsd:int"/>   
    <part name="fedID"                type="xsd:string"/>
</message>

getOutLinksResponse Message

This message returns an array of outgoing link URL strings.

<message name="getOutLinksResponse">     
    <part name="return"      type="typens:StringArray"/>     
</message>

logUserClickRequest Message

This message logs the user's click. It consists of the following parameters:

queryID: ID of the submitted search
urlID: ID of the document that the user clicked on
infosourceID: Infosource ID. If none, then –1 is used as the default value
position: The position of the document in the result list (for example, first hit on the page or 9th hit on the page)
fedID: Federation ID. Specifies the federated instance on which the document resides.

<message name="logUserClickRequest">
    <part name="queryID"        type="xsd:int"/>
    <part name="urlID"          type="xsd:int"/>
    <part name="infoSourceID"   type="xsd:int"/>
    <part name="position"       type="xsd:int"/>
    <part name="fedID"          type="xsd:string"/>
</message>

logUserClickResponse Message

This message returns the URL of the clicked-on document.

<message name="logUserClickResponse">
  <part name="url"              type="xsd:string"/>
</message>

User Feedback Operations

This section describes the following user feedback operations:

submitUrlRequest Message
submitUrlResponse Message

submitUrlRequest Message

This message submits a URL to Oracle SES, such that Oracle SES will crawl and index the URL. It consists of the following parameter:

url: The URL to be submitted to the crawler so it can be crawled next time. It must be a valid URL and it cannot be null.

<message name="submitUrlRequest">     
     <part name="url"        type="xsd:string"/>   
</message>

submitUrlResponse Message

This message returns the status, which consists of two strings: the first one is the submission status, it is either "successful" or "failed"; the second string is the error message in case that submission status is "failed".

<message name="submitUrlResponse">     
     <part name="return"      type="typens:Status"/>
</message>

Oracle Secure Enterprise Search Query Web Service Query Syntax

This section describes the query syntax used in the Oracle Secure Enterprise Search Search API.

Search Term

A search term can be a single word, a phrase, or a special search term. For example, if the search string is oracle secure enterprise search, then there are four search terms in the search string: oracle, secure, enterprise, and search. If the search string is oracle "secure enterprise search", then there are two search terms in the search string: oracle and "secure enterprise search".

Search terms in different cases are treated the same (case insensitive). For example, searching oracle, Oracle, or ORACLE will return the same search result.

Phrase

A phrase is a string enclosed in double-quotes ("). It can contain one or multiple words.

Operators

The following operators are defined in the query syntax:

Plus [+]: The plus operator specifies that the search term immediately following it must be found in all matching documents. For example, searching for [Oracle +Applications] only finds documents that contain the word "Applications". In a multiple word search, you can attach a [+] in front of every token including the very first token. A token is a phrase enclosed in double-quotes ("). It can be a single word or a phrase, but there should be no space between the [+] and the token.
Minus [-]: The minus operator specifies that the search term immediately following it cannot appear in any document included in the search result. For example, searching for [Oracle -Applications] only finds documents that do not contain the word "Applications". In a multiple word search, you can attach a [-] in front of every token except the very first token. It can be a single word or a phrase, but there should be no space between the [-] and the token.
Asterisk [*]: The asterisk specifies a wildcard search. For example, searching for the string [Ora*] finds documents that contain all words beginning with "Ora" such as "Oracle" and "Orator". You can also insert an asterisk in the middle of a word. For example, searching for the string [A*e] finds documents that contain words such as "Apple" or "Ape".

Default Search - Implicit AND Search

By default, Oracle SES searches all of your search terms, as well as relevant variations of the terms you have entered. There is no need to include any operators (like 'AND') between terms. The order of the terms in the search will affect the search results.

Word Separator

Use one or more space characters ' ' to separate each of the search terms.

Filter Conditions (Advanced Conditions)

Oracle SES query syntax only supports 'Site' and 'File type' filter conditions. It does not support any other filter conditions (advanced conditions) such as title, author, last modified date. To restrict your search with other filter conditions, you can specify them in the Web Services API message doOracleSearch.

Special Search Terms

Oracle SES supports the use of several special search terms that allow the user or search administrator to access additional capabilities of the Oracle SES in front of it. Following is the list of special search terms:

'Exclude' Search Term

You can exclude a word from your search by putting a minus sign [-] immediately in front of the term you want to exclude from the search results. Exclusion does not work with stop words.

Example: oracle -search

Negative search is not allowed unless there is another positive search term. For example:

-search is an invalid search.

oracle -search is a valid search.

Wildcard Search

Search for words starting with "ora". The asterisk can only be specified at the end (right side) or middle of a search term. So you cannot search for something like *earch.

Example: Ora*

Phrase Search

Search for complete phrases by enclosing them in quotation marks. Words marked in this way will appear together in all results exactly as entered.

Example: "oracle secure enterprise search"

Site Restricted Search

If you know the specific Web site you want to search, but are not sure where the information is located within that site, then search only within the specific Web site. Enter the search followed by the string "site:" followed by the host name.

Example: oracle site:text.us.oracle.com

Notes:

Domain restriction is not supported, because Oracle SES does not support left-truncated wildcard search (such as *.oracle.com)
The exclusion operator (-) can be applied to this search term to remove a Web site from consideration in the search.
Site restricted search term is implicit AND with other search terms.
Only one site restriction is allowed. Also, you cannot have both site inclusion and exclusion in the search string. For example, the following search string is invalid:
```
oracle search  site:www.oracle.com  -site:otn.oracle.com
```

File Type Restricted Search

The search prefix "filetype:" filters the results returned to include only documents with the extension specified immediately after. There can be no space between "filetype:"; and the specified extension.

Example: oracle filetype:doc

Notes:

The exclusion operator (-) can be applied to this search term to remove a file type from consideration in the search.
Only one file type can be included. The following extensions are supported: doc, htm, html, xml, ps, pdf, txt, rtf, ppt, and xls. doc, html, pdf, txt, rtf, ppt, xls.
File type restricted search term is implicit AND with other search terms.
Only one file type restriction is allowed. Also, you cannot have both file type inclusion and exclusion in the search string. For example, the following search string is invalid:
```
oracle search  filetype:doc -filetype:pdf
```

Oracle Secure Enterprise Search Query Web Service Example

Following is a simple JSP application using Oracle Secure Enterprise Search proxy Java library to provide the basic search functionality:

<%@page contentType="text/html; charset=utf-8" %>
<%@page import = "java.util.Vector" %>
<%@page import = "java.net.URL" %>
<%@page import = "java.util.Properties" %>
<%@page import = "java.util.HashMap" %>
<%@page import = "org.apache.soap.Header" %>
<%@page import = "org.apache.soap.rpc.Call" %>
<%@page import = "org.apache.soap.rpc.Parameter" %>
<%@page import = "org.apache.soap.rpc.Response" %>
<%@page import = "org.apache.soap.Fault" %>
<%@page import = "org.apache.soap.SOAPException" %>
<%@page import = "org.apache.soap.Constants" %>
<%@page import = "org.apache.soap.encoding.SOAPMappingRegistry" %>
<%@page import = "org.apache.soap.encoding.soapenc.BeanSerializer" %>
<%@page import = "org.apache.soap.util.xml.QName" %>
<%@page import = "oracle.soap.transport.http.OracleSOAPHTTPConnection" %>
<%@page import = "oracle.soap.encoding.soapenc.EncUtils" %>
<%@page import = "oracle.search.query.webservice.client.*" %>
 
<%
  //
  // Get the search term entered by the user
  //
  String searchTerm = request.getParameter("searchTerm");
  if (searchTerm == null)  searchTerm = "";
 
  //
  // Define the result element array.
  //
  ResultElement[] resElemArray = null; // ResultElement is one of the proxy Java
classes
  int estimatedHitCount = 0;
 
  if (searchTerm != null && !"".equals(searchTerm))
  {
    //
    // Create the Oracle SES Web Services client stub
    //
    OracleSearchService stub = new OracleSearchService();
 
    //
    // Set the Oracle SES Web Services URL.
    // The URL is http://<host>:<port>/search/query/OracleSearch  
    //
    stub.setSoapURL("http://staca19:7777/search/query/OracleSearch");
 
    //
    // Get the search result by calling OracleSearchService.doOracleSearch()
    //
    OracleSearchResult result = stub.doOracleSearch(searchTerm,
                            new Integer(1),
                            new Integer(10),
                            Boolean.TRUE,
                            Boolean.TRUE,
                            null,
                            "en",
                            "en",
                            Boolean.TRUE,
                            null,
                            null,
                            null);
    //
    // Get the estimated hit count by calling 
    estimatedHitCount = result.getEstimatedHitCount().intValue();
 
    // Get the search results
    resElemArray = result.getResultElements();
  }
%>
 
<HTML>
<HEAD>
    <TITLE>Oracle SES Web Services Demo </TITLE>
</HEAD>
<BODY>
<FORM name="searchBox" method="post" action="./DemoWS.jsp">
 <INPUT id="inputMain" type="text" size="40" name="searchTerm" value="<%=searchTerm%>">
 <INPUT type="hidden"  name="searchTerm" value="<%= searchTerm %>">
 <INPUT type="submit" name="action" value="Search">
</FORM>
<BR><BR><BR>
 
<%
  //
  // Render the search results
  //
  if (resElemArray == null || resElemArray.length == 0)
  {
%>
  <H3> There are no matches for the search term </H3>
<%
  }
  else
  {
%>
  <H3> There are about <%=estimatedHitCount%> matches </H3>
<%
    for (int i=0; i<resElemArray.length; i++)
    {
      String title = resElemArray[i].getTitle();
      if (title == null) title = "Untitled Document";
%>
  <P>  
    <B><A HREF="<%=resElemArray[i].getUrl()%>"><%=title%></A> </B>
    <BR>
     <%=resElemArray[i].getSnippet()%>
    <BR>
  </P>
<%
    }
  }
%>
</BODY>
</HTML>

Oracle Secure Enterprise Search Query Web Service Installation

Oracle SES Web Services runs on top of Oracle SES middle tier standalone OC4J server. It is installed and configured as part of the default install option. You can use Oracle SES Web Services out-of-the-box. There is no specific step to administrate Oracle SES Web Services. Follow the same middle tier administration steps to start and stop Oracle SES Web Services.

Your search application needs to access the following Oracle SES Web Services URL:

http://<host>:<port>/search/query/OracleSearch

For example, if your Oracle SES middle tier is running on host 'myhost' and the port number is 8888, then the Web Services URL is the following:

http://myhost:8888/search/query/OracleSearch

There is a default Oracle SES Web Services administrator console provided by OC4J. The administrator console URL is the same as the Oracle SES Web Services URL. You can obtain the following information from the administrator console:

Oracle SES WSDL description
List of Web Services messages and operations
Client-side Java proxies and source codes

Client-Side Query Java Proxy Library

Oracle SES also provides client-side Java proxies for marshalling and parsing Web Services SOAP messages. Client applications can use the library to access Oracle SES Web Services.

The proxy library includes the following Java classes, which are mapped to the corresponding Web Services data types and messages:

oracle.search.query.webservice.client.Attribute
oracle.search.query.webservice.client.AttributeLOVElement
oracle.search.query.webservice.client.CustomAttribute
oracle.search.query.webservice.client.DataGroup
oracle.search.query.webservice.client.Filter
oracle.search.query.webservice.client.Language
oracle.search.query.webservice.client.Node
oracle.search.query.webservice.client.OracleSearchResult
oracle.search.query.webservice.client.OracleSearchService
oracle.search.query.webservice.client.ResultElement
oracle.search.query.webservice.client.SessionContextElement
oracle.search.query.webservice.client.Status
oracle.search.query.webservice.client.SuggestedLink
oracle.search.query.webservice.client.SCElement

To compile and run your client application using the Oracle SES client-side Java proxy library, you need to include the following files in the Java CLASSPATH. You can obtain these files from Oracle SES server file directory.

$ORACLE_HOME/search/lib/search_client.jar (The proxy Java library)
$ORACLE_HOME/oc4j/webservices/lib/soap.jar
$ORACLE_HOME/oc4j/j2ee/home/lib/http_client.jar
$ORACLE_HOME/lib/xmlparserv2.jar
$ORACLE_HOME/lib/mail.jar
$ORACLE_HOME/lib/activation.jar

Internally Used Query Web Service Messages

The following Web Services messages and operations are intended for Oracle SES internal use only. They are subject to change or removal in future releases.

setSearchUserRequest, setSearchUserResponse, setSearchUser

Oracle Secure Enterprise Search Admin Web Service Endpoint Location

The Admin Web service is located at the following address for an Oracle SES installation: http://<host>:<port>/search/ws/admin/SearchAdmin.

There is a default Oracle SES Web Services administrator console provided by OC4J. The administrator console URL is the same as the Oracle SES Admin Web Service URL. You can obtain the following information from the administrator console:

Oracle SES Admin WSDL description
List of Web Service messages and operations
Client-side JavaScript stub

Client-Side Admin Java Proxy Library

Oracle SES provides client-side Java proxies for marshalling and parsing Web Services SOAP messages. Client applications can use the library to access the Oracle SES Admin Web Service.

The proxy library includes the following Java classes, which are mapped to the corresponding Web Services data types and messages:

oracle.search.admin.ws.client.Schedule
oracle.search.admin.ws.client.ScheduleStatus
oracle.search.admin.ws.client.SearchAdminClient

To compile and run your client application using the Oracle SES client-side Java proxy stub, include the following files in the Java CLASSPATH:

$ORACLE_HOME/search/lib/search_admin_wsclient.jar
wsclient_extended.jar

The wsclient_extended.jar file is available as a separate download from the Oracle Technology network: http://download.oracle.com/otn/java/oc4j/10131/wsclient_extended.zip

See Also:

Oracle Secure Enterprise Search Java API Reference
"Setting the Classpath for a Web Service Proxy" in the Oracle Application Server Web Services Developer's Guide, 10g Release 3 (10.1.3.1.0)

Oracle Secure Enterprise Search Admin Web Service SOAP Fault Error Codes

If an error occurs as a result of an Admin Web Service request, a SOAP fault is returned. When using the provided Java proxy client, a javax.xml.rpc.soap.SOAPFaultException is thrown. To access the machine parseable error code, call the SOAPFaultException.getFaultCode() method.

The following table lists the Admin Web Service error codes:

Table 7-3 Admin Web Service Error Codes

Error Code	Description	SOAP Fault Code Prefix
`Authentication`	The provided security credentials are not valid	Client
`InternalError`	An internal error occurred. Please try again	Server
`InvalidSchedule`	The specified schedule is invalid for the operation performed.	Client
`InvalidScheduleName`	The specified schedule name does not exist.	Client

Oracle Secure Enterprise Search Java SDK

The Oracle Secure Enterprise Search Java SDK contains the following APIs:

Crawler Plug-in API
URL Rewriter API
Query-time Authorization API

Crawler Plug-in API

You can implement a crawler plug-in to crawl and index a proprietary document repository. In Oracle SES, the proprietary repository is called a user-defined source. The module that enables the crawler to access the source is called a crawler plug-in (or connector).

The plug-in collects document URLs and associated metadata from the user-defined source and returns the information to the Oracle SES crawler. The crawler starts processing each URL as it is collected. The crawler plug-in must be implemented in Java using the Oracle SES Crawler Plug-in API. Crawler plug-ins go in the $ORACLE_HOME/search/lib/plugins directory.

This section includes the following topics:

Crawler Plug-in Overview
Crawler Plug-in Functionality

See Also:

Oracle SES developer tutorial for a guide to using the Crawler Plug-in API:

http://st-curriculum.oracle.com/tutorial/SESDevTutorial/index.htm

Crawler Plug-in Overview

The following diagram illustrates the crawler plug-in architecture.

Description of the illustration benri010.gif

Two interfaces in the Crawler Plug-in API (CrawlerPluginManager and CrawlerPlugin) need to be implemented to create a crawler plug-in. A crawler plug-in does the following:

Provides the metadata of the document in the form of document attributes
Provides access control list information (ACL) if the document is protected.
Maps each document attribute to a common attribute name used by end users
Optionally provides the list of URLs that have changed since a given time stamp
Optionally provides an access URL in addition to the display URL for the processing of the document
Provide the document contents in the form of a Java Reader. In other words, the plug-in is responsible for fetching the document.
Can submit "attribute-only" documents to the crawler; that is, a document that has metadata but no document contents.

Document Attributes and Properties

Document attributes, or metadata, describe document properties. Some attributes can be irrelevant to your application. The crawler plug-in creator must decide which document attributes should be extracted and saved. The plug-in also can be created such that the list of collected attributes are configurable. Oracle SES automatically registers attributes returned by the plug-in. The plug-in can decide which attributes to return for a document.

Library Path and Java Class Path

Any other Java class needed by the plug-in should be included in the plug-in jar file. (You could add the paths for the additional jar files needed by the plug-in into the Class-Path of the MANIFEST.MF file in the plug-in jar file.) This is because Oracle SES automatically adds the plug-in jar file to the crawler Java class path, and Oracle SES does not let you add other class paths from the administration interface.

If the plug-in code also relies on a particular library file (for example, a .dll file on Windows or a .so file on UNIX), then the library must be put under the $ORACLE_HOME/lib directory or the $ORACLE_HOME/search/lib/plugins directory. The Java library path is set explicitly by the crawler to those locations.

Crawler Plug-in Restrictions

The plug-in must handle mimetype rejection and large document rejection itself. For example, the plug-in should reject files it does not want to index based on its type or size, such as zip files. Also, plain text files, such as log files, can grow very large. Because the crawler reads HTML and plain text files into memory, it could run out of memory with very large files.

Crawler Plug-in Functionality

This section describes aspects of the crawler plug-in.

Source Registration

Source registration is automated. After a source type is defined, any instance of that source type can be defined:

Source name
Description of the source; limit to 4000 bytes
Source type ID
Default language; default is 'en' (English)

Parameter values; for example:

seed - http://www.oracle.com  
depth – 8

Source Attribute Registration

You can add new attributes to Oracle SES by providing the attribute name and the attribute data type. The data type can be string, number, or date. Attributes returned by an plug-in are automatically registered if they have not been defined.

User-Implemented Crawler Plug-in

The crawler plug-in has the following requirements:

The plug-in must be implemented in Java.
The plug-in must support the Java plug-in APIs defined by Oracle SES.
The plug-in must return the URL attributes and properties.
The plug-in must decide which document attributes Oracle SES should keep. Any attribute not defined in Oracle SES is registered automatically.
The plug-in can map attributes to source properties. For example, if an attribute "ID" is the unique ID of a document, then the plug-in should return (document_key, 4) where "ID" has been mapped to the property "document_key" and its value is 4 for this particular document.
If the attribute LOV is available, then the plug-in returns them upon request.

Crawler Plug-in APIs and Classes

The Crawler Plug-in API is a collection of classes and interfaces used to implement a crawler plug-in.

Table 7-4 Crawler Plug-in Interfaces and Classes

Interface/Class	Description
`CrawlerPluginManager`	This interface is used to generate the crawler plug-in instances. It provides general plug-in information for automatic plug-in registration on the administration page for defining user-defined source types. It has the control on which plug-in object (if more than one implementation is available) to return in `getCrawlerPlugin` call and how many instances of the plug-in to return. If only one instance is returned, then the plug-in implementation must handle multi-threading execution. The `CrawlingThreadService` object pass in is thread-specific as the invocation of each `getCrawlerPlugin` call is initiated by each thread.
`CrawlerPlugin`	This interface is used by the crawler plug-in to integrate with the Oracle SES crawler. The Oracle SES crawler loads the plug-in manager class and invokes the plug-in manager API to obtain the crawler plug-in instance. Each plug-in instance is run in the context of a thread execution.
`QueueService`	This interface is implemented by the Oracle SES crawler and made available to the plug-in through the `GeneralService` object. This interface is used by the crawler plug-in to submit URL-related data to the crawler.
`DataSourceService`	This interface is implemented by the Oracle SES crawler and made available to the plug-in through the `GeneralService` object. This interface is used by a crawler plug-in to manage the current crawled document set.
`GeneralService`	This interface provides Oracle SES service and implemented interface objects to the plug-in. It is implemented by the Oracle SES crawler and made available through plug-in manager initialization. This interface is used by a crawler plug-in to obtain Oracle SES interface objects.
`CrawlingThreadService`	This interface is used by a crawler plug-in to perform crawler-related tasks. It has execution context specific to the crawling thread that invokes the plug-in `crawl()` method.
`DocumentMetadata`	This interface holds a document's attributes and properties for processing and indexing. This interface is used by a crawler plug-in to submit URL-related data to the crawler.
`DocumentContainer`	This interface is used by a crawler plug-in to submit or retrieve document information.
`DocumentAcl`	This interface is used by a crawler plug-in to submit access control list (ACL) information for the document.
`ProcessingException`	This class encapsulates information about errors from processing plug-in requests.

URL Rewriter API

A URL rewriter is a user supplied Java module that implements the Oracle SES UrlRewriter Java interface. When activated, it is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue.

Note:

The URL Rewriter API is included as part of the Crawler Plug-in SDK. The URL Rewriter API is used for Web sources.

Web crawling generally consists of the following steps:

Get the next URL from the URL queue. (Web crawling stops when the queue is empty.)
Fetch the contents of the URL.
Extract URL links from the contents.
Insert the links into the URL queue.

The generated new URL link is subject to all existing boundary rules.

There are two possible operations that can be done on the extracted URL link:

Filtering: removes the unwanted URL link
Rewriting: transforms the URL link

URL Link Filtering

Users control what type of URL links are allowed to be inserted into the queue with the following mechanisms supported by the Oracle SES crawler:

robots.txt file on the target Web site; for example, disallow URLs from the /cgi directory
Hosts inclusion and exclusion rules; for example, only allow URLs from www.example.com
File path inclusion and exclusion rules; for example, only allow URLs under the /archive directory
Mimetype inclusion rules; for example, only allow HTML and PDF files
Robots metatag NOFOLLOW; for example, do not extract any link from that page
Black list URL; for example, URL explicitly singled out not to be crawled

With these mechanisms, only URL links that meet the filtering criteria are processed. However, there are other criteria that users might want to use to filter URL links. For example:

Allow URLs with certain file name extensions
Allow URLs only from a particular port number
Disallow any PDF file if it is from a particular directory

The possible criteria could be very large, which is why it is delegated to a user-implemented module that can be used by the crawler when evaluating an extracted URL link.

URL Link Rewriting

For some applications, due to security reasons, the URL crawled is different from the one seen by the end user. For example, crawling is done on an internal Web site behind a firewall without security checking, but when queried by an end user, a corresponding mirror URL outside the firewall must be used.

A display URL is a URL string used for search result display. This is the URL used when users click the search result link. An access URL is a URL string used by the crawler for crawling and indexing. An access URL is optional. If it does not exist, then the crawler uses the display URL for crawling and indexing. If it does exist, then it is used by the crawler instead of the display URL for crawling.

For regular Web crawling, there are only display URLs available. But in some situations, the crawler needs an access URL for crawling the internal site while keeping a display URL for the external use. For every internal URL, there is an external mirrored one.

For example:

http://www.example-qa.us.com:9393/index.html
http://www.example.com/index.html

When the URL link http://www.example-qa.us.com:9393/index.html is extracted and before it is inserted into the queue, the crawler generates a new display URL and a new access URL for it:

Access URL:

http://www.example-qa.us.com:9393/index.html

Display URL:

http://www.example.com/index.html

The extracted URL link is rewritten, and the crawler crawls the internal Web site without exposing it to the end user.

Another example is when the links that the crawler picks up are generated dynamically and can be different (depending on referencing page or other factor) even though they all point to the same page. For example:

http://compete3.example.com/rt/rt.wwv_media.show?p_type=text&p_id=4424&p_currcornerid=281&p_textid=4423&p_language=us

http://compete3.example.com/rt/rt.wwv_media.show?p_type=text&p_id=4424&p_currcornerid=498&p_textid=4423&p_language=us

Because the crawler detects different URLs with the same contents only when there is sufficient number of duplication, the URL queue could grow to a huge number of URLs, causing excessive URL link generation. In this situation, allow "normalization" of the extracted links so that URLs pointing to the same page have the same URL. The algorithm for rewriting these URLs is application dependent and cannot be handled by the crawler in a generic way.

When a URL link goes through a rewriter, there are the following possible outcomes:

The link is inserted with no changes made to it.
The link is discarded; it is not inserted.
A new display URL is returned, replacing the URL link for insertion.
A display URL and an access URL are returned. The display URL might or might not be identical to the URL link.

Creating and Using a URL Rewriter

Follow these steps to cr eate and use a URL rewriter:

Create a new Java file implementing the UrlRewriter interface open, close, and rewrite methods.

Compile the rewriter Java file into a class file. For example:

$ORACLE_HOME/jdk/bin/javac -classpath $ORACLE_HOME/search/lib/search.jar SampleRewriter.java

Package the rewriter class file into a jar file under the $ORACLE_HOME/search/lib/plugins/ directory. For example:
```
$ORACLE_HOME/jdk/bin/jar cv0f $ORACLE_HOME/search/lib/plugins/sample.jar SampleRewriter.class 
```
Enable the UrlRewriter option and specify the rewriter class name and jar file name (for example, SampleRewriter and sample.jar) in the administration tool Home - Sources - Crawling Parameters page of an existing Web source
Crawl the target Web source by launching the corresponding schedule. The crawler log file confirms the use of the URL rewriter with the message Loading URL rewriter "SampleRewriter"...

Note:

URL rewriting is available for Web sources only.

See Also:

Oracle Secure Enterprise Search Java API Reference for the API (oracle.search.sdk.crawler package)

Query-time Authorization API

Query-time authorization allows an Oracle SES administrator to associate a Java class with a source that will, at search time, validate every document fetched out of the Oracle SES repository belonging to the protected source. This result filter class can dynamically check access rights to make sure that the current search user has the credentials to view each document.

This authorization model can be applied to any source other than self service or federated sources. Besides acting as the sole provider of access control for a source, it can also be used as a post-filter. For example, a source can be stamped with a more generic ACL, while query-time authorization can be used to fine tune the results.

Overview of Query-time Authorization

Query-time authorization has the following characteristics:

It allows dynamic access control at search time compared to more static ACL stamping.
It filters documents returned to a search user.
It controls the Browse functionality to determine whether a folder is visible to a search user.
Optionally, it allows pruning of an entire source from the results to reduce performance costs of filtering each document individually.
It is applicable to all source types except self service and federated sources.

Query-time filtering is handled by class implementations of the QueryTimeFilter interface.

Filtering Document Access

Filtering document access is handled by the filterDocuments method of the QueryTimeFilter interface. The most common situation for filtering will occur with a search request, in which this method will be invoked with batches of documents from the result list. Based on the values returned by this method, all, some, or none of the documents might be removed from the results returned to the search user.

Access of individual documents is also controlled. For example, viewing a cached copy of a document or accessing the in-links and out-links will require a call into filterDocuments to determine the authorization for the search user.

Filtering Folder Browsing

The QueryTimeFilter implementation is also responsible for controlling the access to, and visibility of folders in, the Browse application. If a folder belongs to a source protected by a query-time filter, then the folder name in the Browse page will not have a document count listed next to it. Instead, the folder will show a view_all link.

For performance reasons, it could be costly to determine the exact number of documents visible to the current search user for every query-time filtered folder displayed on a Browse page. This task would require that every document in every folder be processed by the filter in order to calculate the total number of documents available for each folder. To prevent this comprehensive and potentially time-consuming operation, document counts are not used. Instead, folder visibility is explicitly determined by the query-time filter.

Based on the results from the filterBrowseFolders method, a folder might be hidden or shown in the Browse page. This result also controls access to the single folder browsing page, which displays the documents contained in a folder.

If the security of folder names is not a concern for a particular source, then the filterBrowseFolders method can blindly authorize all folders to be visible in the Browse application. After a folder is selected, the document list is still filtered through the filterDocuments method. This strategy should not be employed if folder names could reveal sensitive information.

If security is very critical, then it might be easiest to hide all folders for browsing. The documents from the source will still be available for search queries from the Basic and Advanced Search boxes, but a user will not be able to browse the source in the Browse pages of the search application.

Limitations of folder filtering:

The filterBrowseFolders method does not implicitly restrict access to subfolders. For example, if folder /Miscellaneous/www.example.com/private is hidden for a search user, then it is still possible for that user to view any subfolder, such as /Miscellaneous/www.example.com/private/a/b, if that subfolder is not also explicitly filtered out by this method. It would be possible to view this subfolder if the user followed a bookmark or outside link directly to the authorized subfolder in the Browse application.
This method does not affect functionality outside of the Browse application. This is not a generic folder pruning method. Search queries and document retrieval outside of the Browse application are only affected by the filterDocuments and pruneSource methods.

Pruning Access to an Entire Source

The QueryTimeFilter interface provides the ability to determine access privileges at the source level. This is achieved through calls to the pruneSource method. This method can be called in situations where there are a large number of documents or folders to be filtered. Authorizing or unauthorizing the entire source for a given user could provide a large performance gain over filtering each document individually.

The implementation of QueryTimeFilter must not rely on this method to secure access to documents or folders. This method is strictly an optimization feature. There is no guarantee that this will ever be invoked for any particular search request or document access. For example, when performing authorization for a single document, Oracle SES may call the filterDocuments method directly without invoking this method at all. Therefore, the filterDocuments and filterBrowseFolders methods must be implemented to provide full security in the absence of pruning.

Determining the Authenticated User

A query-time filter is free to define a search user's access privileges to sources and documents based on any criteria available. For example, a filter could be written to deny access to a source depending on the time of day.

In most cases, however, a filter will impose restrictions based on the authenticated user for that search request. The Oracle SES authenticated user name for a request is contained in the RequestInfo object. The steps for accessing this user name value depend on whether the request originated from the JSP search application or the Oracle SES Query Web Services interface. For either type of request, the key used to access the authenticated user name is the string value AUTH_USER.

Note:

User name is not case-sensitive.

This sample implementation of the QueryTimeFilter.getCurrentUserName method illustrates how to retrieve the current authenticated user from either a JSP or Web Services request:

public String getCurrentUserName( RequestInfo req )
    throws QueryTimeFilterException
  {
    HttpServletRequest servReq = req.getHttpRequest();
    Map sessCtx = req.getSessionContext();
    String user = null;
 
    if( servReq != null )
    {
      HttpSession session = servReq.getSession();
      if( session != null )
        user = ( String ) session.getAttribute( "AUTH_USER" );
    }
 
    else if( sessCtx != null )
    {
      // Web Service request
      user = ( String ) sessCtx.get( "AUTH_USER" );
    }
    
    if( user == null )
      user = "unknown";
 
    return user;
  }

See Also:

"Authentication Methods"

Query-time Authorization Interfaces and Exceptions

The oracle.search.query.qta package contains all interfaces and exceptions in the Query-time Authorization API.

To write a query-time authorization filter, implement the QueryTimeFilter interface. The methods in this interface may throw instances of the QueryTimeFilterException exception.

Objects that implement the RequestInfo, DocumentInfo, and FolderInfo interfaces are passed in as arguments for filtering, but these interfaces do not need to be implemented by the filter writer.

The API contains the following interfaces and exceptions:

Table 7-5 Query-time Authorization Interfaces and Exceptions

Interface/Exception	Description
`QueryTimeFilter`	This interface filters search results and access to document information at search time. If an object implementing this interface has been assigned to a source, then any search results or other retrieval of documents belonging to the source are passed through this filter before being presented to the end user.
`QueryTimeFilterException`	This exception is thrown by methods in the `QueryTimeFilter` interface to indicate that a failure has occurred.
`RequestInfo`	This interface represents information about a request that can be passed to a `QueryTimeFilter` for filtering out documents, folders, or entire sources.
`DocumentInfo`	This interface represents information about a document that can be passed to a `QueryTimeFilter` for filtering out documents.
`FolderInfo`	This interface represents information about a folder that can be passed to a `QueryTimeFilter` to control folder browsing.

See Also:

Oracle Secure Enterprise Search Java API Reference for the oracle.search.query.qta package

Thread-safety of the Filter Implementation

Classes that implement the QueryTimeFilter interface should be designed to persist for the lifetime of a running Oracle SES search application. A single instance of QueryTimeFilter will generally handle multiple concurrent requests from different search end users. Therefore, the filterDocuments, pruneSource, filterBrowseFolders, and getCurrentUserName methods in this class must be both reentrant and thread-safe.

Compiling and Packaging the Query-time Filter

To compile your query-time filter class, you will need to include at least the two following files in the Java CLASSPATH. These files can be found in the Oracle SES server directory.

$ORACLE_HOME/search/lib/search_query.jar
$ORACLE_HOME/jlib/servlet.jar

It is recommended to build a jar file containing your QueryTimeFilter class (or classes) and any supporting Java classes. This jar file should be placed in a secure location for access by the Oracle SES server. If this jar file is compromised, then the security of document access in the search server can be compromised.

Your query-time filter might require other class or jar files that are not included in the jar file you build and are not located in the Oracle SES class path. If so, these files should be added to the Class-Path attribute of the JAR file manifest. This manifest file should be included in the jar file you build.

If Oracle SES cannot locate a class used by a QueryTimeFilter during run-time, then an error message will be written to the log file and all documents from that source will be filtered out for the search request being processed.

See Also:

http://java.sun.com/j2se/1.4.2/docs/guide/jar/jar.html for more information about jar file manifests