Skip Headers
Oracle® Data Mining Concepts
10g Release 2 (10.2)

Part Number B14339-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

6 Text Mining Using Oracle Data Mining

This chapter describes support for text mining. Oracle provides support for text mining with two products:

The support for text data in ODM is different from that provided by Oracle Text. Oracle Text is dedicated to text document processing. ODM allows the combination of text (unstructured) columns and non-text (categorical and numerical) columns of data as input for clustering, classification, and feature extraction.

Oracle Text is described in the Oracle Text Reference and the Oracle Text Application Developer's Guide.

Text data is the only unstructured data type supported by ODM.

Table 6-1 summarizes how DBMS_DATA_MINING, the ODM Java interface, and Oracle Text support text mining.

Oracle Data Mining Application Developer's Guide provides information that helps you develop text mining applications using the PL/SQL and Java ODM interfaces. Oracle Data Mining Administrator's Guide contains descriptions of the sample text mining programs included with ODM.

This chapter discusses the following topics:

6.1 What is Text Mining?

Text mining is conventional data mining done using "text features." Text features are usually keywords, frequencies of words, or other document-derived features. Once you derive text features, you mine them just as you would any other data.

Some of the applications for text mining include:

6.1.1 Document Classification

Document classification, also known as document categorization, is the process of assigning documents to categories (for example, themes or subjects). A particular document may fit into two or more different categories. This type of classification can often be represented as a multi target classification problem where a supervised model is built for each category.

6.2 ODM Support for Text Mining

ODM provides infrastructure for developing data mining applications suitable for addressing a variety of business problems involving text. Among these, the following specific technologies provide key elements for addressing problems that require text mining:

The technologies that are most used in text mining are classification, clustering, and feature extraction.

6.2.1 Classification and Text Mining

A large number of document classification applications fall into one of the following:

  • Assigning multiple labels to a document. ODM does not support this case.

  • Assigning a document to one of many labels. For example, automatically assigning a mail message to a folder and spam filtering. This application requires multi-class classification.

The Support Vector Machine (SVM) algorithm provides powerful classifiers that have been used successfully in document classification applications. SVM can deal with thousands of features and is easy to train with small or large amounts of data. SVM is known to work well with text data. For more information about SVM, see Chapter 3.

6.2.2 Clustering and Text Mining

Clustering is used frequently in text mining; the main applications of clustering in text mining are

  • Taxonomy generation

  • Topic extraction

  • Grouping the hits returned by a search engine

Clustering can also be used to group textual information with other indications from business databases to provide novel insights.

The current release of ODM supports clustering text features using both the PL/SQL and Java interfaces.

The k-Means clustering algorithm, described in "Enhanced k-Means Algorithm", supports mining text columns.

6.2.3 Feature Extraction and Text Mining

There are two kinds of text mining problems for which feature extraction is useful:

  • Extract features from actual text. Oracle Text is designed to solve this kind of problem. ODM also supports feature extraction from text. Most text mining is focused on this problem.

  • Extract semantic features or higher-level features from the basic features uncovered when features are extracted from actual text. Statistical techniques, such as single value decomposition (SVD) and non-negative matrix factorization (NMF), are important in solving this kind of problem. Higher-order features can greatly improve the quality of information retrieval, classification, and clustering tasks.

NMF has been found to provide superior text retrieval when compared to SVD and other traditional decomposition methods. NMF takes as input a term-document matrix and generates a set of topics that represent weighted sets of co-occurring terms. The discovered topics form a basis that provides an efficient representation of the original documents. For more information about NMF, see "Algorithm for Feature Extraction".

6.2.4 Association and Text Mining

Association models can be used to uncover the semantic meaning of words. For example, suppose that the word sheep co-occurs with words like sleep, fence, chew, grass, meadow, farmer, and shear. An association model would include rules connecting sheep with these concepts. Inspection of the rules would provide context for sheep in the document collection. Such associations can improve information retrieval engines. For more information about association models, see "Association".

6.2.5 Regression and Text Mining

Regression is most often used in problems that combine text with other types of data. For more information about regression, see "Regression".

6.2.6 Anomaly Detection and Text Mining

One-Class Support Vector Machine can be used to detect or identify novel or anomalous patterns. For more information, see "Anomaly Detection" .

6.3 Oracle Support for Text Mining

Table 6-1 summarizes how the ODM (both the Java and PL/SQL interfaces) and Oracle Text support text mining functions.

Table 6-1 Text Mining Comparison

Feature ODM Oracle Text

Association

Text data only or text and non-text data

No support

Clustering

k-Means algorithm supports text only or text and non-text data

k-means algorithm supports text only

Attribute importance

No support for text data

No support

Regression

Support Vector Machine (SVM) algorithm supports text data only or text and non-text data

No support

Classification

SVM supports text only or text and non-text data

Support for assigning documents to one of many labels

SVM and decision trees support text only

Support for assigning documents to one of many labels and also for assigning documents to multiple labels at the same time

One-Class SVM

One-Class SVM supports text only or text and non-text data.

No support

Feature extraction (basic features)

The Java API handles the process supports the feature extraction process that transforms a text column to a nested table. The PL/SQL API requires the use of Oracle Text procedures to perform extraction. ODM allows the same degree of control as Oracle Text

Feature extraction is done internally; the results are not exposed

Feature extraction (higher order features)

Non-negative matrix factorization (NMF) supports either text or text and non-text data

No support

Record apply

No support for record apply

Supports record apply for classification

Support for text columns

Features extracted from a column of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW using an appropriate transformation

Supports table columns of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW