5 Data Mining Process

This chapter describes the data mining process in general and how it is supported by Oracle Data Mining. Data mining requires data preparation, model building, model testing and computing lift for a model, model applying (scoring), and model deployment. The Oracle database and Oracle Data Mining provide facilities for performing all of the data mining steps. This chapter discusses the following topics:

How Is Data Mining Done?
How Does Oracle Data Mining Support Data Mining?

5.1 How Is Data Mining Done?

CRISP-DM is a widely accepted methodology for data mining projects. For details, see htttp://www.crisp-dm.org. The steps in the process are:

Business Understanding: Understand the project objectives and requirements from a business perspective, and then convert this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.
Data Understanding: Start by collecting data, then get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses about hidden information.
Data Preparation: Includes all activities required to construct the final data set (data that will be fed into the modeling tool) from the initial raw data. Tasks include table, case, and attribute selection as well as transformation and cleaning of data for modeling tools.
Modeling: Select and apply a variety of modelling techniques, and calibrate tool parameters to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
Evaluation: Thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. Determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results is reached.
Deployment: Organize and present the results of data mining. Deployment can be as simple as generating a report or as complex as implementing a repeatable data mining process.

Data mining is iterative. A data mining process continues after a solution is deployed. The lessons learned during the process can trigger new business questions. Changing data can require new models. Subsequent data mining processes benefit from the experiences of previous ones.

Oracle Data Mining (ODM) supports the last three steps of CRISP-DM process. The first step, Business Understanding, is unique to your business. The remaining steps are supported by a combination of ODM and the Oracle database, especially in the context of an Oracle data warehouse. The facilities of the Oracle database can be very useful during data understanding and data preparation.

5.2 How Does Oracle Data Mining Support Data Mining?

ODM integrates data mining with the Oracle database and exposes data mining through the following interfaces:

Java interface: Java Data Mining (JSR-73) compliant interface that allows users to embed data mining in Java applications.
PL/SQL interface: The packages DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM allow users to embed data mining in PL/SQL applications.
Automated data mining: The DBMS_PREDICTIVE_ANALYTICS PL/SQL package, described briefly in "Automated Data Mining", automates the entire data mining process from data preprocessing through model building to scoring data.
Data mining SQL functions: The SQL Data Mining functions (CLUSTER_ID, CLUSTER_PROBABILITY, CLUSTER_SET, FEATURE_ID, FEATURE_SET, FEATURE_VALUE, PREDICTION, PREDICTION_COST, PREDICTION_DETAILS, PREDICTION_PROBABILITY, and PREDICTION_SET) support deployment of models within the context of existing applications, improve scoring performance, and enable pipelining of results involving data mining predictions. For more information, see "Data Mining Functions".
Graphical interfaces: Oracle Data Miner and Oracle Spreadsheet Add-In for Predictive Analytics are graphical interfaces that solve data mining problems. See "Graphical Interfaces" for a brief overview.

The end result of data mining is a model. Often this model is deployed so that its results can be embedded in an application. ODM provides facilities for deployment described in "Model Deployment".

5.2.1 Java and PL/SQL Interfaces

The Java and PL/SQL programmatic interfaces provide the facilities to do basic data preparation (binning, normalization, winsorizing, clipping, and missing values treatment.) The two interfaces also provide calls that build, test, and apply the models described in Chapter 3 and Chapter 4.

The ODM Java interface and the ODM PL/SQL interface have the same capabilities. Models produced by either interface are interoperable, for example, a model can be built using one interface and applied using the other interface.

5.2.2 Automated Data Mining

The PL/SQL package DBMS_PREDICTIVE_ANALYTICS automates the data mining process from data preprocessing through model building to scoring new data. This automation provides a simple and intuitive interface. The package provides an important tool that simplifies data mining for users who are not data mining experts.

DBMS_PREDICTIVE_ANALYTICS provides the following functionality:

EXPLAIN - Rank attributes in order of influence in explaining a target column
PREDICT - Predict the value of an attribute

For detailed information about DBMS_PREDICTIVE_ANALYTICS, see the Oracle Database PL/SQL Packages and Types Reference.

The Oracle Spreadsheet Add-In for Predictive Analytics provides a graphical user interface to DBMS_PREDICTIVE_ANALYTICS; the Add-In is briefly described in "Graphical Interfaces".

5.2.3 Data Mining Functions

The data mining functions are SQL functions that apply existing ODM models; they also return information about existing ODM models. The functions are as follows:

CLUSTER_ID: Returns the cluster identifier of the predicted cluster with the highest probability for a specified set of predictors.
CLUSTER_PROBABILITY: Returns a measure of the degree of confidence of membership of an input row in a cluster associated with the specified model.
CLUSTER_SET: Returns a varray of objects containing all possible clusters and the probabilities for the returned clusters that a given row belongs to subject to certain filtering criteria.
FEATURE_ID: Returns the identifier of the feature (in a feature extraction model) with the highest coefficient value.
FEATURE_SET: Returns a varray of objects containing all possible features and the feature values in a feature extraction model subject to certain filtering criteria.
FEATURE_VALUE: Returns the value of a given feature in a feature extraction model.
PREDICTION: Returns the best prediction for a classification or regression model given a set of predictors.
PREDICTION_COST: Returns the cost for a prediction made using an ODM classification model. Applies to decision tree models only if a cost matrix was specified during build.
PREDICTION_DETAILS: Returns an XML string containing model-specific information about the scoring by an ODM decision tree or NB model of an input row.
PREDICTION_PROBABLILITY: Returns the probability for a prediction made using an ODM classification model.
PREDICTION_SET: Returns a varray of objects that contains all classes, the probability for each class, the costs for a decision tree model in a classification prediction.

The data mining functions have many benefits, the most important of which are the following:

The functions make deployment of models within the context of existing applications straightforward since existing SQL statements can be easily enhanced with them.
The functions greatly improve scoring (model apply) performance.
The functions enable pipelining of results involving data mining predictions; this permits, among other things, the ability to return a few results quickly to an end user.

For more information about the SQL data mining functions, see the Oracle Database SQL Reference.

5.2.4 Graphical Interfaces

ODM has two graphical interfaces, both of which are available as downloads from Oracle Technology Network:

Oracle Data Miner is a user interface to ODM that helps data analysts and application developers build advanced business intelligence applications based on ODM. ODM Java Code Generator is an extension to Oracle JDeveloper that exports models created using Oracle Data Miner to Java code.
Oracle Spreadsheet Add-In for Predictive Analytics enables Microsoft Excel users to mine data in Oracle tables or Excel spreadsheets using the features of the DBMS_PREDICTIVE_ANALYTICS PL/SQL package.

5.2.5 Model Deployment

It is co mmon to build models on one system and then deploy the models to a production system. The ODM Scoring Engine, described in Chapter 7, supports common deployment scenarios.

ODM supports data mining model export and import in native format between Oracle databases or schemas to provide a way to move models.

Model export/import is supported at different levels, as follows:

Database export/import. When a DBA exports a full database using the expdp utility, all the existing data mining models in the database will be exported. When a DBA imports a database dump using the impdp utility, all the data mining models in the dump will be restored.
Schema export/import. When a user or DBA exports a schema using expdp, all the data mining models in the schema will be exported. When the user or DBA imports the schema dump using impdp, all the models in the dump will be imported.
Selected model export/import. Both ODM interfaces include calls that export or import specific models, for example, the PL/SQL interface includes DBMS_DATA_MINING.export_model() and DBMS_DATA_MINING.import_model().

ODM model export and model import are based on the Oracle DBMS Data Pump. When you export a model, the tables that constitute the model and the associated metadata are written to a dump file set that consists of one or more files. When you import a model, the tables and metadata are retrieved from the file and restored in the new database.

For detailed information about model export/import, see Oracle Data Mining Administrator's Guide.