3 Managing Models

Models created by ODM APIs or by Oracle Data Miner are stored in the Database. This chapter provides information about viewing, accessing, configuring, exporting and importing models.

This chapter contains the following topics:

Models in the Database
Import/Export
Model Settings

3.1 Models in the Database

A model is identified by its name. Like tables in the database, a model has storage associated with it. But unlike a table, the form, shape, and content of this storage is opaque to the user. A model is not a database schema object.

You can view the contents of a model — that is, the patterns and rules that constitute a mining model — using algorithm-specific GET_MODEL_DETAILS functions in the DBMS_DATA_MINING PL/SQL package. These functions are documented in Oracle Database PL/SQL Packages and Types Reference. See "Exploring Model Details" for information on model details in the Java API.

You can view a list of the models in your schema by querying the DM_USER_MODELS view. The columns of the DM_USER_MODELS view are described in Table 3-1.

Table 3-1 DM_USER_MODELS View

Column	Data Type	Description
`name`	`VARCHAR2(25)`	Name of the model.
`function_name`	`VARCHAR2(30)`	The model function. See Chapter 1 for an overview of mining functions.
`algorithm_name`	`VARCHAR2(30)`	The algorithm used by the model. See Chapter 1 for algorithms used by the mining functions.
`ctime_creation_date`	`DATE`	The date on which the model was created.
`build_duration`	`NUMBER`	The duration of the model build process.
`target_attribute`	`VARCHAR2(30)`	The attribute designated as the target of a classification model.
`model_size`	`NUMBER`	The size of the model in megabytes.

Note:

Metadata about models is stored in system tables whose names have the prefix DM$ or DM. You should not attempt to query or modify these system tables, and you should not use DM$ or DM_ prefixes in the names of any tables used by ODM applications.

The following query lists the demo programs in the DM_USER schema.

SQL> select 'NAME', 'FUNCTION_NAME', 'ALGORITHM_NAME' from DM_USER_MODELS;

NAME                  FUNCTION_NAME         ALGORITHM_NAME          
--------------------------------------------------------------------
ABN_SH_CLAS_SAMPLE    CLASSIFICATION        ADAPTIVE_BAYES_NETWORK  
AI_SH_SAMPLE          ATTRIBUTE_IMPORTANCE  MINIMUM_DESCRIPTION_LENGTH
AR_SH_SAMPLE          ASSOCIATION_RULES     APRIORI_ASSOCIATION_RULES 
DT_SH_CLAS_SAMPLE     CLASSIFICATION        DECISION_TREE             
KM_SH_CLUS_SAMPLE     CLUSTERING            KMEANS                    
NB_SH_CLAS_SAMPLE     CLASSIFICATION        NAIVE_BAYES               
NMF_SH_SAMPLE         FEATURE_EXTRACTION    NONNEGATIVE_MATRIX_FACTOR  
OC_SH_CLUS_SAMPLE     CLUSTERING            O_CLUSTER    
SVMC_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES    
SVMO_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES  
SVMR_SH_REGR_SAMPLE   REGRESSION            SUPPORT_VECTOR_MACHINES    
T_SVM_CLAS_SAMPLE     CLASSIFICATION        SUPPORT_VECTOR_MACHINES

See Also:

Oracle Data Mining Administrator's Guide for information about installing, running, and viewing the demo programs.

3.1.1 Model Names

Although ODM models are not stored as Oracle schema objects, their names must conform to Database requirements for nonquoted identifiers. Additionally, model names must be less than 25 bytes long.

Oracle requires that nonquoted identifiers contain only alphanumeric characters, the underscore (_), dollar sign ($), and pound sign (#); the initial character must be alphabetic. Oracle strongly discourages the use of the dollar sign and pound sign in nonquoted literals.

See Also:

Oracle Database SQL Reference for information on schema object naming requirements.

3.1.2 Model Access

Oracle Data Mining does not support a general privilege model that spans multiple users. GRANT and REVOKE of read and update privileges on a mining model across user schemas are not yet supported.

You can only read and update models in your own schema. If you want to modify the settings of a model or view its details, you must be logged in with the identity of the schema that owns the model. Results of all mining operations are generated in the schema that owns the model.

Models in one schema can be exported to other schemas. You can import a model into your own schema once it has been exported to an accessible location.

3.2 Import/Export

Mining models are included when a database or schema is exported or imported with the Oracle Data Pump utility. You can export and import individual models or groups of models using the ODM SQL and Java APIs.

You can use the EXPORT_MODEL procedure in the DBMS_DATA_MINING package to export a model or a group of models to a dump file. Models can be imported from the dump file using IMPORT_MODEL.

The Java API uses the ExportTask and ImportTask standard JDM interfaces to provide the same export/import functionality.

See Also:

Oracle Data Mining Administrator's Guide for more information on model export/import.
Oracle Database Utilities for information on Oracle Data Pump.

3.3 Model Settings

A settings table is a relational table that provides configuration information for a specific model. You must create a settings table if you want a model to have any nondefault characteristics. You will supply the name of the settings table when you create the model.

You must create the settings table in the schema of the model. You can choose the name of the settings table, but the column names and their types must be defined as shown.

Column Name	Data Type
`setting_name`	`VARCHAR2(30)`
`setting_value`	`VARCHAR2(128)`

The values inserted into the setting_name column are one or more of several constants defined in the DBMS_DATA_MINING package. Depending on what the setting name denotes, the value for the setting_value column can be a predefined constant or the actual numerical value corresponding to the setting itself. The setting_value column is defined to be VARCHAR2. You can explicitly cast numerical inputs to string using the TO_CHAR() function, or you can rely on the implicit type conversion provided by the Database.

The settings described in Table 3-2 apply to a mining function. Use these settings to specify the algorithm that the model will use, the location of cost matrix and prior probabilities tables, and other function-specific characteristics. See Table 1-1, "Predictive Data Mining Functions" and Table 1-2, "Descriptive Data Mining Functions" for information about mining functions.

Table 3-2 Data Mining Function Settings

Algorithm Settings	Setting Value (with Permissible Value Ranges)
`algo_name`	Classification: One of: `algo_naive_bayes` (Default) `algo_support_vector_machines` (Use this setting for both SVM and One-Class SVM `algo_adaptive_bayes_network` `algo_decision_tree` Regression: `algo_support_vector_machines` Association Rules: `algo_apriori_association_rules` Clustering: `algo_kmeans` (Default) `algo_o_cluster` Feature Extraction: `algo_nonnegative_matrix_factor` Attribute Importance: `algo_ai_mdl`
`clas_cost_table_name`	The name of a relational table that specifies a cost matrix. The column requirements for this table are described in "Costs". This input is applicable only for Decision Tree algorithms, since this is the only algorithm that supports a cost matrix at build time. The cost matrix table must be present in the current user's schema.
`clas_priors_table_name`	The name of a relational table that specifies prior probabilities. The column requirements for this table are described in "Priors". This input is applicable only for classification algorithms. Decision Tree is the only classification algorithm that does not use priors. The prior probabilities table must be present in the current user's schema. For SVM classification, this setting identifies a table of weights.
`clus_num_clusters`	`TO_CHAR(numeric_expr` `>= 1`) Number of clusters generated by a clustering algorithm. Default is 10.
`feat_num_features`	`TO_CHAR(numeric_expr` `> = 1)` Number of features to be extracted. Default value estimated from the data by the algorithm.
`asso_max_rule_length`	`TO_CHAR(2 <=` `numeric_expr` `<= 20)` Maximum rule length for AR algorithm. Default is 4.
`asso_min_confidence`	`TO_CHAR(0 <=` `numeric_expr` `<= 1)` Minimum confidence value for AR algorithm Default is 0.1.
`asso_min_support`	`TO_CHAR(0 <=` `numeric_expr` `<= 1)` Minimum support value for AR algorithm Default is 0.1.

Table 3-3 through Table 3-9 provide algorithm-specific settings. You can use these settings to tune the behavior of the algorithm.

Table 3-3 Algorithm Settings for Adaptive Bayes Network

Setting Name	Setting Value (with Permissible Value Ranges)
`abns_model_type`	Model type for Adaptive Bayes Network: `abns_single_feature` `abns_multi_feature` (Default) `abns_naive_bayes)`
`abns_max_build_minutes`	`TO_CHAR(numeric_expr` `>= 0)` The maximum time threshold for completion of model build. Default is 0, which implies no time limit.
`abns_max_nb_predictors`	`TO_CHAR(numeric_expr` `> 0)` Maximum number of predictors, measured by their MDL ranking, to be considered for building an ABN model of type `abns_naive_bayes`. Default is 10.
`abns_max_predictors`	`TO_CHAR(numeric_expr` `> 0)` Maximum number of predictors, measured by their MDL ranking, to be considered for building an ABN model of type `abns_single_feature` or `abns_multi_feature`. Default is 25.

Table 3-4 Algorithm Settings for Naive Bayes

Setting Name Setting Value (with Permissible Value Ranges)

Setting Name	Setting Value (with Permissible Value Ranges)
`nabs_singleton_threshold`	`TO_CHAR(0 <=` `numeric_expr` `<=1)` Value of singleton threshold for NB algorithm Default value is 0.01
`nabs_pairwise_threshold`	`TO_CHAR(0 <=` `numeric_expr` `<=1)` Value of pairwise threshold for NB algorithm Default is 0.01.

nabs_singleton_threshold

TO_CHAR(0 <= numeric_expr <=1)

Value of singleton threshold for NB algorithm

Default value is 0.01

nabs_pairwise_threshold

TO_CHAR(0 <= numeric_expr <=1)

Value of pairwise threshold for NB algorithm

Default is 0.01.

Table 3-5 Algorithm Settings for Decision Tree

Setting Name	Setting Value (with Permissible Value Ranges)
`tree_impurity_metric`	Tree impurity metric for Decision Tree. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity for the entities in the node. Homogeneity is measured in accordance with a metric. For classification (Binary or multi-class targets), the supported metrics are gini and entropy. `tree_impurity_entropy` `tree_impurity_gini` (Default)
`tree_term_max_depth`	`TO_CHAR(2 <=` `numeric_expr` `<= 20)` Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). Default is 7.
`tree_term_minpct_node`	`TO_CHAR(0 <=` `numeric_expr` `<= 10)` No child shall have fewer records than this number, which is expressed as a percentage of the training rows. Default is 0.05, indicating 0.05%.
`tree_term_minpct_split`	`TO_CHAR(0 <=` `numeric_expr` `<= 20)` Criteria for splits: minimum number of records in a parent node expressed as a percent of the total number of records used to train the model. No split is attempted if number of records is below this value. Default is 0.1, indicating 0.1%.
`tree_term_minrec_node`	`TO_CHAR(numeric_expr` `>= 0)` No child shall have fewer records than this number. Default is 10.
`tree_term_minrec_split`	`TO_CHAR(numeric_expr` `>= 0)` Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if number of records is below this value. Default is 20.

Table 3-6 Algorithm Settings for Support Vector Machines

Setting Name	Setting Value (with Permissible Value Ranges)
`svms_active_learning`	Whether active learning is enabled or disabled: `svms_al_disable` `svms_al_enable` (Default) When active learning is enabled, the SVM algorithm uses active learning to build a reduced size model. When active learning is disabled, the SVM algorithm builds a standard model.
`svms_kernel_function`	Kernel for Support Vector Machine: `svms_linear` (Default `svms_gaussian`
`svms_kernel_cache_size`	`TO_CHAR(numeric_expr` `> 0)` Value of kernel cache size for SVM algorithm. Applies to Gaussian kernel only. Default is 50000000 bytes.
`svms_conv_tolerance`	`TO_CHAR(numeric_expr` `> 0)` Convergence tolerance for SVM algorithm Default is 0.001.
`svms_std_dev`	`TO_CHAR(numeric_expr` `> 0)` Value of standard deviation for SVM algorithm This is applicable only for Gaussian kernel Default value estimated from the data by the algorithm
`svms_complexity_factor`	`TO_CHAR(numeric_expr` `> 0)` Value of complexity factor for SVM algorithm (both classification and regression) Default value estimated from the data by the algorithm
`svms_epsilon`	`TO_CHAR(numeric_expr` `> 0)` Value of epsilon factor for SVM Regression Default value estimated from the data by the algorithm
`svms_outlier_rate`	`TO_CHAR(0 <numeric_expr` `< 1)` The desired rate of outliers in the training data. Valid for One-Class SVM models only. Cannot be used with `svms_complexity_factor`. Default is 0.1.

Table 3-7 Algorithm Settings for Non-Negative Matrix Factorization

Setting Name Setting Value (with Permissible Value Ranges)

Setting Name	Setting Value (with Permissible Value Ranges)
`nmfs_random_seed`	`TO_CHAR(numeric_expr)` Random seed for NMF algorithm. Default is –1.
`nmfs_num_iterations`	`TO_CHAR(1 <=` `numeric_expr` `<= 500)` Number of iterations for NMF algorithm Default is 50
`nmfs_conv_tolerance`	`TO_CHAR(0 <` `numeric_expr` `<= 0.5)` Convergence tolerance for NMF algorithm Default is 0.05

nmfs_random_seed

TO_CHAR(numeric_expr)

Random seed for NMF algorithm.

Default is –1.

nmfs_num_iterations

TO_CHAR(1 <= numeric_expr <= 500)

Number of iterations for NMF algorithm

Default is 50

nmfs_conv_tolerance

TO_CHAR(0 < numeric_expr <= 0.5)

Convergence tolerance for NMF algorithm

Default is 0.05

Table 3-8 Algorithm Settings for O-Cluster

Setting Name Setting Value (with Permissible Value Ranges)

Setting Name	Setting Value (with Permissible Value Ranges)
`oclt_max_buffer`	`TO_CHAR(numeric_expr` `> 0)` Buffer size for O-Cluster. Default is 50,000.
`oclt_sensitivity`	`TO_CHAR(0 <=numeric_expr` `<= 1)` A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density. Default is 0.5.

oclt_max_buffer

TO_CHAR(numeric_expr > 0)

Buffer size for O-Cluster.

Default is 50,000.

oclt_sensitivity

TO_CHAR(0 <=numeric_expr <= 1)

A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density.

Default is 0.5.

Table 3-9 Algorithm Settings for k-Means

Setting Name	Setting Value (with Permissible Value Ranges)
`kmns_distance`	Distance Function for k-Means Clustering: `kmns_euclidean` (Default) `kmns_cosine` `kmns_fast_cosine`
`kmns_iterations`	`TO_CHAR(0 <` `numeric_expr` `<= 20)` Number of iterations for k-Means algorithm Default is 3
`kmns_conv_tolerance`	`TO_CHAR(0<` `numeric_expr` `<= 0.5)` Convergence tolerance for k-Means algorithm Default is 0.01
`kmns_split_criterion`	Split criterion for k-Means Clustering: `kmns_variance` (Default) `kmns_size`
`kmns_num_bins`	`TO_CHAR(numeric_expr` `> 0)` Number of histogram bins. Specifies the number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin. Default is 10.
`kmns_block_growth`	`TO_CHAR(1 <` `numeric_expr` `<= 5)` Growth factor for memory allocated to hold cluster data Default value is 2
`kmns_min_pct_attr_support`	`TO_CHAR(0<=` `numeric_expr<=1)` The fraction of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster. Setting the parameter value too high in data with missing values can result in very short or even empty rules. Default is 0.1.

This example creates a settings table for an SVM classification model and edits the individual values using SQL DML.

CREATE TABLE drugstore_settings (
  setting_name VARCHAR2(30),
  setting_value VARCHAR2(128))

BEGIN
-- override the default for convergence tolerance for SVM Classification
INSERT INTO drugstore_model_settings (setting_name, setting_value)
VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081));
COMMIT;
END;

The table function GET_DEFAULT_SETTINGS provides you all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using SQL DML.

BEGIN
CREATE TABLE drug_store_settings AS
SELECT setting_name, setting_value
  FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS
 WHERE setting_name LIKE 'SVMS_%';
-- update the values using appropriate DML
END;

You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS.

BEGIN
CREATE TABLE my_new_model_settings AS
SELECT setting_name, setting_value 
  FROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_model'));
END;

3.3.1 Costs

In classification models, you can specify a cost matrix to represent the costs associated with false positive and false negative predictions. A cost matrix can be used in testing and scoring most classification models.

The Decision Tree algorithm can use a cost matrix at build time.To specify the cost matrix, you must create a cost matrix table and provide its name in the clas_cost_table_name setting for the Decision Tree model. See "Build Settings" for an example.

If you are using the Java API, instantiate a CostMatrix object and specify the name of the table as a parameter to the dmeConn.saveObject method for the object.

The cost matrix table must have these columns.

Column Name	Data Type
`actual_target_value`	`VARCHAR2(4000)` for categorical targets `NUMBER` for numeric targets
`predicted_target_value`	`VARCHAR2(4000)` `NUMBER` for numeric targets
`cost`	`NUMBER`

ODM enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.

The data input to each test computation (a COMPUTE procedure in PL/SQL, or a TestMetrics object in Java) is the result generated from applying the model on test data. In addition, if you also provide a cost matrix as an input, the computation generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY results, without rebuilding the model and applying it against the same test data for every iteration.

Once you arrive at an optimal cost matrix, you can then input this cost matrix to the RANK_APPLY operation along with the results of APPLY on your scoring data. RANK_APPLY will provide your new data ranked by cost.

See Also:

Oracle Data Mining Concepts for more information on cost matrix.

3.3.2 Priors

In most classification models, you can specify prior probabilities to offset differences in distribution between the build data and the real population (scoring data). Priors can be used in building any classification model that uses a Bayesian algorithm. Priors are not used by the Decision Tree algorithm.

To specify prior probabilities, you must create a priors table and provide its name in the clas_priors_table_name setting for the model. If you are using the Java API, use a setPriorProbabilitiesMap object in the classification function settings for the model.

SVM Classification uses weights to correct for differences in target distribution. Use the priors table to specify weights for SVM Classification models.

The priors table must have these columns.

Column Name	Data Type
`target_value`	`VARCHAR2` for categorical targets `NUMBER` for numeric targets
`prior_probability`	`NUMBER`

See Also:

Oracle Data Mining Concepts for more information on prior probabilities.