Skip Headers
Oracle® Data Mining Concepts
10g Release 2 (10.2)

Part Number B14339-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

3 Supervised Data Mining

This chapter describes supervised models; supervised models are sometimes referred to as predictive models. These models predict a target value. The Java and PL/SQL Oracle Data Mining interfaces support the following supervised functions:

This chapter also describes

3.1 Classification

Classification of a collection consists of dividing the items that make up the collection into categories or classes. In the context of data mining, classification is done using a model that is built on historical data. The goal of predictive classification is to accurately predict the target class for each record in new data, that is, data that is not in the historical data.

A classification task begins with build data (also known as training data) for which the target values (or class assignments) are known. Different classification algorithms use different techniques for finding relations between the predictor attributes' values and the target attribute's values in the build data. These relations are summarized in a model; the model can then be applied to new cases with unknown target values to predict target values. A classification model can also be applied to data that was held aside from the training data to compare the predictions to the known target values; such data is also known as test data or evaluation data. The comparison technique is called testing a model, which measures the model's predictive accuracy. The application of a classification model to new data is called applying the model, and the data is called apply data or scoring data. Applying a model to data is often called scoring the data.

Classification is used in customer segmentation, business modeling, credit analysis, and many other applications. For example, a credit card company may wish to predict which customers are likely to default on their payments. Customers are divided into two classes: those who default and those who do not default. Each customer corresponds to a case; data for each case might consist of a number of attributes that describe the customer's spending habits, income, demographic attributes, etc. These are the predictor attributes. The target attribute indicates whether or not the customer has defaulted. The build data is used to build a model that predicts whether new customers are likely to default.

Classification problems can have either binary and multiclass targets. Binary targets are those that take on only two values, for example, good credit risk and poor credit risk. Multiclass targets have more than two values, for example, the product purchased (comb or hair brush or hair pin). Multiclass target values are not assumed to exist in an ordered relation to each other, for example, hair brush is not assumed to be greater or less than comb.

Classification problems may require the specification of Costs, described and Priors, described.

3.1.1 Algorithms for Classification

ODM provides the following algorithms for classification:

Table 3-1 compares several important features of the classification algorithms.

Table 3-1 Classification Algorithm Comparison

Feature Naive Bayes Adaptive Bayes Network Support Vector Machine Decision Tree

Speed

Very fast

Fast

Fast with active learning

Fast

Accuracy

Good in many domains

Good in many domains

Significant

Good in many domains

Transparency

No rules (black box)

Rules for Single Feature Build only

No rules (black box)

Rules

Missing value interpretation

Missing value

Missing value

Sparse data

Missing value


3.1.1.1 Decision Tree Algorithm

Decision tree rules provide model transparency so that a business user, marketing analyst, or business analyst can understand the basis of the model's predictions, and therefore, be comfortable acting on them and explaining them to others.

In addition to transparency, the Decision Tree algorithm provides speed and scalability. The build algorithm scales linearly with the number of predictor attributes and on the order of nlog(n) with the number of rows, n. Scoring is very fast. Both build and apply are parallelized. The Decision Tree algorithm builds models for binary and multi-class targets. It produces accurate and interpretable models with relatively little user intervention required. The Decision Tree algorithm is implemented in such a way as to handle data in the typical data table formats, to have reasonable defaults for splitting and termination criteria, to perform automatic pruning, and to perform automatic handling of missing values. However, it does not distinguish sparse data from missing data. (See "Sparse Data" for more information.) Users can specify costs and priors.

Decision Tree does not support nested tables.

Decision Tree Models can be converted to XML.

3.1.1.1.1 Decision Tree Rules

A Decision Tree model always produces rules. Decision tree rules are in the form "IF predictive information THEN target," as in "IF income is greater than $70K and household size is greater than 3 THEN the probability of Churn is 0.075."

3.1.1.1.2 XML for Decision Tree Models

You can generate XML representing a decision tree model; the generated XML satisfies the definition specified in the Data Mining Group Predictive Model Markup Language (PMML) version 2.1 specification. The specification is available at http://www.dmg.org.

3.1.1.2 Naive Bayes Algorithm

The Naive Bayes algorithm (NB) can be used for both binary and multiclass classification problems.

NB builds and scores models extremely rapidly; it scales linearly in the number of predictors and rows.

NB makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence. Bayes' Theorem states that the probability of event A occurring given that event B has occurred (P(A|B)) is proportional to the probability of event B occurring given that event A has occurred multiplied by the probability of event A occurring ((P(B|A)P(A)).

Naive Bayes makes the assumption that each attribute is conditionally independent of the others, that is, given a particular value of the target, the distribution of each predictor is independent of the other predictors. In practice, this assumption of independence, even when violated, does not degrade the model's predictive accuracy significantly, and makes the difference between a fast, computationally feasible algorithm and an intractable one.

3.1.1.3 Adaptive Bayes Network Algorithm

Adaptive Bayes Network (ABN) is an Oracle proprietary algorithm that provides a fast, scalable, non-parametric means of extracting predictive information from data with respect to a target attribute. (Non-parametric statistical techniques avoid assuming that the population is characterized by a family of simple distributional models, such as standard linear regression, where different members of the family are differentiated by a small set of parameters.)

ABN, in Single Feature Build mode, can describe the model in the form of human-understandable rules. The rules produced by ABN are one of its main advantages over Naive Bayes. ABN rules provide model transparency so that a business user, marketer, or business analyst can understand the basis of the model's predictions and therefore, be comfortable acting on them and explaining them to others.

In addition to rules, ABN provides performance and scalability, which are derived via various user parameters controlling the trade-off of accuracy and build time.

ABN predicts binary as well as multiclass targets.

ABN can use costs and priors for both building and scoring (see "Costs" and "Priors").

3.1.1.3.1 ABN Model Types

An ABN model is an (adaptive conditional independence model that uses the minimum description length principle to construct and prune an array of conditionally independent network features. Each network feature consists of one or more conditional probability expressions. The collection of network features forms a product model that provides estimates of the target class probabilities. There can be one or more network features. The number and depth of the network features in the model determine the model mode. There are three model modes for ABN:

  • Pruned Naive Bayes (Naive Bayes Build)

  • Simplified decision tree (Single Feature Build)

  • Boosted (Multi Feature Build)

Users can select the ABN model type. Rules are available only for Single Feature Build.

Each network feature consists of one or more attributes included in a conditional probability expression. An array of single attribute network features is an MDL-pruned Naive Bayes model. A single multi-attribute network feature model is equivalent to a simplified C4.5 decision tree; such a model is simplified in the sense that numerical attributes are binned and treated as categorical. Furthermore, a single predictor is used to split all nodes at a given tree depth. The splits are k-way, where k is the number of unique (binned) values of the splitting predictor. Finally, a collection of multi-attribute network features forms a product model (boosted mode). All three types provide estimates of the target class probabilities.

3.1.1.3.2 ABN Rules

Rules can be extracted from an Adaptive Bayes Network model as compound predicates. Rules form a human-interpretable depiction of the model and include statistics indicating the number of the relevant training data instances in support of the rule. A record apply instance specifies a pathway in a network feature taking the form of a compound predicate.

Note:

Rules are generated for the single feature build model type only.

For example, suppose the feature consists of two training attributes: Age {20-40, 40-60, 60-80} and Income {<=50K, >50K}. A record instance consisting of a person age 25 and income $42K is expressed as

IF AGE IN (20-40) and INCOME IN (<=50K)

Suppose that the associated target (for example, response to a promotion) probabilities are {0.8 (no), 0.2 (yes)}. Then we have a detailed rule of the form

IF AGE IN (20-40) and INCOME IN (<=50K) THEN Probability = {0.8, 0.2}

In addition to the probability distribution, there are the associated training data counts, e.g. {400, 100}.

Suppose there is a cost matrix specifying that it is 6 times more costly to predict a no incorrectly than it is to predict a yes incorrectly. Then the cost of predicting yes for this instance is 0.8 * 1 = 0.8 (because the model is wrong in this prediction 80% of the time) and the cost of predicting no is 0.2 * 6 = 1.2. Thus, the minimum cost (best) prediction is yes. Without the cost matrix, the decision is reversed. Implicitly, all errors are equal and we have: 0.8 * 1 = 0.8 for yes and 0.2 * 1 = 0.2 for no.

The order of the predicates in the generated rules implies relative importance.

When you apply an ABN model for which rules were generated, with a single feature, you get the same result that you would get if you wrote an external program that applied the rules.

3.1.1.4 Support Vector Machine Algorithm

Support Vector Machine (SVM) is a state-of-the-art classification and regression algorithm. SVM is an algorithm with strong regularization properties, that is, the optimization procedure maximizes predictive accuracy while automatically avoiding over-fitting of the training data. Neural networks and radial basis functions, both popular data mining techniques, have the same functional form as SVM models; however, neither of these algorithms has the well-founded theoretical approach to regularization that forms the basis of SVM.

SVM projects the input data into a kernel space. Then it builds a linear model in this kernel space. A classification SVM model attempts to separate the target classes with the widest possible margin. A regression SVM model tries to find a continuous function such that maximum number of data points lie within an epsilon-wide tube around it. Different types of kernels and different kernel parameter choices can produce a variety of decision boundaries (classification) or function approximators (regression). The ODM SVM implementation supports two types of kernels: linear and Gaussian. ODM also provides automatic parameter estimation on the basis of the characteristics of the data.

SVM performs well with real-world applications such as classifying text, recognizing hand-written characters, classifying images, as well as bioinformatics and biosequence analysis. The introduction of SVM in the early 1990s led to an explosion of applications and deepening theoretical analysis that established SVM along with neural networks as one of the standard tools for machine learning and data mining.

There is no upper limit on the number of attributes and target cardinality for SVMs; the only constraints are those imposed by hardware.

SVM is the preferred algorithm for sparse data.

The following new features have been added to the SVM algorithm in ODM 10g Release 2:

3.1.1.4.1 Active Learning

SVM models grow as the size of the training data set increases. This property limits SVM models to small and medium size training sets (less than 100,000 cases). Active learning provides a way to deal with large training sets.

The termination criteria for active learning is usually an upper bound on the number of support vectors; when the upper bound is attained, the build stops. Alternatively, stopping criteria are qualitative, such as no significant improvement in model accuracy on a held-aside sample.

Active learning forces the SVM algorithm to restrict learning to the most informative training examples and not to attempt to use the entire body of data. In most cases, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.

Active learning can be applied to all SVM models (classification, regression, and one-class).

Active learning is on by default. It can be turned off.

3.1.1.4.2 Sampling for Classification

For classification, SVM automatically performs stratified sampling during model build. The algorithm scans the entire build data set and selects a sample that is balanced across target values.

3.1.1.4.3 Automatic Kernel Selection

SVM automatically determines the appropriate kernel type based on build data characteristics. This selection can be overridden by explicitly specifying a kernel type.

3.1.1.4.4 Data Preparation and Settings Choice for Support Vector Machines

You can influence both the Support Vector Machine (SVM) model quality (accuracy) and performance (build time) through two basic mechanisms: data preparation and model settings. Significant performance degradation can be caused by a poor choice of settings or inappropriate data preparation. Poor settings choices can also lead to inaccurate models.

For detailed information about data preparation for SVM models, see the Oracle Data Mining Application Developer's Guide.

SVM has built-in mechanisms that attempt to choose appropriate settings automatically based on the data provided. You may need to override the system-determined settings for some domains.

3.1.2 Data Preparation for Classification

This section summarizes data preparation that may be required by classification algorithms.

3.1.2.1 Outliers

Outliers affect classification algorithms as follows:

  • Naive Bayes and Adaptive Bayes Network: The presence of outliers, when external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the discriminating power of these algorithms may be significantly reduced. In this case, quantile binning helps to overcome these problems.

  • Support Vector Machine: The presence of outliers can significantly impact models. Use a clipping transformation to avoid the problems caused by outliers.

  • Decision Tree: The presence of outliers does not impact decision tree models.

3.1.2.2 NULL Values

The meaning of NULL values and how to treat them depends on the algorithm as follows:

  • Support Vector Machine: NULL values indicate sparse data. Missing values are not automatically handled. If the data is not sparse and the values are indeed missing at random, it is necessary to perform missing data imputation (that is, perform some kind of missing values treatment) and substitute a non-NULL value for the NULL value. One simple approach is to use the mean for numerical attributes and the mode for categorical attributes. If you do not treat missing values, the algorithm will not handle the data correctly.

  • For all other classification algorithms, NULL values indicate missing values:

    • Decision Tree, Naive Bayes, and Adaptive Bayes Network: Missing values are handled automatically.

3.1.2.3 Normalization

Support Vector Machine may benefit from normalization.

3.1.3 Costs

In a classification problem, it may be important to specify the costs involved in making an incorrect decision. Doing so can be useful when the costs of different misclassifications vary significantly.

For example, suppose the problem is to predict whether a user will respond to a promotional mailing. The target has two categories: YES (the customer responds) and NO (the customer does not respond). Suppose a positive response to the promotion generates $500 and that it costs $5 to do the mailing. If the model predicts YES and the actual value is YES, the cost of misclassification is $0. If the model predicts YES and the actual value is NO, the cost of misclassification is $5. If the model predicts NO and the actual value is YES, the cost of misclassification is $500. If the model predicts NO and the actual value is NO, the cost is $0. In this case, you would probably want to avoid cases where the model predicts NO and the actual value is YES

Exactly how costs are specified depends on the classification algorithm used:

  • NB and ABN use a cost matrix

  • SVM uses weights

The cost of misclassification is summarized in a cost matrix. The rows of the matrix represent actual values and the columns, predicted values. A cell in the matrix represents the misclassification cost that occurs when the model predicts the class indicated by the column when the class is really the one specified by the row.

Weights for an SVM model are automatically initialized to achieve the best average prediction across all target values. If you change a weight value, the percentage of correct predictions changes in the same way; for example, if you increase a weight value, the percent of correct predictions increases for the associated class.

Classification algorithms apply the cost information to the predicted probabilities during scoring to estimate the least expensive prediction. If a cost matrix is specified for scoring, the output of the scoring is the minimum cost for the prediction. If no cost matrix is supplied, the output is the most likely prediction.

You must be careful how you assign costs. You are making a trade-off between false-positives (falsely accusing someone of fraud) and false negatives (letting a crime go unpunished). Your costs should reflect this trade-off. Perhaps you are willing to let some crimes go unpunished so that you don't falsely accuse millions of committing fraud; for example, you must be sure that you are right before you accuse someone (say 99%, rather than just 50% sure). Predicting on probability means you are indifferent to the type of error you make. If you are concerned about the type of error, a cost matrix or carefully adjusted weights are warranted.

3.1.4 Priors

In building a classification model, describing the distribution in the real population can be useful when the training data does not accurately reflect the real population. The real population is described by providing the prior distribution, often referred to as the priors, to the build operation.

In many problems with a binary target, one target value dominates in frequency. For example, the positive responses for a telephone marketing campaign may be 2% or less, and the occurrence of fraud in credit card transactions may be less than 1%. A classification model built on historic data of this type may not observe enough positive cases to be able to distinguish the characteristics of the two classes; the result could be a model that when applied to new data predicts the negative class for every case. While such a model may be highly accurate, it may not be very useful. This illustrates that it is not a good idea to rely solely on accuracy when judging a model.One solution to this problem involves creating a source table for the build operation that contains approximately equal numbers of each target value. However, the algorithm will take the observed distribution as realistic, and will build a model that will predict each of the target values in equal numbers unless it is instructed otherwise. Supplying the actual distribution of target values, the priors, to the Build process can result in a more effective model.Note that the model should be tested against data that has the actual distribution of target values. For example, 98% negative and 2% positive for the marketing campaign.

3.2 Regression

Regression models are similar to classification models. The difference between regression and classification is that regression deals with numerical or continuous target attributes, whereas classification deals with discrete or categorical target attributes. In other words, if the target attribute contains continuous (floating-point) values or integer values that have inherent order, a regression technique can be used. If the target attribute contains categorical values, that is, string or integer values where order has no significance, a classification technique is called for. Note that a continuous target can be turned into a discrete target by binning; this turns a regression problem into a problem that can be solved using classification algorithms.

3.2.1 Algorithm for Regression

Support Vector Machine (SVM) builds both classification and regression models. For more information about SVM, see "Support Vector Machine Algorithm".

ODM SVM provides improved data-driven estimation of epsilon and the complexity factor for SVM regression models.

Active learning (see "Active Learning") can be used for regression models.

One-class SVM (see "Anomaly Detection") cannot be used for regression problems.

3.3 Attribute Importance

Attribute Importance (AI) provides an automated solution for improving the speed and possibly the accuracy of classification models built on data tables with a large number of attributes.

The time required to build ODM classification models increases with the number of attributes. Attribute Importance identifies a proper subset of the attributes that are most relevant to predicting the target. Model building can proceed using the selected attributes only.

Using fewer attributes does not necessarily result in lost predictive accuracy. Using too many attributes (especially those that are "noise") can affect the model and degrade its performance and accuracy. Mining using the smallest number of attributes can save significant computing time and may build better models.

The programming interfaces for Attribute Importance permit the user to specify a number or percentage of attributes to use; alternatively the user can specify a cutoff point.

3.3.1 Data Preparation for Attribute Importance

The presence of outliers, when external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the discriminating power of an attribute importance model may be significantly reduced. In this case, quantile binning helps to overcome these problems.

NULL values are treated as missing values and not as indicators of sparse data.

3.3.2 Algorithm for Attribute Importance

ODM uses the Minimum Descriptor Length algorithm for attribute importance.

3.3.2.1 Minimum Description Length Algorithm

Minimum Description Length (MDL) is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of the data. The MDL principle is used to build ODM Attribute Importance models.

MDL considers each attribute as a simple predictive model of the target class. These single predictor models are compared and ranked with respect to the MDL metric (compression in bits). MDL penalizes model complexity to avoid over-fit. It is a principled approach that takes into account the complexity of the predictors (as models) to make the comparisons fair.

With MDL, the model selection problem is treated as a communication problem. There is a sender, a receiver, and data to be transmitted. For classification models, the data to be transmitted is a model and the sequence of target class values in the training data.

Attribute importance uses a two-part code to transmit the data. The first part (preamble) transmits the model. The parameters of the model are the target probabilities associated with each value of the prediction. For a target with j values and a predictor with k values, ni (i= 1,..., k) rows per value, there are Ci, the combination of j-1 things taken ni-1 at time possible conditional probabilities. The size of the preamble in bits can be shown to be Sum(log2(Ci)), where the sum is taken over k. Computations like this represent the penalties associated with each single prediction model. The second part of the code transmits the target values using the model.

It is well known that the most compact encoding of a sequence is the encoding that best matches the probability of the symbols (target class values). Thus, the model that assigns the highest probability to the sequence has the smallest target class value transmission cost. In bits this is the Sum(log2(pi)), where the pi are the predicted probabilities for row i associated with the model.

The predictor rank is the position in the list of associated description lengths, smallest first.

3.4 Anomaly Detection

Anomaly detection consist of identifying novel or anomalous patterns. Identifying such patterns can be useful in problems of fraud detection (insurance, tax, credit card, etc.) and computer network intrusion detection. An anomaly detection model predicts whether a data point is typical for a given distribution or not. An atypical data point can be either an outlier or an example of a previously unseen class.

An anomaly detection model discriminates between the known examples of the positive class and the unknown negative set of counterexamples. An anomaly detection model identifies items that do not fit in the distribution.

Anomaly detection is a mining function in the Oracle Data Miner interface. In the ODM Java and PL/SQL interfaces, an anomaly detection model is a classification model. See "Specify the One-Class SVM Algorithm" for more information.

3.4.1 Algorithm for Anomaly Detection

Standard binary supervised classification algorithms, such as Naive Bayes, require the presence of both positive examples and negative examples (counterexamples) of a target class. One-class SVM classification requires only the presence of examples of a single target class. The model learns to discriminate between the known examples of the positive class and the unknown negative set of counterexamples. In other words, one-class SVM detects anomalies. One-class SVM was initially used to estimate the support of a distribution. The goal is to estimate a function that will be positive if an example belongs to a set and negative if the example belongs to the complement of the set. The model computes a binary function that identifies regions in the input space where the majority of the positive data lives.

One-class SVM models are useful in situations such as:

  • Outlier detection

  • Cases where it is difficult to provide counterexamples

In outlier detection, you separate the typical examples in a distribution from the atypical (outlier) examples. The distance from the separating plane indicates how typical a given point is with respect to the distribution of the training data. Outliers can be ranked on the basis of the probability of them being typical or atypical cases. In a similar way, you can use one-class SVM to detect other kinds of anomalies.

In some classes of problems, it is difficult or impossible to provide a useful and representative set of counterexamples. Examples may be easy to identify, but counterexamples are either hard to specify or expensive to collect. For example, in text document classification, it is easy to classify a document under a given topic. However, the universe of documents not belonging to this topic can be very large and it may not be feasible to provide counterexamples.

The accuracy of one-class SVM models cannot usually match the accuracy of standard SVM classifiers built with meaningful counterexamples.

SVM is the only supervised classification algorithm in ODM that can operate in one-class mode.

One-class SVM cannot be used for regression problems.

3.4.1.1 Specify the One-Class SVM Algorithm

To build a one-class SVM model in either of the ODM programmatic interfaces, select classification as the mining function, SVM as the algorithm, and pass a NULL or empty string as the target column name.

To build a one-class SVM model in Oracle Data Miner, select anomaly detection.

3.5 Testing Supervised Models

Supervised models are tested to evaluate the accuracy of their predictions. A model is tested by applying it to data with known values of the target and comparing the model's predicted values with the known values. The test data must be compatible with the data used to build the model and must be prepared in the same way that the build data was.

Model testing results in the calculation of test metrics. The exact test metrics calculated depend on the type of model. Testing a classification model results in a confusion matrix; testing a regression model results in error estimates. Optionally, lift and receiver operating characteristics (ROC) can be calculated for classification models.

The rest of this section describes these test metrics, as follows:

3.5.1 Confusion Matrix

ODM supports the calculation of a confusion matrix to asses the accuracy of a classification model. The simplest example of a confusion matrix is one for a binary classification problem. For a binary problem, the confusion matrix is a two-dimensional square matrix. (In general, the confusion matrix is an n-dimensional square matrix, where n is the number of distinct target values.) The row indexes of a confusion matrix correspond to actual values observed and used for model testing; the column indexes correspond to predicted values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing. Figure 3-1 contains an example of a confusion matrix. For example, a value of 25 for an actual value index of "buyer" and a predicted value index of "nonbuyer" indicates that the model incorrectly classified a "buyer" as a "nonbuyer" 25 times. A value of 516 for an actual/predicted value index of "buyer" indicates that the model correctly classified a "buyer" 516 times.

The predictions were correct 516 + 725 = 1241 times, and incorrect 25 + 10 = 35 times. The sum of the values in the matrix is equal to the number of scored records in the input data table. The number of scored records is the sum of correct and incorrect predictions, which is 1241 + 35 = 1276. The error rate is 35/1276 = 0.0274; the accuracy rate is 1241/1276 = 0.9725.

A confusion matrix provides a quick understanding of model accuracy and the types of errors the model makes when scoring records. It is the result of a test task for classification models.

Figure 3-1 Confusion Matrix

Description of Figure 3-1 follows
Description of "Figure 3-1 Confusion Matrix"

3.5.2 Lift

ODM supports computing lift for a classification model. Lift can be computed for binary (two values) target fields. Lift can also be computed for multiclass targets by designating a preferred positive class and combining all other target class values, effectively turning a multiclass target into a binary target. Given a designated positive target value (that is, the value of most interest for prediction, such as "buyer," or "has disease"), test cases are sorted according to how confidently they are predicted to be positive cases. Positive cases with highest confidence come first, followed by positive cases with lower confidence. Negative cases with lowest confidence come next, followed by negative cases with highest confidence. Based on that ordering, they are partitioned into quantiles, and the following statistics are calculated by the programming interfaces:

  • Probability threshold for a quantile n is the minimum probability for the positive target to be included in this quantile or any preceding quantiles (quantiles n-1, n-2,..., 1). If a cost matrix is used, a cost threshold is reported instead. The cost threshold is the maximum cost for the positive target to be included in this quantile or any of the preceding quantiles.

  • Cumulative gain for a given quantile is the ratio of the cumulative number of positive targets to the total number of positive targets.

  • Target density of a quantile is the number of true positive instances in that quantile divided by the total number of instances in the quantile.

  • Cumulative target density for quantile n is the target density computed over the first n quantiles.

  • Quantile lift is the ratio of target density for the quantile to the target density over all the test data.

  • Cumulative percentage of records for a given quantile is the percentage of all test cases represented by the first n quantiles, starting at the end that is most confidently positive, up to and including the given quantile.

  • Cumulative number of targets for quantile n is the number of true positive instances in the first n quantiles (defined as above).

  • Cumulative number of nontargets is the number of actually negative instances in the first n quantiles (defined as above).

  • Cumulative lift for a given quantile is the ratio of the cumulative target density to the target density over all the test data.

Cumulative targets can be computed from the quantities that are available in the LiftResultElement using the following formula:

targets_cumulative = lift_cumulative * percentage_records_cumulative

Oracle Data Miner calculates different statistics for lift. See the online help for Oracle Data miner for more information.

3.5.3 Receiver Operating Characteristics

Another useful method for evaluating classification models is Receiver Operating Characteristics (ROC) analysis. ROC curves are similar to lift charts in that they provide a means of comparison between individual models and determine thresholds which yield a high proportion of positive hits. ROC was originally used in signal detection theory to gauge the true hit versus false alarm ratio when sending signals over a noisy channel.

The horizontal axis of an ROC graph measures the false positive rate as a percentage. The vertical axis shows the true positive rate. The top left hand corner is the optimal location in an ROC curve, indicating high TP (true-positive) rate versus low FP (false-positive) rate. The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).

In the example graph in Figure 3-2, Model A clearly has a higher AUC for the entire data set. However, if the user decides that a false positive rate of 40% is acceptable, Model B is better suited, since it achieves a better error true positive rate at that false positive rate.

Figure 3-2 Receiver Operating Characteristics Curves

Description of Figure 3-2 follows
Description of "Figure 3-2 Receiver Operating Characteristics Curves "

Besides model selection the ROC also helps to determine a threshold value to achieve an acceptable trade-off between hit (true positives) rate and false alarm (false positives) rate. By selecting a point on the curve for a given model a given trade-off is achieved. This threshold can then be used as a post-processing parameter for achieving the desired performance with respect to the error rates. ODM models by default use a threshold of 0.5.

The Oracle Data Mining ROC computation calculates the following statistics:

  • Probability threshold: The minimum predicted positive class probability resulting in a positive class prediction. Different threshold values result in different hit rates and false alarm rates.

  • True negatives: Negative cases in the test data with predicted probabilities strictly less than the probability threshold (correctly predicted).

  • True positives: Positive cases in the test data with predicted probabilities greater than or equal to the probability threshold (correctly predicted).

  • False negatives: Positive cases in the test data with predicted probabilities strictly less than the probability threshold (incorrectly predicted).

  • False positives: Negative cases in the test data with predicted probabilities greater than or equal to the probability threshold (incorrectly predicted).

  • Hit rate ("True Positives" in Oracle Data Miner): (true positives/(true positives + false negatives))

  • False alarm rate ("False Positives" in Oracle Data Miner): (false positives/(false positives + true negatives))

3.5.4 Test Metrics for Regression Models

Regression test results provide the following measures of model accuracy:

  • Root mean square

  • Mean absolute error

These two statistics are the metrics most commonly used to test regression models. For more information about these metrics, see the Oracle Data Mining Application Developer's Guide.