Oracle® Data Mining Application Developer's Guide 11g Release 1 (11.1) Part Number B28131-01 |
|
|
View PDF |
This chapter provides several examples that show how you can perform complex data mining operations by using simple SQL queries and PL/SQL procedures. The examples show techniques for obtaining valuable information about your customers. This kind of information is needed to answer business questions, such as: How should we position a product? When should we introduce a new product? What pricing and sales promotion strategies should we use?
Note:
The base interfaces to Oracle Data Mining are PL/SQL packages and SQL functions. Most of the examples throughout this manual use PL/SQL and SQL code.Oracle Data Mining also supports a standards-based Java API, which is layered on the SQL API. The Java API is described in "Data Mining Java API" and in Chapter 7.
See Also:
Oracle Database SQL Language Reference for syntax of the SQL Data Mining functions.
Oracle Database PL/SQL Packages and Types Reference for syntax of the PL/SQL API.
Oracle Data Mining Java API Reference (javadoc) for syntax of the Java API.
This chapter contains the following topics:
The code fragments in this section show how you can apply data mining models to learn more about your customers.
These examples use models that were previously created. They exist in Oracle Database as database objects. For the sake of simplicity, we can assume that you created the models, that they exist within your schema, and that you are executing the SQL statements that apply them. You could have used the PL/SQL API or the Java API to create the models.
See Also:
Chapter 3, "Creating a Model"The primary method for applying data mining models is by executing specialized SQL functions. There exists a separate family of SQL functions for prediction, clustering, and feature extraction.
See Also:
Chapter 4, "Scoring and Deployment"Note:
The examples in the first two sections of this chapter illustrate SQL syntax for different use cases, but the code uses hypothetical object names. For example, a table might be namedmy_table
.The query in Example 1-1 divides customers into segments and predicts customer attrition.
The segments are created by a clustering model named clus_model
. The model uses heuristics to divide the customer database into groups that have similar characteristics.The prediction is generated by a classification model called svmC_model
.
The model predicts the attrition probability for each customer in each segment. The data is returned by segment, ordered by the overall attrition propensity in the segment.
Example 1-1 Predict Attrition for Customer Segments
SELECT count(*) as cnt, AVG(PREDICTION_PROBABILITY(svmC_model, 'attrite' USING *)) as avg_attrite, AVG(cust_value_score) FROM customers GROUP BY CLUSTER_ID(clus_model USING *) ORDER BY avg_attrite DESC;
The sophisticated analytics within this seemingly simple query let you see your customer base within natural groupings based on similarities. The SQL data mining functions show you where your most loyal and least loyal customers are. This information can help you make decisions about how to sell your product. For example, if most of the customers in one segment have a high probability to attrite, you might take a closer look at the demographics of that segment when considering your marketing strategy.
The CLUSTER_ID
function in Example 1-1 applies the mining model clus_model
to the customers
table. It returns the segment for each customer. The number of segments is determined by the clustering algorithm. Oracle Data Mining supports two clustering algorithms: enhanced k-Means and O-Cluster.
The PREDICTION_PROBABILITY
function in Example 1-1 applies the mining model svmC_model
and returns the probability to attrite for each customer. Oracle Data Mining supports a number of classification algorithms. In this case, the model name implies that Support Vector Machine was chosen as the classifier.
The sample query in Example 1-2 returns the ten customers who are most likely to attrite, based on age
, gender
, annual_income
, and zipcode
.
In addition, since annual_income
is often missing, the PREDICTION
function is used to perform missing value imputation for the annual_income
attribute. The PREDICTION
function applies the regression model, svmR_model
, to predict the most likely annual income. The NVL
function replaces missing annual income values with the resulting prediction.
Example 1-2 Predict Customer Attrition and Transform Missing Income Values
SELECT * FROM ( SELECT cust_name, cust_contact_info FROM customers ORDER BY PREDICTION_PROBABILITY(tree_model, 'attrite' USING age, gender, zipcode, NVL(annual_income, PREDICTION(svmR_model USING *)) as annual_income) DESC) WHERE rownum < 11;
These examples use a one-class SVM model to discover atypical customers (outliers), find common demographic characteristics of the most typical customers, and compute the probability that a new or hypothetical customer will be a typical affinity card holder.
Find the top 10 outliers -- customers that differ the most from the rest of the population. Depending on the application, such atypical customers can be removed from the data.
SELECT cust_id FROM ( SELECT cust_id FROM svmo_sh_sample_prepared ORDER BY prediction_probability(SVMO_SH_Clas_sample, 0 using *) DESC, 1) WHERE rownum < 11;
Compute the probability of a new or hypothetical customer being a typical affinity card holder. Normalization of the numerical attributes is performed on-the-fly.
WITH age_norm AS ( SELECT shift, scale FROM SVMO_SH_sample_norm WHERE col = 'AGE'), yrs_residence_norm AS ( SELECT shift, scale FROM svmo_sh_sample_norm WHERE col = 'YRS_RESIDENCE') SELECT prediction_probability(SVMO_SH_Clas_sample, 1 using (44 - a.shift)/a.scale AS age, (6 - b.shift)/b.scale AS yrs_residence, 'Bach.' AS education, 'Married' AS cust_marital_status, 'Exec.' AS occupation, 'United States of America' AS country_name, 'M' AS cust_gender, 'L: 300,000 and above' AS cust_income_level, '3' AS houshold_size ) prob_typical FROM age_norm a, yrs_residence_norm b;
Find demographic characteristics of the typical affinity card members. These statistics will not be influenced by outliers and are likely to provide a more truthful picture of the population of interest than statistics computed on the entire group of affinity members.
SELECT a.cust_gender, round(avg(a.age)) age, round(avg(a.yrs_residence)) yrs_residence, count(*) cnt FROM mining_data_one_class_v a WHERE PREDICTION(SVMO_SH_Clas_sample using *) = 1 GROUP BY a.cust_gender ORDER BY a.cust_gender;
This example uses a classification model to predict who will respond to a marketing campaign for DVDs and why.
First predict the responders.
This statement uses the PREDICTION
and PREDICTION_DETAILS
functions to apply the model campaign_model
.
SELECT cust_name, PREDICTION(campaign_model USING *) AS responder, PREDICTION_DETAILS(campaign_model USING *) AS reason FROM customers;
Combine the predictions with relational data.
This statement combines the predicted responders with additional information from the sales
table. In addition to predicting the responders, it shows how much each customer has spent for a period of three months before and after the start of the campaign.
SELECT cust_name, PREDICTION(campaign_model USING *) AS responder, SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS pre_purch, SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS post_purch FROM customers, sales WHERE sales.cust_id = customers.cust_id AND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005 GROUP BY cust_id, PREDICTION(campaign_model USING *);
Combine the predictions and relational data with multi-domain, multi-database data.
In addition to predicting responders, find out how much each customer has spent on DVDs for a period of three months before and after the start of the campaign.
SELECT cust_name, PREDICTION(campaign_model USING *) as responder, SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS pre_purch, SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS post_purch FROM customers, sales, products@PRODDB WHERE sales.cust_id = customers.cust_id AND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005 AND sales.prod_id = products.prod_id AND CONTAINS(prod_description, 'DVD') > 0 GROUP BY cust_id, PREDICTION(campaign_model USING *);
Evaluate the effectiveness and significance of the information you have obtained.
Compare the success rate of predicted responders and non-responders within different regions and across the company. Is the success statistically significant?
SELECT responder, cust_region, COUNT(*) AS cnt, SUM(post_purch – pre_purch) AS tot_increase, AVG(post_purch – pre_purch) AS avg_increase, STATS_T_TEST_PAIRED(pre_purch, post_purch) AS significance FROM ( SELECT cust_name, cust_region PREDICTION(campaign_model USING *) AS responder, SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS pre_purch, SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS post_purch FROM customers, sales, products@PRODDB WHERE sales.cust_id = customers.cust_id AND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005 AND sales.prod_id = products.prod_id AND CONTAINS(prod_description, 'DVD') > 0 GROUP BY cust_id, PREDICTION(campaign_model USING *) ) GROUP BY ROLLUP responder, cust_region ORDER BY 4 DESC;
Predictive analytics, implemented in the DBMS_PREDICTIVE_ANALYTICS
package, support routines for making predictions, assessing attribute importance, and creating profiles. Example 1-3 shows how you could use predictive analytics to generate a customer profile.
Note:
With predictive analytics, you do not need to create a model. The routine dynamically creates and applies a model, which does not persist upon completion.The PROFILE
statement in Example 1-3 returns rules that suggest whether or not a customer is likely to use an affinity card. The rules are generated based on two predictors: customer gender and customer occupation. The rules are written as XML to a results table with these columns.
Name Type ------------------------- ------------ PROFILE_ID NUMBER RECORD_COUNT NUMBER DESCRIPTION XMLTYPE
The rule identifier is stored in PROFILE_ID
. The number of cases described by the rule is stored in RECORD_COUNT
. The XML that describes the rule is stored in the DESCRIPTION
column.
Note:
This example uses sample data based on theSH
schema. This data is used with the Data Mining sample programs. For information on the sample programs, see Oracle Data Mining Administrator's Guide.Example 1-3 Generate a Customer Profile
--create a source view 070307 CREATE VIEW cust_gend_occ_view AS SELECT cust_gender, occupation, affinity_card FROM mining_data_apply; -- describe the source data DESCRIBE cust_gend_occ_view Name Null? Type ---------------------------------------------- CUST_GENDER VARCHAR2(1) OCCUPATION VARCHAR2(21) AFFINITY_CARD NUMBER(10) -- find the rules BEGIN DBMS_PREDICTIVE_ANALYTICS.PROFILE( DATA_TABLE_NAME => 'cust_gend_occ_view', TARGET_COLUMN_NAME => 'affinity_card', RESULT_TABLE_NAME => 'profile_result'); END; / -- PROFILE has created 5 rules SELECT profile_id from cust_gend_occ_profile_results; PROFILE_ID ---------- 1 2 3 4 5 -- display the rules <SimpleRule id="1" score="1" recordCount="275"> <CompoundPredicate booleanOperator="and"> <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn"> <Array type="string">"Exec." "Prof." "Protec." </Array> </SimpleSetPredicate> <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn"> <Array type="string">"M"</Array> </SimpleSetPredicate> </CompoundPredicate> <ScoreDistribution value="1" recordCount="146"/> <ScoreDistribution value="0" recordCount="129"/> </SimpleRule> <SimpleRule id="2" score="0" recordCount="124"> <CompoundPredicate booleanOperator="and"> <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn"> <Array type="string">"Exec." "Prof." "Protec." </Array> </SimpleSetPredicate> <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn"> <Array type="string">"F" </Array> </SimpleSetPredicate> </CompoundPredicate> <ScoreDistribution value="0" recordCount="96"/> <ScoreDistribution value="1" recordCount="28"/> </SimpleRule> <SimpleRule id="3" score="0" recordCount="397"> <CompoundPredicate booleanOperator="and"> <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn"> <Array type="string">"M" </Array> </SimpleSetPredicate> <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn"> <Array type="string">"Crafts" "Sales" "TechSup" "Transp." </Array> </SimpleSetPredicate> </CompoundPredicate> <ScoreDistribution value="0" recordCount="289"/> <ScoreDistribution value="1" recordCount="108"/> </SimpleRule> <SimpleRule id="4" score="0" recordCount="316"> <CompoundPredicate booleanOperator="and"> <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn"> <Array type="string">"M" </Array> </SimpleSetPredicate> <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn"> <Array type="string"> "?" "Cleric." "Farming" "Handler" "House-s" "Machine" "Other" </Array> </SimpleSetPredicate> </CompoundPredicate> <ScoreDistribution value="0" recordCount="277"/> <ScoreDistribution value="1" recordCount="39"/> </SimpleRule <SimpleRule id="5" score="0" recordCount="388"> <CompoundPredicate booleanOperator="and"> <SimpleSetPredicate field="OCCUPATION" booleanOperator="isIn"> <Array type="string"> "?" "Cleric." "Crafts" "Farming" "Handler" "House-s" "Machine" "Other" "Sales" "TechSup" "Transp." </Array> </SimpleSetPredicate> <SimpleSetPredicate field="CUST_GENDER" booleanOperator="isIn"> <Array type="string">"F" </Array> </SimpleSetPredicate> </CompoundPredicate> <ScoreDistribution value="0" recordCount="363"/> <ScoreDistribution value="1" recordCount="25"/> </SimpleRule>