Defining Data Objects, 8 of 11

Defining Variables That Handle Sparse Data Efficiently

A variable with sparse data is one in which a relatively high percentage of the cells of the variable do not contain actual data. Such "empty," or NA, values take up storage space in the analytic workspace.

There are two types of sparsity:

Controlled sparsity occurs when a range of values of one or more dimensions has no data; for example, a new variable dimensioned by month for which you do not have data for past months. The cells exist because you have defined past months in the month dimension, but the cells are empty.
Random sparsity occurs when NA values are scattered throughout the data variable, usually because some combinations of dimension values never have any data. For example, a district might only sell certain products and never have data for other products. Other districts might sell a different selection of products.

Definition: Composite

A composite is an internal object that is used to store sparse data compactly in a variable. A composite is a list of dimension-value combinations in which one value is taken from each of the dimensions on which the composite is based.

Composites can be named or unnamed:

An unnamed composite is not a workspace object; it is merely an internal structure. When you define a variable, you use the SPARSE keyword to request that an unnamed composite is automatically created.
A named composite is an object that is you define using the DEFINE COMPOSITE command. Later, when you are defining or accessing a variable, you can specify this composite by name along with the names of other dimensions.

Because the values in composites are maintained automatically, using composites is the recommended way of handling sparsity in your analytic workspace.

Using composites is one of the most important steps you can take to manage sparsity, which contributes to keeping analytic workspace size to a minimum and promoting good performance.

Why You Should Use Named Composites

Using named composites makes it easier to track which variables share the same composite. A named composite in the dimension list of a variable tells Oracle OLAP that the dimensions in the named composite are sparse in this variable, and that this composite is shared only with other variables that have the same sparsity pattern.

In contrast, all variables defined with an unnamed composite that have exactly the same dimensions in the same order will automatically share that unnamed composite. If these variables have different sparsity patterns, performance will suffer.

You can also manage sparsity by using a conjoint dimension to hold dimension-value combinations for which a given variable has data. However, because the values in composites are automatically maintained, using composites is the recommended way of handling sparsity in your analytic workspace.

How to Use Composites

When you define a multidimensional variable, you can specify a composite in the list of dimensions.

First, define a named composite by using the DEFINE COMPOSITE command. Then, define the variables by using the following syntax to include a named composite in the dimension list of each variable.

composite-name <dims>

For example, suppose you define a composite named proddist, whose dimensions include product and district, as shown in the following command.

DEFINE proddist COMPOSITE <product district>

Now, suppose you want to define a sales variable in which time will be the fastest-varying dimension and the proddist composite will be the slowest-varying dimension, as shown in the following command.

DEFINE sales <time proddist<product district>>

Note that you should never use the SPARSE keyword with a composite. Essentially, you use the name of the composite instead of the SPARSE keyword.

See Also:

"Using Composites in Expressions" for more information about using composites.
"Working with NA Values" for more information about working with sparse data.

Naming, Renaming, and Unnaming Composites

You can use the RENAME command to:

Name an unnamed composite.
Change the name of a named composite.
Change a named composite to an unnamed composite.

Adding Data to a Variable That Uses a Composite

When you define a multidimensional variable, you can specify that a composite is used instead of its base dimensions to dimension the data. Later, as you add values to the dimensions of the variable for which you defined a composite, the following actions are taken:

The composite is filled with dimension-value combinations.
The data for the variable is stored using the composite structure rather than the structure of the base dimensions.

For a variable that uses a composite, cells are created for only those dimension values that are used in the composite dimension-value combinations; it does not create a variable cell for every value in the base dimensions. Data for a variable is stored in order, cell by cell, for each combination of dimension values. From the perspective of data storage, each combination of base dimension values in a composite is treated like the value of a regular dimension. This means that if you define a variable with one regular dimension and one composite, then it is stored like a two-dimensional variable.

Example 3-1 Defining a Variable That Uses a Named Composite

If your company does promotional marketing for certain products in some but not all districts, then your variable data will be sparse along the product and district dimensions. Therefore, suppose you define a composite named proddist, whose base dimensions are product and district. There are dimension-value combinations in the composite only for those values that have data. For example, if you run a promotion for tents but not canoes, then the composite includes the tents and city combinations, but not the canoes and city combinations.

The following command creates a variable called promo that is dimensioned by month and a composite named proddist, whose base dimensions are product and district.

DEFINE promo INTEGER <month proddist<product district>>

The following conceptual figure illustrates the promo variable that is created by this command, the month, product and district base dimensions, a named composite (proddist) created from the product and district base dimensions, and the internal relation that is created between the product and district base dimensions and the proddist composite.

Text description of the illustration compvar.gif

The following table is an example of the sequence in which the data for the promo variable might be stored.

TENTS BOSTON JAN95	TENTS BOSTON FEB95	TENTS BOSTON MAR95	. . . . . . . . .	RACQUETS CHICAGO JAN95	RACQUETS CHICAGO FEB95	. . . . . . . . .
257	379	428	. . .	635	192	. . .

Defining a Variable with a Single-Dimension Composite

When you specify a composite for just one dimension in a variable definition, a single-dimension composite is created. The values of this composite will be a subset of the values in its base dimension.

It is a good idea to use single dimension composites when a variable will share the same dimensions as some other variables, but for a particular single dimension, the variable will only have data for some of the values of the dimension.

Suppose you have already defined a variable called actual with the dimensions line, division, and month. The actual variable does not contain any NA values. You need to define a variable called budget, which requires much less detail than actual. For example, budget only needs 10 percent of the line dimension values, while actual needs all of them.

If you define budget without setting sparsity, then all of the line dimension values are present for every month and division, but 90 percent of the line dimension cells will have NA values.

To handle sparse data in this case, you define budget with an unnamed composite for only the line dimension as shown below.

DEFINE budget DECIMAL <SPARSE <line> division month>