Sunday, 30 October 2011

CLM Warehouse Concepts

Rational Reporting for Development Intelligence (RRDI) is a very complicated name for what is essentially a reporting engine for the CLM warehouse.  Looking at the entire solution at first glance can be quite intimidating especially if one has not been exposed to warehouse concepts before. In this blog I will cover the CLM warehouse in preparation for subsequent blogs about RRDI, the schemas and reporting. So, I will cover warehousing concepts and general operation of the CLM warehouse first in preparation for creating a report in RRDI in the next blog.

CLM Basics for Data Warehousing

Lets start with a very simple scenario: I have a database and I want to report off of it.  In our case, the database is one of the CLM applications, lets say CCM. What are the possibilities?  Previous to the release of 3.0, using BIRT was one, and it still is, though now it is being succeeded by RRDI which is probably the better choice.

You may wonder at this point, why not point BIRT or some other report writer directly at the CCM database and be done with it? Here are some answers:

a) Much of the extensibility of the Jazz products comes from the fact that XML is used as the underlying technology. So, in many cases, the operational data store of CCM, for example, has a lot of XML data in it that is hard to work with in a report writer.

b) Performance. Usually you don't want to have an intensive reports running against a live repository serving operational requests (or an OLTP, online transaction processing database).

c) Trending. Sometimes you want to collect data for reporting on later.

d) Reporting directly from the OLTP is not supported, probably for the three reasons above ( and perhaps others ).

The way to get around these issues is to create another database from which to report from that has a schema that is friendly to reporting. 
This is a good point to introduce the CLM warehouse component which is a separate database for reporting that you create during CLM setup.  Let's look at the setup.

There are various sections ask for database details for a warehouse like below:
This is basically database configuration parameters. The last user, RPTUSER, is a database user which is used by RDDI to log in to the database when executing a report.

At this point I should also mention the section below:

The user specified here is the user that is used by the process that transfers data from CCM (or other CLM product) into the new database we created above, the data warehouse. The process uses these user credentials to log into CCM to extract the data, so it is governed by this user's access rights.

So in our scenario, now we have two databases: CCM and the data warehouse (and also a process, called an Extract, Transfer and Load (ETL) process)and we know that the CCM database is a proprietary database and the data warehouse is a report friendly database. 

Now I will cover these databases and process in a little more detail...

CLM Data Warehouse: A report-friendly database

What makes a database report friendly? Its, all in the schema. There are two primary schemas in the CLM warehouse: a normalized form (NF) schema (or operational data store) and a star schema or dimensional model. I won't go into too much detail about these topic as there are several good books about the topic but i will cover them briefly here.

A Normalized form (NF) database
This is a very common form database and comes in many degrees of purity. There are even standards of normal form purity, 1NF, 2NF, 3NF, 3.5NF, 4NF, 5NF and 6NF. For people who are comfortable with object oriented type of paradigms, this is the database design that it closest to that. For example, the 3NF form has the following rules for tables:
  • No columns with duplicate data
  • All columns must be dependent on the primary key
  • Columns which have data subsets that are identical across many rows are moved to a separate table and use foreign keys to establish the relationship. (An example would be a table with people, and the brand of the care they own. Since many people own the same cars, cars should be put into a separate table. )
In the case where are tables in a database are 3NF then the database is considered 3NF.

As an example, here is an entity relationship diagram of our car/people example to give a simple example of a normalized form:

A Dimensional form database (star schema)
A star schema look like a, you guessed it, a star. The points of the star, take the form of dimension tables and the center of the star is another table, called the fact table. The dimension tables are connect through a foreign key relationship to the fact table. When you represent the foreign key relationships by lines you get an n-point stars for n dimensions. The dimension table contains an entity or objects (with keys) that are related together through the fact table with the foreign keys. The fact table qualifies this  relationship with facts or numbers. For people who are more comfortable with functional notation, it can be described as follows:

F(D1,D2,Dx) = f1,f2,fy

In other words, given a set of 1 to x dimensions, we will get will 1..n facts. The great thing about the dimensional form is that it is quite a familiar concept.  If we happen only to specify a subset of all available dimensions then facts can be "rolled up" or summarized using a function. Functions can be average, summations, maximum, minimum, etc.

Here is our car/people example as a star schema, in this case we are counting the car by color:
Using "roll ups".  I can get the number of cars, irrespective of color using a aggregate function in a query. Our functions would be,

F_CAR_OWNERSHIP(Gender, Person, City, Car) = #Red, #Blue, #White, #Black, #Green

For example:

F_CAR_OWNERSHIP("Male","Pat Smith","Albany","BMW") = 1,0,0,0,0

This tell use that John Smith has a red BMW. We could leave out Gender and aggregate and we would probably get the same result unless there is a female "Pat Smith". A better aggregate would be to get the number of BMW cars in Albany:

F_CAR_OWNERSHIP(City, Car) = SUM(#Red+#Blue+#White+#Black+#Green)

F_CAR_OWNERSHIP("Albany","BMW") = 10

You could also model another star that would give the answer to this function without rolling up.

Dimensional modeling provides a easy and familiar concepts to calculate values efficiently using simple queries. Providing these types of activity in a report would be quite inefficient as you would need to calculate these values using inefficient and sometimes complicated queries with many joins. This is why this paradigm is popular in data warehouses.

The CLM Data warehouse contains both the normalized form (RIODS schema) and also contains a star schema (RIDW schema).   The star schema is populated from the normalized schema using a another ETL process that performs all the required calculations and roll ups. When to use normalized vs. dimensional really depends on the report you wish to create. Generally speaking, if your report is about facts (numbers) you should use a star schema as it has data already calculated with minimal table joins and is much more efficient. If it is a list of objects of some sort then the normalized form.

Documentation about these CLM warehouse schemas can be found here.

ETL Processes

As mentioned above there are ETL processes that transfer data from CCM, RM and QM to the CLM data warehouse, or more specifically, the normalized form (RIODS).  These processes are available to be run in the Report section of the respective application administrative areas in CLM 3.x. For example, for CCM it looks like this:

Note that "Data Warehouse" is selected. The other setting is for the previous "Data Mart" that was used in previous versions. Also Note the four ETL process. You generally run all of them if you want all the CCM data to show up in reports but you can also define better fidelity if you only want work items to be transferred, for example.

You can also run all the jobs for all the CLM applications on a scheduled basis from the report section in JTS admin. From here, there is also the job that transfers the data from the normalized form (RIODS) to the star schema (RIDW).

So this gives a round up of the reasoning for having three databases for reporting (actually two, but one having two schemas) and a description of the processes which transfer data between them and how to activate them in CLM. Here is a diagram of the overall CLM picture:

In the next blog we will use RRDI to examine these schemas from the report designer and create a report!