Wednesday, February 5, 2014

Practical Steps for Dimensional Modeling


In my previous blog, I have highlighted some of the key differences between a star schema and an OLAP cube. Although OLAP cubes deliver superior performance due to aggregations and pre-calculations, it is recommended to first build a star schema which contains detailed, atomic information, which then serves as a base/foundation for optional OLAP cubes.

Some of the key reasons for this:
  • Each OLAP cube has to be re-built and processed (every time) to analyze real-time data. 
  • It is difficult to design an OLAP cube to answer most of the business questions.
  • An OLAP cube doesn’t have ad-hoc capabilities – calculations have to be designed into the cube.
  • Size of an OLAP cube usually exceeds several MB and therefore presents space constraints.
Now that the importance of a star schema (or a dimensional model in general) is recognized, how do we go about building it? 

Four-step process for Dimensional Modeling:


1. Select the Business Process - The first step where the business process (usually a verb) is identified by understanding the business requirements and the underlying data sources. In a retail store, typical business processes include purchasing, orders, shipments, inventory etc.

2. Declare the Grain - A critical step in dimensional modelling which specifies what exactly a single row in a fact table represents. Grain is usually the lowest level of detail in a warehouse. A highly granular model will be able to handle unexpected user requests to drill down to the lowest level of detail. Based on the business needs, product transaction details at a day level, order details at a month level (aggregated) are some of the examples of grains in a retail store.


3. Identify the Dimensions - Dimensions can be easily identified once the grain of the fact table has been established. These dimensions provide contextual descriptions to questions about the grain in a fact table. For example, dimensions such as date, product and store will be key to answering various questions on a particular transaction in a retail store.

4. Identify the Facts - The final step where the facts which appear in the fact table are decided. These facts can be additive, semi-additive or non-additive and should always be true to the grain which is defined in Step 2 of the dimensional modeling process. Sales amount, quantity etc. should be captured in the fact table at the grain defined previously (product and day level). If there are facts which are at a different grain, they should be part of a different fact table. 

Although the process seems to be fairly straight-forward in theory, a number of best practices need to be followed to design an effective dimensional model:

  • Good understanding of business requirements along with knowledge of underlying source systems improves the chances of addressing most of the user requests.
  • Data modeling should be a collective team effort. A data modeler should lead the team consisting of business users, analysts and BI developers. 
  • Once there is consensus on the high-level diagram, the team should delve deeper into the model by getting into discussions about each of the tables, columns and the data rules governing them. 
  • The last phase involves validation of data model with the team members, IT, business users and other concerned parties. Once the model is approved by the steering group, further steps to implement a data warehouse can be taken.

To summarize, a lot of factors have to be considered while designing a dimensional model irrespective of the nature of business and the complexity of requirements.

Are you now ready to design your first dimensional model? Well, I am! 

References:

The Data Warehouse Toolkit, Ralph Kimball and Margy Ross
http://www.seemoredata.com/en/entry.php?10-Differences-between-CUBES-and-Star-Schema
Wikipedia


4 comments:

  1. Terrific, Pradeep! Compact and practical. As you have probably discovered, DBMS-s are being presented today as capable of handling most reporting requirements without going through either the Star Schema or the OLAP Cube design. These DBMS-s carry out 'in-memory' processing of raw data and have displayed some degree of success. Of course, the Relational and Star Schema exercises are invaluable for the custodians of data, for knowledge acquisition and transfer, whatever the claims of the DBMS Vendors.

    ReplyDelete
  2. Thanks for your comments, SC! I am quite fascinated by the power of in-memory computing - vastly shortened response times, ability to present real-time (almost!) data. I have worked quite a bit with SAP Business Objects, but never had a chance to actually use SAP HANA (arguably one of the leaders in in-memory computing today). This is one of the topics which I will continue to follow.

    ReplyDelete
    Replies
    1. You're welcome, Pradeep, i am enjoying the simplicity you bring to your communication of good ideas.
      The in-memory approach has a higher overhead in computing capacity but does offer a quicker ROI when done well. My own experience is that we need to retain a hybrid data structure even as we adopt in-memory techniques, where we consolidate the summarized data incrementally.
      One of my Clients is beginning to use SAP HANA and has assured me they will keep me updated on their experience. I shall share that with you.

      Delete
  3. Consumer goods giant Unilever on how the in-memory computing platform SAP Hana will boost its bottom line.
    http://www.techrepublic.com/blog/cio-insights/how-in-memory-computing-is-helping-unilever-to-maxmise-profits/

    ReplyDelete