Thursday, March 13, 2014

Extract, Transform, Load

Extract, transform and load (ETL) refers to the process of collecting data from one or more sources, integrating them, performing transformations and loading data in the right format to a target such as a database, more specifically operational data store, data mart, or data warehouse. This structure provides a platform for further reporting and analysis.

ETL Process


Extract - the data from the source system is extracted and is made available for cleansing and transformations. During extraction, care should be taken to avoid any negative impact on the source system in terms of performance and response time.

Data can be extracted in several ways, two important ones being incremental load and full load. An incremental load adds new and changed data into the system, unchanged data will not be loaded. A full load or destructive load on the other hand deletes the entire data and reloads it from scratch.



Clean - it is the process of cleansing data from a data set to handle null values, inconsistent data formats, incompatible data types, missing values, duplicates and other anomalies.

Transform - in the transform stage, rules are applied to transform the data when it is being moved from a source to target. Some of the important transformations include generating surrogate keys, performing aggregations, joining sources, sorting and creating derived calculated fields. In some cases, target systems require very simple or no manipulation of data at all. Slowly changing dimensions can also be handled in the transformation phase.

Load - transformed data is finally loaded into a target, usually the data warehouse, and takes the form of fact and dimension tables. These load processes can insert each record as a new row in the table which utilizes the SQL Insert function, or perform a bulk insert of records into the table. Bulk insert is a good option when there is a large amount of data, but there is no integrity check upon load of individual records.

Need for an ETL tool 


If there is a need to control the source code as things evolve in an organization and to integrate data from various sources, a centralized ETL tool will make it more manageable and scalable. GUI based, centralized, multi-user ETL tools enable people to change the data and document the whole process during development.

Some of the other important factors include data connectivity, performance, transformation flexibility and data quality. Therefore it is pragmatic to use an ETL tool for a reasonably complex business that is evolving over time.

In order to facilitate a future business model for my client, I am currently exploring the capabilities of Talend. Talend Open Studio is a powerful set of open source products for data management and application integration projects.

Exciting times ahead!

References:

http://www.dataintegration.info/etl
http://bi-insider.com/data-warehousing/three-steps-in-etl-processing/
Wikipedia



Thursday, February 27, 2014

BI Requirements Gathering

Requirements gathering is a critical phase in any IT project, both technical and non-technical. It is understood that requirements always come before design. Requirements focus on "building the right system"; design lays emphasis on "building the system right". Considering the vast scope that BI offers, gathering requirements for a BI system is a challenging proposition. A business intelligence system should deliver an environment that is business focused rather than an application that performs specific functions. BI should be embedded into the business process that can be used to improve business processes.

Gathering requirements is a three-step process which is commonly known as elicit, specify, and test (EST). Requirement eliciting is the practice of collecting requirements from the stakeholders through interviews, questionnaires, brain storming  and prototyping. The most important part of this phase is to identify what the users 'need' (that can be successfully built) by asking the right questions. 


The complete description of each requirement along with non-functional requirements of the system is captured in requirement specification. Each requirement in the specification describes a property or characteristic which the BI system should possess in order to add value to the business users. 


A requirement specification includes:

WHAT Describes the requirement along with features of individual attributes. The requirement traceability matrix is also included which enables tracking of requirements.
WHY Provides the rationale for the BI system, describing the value to be achieved
WHO Describes the source of requirement and the business users who will benefit from the system

No matter how well a BI system is developed, it will only do what the requirements ask it to do. Requirement testing is a very important step in the process. As part of requirement testing, each requirement must be validated to ensure that it is clear, unambiguous and feasible.

On a lighter note, a comic strip to summarize the importance of requirement gathering in a project life cycle:


More to follow in my next blog!

References:

http://www.miproconsulting.com/blog/2009/06/bi-requirements-gathering
http://www.kimballgroup.com/data-warehouse-business-intelligence-consulting/business-intelligence-requirements/
https://cours.etsmtl.ca/mti820/public_docs/lectures/TenMistakesToAvoidWhenGatheringBIRequirements.pdf
http://tdwi.org/research/2008/09/ten-mistakes-to-avoid-when-gathering-bi-requirements.aspx

Thursday, February 13, 2014

Business Intelligence on Cloud

In the last few years, cloud computing has grown exponentially and is expected to become a billion dollar industry by 2016. Business Intelligence is all about delivering the right and meaningful information to the right people at the right time. Cloud computing provides a platform for BI applications which is agile and lightweight. Combining business intelligence and cloud computing could unleash greater benefits to a firm.

There are a few key factors which work in favor of Cloud Business Intelligence:

  • Speed of Implementation and Deployment: Very easy to set up and deploy, since no additional hardware or software installations are required.
  • Elasticity: The immense computing power available on cloud can be leveraged, and it can be scaled up and scaled down based on changing needs.
  • Accessibility: These applications allow mobile and browser users to access and control everything on the cloud.
  • Cost Benefits: Data center and IT management costs are greatly reduced, cost-effective pricing models offer cost-effective options.
A cloud model is typically categorized into Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS):



- IaaS provides data storage, managed networking, virtual servers, load balancers and operating system environments.

- PaaS, which sits on the IaaS layer, includes managed middleware, such as databases, application servers, data integration services and BI platforms.

- SaaS extends this stack to include the application layer. Most of the vendors provide cloud business intelligence as an online service which can be accessed through web based interfaces.

Cloud BI Vendors - Currently, several startups such as Birst, GoodData and Indicee are the niche players in the cloud BI market whose principal delivery model is cloud. Big players such as Microstrategy, SAP and Oracle are also expanding their cloud capabilities and are steadily moving into the cloud space.


Though companies are still wary of the capabilities of cloud, many of the enterprises are migrating from their on-premise applications to cloud in order to experience the potential benefits - reduced costs, improved flexibility and faster deployment. Cloud solutions offer self-service BI and on-demand BI through private or public cloud or even a hybrid model based on business needs.

Cloud BI is here to stay!

References:

http://www.tcs.com/SiteCollectionDocuments/White%20Papers/HighTech_Whitepaper_Business_Intelligence_Cloud_0412-1.pdf
http://www.klipfolio.com/resources/articles/what-is-cloud-business-intelligence
http://www.informationweek.com/software/information-management/5-big-business-intelligence-trends-for-2014/d/d-id/1113468

Wednesday, February 5, 2014

Practical Steps for Dimensional Modeling


In my previous blog, I have highlighted some of the key differences between a star schema and an OLAP cube. Although OLAP cubes deliver superior performance due to aggregations and pre-calculations, it is recommended to first build a star schema which contains detailed, atomic information, which then serves as a base/foundation for optional OLAP cubes.

Some of the key reasons for this:
  • Each OLAP cube has to be re-built and processed (every time) to analyze real-time data. 
  • It is difficult to design an OLAP cube to answer most of the business questions.
  • An OLAP cube doesn’t have ad-hoc capabilities – calculations have to be designed into the cube.
  • Size of an OLAP cube usually exceeds several MB and therefore presents space constraints.
Now that the importance of a star schema (or a dimensional model in general) is recognized, how do we go about building it? 

Four-step process for Dimensional Modeling:


1. Select the Business Process - The first step where the business process (usually a verb) is identified by understanding the business requirements and the underlying data sources. In a retail store, typical business processes include purchasing, orders, shipments, inventory etc.

2. Declare the Grain - A critical step in dimensional modelling which specifies what exactly a single row in a fact table represents. Grain is usually the lowest level of detail in a warehouse. A highly granular model will be able to handle unexpected user requests to drill down to the lowest level of detail. Based on the business needs, product transaction details at a day level, order details at a month level (aggregated) are some of the examples of grains in a retail store.


3. Identify the Dimensions - Dimensions can be easily identified once the grain of the fact table has been established. These dimensions provide contextual descriptions to questions about the grain in a fact table. For example, dimensions such as date, product and store will be key to answering various questions on a particular transaction in a retail store.

4. Identify the Facts - The final step where the facts which appear in the fact table are decided. These facts can be additive, semi-additive or non-additive and should always be true to the grain which is defined in Step 2 of the dimensional modeling process. Sales amount, quantity etc. should be captured in the fact table at the grain defined previously (product and day level). If there are facts which are at a different grain, they should be part of a different fact table. 

Although the process seems to be fairly straight-forward in theory, a number of best practices need to be followed to design an effective dimensional model:

  • Good understanding of business requirements along with knowledge of underlying source systems improves the chances of addressing most of the user requests.
  • Data modeling should be a collective team effort. A data modeler should lead the team consisting of business users, analysts and BI developers. 
  • Once there is consensus on the high-level diagram, the team should delve deeper into the model by getting into discussions about each of the tables, columns and the data rules governing them. 
  • The last phase involves validation of data model with the team members, IT, business users and other concerned parties. Once the model is approved by the steering group, further steps to implement a data warehouse can be taken.

To summarize, a lot of factors have to be considered while designing a dimensional model irrespective of the nature of business and the complexity of requirements.

Are you now ready to design your first dimensional model? Well, I am! 

References:

The Data Warehouse Toolkit, Ralph Kimball and Margy Ross
http://www.seemoredata.com/en/entry.php?10-Differences-between-CUBES-and-Star-Schema
Wikipedia


Monday, January 27, 2014

Star Schema v/s OLAP Cube

In the first couple of weeks, we have recognized the importance of dimensional modeling in the design of a data warehouse. Dimensional model simplifies the database and allows the BI software to present the data to end-users in a meaningful and efficient way. Dimensional model can be implemented using Star schemas or OLAP cubes. 

Star Schema:

A star schema is the simplest form of dimensional modeling which consists of one or more fact tables in the center surrounded by dimension tables (providing a context to the facts). This structure resembles a star and hence the name Star Schema. These star schemas are implemented on a relational database management system.


OLAP Cube:

An OLAP (Online Analytical Processing) cube is multi-dimensional and consists of aggregated and pre-calculated summary data. As a result, an OLAP cube provides superior performance and enables the business users to slice and dice along the various dimensions, drill up and drill down through hierarchies.


Slice-and-Dice operation

By slicing across the product dimension in the cube, the product manager will be able to analyze the sale of a product while the store manager can observe the sales for a particular store by slicing across the location dimension.

A “slice” means a cut across the cube and “dice” means a small cube which is produced by the intersection of slices.


Drill-up and Drill-down operation

To analyze the data at different levels, the user is provided with three hierarchies:
  • Time Hierarchy
  • Product Hierarchy
  • Location Hierarchy
 

In this scenario, the manager will be able examine the data for different product types in various cities of a particular state for all the quarters in a particular year. This can be possible through the drill-up and drill-down functionality in multidimensional cubes.

Both star schemas and OLAP cubes offer their own benefits and shortcomings. Therefore, based on a good understanding of requirements, an efficient star schema or/and an OLAP cube can be designed to answer difficult business questions!

More to follow in my next blog. Stay tuned!

References:
  • The Data Warehouse Toolkit, Ralph Kimball and Margy Ross
  • Database Systems – Design, Implementation and Management, Peter Rob and Carlos Coronel
  • Wikipedia