Thursday, March 13, 2014

Extract, Transform, Load

Extract, transform and load (ETL) refers to the process of collecting data from one or more sources, integrating them, performing transformations and loading data in the right format to a target such as a database, more specifically operational data store, data mart, or data warehouse. This structure provides a platform for further reporting and analysis.

ETL Process


Extract - the data from the source system is extracted and is made available for cleansing and transformations. During extraction, care should be taken to avoid any negative impact on the source system in terms of performance and response time.

Data can be extracted in several ways, two important ones being incremental load and full load. An incremental load adds new and changed data into the system, unchanged data will not be loaded. A full load or destructive load on the other hand deletes the entire data and reloads it from scratch.



Clean - it is the process of cleansing data from a data set to handle null values, inconsistent data formats, incompatible data types, missing values, duplicates and other anomalies.

Transform - in the transform stage, rules are applied to transform the data when it is being moved from a source to target. Some of the important transformations include generating surrogate keys, performing aggregations, joining sources, sorting and creating derived calculated fields. In some cases, target systems require very simple or no manipulation of data at all. Slowly changing dimensions can also be handled in the transformation phase.

Load - transformed data is finally loaded into a target, usually the data warehouse, and takes the form of fact and dimension tables. These load processes can insert each record as a new row in the table which utilizes the SQL Insert function, or perform a bulk insert of records into the table. Bulk insert is a good option when there is a large amount of data, but there is no integrity check upon load of individual records.

Need for an ETL tool 


If there is a need to control the source code as things evolve in an organization and to integrate data from various sources, a centralized ETL tool will make it more manageable and scalable. GUI based, centralized, multi-user ETL tools enable people to change the data and document the whole process during development.

Some of the other important factors include data connectivity, performance, transformation flexibility and data quality. Therefore it is pragmatic to use an ETL tool for a reasonably complex business that is evolving over time.

In order to facilitate a future business model for my client, I am currently exploring the capabilities of Talend. Talend Open Studio is a powerful set of open source products for data management and application integration projects.

Exciting times ahead!

References:

http://www.dataintegration.info/etl
http://bi-insider.com/data-warehousing/three-steps-in-etl-processing/
Wikipedia