Extract Transform and Load(ETL)/Data Transformation

Extract Transform and Load/Data Transformation

ETL process: Extract Transform and Load is the process of transferring data from the source databases to a target database after performing the following set of operations:

· Extract: this process entails extracting data from single or multiple datasources. The data sources could be unstructured(such as Excel spreadsheets) or structured data as in relational databases. For example your input data sources could be SQL server inputs, Oracle inputs, Excel files, flat files etc.

· Transform: This process entails the following set of processes:

o Cleansing the data: this entails removing the test and the redundant data from the input data sources.

o Transforming the data: this entails performing operations on the input data set so that the output is in the required format. We could perform the following set of operations for executing the same:

§ Choosing a subset of columns from the input source.

§ Joining

§ Aggregation

§ Union

§ A large number of operations on the character data set

§ Generate surrogate keys.

§ Lookups: this entails finding the required values from standalone tables.

§ Validation

§ Case transform: performing multiple ‘ifs’.

· Load: The data is normally loaded onto the target databases. The target databases could be Operational data stores(in 3^rd normal format), Data staging(keeping the copy of the database from input data sources after data cleansing on a single platform), Datawarehouse(could be in 3^rd normal or star schema format). When we load the data onto the target databases we need to keep the following into consideration:

o Update strategy: truncate the tables before loading the database tables or append the targets. Bulk insert is also an option.

o Parallel processing: in how many parallel threads would an ETL job be executed.

o Job dependencies: this entails deciding on the order in which the ETL jobs need to be executed. This is normally taken care of by the script in the ETL tool.

Data loading into the Datawarehouse or operational reporting sources: when we load the data into the target databases, we need to consider the following intricacies:

· Handling type-1 entities: initial and incremental load.

· Handling type-2 entities: for initial and incremental load.

· Working with sources with time stamps

· Working with sources without time stamps.

Our firm has written a paper to handle the above four issues during our ETL projects. For downloading the same, you could click on the following link:

Handling of fact table load: a fact table should contain the surrogate keys from the dimension tables and the measures to be analyzed. In order to load the fact tables we extensively use Look up or outer join functions.

Handling of aggregate table loads: when we need to aggregate the fact table data for a higher level of granularity(say fact table is loaded on a daily basis but the aggregate needs to be on a monthly basis), then we generate aggregate tables. This operation normally entails performing sum, average operations on the measures in the fact table and grouping by dimension entities.

Factless fact: a factless fact table is a table with no measures. It is used in those cases where we need to perform aggregation operations on entities in dimension tables. Say you have a person dimension, you desire to find what is the count of the persons in a particular district. Then one generates a factless fact table that is a copy of the dimension table.

Data Services-incremental and Initial Load