Data Cleansing

Data Cleansing

As part of the Data Transformation, Data Consolidation and Excel Consolidation processes, we are involved in activities related with Data Cleansing. Data cleansing follows the data extraction process.

Why is data dirty?

Data could be dirty because of the following reasons:

·        Incomplete data set: say if the completion date for a project has not been filled up.
·        Noisy: say I have a numeric value instead of a date entry.
·        Inconsistent: if the calculations do not match up with the other input data sets.

Why is it important to handle the issue of dirty data?

Handling dirty data is important because poor data set could result in wrong decisions by the senior managers.

How do we cleanse the data sets?

We adopt multiple methods of cleansing the data sets. Some of the approaches are as follows:

·        Handling missing data: for solving this problem, we could adopt the following approaches:
o   Ignore those values
o   Put in a default value. This default value could be a mean, most probable value or a constant.
o   Handle each entry manually.
·        Handling noisy data: we solve issues resulting from the same using Data binning and Data clustering(remove outliers)

Data cleansing methodology adopted by Turbodata: Turbodata does data cleansing at staging layer as given in the diagram in the 'Data Consolidation' page. The ETL team uses SQL/C#/.Net methodology for data cleansing whereby the process removes unwanted characters before the process of normalization and data transformation begins.