· Incomplete data set: say if the completion date for a project has not been filled up.
·
Noisy: say I have a numeric value instead of
a date entry.
· Inconsistent: if the calculations do not match up with the other input data sets.
Why
is it important to handle the issue of dirty data?
Handling dirty data is important because poor data set could result in wrong decisions
by the senior managers.
How do we cleanse the data sets?
We adopt multiple methods of cleansing the data sets. Some of the approaches are
as follows:
·
Handling missing data: for solving this problem,
we could adopt the following approaches:
o Put in a default value. This default value could be a mean, most probable value
or a constant.
o Handle
each entry manually.
· Handling noisy data: we solve issues resulting from the same using Data binning and Data clustering(remove
outliers)
Data cleansing methodology adopted by Turbodata: Turbodata does data cleansing
at staging layer as given in the diagram in the 'Data Consolidation' page. The ETL team uses SQL/C#/.Net methodology for data cleansing whereby the process removes unwanted characters
before the process of normalization and data transformation begins.