Star Schema
A star schema is the simplest style of datawarehousing schema. It consists
of dimension tables, fact tables and aggregate tables.
What is a dimension table: a dimension table is the list of attributes by which we desire to look at various measures such
as sales, COGS. The dimension table could be a hierarchy(such as region, date) or could be a list of attributes(say the details
of the person. There are 2 types of dimensions:
· Type-1
dimension: in this we capture
only the latest record.
· Type-2 dimension: in this we capture the history of the records.
This is done by using the effective from and effective to time stamps and the current flags.
What
is a fact table: a
fact table contains the surrogate keys from the dimension tables as foreign keys and the measures to be reported.
What is an aggregate table: when we need to aggregate the fact table data for a higher level of granularity(say fact table is loaded on a daily
basis but the aggregate needs to be on a monthly basis), we then generate aggregate tables. This operation normally entails
performing sum, average operations on the measures in the fact table and grouping by dimension entities.
Complications with star schemas: the following are the complications
that could be possible due to star schema.
- Dimension tables could become very large.
This could result in slow speed of reporting.
- The need of the end client is to do cross
domain reporting. In such cases reports from multiple star schemas shall need to be generated resulting in complications
for report development. The load on reporting tool shall be more resulting in slower speed of analysis with large
data set.
- Measures from multiple fact tables need to be analyzed by common dimension entities. This results in ‘chasm
trap’. In such a scenario, the path of the querying needs to be determined. This results in reduction in adhoc
capabilities for reporting analysis.
Snowflake
Schema
Snowflake schemas: many a times the end client would like the database to be reported from to be in the 3rd normal form itself. This could be desired in the following
cases:
- The reporting layer to be generated should be a reflection of the transaction tables which
are already normalized.
- The dimension tables could become very large.
- The
dimension tables could be sparsely populated.
- The end client is using OLAP tool in which
sql queries are to be used to populate the cubes. In such cases snowflake schema might be a good option.
- The
end client desires to have a look at the data sets from multiple angles in a single query or view. This is facilitated
by usage of OLAP tools. In such a scenario snowflake schema should be a good option.
Complications
with snowflake schema: when we
use the snowflake schema we could have the following issues in analysis.
- Fan trap: say I have 3 tables: A,B and C. A has
one to many relationship with B. B has one to many relationship with C. For a particular analysis, we pick up a
dimension entities from A and B and measures from B and C. In the consolidated report, the sum or the count of measures
from B shall be inflated. In order to solve this problem, we will have to generate a copy of B(alias), do the analysis
of A and B in one query and A,B,C in second query(this query shall have dimension objects from A and B and measures
from C only).
- The speed of querying could be slow if a large number of tables are
involved.