Star Schema and Snowflake Schema

Star Schema

A star schema is the simplest style of datawarehousing schema. It consists of dimension tables, fact tables and aggregate tables.

What is a dimension table: a dimension table is the list of attributes by which we desire to look at various measures such as sales, COGS. The dimension table could be a hierarchy(such as region, date) or could be a list of attributes(say the details of the person. There are 2 types of dimensions:

· Type-1 dimension: in this we capture only the latest record.

· Type-2 dimension: in this we capture the history of the records. This is done by using the effective from and effective to time stamps and the current flags.

What is a fact table: a fact table contains the surrogate keys from the dimension tables as foreign keys and the measures to be reported.

What is an aggregate table: when we need to aggregate the fact table data for a higher level of granularity(say fact table is loaded on a daily basis but the aggregate needs to be on a monthly basis), we then generate aggregate tables. This operation normally entails performing sum, average operations on the measures in the fact table and grouping by dimension entities.

Complications with star schemas: the following are the complications that could be possible due to star schema.

Dimension tables could become very large. This could result in slow speed of reporting.
The need of the end client is to do cross domain reporting. In such cases reports from multiple star schemas shall need to be generated resulting in complications for report development. The load on reporting tool shall be more resulting in slower speed of analysis with large data set.
Measures from multiple fact tables need to be analyzed by common dimension entities. This results in ‘chasm trap’. In such a scenario, the path of the querying needs to be determined. This results in reduction in adhoc capabilities for reporting analysis.

Snowflake Schema

Snowflake schemas: many a times the end client would like the database to be reported from to be in the 3^rd normal form itself. This could be desired in the following cases:

The reporting layer to be generated should be a reflection of the transaction tables which are already normalized.
The dimension tables could become very large.
The dimension tables could be sparsely populated.
The end client is using OLAP tool in which sql queries are to be used to populate the cubes. In such cases snowflake schema might be a good option.
The end client desires to have a look at the data sets from multiple angles in a single query or view. This is facilitated by usage of OLAP tools. In such a scenario snowflake schema should be a good option.

Complications with snowflake schema: when we use the snowflake schema we could have the following issues in analysis.

Fan trap: say I have 3 tables: A,B and C. A has one to many relationship with B. B has one to many relationship with C. For a particular analysis, we pick up a dimension entities from A and B and measures from B and C. In the consolidated report, the sum or the count of measures from B shall be inflated. In order to solve this problem, we will have to generate a copy of B(alias), do the analysis of A and B in one query and A,B,C in second query(this query shall have dimension objects from A and B and measures from C only).
The speed of querying could be slow if a large number of tables are involved.