A surrogate key is a mechanism that exists in our books for years and I hate for bringing into discussion again. Everyone is talking about the benefits of using a surrogate key instead of a business key. Even Microsoft Analysis Services Tabular and Microsoft PowerBI Tabular Models are working with the surrogate key. Both platforms mentioned give you the ability to connect a dimension and a fact using one column, and therefore is a surrogate key, as is very difficult to have one single business key in real life.
Working as BI Architect in the latest years I worked with Analysis Services Multidimensional and Tabular, I had projects in Multidimensional, which were managed up to 500GB in the DataWarehouse each night. I faced facts contracted from 5-6 unions and 8-10 joins among tables with millions of records.
Here comes the question, using Surrogate Key, in order the fact to be able to know the dimensions Key we need to make an extra Join. As a result, if we want to be able to “Relate” N dimensions (which are not already connected with a fact in construction expression) with a single Fact we need N additional Joins in the DataWarehouse.
Let’s take the previous example, so for this particular fact, we need 5-6 unions + (8-10 + N) joins which increases the complexity, image of what will happen once we have the requirement to relate this fact with 10-15 dimensions to get the surrogate key.
All these years I was trying to read my facts expressions using my early coffee like reading a newspaper and remove unused columns, unions, joins, and make everything to reduce the complexity ta save ETL process time.
Its fully understand that we will save time for querying DataWarehouse and Semantic Layer, but what about ETL, I am missing something?
Advertisement
Answer
a couple of thoughts about your question…
- If you didn’t use SKs then how would you handle SCD2 dimensions where the natural/business keys from the source system (even if they were a single column) wouldn’t be unique?
- The purpose of a DW is to make it easier and quicker to query your data. If you consider that any problem takes a certain amount of effort to resolve then you have a choice where you apply that effort in the chain of activities required to produce the solution. If you want to reduce the effort of querying then you need to increase the effort in data preparation i.e. your ETL