This text discusses the semantics of slowly changing dimension type 2 (SCD2) techniques in dimensional modeling. It covers the importance of choosing appropriate reference dates and the impact of different row-versioning methods on access patterns. Three options for reference dates are discussed: extract timestamps, source system timestamps, and business timestamps. Additionally, the format of valid_to and valid_from dates is explored, along with the potential use of dimensional snapshots as an alternative to SCD2. The importance of making conscious decisions in SCD2 design is emphasized.
The Semantics of Differing SCD2 Techniques
How small differences can have a big impact
Recently, I’ve been thinking a lot about dimensional modeling, specifically how we represent different kinds of history in the warehouse / lakehouse. There are many articles that describe how to build an SCD2 table across many languages and platforms. Instead, I want to focus on something more nuanced and less commonly discussed: the semantics of SCD2 and how various design choices have meaningful consequences on use cases.
The dates you choose to row-version your dimensions matter quite a bit.
The choice should never be arbitrary, and your most common use cases should be top-of-mind in your design.
How you row-version records will determine the access patterns against your tables.
To some extent this is strictly ergonomic, but I would argue that ergonomics are an important aspect of data quality; making it easy for users to do the right thing should be our goal as data modelers.
Choosing reference dates
The most common pattern for creating an SCD2 table is utilizing some date or timestamp in your data. Once you’ve established that a row has changed meaningfully, either via direct comparison of columns or comparison of hash values, you will have to establish dates to “retire” existing records and insert new records.
But which dates do we use? For many types of data, we’ll be able to choose from one of three options:
Extract timestamps
This method takes the perspective of, “What the raw data looked like when we captured it.” The source of truth is your warehouse and the processes that load it, as opposed to any essential attributes of the data itself.
Source system timestamps
This method takes the perspective of, “What the raw data looked like when the source system created or updated it.”
Business timestamps
This approach takes the perspective of, “What the business entity looked like in relation to a business date.”
Choosing the format of valid_to and valid_from
In our examples, we used a popular strategy for picking our record effective dates based off of some update columns. dbt snapshots provide this functionality out of the box via their timestamp strategy.
The subtle note on usage is that when the valid_to of the “old” record and the valid_from of the record that’s replacing it are equal, our query patterns require a strict inequality, as seen above.
Bonus Round: SCD2 vs. dimensional snapshots
As you can see, SCD2 introduces a lot of complexity to your data models, and there’s an open question whether this modeling exercise is always worth it. In one of data engineering’s most seminal works, Maxime Beauchemin discusses this idea in some depth.
Wrapping Up
Dimensional modeling is a powerful tool in any data engineer’s or analytics engineer’s toolbox. Being able to track history is crucial to certain analytics use cases, and history can provide you with valuable insights into operational workflows. While there are many different ways you can approach SCD2, you need to be conscious of the decisions you make. These small changes can seem abstract and inconsequential, but in actual usage, these distinctions will become crystal clear. The first time you have to explain why a “missing” record isn’t actually missing, just not valid when a user expects it to be, you’ll know exactly how important these choices are.