SCD2 — Semantics and Styles

This text discusses the semantics of slowly changing dimension type 2 (SCD2) techniques in dimensional modeling. It covers the importance of choosing appropriate reference dates and the impact of different row-versioning methods on access patterns. Three options for reference dates are discussed: extract timestamps, source system timestamps, and business timestamps. Additionally, the format of valid_to and valid_from dates is explored, along with the potential use of dimensional snapshots as an alternative to SCD2. The importance of making conscious decisions in SCD2 design is emphasized.

 SCD2 — Semantics and Styles

The Semantics of Differing SCD2 Techniques

How small differences can have a big impact

Recently, I’ve been thinking a lot about dimensional modeling, specifically how we represent different kinds of history in the warehouse / lakehouse. There are many articles that describe how to build an SCD2 table across many languages and platforms. Instead, I want to focus on something more nuanced and less commonly discussed: the semantics of SCD2 and how various design choices have meaningful consequences on use cases.

The dates you choose to row-version your dimensions matter quite a bit.

The choice should never be arbitrary, and your most common use cases should be top-of-mind in your design.

How you row-version records will determine the access patterns against your tables.

To some extent this is strictly ergonomic, but I would argue that ergonomics are an important aspect of data quality; making it easy for users to do the right thing should be our goal as data modelers.

Choosing reference dates

The most common pattern for creating an SCD2 table is utilizing some date or timestamp in your data. Once you’ve established that a row has changed meaningfully, either via direct comparison of columns or comparison of hash values, you will have to establish dates to “retire” existing records and insert new records.

But which dates do we use? For many types of data, we’ll be able to choose from one of three options:

Extract timestamps

This method takes the perspective of, “What the raw data looked like when we captured it.” The source of truth is your warehouse and the processes that load it, as opposed to any essential attributes of the data itself.

Source system timestamps

This method takes the perspective of, “What the raw data looked like when the source system created or updated it.”

Business timestamps

This approach takes the perspective of, “What the business entity looked like in relation to a business date.”

Choosing the format of valid_to and valid_from

In our examples, we used a popular strategy for picking our record effective dates based off of some update columns. dbt snapshots provide this functionality out of the box via their timestamp strategy.

The subtle note on usage is that when the valid_to of the “old” record and the valid_from of the record that’s replacing it are equal, our query patterns require a strict inequality, as seen above.

Bonus Round: SCD2 vs. dimensional snapshots

As you can see, SCD2 introduces a lot of complexity to your data models, and there’s an open question whether this modeling exercise is always worth it. In one of data engineering’s most seminal works, Maxime Beauchemin discusses this idea in some depth.

Wrapping Up

Dimensional modeling is a powerful tool in any data engineer’s or analytics engineer’s toolbox. Being able to track history is crucial to certain analytics use cases, and history can provide you with valuable insights into operational workflows. While there are many different ways you can approach SCD2, you need to be conscious of the decisions you make. These small changes can seem abstract and inconsequential, but in actual usage, these distinctions will become crystal clear. The first time you have to explain why a “missing” record isn’t actually missing, just not valid when a user expects it to be, you’ll know exactly how important these choices are.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.