Itinai.com futuristic ui icon design 3d sci fi computer scree 53325f5e 8707 4993 866c f93d7a06d6eb 3
Itinai.com futuristic ui icon design 3d sci fi computer scree 53325f5e 8707 4993 866c f93d7a06d6eb 3

SCD2 — Semantics and Styles

This text discusses the semantics of slowly changing dimension type 2 (SCD2) techniques in dimensional modeling. It covers the importance of choosing appropriate reference dates and the impact of different row-versioning methods on access patterns. Three options for reference dates are discussed: extract timestamps, source system timestamps, and business timestamps. Additionally, the format of valid_to and valid_from dates is explored, along with the potential use of dimensional snapshots as an alternative to SCD2. The importance of making conscious decisions in SCD2 design is emphasized.

 SCD2 — Semantics and Styles

The Semantics of Differing SCD2 Techniques

How small differences can have a big impact

Recently, I’ve been thinking a lot about dimensional modeling, specifically how we represent different kinds of history in the warehouse / lakehouse. There are many articles that describe how to build an SCD2 table across many languages and platforms. Instead, I want to focus on something more nuanced and less commonly discussed: the semantics of SCD2 and how various design choices have meaningful consequences on use cases.

The dates you choose to row-version your dimensions matter quite a bit.

The choice should never be arbitrary, and your most common use cases should be top-of-mind in your design.

How you row-version records will determine the access patterns against your tables.

To some extent this is strictly ergonomic, but I would argue that ergonomics are an important aspect of data quality; making it easy for users to do the right thing should be our goal as data modelers.

Choosing reference dates

The most common pattern for creating an SCD2 table is utilizing some date or timestamp in your data. Once you’ve established that a row has changed meaningfully, either via direct comparison of columns or comparison of hash values, you will have to establish dates to “retire” existing records and insert new records.

But which dates do we use? For many types of data, we’ll be able to choose from one of three options:

Extract timestamps

This method takes the perspective of, “What the raw data looked like when we captured it.” The source of truth is your warehouse and the processes that load it, as opposed to any essential attributes of the data itself.

Source system timestamps

This method takes the perspective of, “What the raw data looked like when the source system created or updated it.”

Business timestamps

This approach takes the perspective of, “What the business entity looked like in relation to a business date.”

Choosing the format of valid_to and valid_from

In our examples, we used a popular strategy for picking our record effective dates based off of some update columns. dbt snapshots provide this functionality out of the box via their timestamp strategy.

The subtle note on usage is that when the valid_to of the “old” record and the valid_from of the record that’s replacing it are equal, our query patterns require a strict inequality, as seen above.

Bonus Round: SCD2 vs. dimensional snapshots

As you can see, SCD2 introduces a lot of complexity to your data models, and there’s an open question whether this modeling exercise is always worth it. In one of data engineering’s most seminal works, Maxime Beauchemin discusses this idea in some depth.

Wrapping Up

Dimensional modeling is a powerful tool in any data engineer’s or analytics engineer’s toolbox. Being able to track history is crucial to certain analytics use cases, and history can provide you with valuable insights into operational workflows. While there are many different ways you can approach SCD2, you need to be conscious of the decisions you make. These small changes can seem abstract and inconsequential, but in actual usage, these distinctions will become crystal clear. The first time you have to explain why a “missing” record isn’t actually missing, just not valid when a user expects it to be, you’ll know exactly how important these choices are.

List of Useful Links:

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions