Dimension Table Design

SCDs, Hierarchies, and the Date Dimension

9 min read

Part of the Data Modeling for Analytics guide. Previous: Fact Table Design · Next: Semantic Modeling →

Dimensions are where analytics gets its vocabulary. Every filter a user applies, every header on a pivot table, every GROUP BY in a generated query comes from a dimension attribute. Kimball called dimensions "the soul of the data warehouse" — facts are just numbers until dimensions make them mean something. This chapter covers building great dimensions: structure, keys, history tracking, and the specialty patterns that handle the awkward cases.

Wide, flat, and verbose

A good dimension table is unapologetically denormalized:

CREATE TABLE core.dim_product (
    product_key          INTEGER NOT NULL,   -- surrogate key (PK)
    product_id           VARCHAR NOT NULL,   -- natural key from source
    product_name         VARCHAR NOT NULL,
    brand_name           VARCHAR NOT NULL,
    category_name        VARCHAR NOT NULL,
    department_name      VARCHAR NOT NULL,
    package_type         VARCHAR NOT NULL,
    package_size         VARCHAR NOT NULL,
    is_private_label     BOOLEAN NOT NULL,
    launch_date          DATE,
    -- SCD bookkeeping (explained below)
    valid_from           TIMESTAMP NOT NULL,
    valid_to             TIMESTAMP,
    is_current           BOOLEAN NOT NULL
);

Brand, category, and department form a hierarchy, and a normalizer's instinct is to split them into separate tables. Don't. Flattening the hierarchy into the dimension costs trivial storage (columnar compression eats repeated values for breakfast) and buys single-join simplicity for every query, every user, forever.

Attribute craft guidelines:

  • Verbose, real words, not codes: 'Frozen Foods', not 'FF'. If the source provides codes the business actually uses, store both the code and the description.
  • No nulls in attributes users filter on — use 'Unknown' or 'Not Applicable' so values never silently vanish from filter lists and group-bys.
  • Flags as readable text or booleans with is_/has_ names.
  • Precompute useful derivations: age bands, tenure buckets, full name. Computed once in the dimension, consistent everywhere — and in Coginiti, equally definable as calculated dimensions in the semantic layer (CONCAT(first_name, ' ', last_name)) when you'd rather not materialize them.

The wider and richer the dimension, the more analyses become possible without anyone touching the pipeline. When in doubt, add the attribute.

Snowflakes and outriggers

A snowflake schema re-normalizes dimension hierarchies into chains of sub-tables (dim_productdim_categorydim_department). It saves negligible space and taxes every query with extra joins; avoid it as a default. The limited exception is the outrigger: a secondary dimension referenced from a dimension — e.g., a first_purchase_date attribute in dim_customer pointing at dim_date so customers can be analyzed by the fiscal attributes of their first purchase. Use sparingly, deliberately.

The date dimension

Every warehouse needs one, and it's the easiest dimension to build well because it can be generated entirely in advance — one row per calendar day, spanning your data's history and a few years forward:

date_key (20260612) · full_date · day_name · day_of_week ·
day_of_month · week_of_year · month_number · month_name ·
quarter · year · fiscal_month · fiscal_quarter · fiscal_year ·
is_weekend · is_company_holiday · holiday_name · selling_season

Why bother, when SQL has date functions? Because the business calendar isn't in the timestamp. Fiscal periods, holiday flags, selling seasons, week-numbering conventions — these are organizational facts that date functions can't compute. Encoding them once in dim_date means "compare holiday-season weekends year over year" is a filter, not a project. The date dimension conventionally uses a readable smart key (YYYYMMDD integer) rather than an opaque surrogate, an accepted exception to the meaningless-key rule.

Surrogate, natural, and durable keys

As introduced in Physical Data Modeling, every dimension carries three identities:

  • Surrogate key (product_key) — warehouse-assigned, meaningless, one per row version; the only key facts join on.
  • Natural key (product_id) — the source system's identifier, kept as an attribute.
  • Durable key — a stable identifier for the entity across all its versions; equal to the natural key when sources are stable, warehouse-assigned when they aren't.

The reason dimension rows need version-level identity is the central problem of dimension management: attributes change.

Slowly changing dimensions

Customers move. Products get reclassified. Sales reps change territories. Each change forces a choice: overwrite (and rewrite history), or preserve (and manage versions). The industry-standard taxonomy of responses — slowly changing dimension (SCD) types — is one of Kimball's most lasting contributions.

Type 0 — Retain original. Never update the attribute. For values that are definitionally fixed: original credit score at signup, first acquisition channel.

Type 1 — Overwrite. Update the attribute in place. Simple, but rewrites history: re-run last year's report and the numbers move to the new value. Appropriate for corrections (fixing a misspelled name) and attributes where history genuinely doesn't matter. The hidden cost: any aggregate tables built on the changed attribute must be recomputed.

Type 2 — Add a new row. The gold standard for real history. When a tracked attribute changes, expire the current row (valid_to, is_current = FALSE) and insert a new row with a new surrogate key:

product_key product_id category valid_from valid_to is_current
12345 INTK-10 Education 2024-01-01 2026-01-31 false
25984 INTK-10 Strategy 2026-02-01 true

Facts loaded before the change keep pointing at key 12345; facts after point at 25984. Historical reports reproduce exactly; current-state queries filter is_current. This is why surrogate keys exist: the natural key INTK-10 now legitimately maps to two rows. Type 2 requires discipline in the load (expire-and-insert, plus correct key resolution during fact loads) but delivers the property analytics needs most: facts permanently associated with the dimensional context that was true when they occurred.

Type 3 — Add a column. Keep current_category and prior_category side by side. Supports "alternate reality" comparisons across exactly one reorganization; useless beyond one change. Niche.

Type 4 — Mini-dimension. When a large dimension has a cluster of rapidly changing attributes (customer demographics, risk scores), Type 2 would explode row counts. Split the volatile attributes into a small "mini-dimension" of banded value combinations, keyed directly from the fact table. The big dimension stays stable; the fast-moving state is captured per fact row.

Types 5, 6, 7 — Hybrids. Combinations layered on Type 2 for the dual need "report history as it was AND restate everything as it is now": Type 5 (mini-dimension plus a Type 1 current pointer), Type 6 (current-value columns maintained Type-1-style inside Type 2 rows), Type 7 (facts carry both the version surrogate key and the durable key, joining to versioned and current views of the dimension respectively). Reach for these only when that dual requirement is real; most warehouses live happily on a vocabulary of Types 0, 1, and 2.

The modern complement: dimension snapshots. Cheap storage enables a blunter instrument — snapshot the entire dimension daily into a date-partitioned table. Point-in-time questions become "join facts to the dimension partition matching the fact date"; current-state questions use the latest partition. The mental model is simpler than SCD machinery, at the cost of storing many copies (cheap, for dimension-sized data) and of point-in-time logic moving into queries. Snapshots and Type 2 aren't mutually exclusive: many teams run Type 2 for the core conformed dimensions that feed the semantic layer, and snapshots for everything else.

Whatever the mechanism, the policy is per-attribute and is a business decision: which changes must preserve history (Type 2), which should overwrite (Type 1), which are frozen (Type 0). Write the policy down in the model documentation.

Specialty dimension patterns

Role-playing dimensions. One physical dimension playing multiple roles in a fact: an order fact with order_date_key, ship_date_key, and delivery_date_key all referencing dim_date. Build the table once; expose each role distinctly. In Coginiti SMDL, each role is simply a separately named relationship from the fact entity to the date entity — "order_date", "ship_date", "delivery_date" — so users see three clearly labeled date contexts instead of one ambiguous join.

Junk dimensions. Every transactional source carries stray low-cardinality flags: payment type, channel, gift-wrap indicator. Rather than five tiny dimension tables (or five flag columns cluttering the fact), combine them into one junk dimension whose rows are the observed combinations. The fact carries one key; the flags stay filterable.

Conformed dimensions — covered in the star schema chapter — are less a pattern than a discipline: one customer dimension, one product dimension, one date dimension, shared by every star. Conformance is what makes cross-process analysis truthful, and it's worth organizational effort to achieve.

Multivalued dimensions and bridge tables. Sometimes the relationship genuinely isn't many-to-one: a hospital visit with multiple diagnoses, an account with multiple owners. A bridge table (group key → member rows, optionally with weighting factors so allocated facts still sum to 100%) handles this honestly. Bridges complicate queries and risk double counting — use them when the business reality demands it, and consider whether a "primary diagnosis" simplification serves analysts better.

Hierarchies. Fixed-depth hierarchies (product → category → department; city → region → country) flatten beautifully into dimension columns, and should. Ragged or variable-depth hierarchies (org charts, account rollups) are genuinely hard in any paradigm; standard treatments include flattened level columns with repeated values for short branches, or a bridge-style traversal table when arbitrary-depth rollup is unavoidable.

Dimensions in the semantic layer

Everything this chapter builds surfaces directly in the semantic model. A dimension table becomes an SMDL entity; its columns become typed semantic dimensions with business labels and descriptions; helper columns stay hidden = true; and the fact-to-dimension foreign keys become declared many_to_one relationships that let the query engine join automatically:

entity "dim_product" {
  label       = "Product"
  table_name  = "core.dim_product"
  description = "One row per product version (Type 2 history tracked on category)."

  dimension "product_name" {
    label = "Product Name"
    type  = "text"
    expr  = "product_name"
  }

  dimension "category_name" {
    label = "Category"
    type  = "text"
    expr  = "category_name"
  }

  dimension "is_current" {
    label  = "Is Current Version"
    type   = "bool"
    expr   = "is_current"
  }
}

The richer and cleaner your dimensions, the more powerful this layer becomes — every well-named attribute is one more thing analysts can drag into a query and AI assistants can reason about. Which brings us to the semantic layer itself.

Next: Semantic Modeling: Giving Your Data Business Meaning →

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.