Fact Table Design

Transactions, Snapshots, and Additivity

9 min read

Part of the Data Modeling for Analytics guide. Previous: Dimensional Modeling and the Star Schema · Next: Dimension Table Design →

Fact tables hold the numbers the business runs on. They are the largest tables in the warehouse, the source of every metric, and the place where a subtle design error becomes a systematically wrong KPI. This chapter covers the anatomy of a well-built fact table, the three (plus one) fundamental fact table types, additivity, and the traps experienced modelers have learned to avoid.

Anatomy of a fact table

A disciplined fact table contains only:

  • Foreign keys to dimensions (surrogate keys: date_key, customer_key, product_key, …)
  • Degenerate dimensions (explained below)
  • Numeric facts, each true to the table's declared grain
  • Optionally, metadata columns (load timestamp, source batch ID) for auditability
CREATE TABLE core.fct_sales_detail (
    date_key            INTEGER NOT NULL,   -- FK → dim_date
    customer_key        INTEGER NOT NULL,   -- FK → dim_customer
    product_key         INTEGER NOT NULL,   -- FK → dim_product
    store_key           INTEGER NOT NULL,   -- FK → dim_store
    promotion_key       INTEGER NOT NULL,   -- FK → dim_promotion
    order_number        VARCHAR NOT NULL,   -- degenerate dimension
    order_line_number   INTEGER NOT NULL,
    unit_quantity       INTEGER NOT NULL,
    unit_price_amount   DECIMAL(12,2) NOT NULL,
    discount_amount     DECIMAL(12,2) NOT NULL,
    sales_amount        DECIMAL(12,2) NOT NULL
);
-- Grain: one row per order line

What's absent matters as much as what's present. No verbose text descriptions (those live in dimensions, stated once per product instead of once per sale). No mixed-grain rows. No columns that "might be handy" without a defined meaning. Fact tables stay narrow because they're enormous; dimensions stay wide because they're small.

Degenerate dimensions are dimension-like identifiers with no attributes of their own: the order number, ticket number, or invoice number. There's nothing descriptive to put in an dim_order_number table, so the value sits directly in the fact table. Degenerate dimensions are essential for grouping line items back into their parent transaction and for tracing facts to source documents.

Nulls deserve care. A null fact is acceptable (aggregations skip it correctly). A null foreign key is not — it breaks joins silently. When a dimension doesn't apply (a sale with no promotion), point the foreign key at an explicit row in the dimension: promotion_key = 0 → "No Promotion in Effect". Every dimension should carry such a default row, plus rows for "Unknown" and "Not Yet Available" where late-arriving data is possible.

The three fact table types

Every fact table you will ever build is one of three types — distinguished by what a row represents in time.

Transaction fact tables

Grain: one row per event, at the moment it occurred. One order line, one payment, one click, one inventory movement. This is the workhorse type: maximally expressive, fully additive (usually), and the foundation from which the other types can be derived. When in doubt, build the transaction table first at the most atomic grain available.

Characteristics: rows are immutable (events happened), the table is sparse (rows exist only when activity occurred), and it grows linearly with business activity. Partition by transaction date; load incrementally.

Periodic snapshot fact tables

Grain: one row per entity per regular interval. Account balance per account per day. Inventory level per product per store per week. Headcount per department per month.

Periodic snapshots answer level questions that transaction tables answer only awkwardly: "what was on the shelf at month-end?" requires replaying every movement since the beginning of time against a transaction table, but is a simple filter against a snapshot. The two types complement each other — transactions explain why levels changed; snapshots state what the level was.

Snapshots are dense (a row exists every period whether or not anything happened) and predictable in size: rows = entities × periods. On modern warehouses, daily snapshots of even large entity sets are routinely affordable, and the simple mental model they give analysts ("filter to the date, read the level") is worth a great deal.

Accumulating snapshot fact tables

Grain: one row per instance of a process with a defined beginning and end, updated as it progresses. One row per order moving through placed → paid → packed → shipped → delivered; one row per insurance claim, mortgage application, or support ticket.

The row carries a date key for each milestone (null or pointed at "Not Yet Happened" until reached) plus lag facts (days from placed to shipped, etc.). Unlike the other types, accumulating snapshot rows are revisited and updated as the pipeline progresses.

Accumulating snapshots make pipeline analytics — bottleneck analysis, cycle-time trends, work-in-progress aging — almost trivial, at the cost of a more involved load process. Use them when the business thinks in terms of items flowing through stages.

Factless fact tables

The "plus one": fact tables with no numeric facts at all — just foreign keys recording that an event or condition occurred. Student attended class (date, student, class). Product was on promotion at a store (date, product, store, promotion). Each row simply asserts "this combination happened"; counting rows is the measurement.

Coverage-style factless tables answer the questions events can't: what was promoted but didn't sell? requires knowing what was promoted — an event that generates no sales row. Compare the coverage table against the sales table and the non-sellers fall out.

Additivity: the property that governs everything

Every fact has an additivity classification, and knowing it is non-negotiable because it determines what aggregations are legitimate.

Additive facts sum meaningfully across all dimensions. Sales amount, quantity, discount amount. Sum them by day, product, region, anything — the result is meaningful. Design for additive facts wherever possible.

Semi-additive facts sum across some dimensions but not across time. Balances and levels — account balance, inventory on hand, headcount — are the canonical case. Adding Monday's balance to Tuesday's balance is meaningless; the correct time aggregation is average, min/max, or last value in period. Semi-additive facts live naturally in periodic snapshots.

Non-additive facts can't be summed across anything. Ratios, percentages, unit prices, temperatures. The rule: store the additive components, compute the ratio after aggregation. Don't store margin_percent; store revenue_amount and cost_amount and compute margin at query time. The average of pre-computed ratios is not the ratio of the sums — this single confusion produces more silently wrong dashboards than any other modeling error.

This classification flows directly into the semantic layer. In Coginiti SMDL, additivity is expressed in how measures are defined — additive facts as simple sum measures, ratios as derived measures computed post-aggregation:

measure "revenue" {
  type             = "decimal"
  aggregation_type = "sum"
  expr             = "revenue_amount"
}

measure "cost" {
  type             = "decimal"
  aggregation_type = "sum"
  expr             = "cost_amount"
}

// Derived measure: aggregated independently, THEN divided —
// the ratio-of-sums, computed correctly for any grouping.
measure "margin_pct" {
  label = "Margin %"
  type  = "number"
  expr  = "(revenue - cost) / NULLIF(revenue, 0)"
}

Because margin_pct is defined once, post-aggregation, every analyst and every dashboard that asks for Margin % gets the ratio of sums — never the sum (or average) of ratios. Encoding additivity into the semantic model turns a modeling convention into an enforced guarantee. The distinction also matters for performance: additive measures (sum, count) can be safely rolled up from pre-aggregated results, while non-additive ones (count_distinct, avg, median) cannot — which is exactly the information a query engine needs for pre-aggregation optimization.

Design traps and how to avoid them

Mixed grain. The cardinal sin, worth repeating: order-level facts (shipping fees, order discounts) do not belong in a line-level table. Either allocate them down to lines explicitly (a real business decision about how to allocate) or give them their own fact table at order grain.

Header/line confusion. Operational systems deliver order headers and order lines as separate tables, and it's tempting to mirror that structure. Resist building a fact table at header grain that repeats line information, and resist a line table that omits header context. The standard pattern: build at line grain, and bring header-level dimensional context (order date, customer, channel) down onto each line row. The order number remains as a degenerate dimension for regrouping.

Storing only ratios. Covered above — store components.

Year-to-date columns. A ytd_sales fact column is wrong tomorrow and ambiguous forever (whose year? as of when?). YTD is a query-time calculation over an additive fact, or a defined measure in the semantic layer — never a stored fact.

Fact-to-fact joins. Joining two fact tables directly almost always produces fan-out double counting, because the tables rarely share a one-to-one grain. Correct cross-process analysis aggregates each fact table separately to a shared dimensional level, then combines the results ("drilling across" on conformed dimensions). A semantic layer with declared relationships does this safely by construction; hand-written SQL must do it deliberately.

Premature aggregation. Building only the daily-summary fact because "we'll never need line detail" is a bet you'll lose. Atomic grain first; aggregates as measured optimizations later. On columnar MPP platforms, the atomic table is usually fast enough far longer than intuition suggests.

Loading and maintaining fact tables

Practical notes for the pipeline (in Coginiti, typically incremental CoginitiScript builds):

  • Incremental loads keyed on a reliable watermark; late-arriving facts (events that show up days after they occurred) need the load to look back a window, not just "since yesterday."
  • Surrogate key lookups: each incoming fact row's natural keys are resolved to dimension surrogate keys during load — pointing at the current dimension row, or the historically correct row when Type 2 history is in play (see next chapter).
  • Idempotency: a re-run of the same load window must not duplicate rows. Delete-and-reinsert by window, or merge on a deterministic row identity.
  • Audit columns: load timestamp and batch ID per row cost nothing and turn production incident triage from archaeology into a query.

Facts are half the star. The other half — the tables that make facts describable — have their own deep craft, including the famous problem of attributes that change over time.

Next: Dimension Table Design →

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.