Dimensional Modeling and the Star Schema
Facts, Dimensions, the Grain, and Conformance
9 min read
Part of the Data Modeling for Analytics guide. Previous: Physical Data Modeling · Next: Fact Table Design →
Dimensional modeling is a way of structuring data so that it directly mirrors how the business measures and describes itself. It was formalized by Ralph Kimball and Margy Ross in The Data Warehouse Toolkit, and after nearly thirty years it remains the standard approach for the analytics-facing layer of a warehouse — not out of tradition, but because business questions are dimensional by nature, and a schema that shares that shape is easy to query, easy to extend, and easy to trust.
Facts and dimensions
Dimensional modeling divides the world into two kinds of tables.
Fact tables record the measurements of a business process. Each row is one measurement event: one product scanned at a register, one invoice line, one account balance at month-end, one support ticket state change. Fact table columns are of exactly two kinds: numeric facts (quantities, amounts, durations) and foreign keys to dimensions. Fact tables are long and narrow — they grow with the business, often into billions of rows.
Dimension tables carry the descriptive context of those measurements — the who, what, where, when, why, and how. A product dimension carries name, brand, category, package size; a customer dimension carries name, segment, region, signup date; a date dimension carries fiscal periods, holiday flags, weekday names. Dimensions are short and wide — thousands or millions of rows, dozens of verbose, human-readable attributes.
The split corresponds exactly to how questions are phrased. In "total sales amount by product category for California stores during the holiday season", sales amount is a fact; everything else is dimensional context. Dimension attributes become the GROUP BY and WHERE clauses of every query; facts become the aggregations.
The star schema
Arrange one fact table at the center with its dimension tables around it, joined by surrogate keys, and you have a star schema:
SELECT
p.category,
d.fiscal_quarter,
SUM(f.sales_amount) AS total_sales
FROM fct_sales_detail f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_store s ON f.store_key = s.store_key
WHERE s.state = 'CA'
GROUP BY p.category, d.fiscal_quarter;
Every analytical query against a star has this same shape: join facts to the dimensions you need, filter on dimension attributes, aggregate facts, group by dimension attributes. That predictability is the point. Analysts learn the pattern once; BI tools generate it reliably; query optimizers execute it efficiently; and — as we'll see in the semantic modeling chapter — a semantic layer can automate it entirely, because the joins are knowable in advance.
The star schema is deliberately denormalized. Product category lives directly in dim_product, even though strict normalization would put categories in their own table. Resisting the urge to re-normalize dimensions is one of Kimball's most repeated warnings: the storage saved is trivial (especially with columnar compression), and the cost — every query and every user now navigating extra joins — is permanent. The normalized variant, with dimensions split into chains of sub-tables, is called a snowflake schema; we cover it (and the rare cases where pieces of it are justified) in Dimension Table Design.
The four-step design process
Kimball's enduring contribution is not just the star schema but a repeatable process for arriving at one. Four decisions, in order:
Step 1: Select the business process
Model a process — an activity the organization performs that generates measurements — not a department or a report. Order taking, shipping, inventory movement, claim processing, subscription billing. Processes are the stable units: departments reorganize and reports churn, but the business will be taking orders for as long as it exists.
Pick the process by combining business priority with data feasibility, and start with the one that matters most. Each business process will typically become one or more fact tables.
Step 2: Declare the grain
The grain is the answer to: what does exactly one row of this fact table represent? One order line. One register scan. One account-day. One click. This is the most important decision in dimensional modeling, and it must be stated in business terms before any keys or columns are listed.
Two rules:
Choose the most atomic grain available — the level at which the data cannot be subdivided further. Atomic data answers every question that aggregated data can answer, plus all the questions you haven't thought of yet. If you store order-level totals and someone asks which products sold best on weekends, you're rebuilding your pipeline; if you stored order lines, it's a GROUP BY. On modern columnar warehouses the old performance excuse for pre-aggregating is mostly gone — store atomic, aggregate on demand, and add summary tables only when measured workloads justify them.
Never mix grains in one fact table. A table where some rows are order lines and others are order-level shipping totals will double-count one or the other in every careless query. Different grains → different fact tables, no exceptions.
Step 3: Identify the dimensions
With the grain declared, the dimensions almost name themselves: ask "how would businesspeople describe one of these rows?" For an order-line grain: the date it happened, the product sold, the customer who bought, the store or channel, the promotion in effect, the employee who rang it up. Each becomes a foreign key to a dimension table.
Be generous. Every attribute you attach is a new way to slice every measure, and wide descriptive dimensions are what make self-service analytics feel effortless. A robust date dimension alone — fiscal periods, holiday flags, selling seasons — unlocks analyses that raw timestamps make painful.
Step 4: Identify the facts
Finally, the numeric measurements. The test: every fact must be true to the grain. At order-line grain, unit_quantity, unit_price_amount, discount_amount, and extended_sales_amount all describe one line — they belong. An order-level shipping fee does not describe one line; including it would repeat the fee across lines and sum it wrongly. It belongs in a separate fact table at order grain (or allocated down to lines explicitly, as a deliberate modeling decision).
Store facts that are additive wherever possible — values that sum meaningfully across all dimensions. Store the components, not just the ratios: discount_amount and gross_amount rather than discount_percent, because percentages can be computed correctly after aggregation but not summed. (Much more on additivity in Fact Table Design.)
Conformed dimensions and the bus matrix
A single star answers questions about a single process. Businesses need to ask questions across processes: did the marketing campaign (promotions process) lift sales (orders process) without wrecking delivery times (shipping process)?
The mechanism that makes cross-process analysis possible is the conformed dimension: one dimension table — one dim_customer, one dim_product, one dim_date — shared by every fact table that touches that concept. When two fact tables join to the same customer dimension, you can query each by customer and lay the results side by side ("drilling across") with confidence that "customer" means the same thing in both.
The planning tool for conformance is the enterprise data warehouse bus matrix: business processes as rows, dimensions as columns, an X where each process uses each dimension.
| Business process | Date | Customer | Product | Store | Promotion | Warehouse |
|---|---|---|---|---|---|---|
| Retail sales | X | X | X | X | X | |
| Inventory | X | X | X | X | ||
| Purchase orders | X | X | X | |||
| Returns | X | X | X | X |
The matrix is an architecture on one page. Columns with many X's are your most valuable dimensions — build them once, build them well, and reuse them everywhere. The matrix also enables incremental delivery: build one star at a time, in priority order, and because each new star reuses conformed dimensions, the warehouse integrates as it grows instead of fragmenting into silos.
Conformance is as much governance as engineering: it requires the organization to agree on one definition of customer, one product hierarchy, one fiscal calendar. That agreement — politically hard, technically simple — is precisely what makes the downstream semantic layer trustworthy, because the semantic model inherits whatever conformance the dimensional layer achieved.
From star schema to semantic model
Notice what a finished star schema gives you: a set of entities (the fact and dimension tables), a set of descriptive attributes (dimension columns), a set of measurements (fact columns) with known aggregation behavior, and a set of relationships (the foreign keys) with known cardinality — facts join many-to-one to dimensions.
Those four things — entities, dimensions, measures, relationships — are exactly the vocabulary of a semantic model. In Coginiti's SMDL, a star schema translates almost mechanically:
entity "sales_detail" {
label = "Sales Detail"
table_name = "core.fct_sales_detail"
measure "sales_amount" {
label = "Sales Amount"
type = "decimal"
aggregation_type = "sum"
expr = "sales_amount"
}
relationship "product" {
type = "many_to_one"
expr = "sales_detail.product_key = dim_product.product_key"
}
}
With that definition in place, the hand-written four-table join earlier in this chapter becomes a query in semantic SQL:
SELECT category, fiscal_quarter, MEASURE(sales_amount)
FROM sales_detail
GROUP BY category, fiscal_quarter
— and the semantic layer supplies the joins from the declared relationships. This is why dimensional modeling rewards the effort twice: once in human queryability, and again when the same structure becomes a machine-readable semantic model. The full mapping is the subject of Semantic Modeling and the end-to-end walkthrough.
First, though, the two table types deserve deeper treatment — starting with facts.
Next: Fact Table Design →
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.