From Star Schema to Semantic Model
A Complete Retail Walkthrough
9 min read
Part of the Data Modeling for Analytics guide. Previous: Semantic Modeling · Back to the guide overview
The previous chapters built the theory: physical layering, the star schema, fact and dimension design, and semantic modeling. This chapter puts it all together on one worked example, end to end: business requirements → dimensional design → physical tables → semantic model → business questions answered in semantic SQL. For a hands-on version of this build, see Coginiti's retail project workflow tutorial.
The scenario
You're the analytics engineer for a specialty retailer selling through stores and a web shop. Source systems land in your warehouse: a point-of-sale system (order headers and order lines), a product catalog, and a customer database. Leadership's questions are the usual suspects: revenue and margin by product, category, channel, and region; promotion effectiveness; customer purchasing patterns; everything trended over fiscal periods.
Your stack: raw data lands in a raw schema; CoginitiScript transformations build staging and core; the semantic model is defined over core in .smdl files in the semantic/ folder of the project.
Step 1–4: The dimensional design
Applying Kimball's four-step process:
1. Business process: retail sales (order taking). The highest-priority process; others (inventory, fulfillment, returns) will follow later, sharing conformed dimensions via the bus matrix.
2. Grain: one row per order line — the most atomic level the POS provides. Order-level amounts (shipping fees) are excluded from this table; they'll get a header-grain fact when needed.
3. Dimensions: date (order date), customer, product, store/channel, promotion.
4. Facts: unit_quantity, unit_price_amount, discount_amount, unit_cost_amount — all additive, all true to the line grain. Extended amounts are derivable; we'll store gross_amount and net_amount for convenience and define margin post-aggregation in the semantic layer, never as a stored ratio.
The star:
The physical model
Staging models (stg_pos_order_headers, stg_pos_order_lines, stg_catalog_products, stg_crm_customers) rename, cast, deduplicate. The core layer then builds the star. Abbreviated DDL for the key tables:
CREATE TABLE core.dim_customer (
customer_key INTEGER NOT NULL, -- surrogate (version) key
customer_id VARCHAR NOT NULL, -- natural key (CRM)
full_name VARCHAR NOT NULL,
customer_segment VARCHAR NOT NULL, -- Type 2 tracked
region VARCHAR NOT NULL, -- Type 2 tracked
signup_date DATE NOT NULL, -- Type 0
is_current BOOLEAN NOT NULL,
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP
);
CREATE TABLE core.fct_sales_detail (
sales_detail_key BIGINT NOT NULL,
order_date_key INTEGER NOT NULL, -- → dim_date
customer_key INTEGER NOT NULL, -- → dim_customer (version row)
product_key INTEGER NOT NULL, -- → dim_product
store_key INTEGER NOT NULL, -- → dim_store
promotion_key INTEGER NOT NULL, -- → dim_promotion (0 = none)
order_number VARCHAR NOT NULL, -- degenerate dimension
unit_quantity INTEGER NOT NULL,
unit_price_amount DECIMAL(12,2) NOT NULL,
discount_amount DECIMAL(12,2) NOT NULL,
unit_cost_amount DECIMAL(12,2) NOT NULL,
gross_amount DECIMAL(12,2) NOT NULL, -- qty × price
net_amount DECIMAL(12,2) NOT NULL -- gross − discount
);
-- Grain: one row per order line
-- Partitioned/clustered by order_date_key; loaded incrementally
Design notes, connecting back to earlier chapters: customer segment and region are Type 2 (so historical sales stay attributed to the segment/region in effect at order time); promotion_key = 0 points at an explicit "No Promotion" row rather than null; the order number rides along as a degenerate dimension; money is DECIMAL; and the fact is date-partitioned and incrementally loaded by CoginitiScript with idempotent merge logic.
The semantic model
Now the layer that consumers actually touch. In the project: semantic/sales/fct_sales_detail.smdl, semantic/customer/dim_customer.smdl, etc., registered in project.toml. The model, abbreviated to the interesting parts:
// semantic/customer/dim_customer.smdl
entity "customer" {
label = "Customer"
description = "One row per customer version; segment and region track history."
table_name = "core.dim_customer"
dimension "customer_id" {
label = "Customer ID"
type = "text"
expr = "customer_id"
}
dimension "full_name" {
label = "Customer Name"
type = "text"
expr = "full_name"
}
dimension "customer_segment" {
label = "Segment"
type = "text"
description = "Marketing segment in effect at the time of each sale."
expr = "customer_segment"
}
dimension "region" {
label = "Region"
type = "text"
expr = "region"
}
dimension "is_current" {
label = "Is Current Version"
type = "bool"
expr = "is_current"
hidden = true
}
}
// semantic/sales/fct_sales_detail.smdl
entity "sales" {
label = "Sales"
description = "Retail sales at order-line grain, all channels."
table_name = "core.fct_sales_detail"
// ---- relationships: the star's joins, declared once ----
relationship "order_date" {
label = "Order Date"
type = "many_to_one"
expr = "sales.order_date_key = order_date.date_key"
}
relationship "customer" {
label = "Customer"
type = "many_to_one"
expr = "sales.customer_key = customer.customer_key"
}
relationship "product" {
label = "Product"
type = "many_to_one"
expr = "sales.product_key = product.product_key"
}
relationship "store" {
label = "Store"
type = "many_to_one"
expr = "sales.store_key = store.store_key"
}
relationship "promotion" {
label = "Promotion"
type = "many_to_one"
expr = "sales.promotion_key = promotion.promotion_key"
}
// ---- dimensions ----
dimension "order_number" {
label = "Order Number"
type = "text"
expr = "order_number"
}
// hidden operands so the system can type the measure expressions
dimension "net_amount" {
type = "decimal"
expr = "net_amount"
hidden = true
}
dimension "unit_cost_amount" {
type = "decimal"
expr = "unit_cost_amount"
hidden = true
}
dimension "unit_quantity" {
type = "integer"
expr = "unit_quantity"
hidden = true
}
// cross-entity convenience dimensions: the declared relationships
// let these reference related entities; the join is added automatically
// whenever a query uses them
dimension "product_category" {
label = "Product Category"
type = "text"
expr = "product.category_name"
}
dimension "product_name" {
label = "Product Name"
type = "text"
expr = "product.product_name"
}
dimension "customer_segment" {
label = "Customer Segment"
type = "text"
expr = "customer.customer_segment"
}
dimension "region" {
label = "Customer Region"
type = "text"
expr = "customer.region"
}
dimension "fiscal_quarter" {
label = "Fiscal Quarter"
type = "text"
expr = "order_date.fiscal_quarter"
}
// ---- measures: the metric definitions, once ----
measure "units_sold" {
label = "Units Sold"
type = "number"
aggregation_type = "sum"
expr = "unit_quantity"
}
measure "revenue" {
label = "Revenue"
type = "decimal"
description = "Net sales: gross minus line discounts. Excludes shipping."
aggregation_type = "sum"
expr = "net_amount"
}
measure "cost" {
label = "Cost of Goods"
type = "decimal"
aggregation_type = "sum"
expr = "unit_cost_amount * unit_quantity"
}
measure "promoted_revenue" {
label = "Promoted Revenue"
type = "decimal"
description = "Revenue from lines sold under any promotion."
aggregation_type = "sum"
expr = "net_amount"
filter = "promotion_key != 0"
}
measure "order_count" {
label = "Order Count"
type = "number"
aggregation_type = "count_distinct"
expr = "order_number"
}
// derived measures: post-aggregation arithmetic
measure "profit" {
label = "Profit"
type = "decimal"
expr = "revenue - cost"
}
measure "margin_pct" {
label = "Margin %"
type = "number"
expr = "profit / NULLIF(revenue, 0)"
}
measure "promoted_revenue_share" {
label = "Promoted Revenue Share"
type = "number"
expr = "promoted_revenue / NULLIF(revenue, 0)"
}
measure "average_order_value" {
label = "Average Order Value"
type = "decimal"
expr = "revenue / NULLIF(order_count, 0)"
}
}
The product, store, promotion, and order_date entities follow the same pattern as customer and are elided for brevity — each one a wide dimension exposed with business labels, with bookkeeping columns hidden.
Every design decision from the earlier chapters has surfaced here: the star's foreign keys became declared relationships; additivity discipline became simple vs. derived measures (margin_pct is a ratio of sums by construction); the business rule "promoted" became a filtered measure instead of forty hand-written CASE expressions; and count_distinct on the degenerate order number gives a correct order count at line grain — a calculation analysts routinely get wrong by hand.
Answering the business questions
All consumers — analysts, dashboards, external tools over JDBC/ODBC, and AI assistance — now query one set of definitions.
Revenue, profit, and margin by category and fiscal quarter:
SELECT
product_category,
fiscal_quarter,
MEASURE(revenue) AS revenue,
MEASURE(profit) AS profit,
MEASURE(margin_pct) AS margin
FROM sales
GROUP BY product_category, fiscal_quarter
ORDER BY fiscal_quarter, revenue DESC
No joins written; product_category and fiscal_quarter resolve through the declared relationships to the product and date entities.
Did promotions work, by region?
SELECT
region,
MEASURE(revenue) AS total_revenue,
MEASURE(promoted_revenue) AS promoted_revenue,
MEASURE(promoted_revenue_share) AS promoted_share,
MEASURE(margin_pct) AS margin
FROM sales
GROUP BY region
ORDER BY promoted_share DESC
One query, four governed metrics, and the region attribute is the historically accurate Type 2 value — sales count toward the region the customer was in when they bought.
Top ten products among high-value customers, with rank:
WITH product_perf AS (
SELECT
product_category,
product_name,
MEASURE(revenue) AS revenue,
MEASURE(average_order_value) AS aov
FROM sales
WHERE customer_segment = 'High Value'
GROUP BY product_category, product_name
)
SELECT
product_category,
product_name,
revenue,
aov,
RANK() OVER (PARTITION BY product_category ORDER BY revenue DESC) AS category_rank
FROM product_perf
WHERE revenue > 0
ORDER BY product_category, category_rank
LIMIT 10
CTEs and window functions compose over governed measures — sophisticated analysis without re-deriving the basics.
And in natural language. With the semantic model selected as AI context, "show me margin by product category for high-value customers this fiscal year, promoted sales only" has everything it needs: margin is margin_pct (ratio of sums), high-value is a customer_segment value, fiscal year is a date-entity attribute, promoted is a defined filter. The generated query uses your definitions — the difference between AI that accelerates analysts and AI that fabricates joins.
The payoff, and the path
Look at the architecture you've assembled across this guide:
- Raw → staging: sources landed, cleaned at the boundary, quality-checked.
- Staging → core: a star schema — atomic-grain facts, conformed wide dimensions, history tracked deliberately — built as version-controlled, modular transformations.
- Core → semantic: entities, dimensions, measures, and relationships defined once in SMDL, in the same project, reviewed and promoted like code.
- Semantic → everyone: analysts writing semantic SQL, tools connecting over standard drivers, and AI generating queries against governed definitions.
Each layer does one job. The physical layer made data correct and efficient. The dimensional layer made it understandable and gave every number a declared grain. The semantic layer made it meaningful — and made that meaning enforceable.
That's data modeling done end to end: hard work, done once, in the right place — so that every question afterward is just a query.
Back to the guide overview →
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.