From Star Schema to Semantic Model

A Complete Retail Walkthrough

9 min read

Part of the Data Modeling for Analytics guide. Previous: Semantic Modeling · Back to the guide overview

The previous chapters built the theory: physical layering, the star schema, fact and dimension design, and semantic modeling. This chapter puts it all together on one worked example, end to end: business requirements → dimensional design → physical tables → semantic model → business questions answered in semantic SQL. For a hands-on version of this build, see Coginiti's retail project workflow tutorial.

The scenario

You're the analytics engineer for a specialty retailer selling through stores and a web shop. Source systems land in your warehouse: a point-of-sale system (order headers and order lines), a product catalog, and a customer database. Leadership's questions are the usual suspects: revenue and margin by product, category, channel, and region; promotion effectiveness; customer purchasing patterns; everything trended over fiscal periods.

Your stack: raw data lands in a raw schema; CoginitiScript transformations build staging and core; the semantic model is defined over core in .smdl files in the semantic/ folder of the project.

Step 1–4: The dimensional design

Applying Kimball's four-step process:

1. Business process: retail sales (order taking). The highest-priority process; others (inventory, fulfillment, returns) will follow later, sharing conformed dimensions via the bus matrix.

2. Grain: one row per order line — the most atomic level the POS provides. Order-level amounts (shipping fees) are excluded from this table; they'll get a header-grain fact when needed.

3. Dimensions: date (order date), customer, product, store/channel, promotion.

4. Facts: unit_quantity, unit_price_amount, discount_amount, unit_cost_amount — all additive, all true to the line grain. Extended amounts are derivable; we'll store gross_amount and net_amount for convenience and define margin post-aggregation in the semantic layer, never as a stored ratio.

The star:

The worked star schema: fct_sales_detail at the center, surrounded by five conformed dimensions — dim_date, dim_customer, dim_product, dim_store, and dim_promotion — each joined by a surrogate key.

The physical model

Staging models (stg_pos_order_headers, stg_pos_order_lines, stg_catalog_products, stg_crm_customers) rename, cast, deduplicate. The core layer then builds the star. Abbreviated DDL for the key tables:

CREATE TABLE core.dim_customer (
    customer_key     INTEGER       NOT NULL,  -- surrogate (version) key
    customer_id      VARCHAR       NOT NULL,  -- natural key (CRM)
    full_name        VARCHAR       NOT NULL,
    customer_segment VARCHAR       NOT NULL,  -- Type 2 tracked
    region           VARCHAR       NOT NULL,  -- Type 2 tracked
    signup_date      DATE          NOT NULL,  -- Type 0
    is_current       BOOLEAN       NOT NULL,
    valid_from       TIMESTAMP     NOT NULL,
    valid_to         TIMESTAMP
);

CREATE TABLE core.fct_sales_detail (
    sales_detail_key  BIGINT        NOT NULL,
    order_date_key    INTEGER       NOT NULL,  -- → dim_date
    customer_key      INTEGER       NOT NULL,  -- → dim_customer (version row)
    product_key       INTEGER       NOT NULL,  -- → dim_product
    store_key         INTEGER       NOT NULL,  -- → dim_store
    promotion_key     INTEGER       NOT NULL,  -- → dim_promotion (0 = none)
    order_number      VARCHAR       NOT NULL,  -- degenerate dimension
    unit_quantity     INTEGER       NOT NULL,
    unit_price_amount DECIMAL(12,2) NOT NULL,
    discount_amount   DECIMAL(12,2) NOT NULL,
    unit_cost_amount  DECIMAL(12,2) NOT NULL,
    gross_amount      DECIMAL(12,2) NOT NULL,  -- qty × price
    net_amount        DECIMAL(12,2) NOT NULL   -- gross − discount
);
-- Grain: one row per order line
-- Partitioned/clustered by order_date_key; loaded incrementally

Design notes, connecting back to earlier chapters: customer segment and region are Type 2 (so historical sales stay attributed to the segment/region in effect at order time); promotion_key = 0 points at an explicit "No Promotion" row rather than null; the order number rides along as a degenerate dimension; money is DECIMAL; and the fact is date-partitioned and incrementally loaded by CoginitiScript with idempotent merge logic.

The semantic model

Now the layer that consumers actually touch. In the project: semantic/sales/fct_sales_detail.smdl, semantic/customer/dim_customer.smdl, etc., registered in project.toml. The model, abbreviated to the interesting parts:

// semantic/customer/dim_customer.smdl
entity "customer" {
  label       = "Customer"
  description = "One row per customer version; segment and region track history."
  table_name  = "core.dim_customer"

  dimension "customer_id" {
    label = "Customer ID"
    type  = "text"
    expr  = "customer_id"
  }

  dimension "full_name" {
    label = "Customer Name"
    type  = "text"
    expr  = "full_name"
  }

  dimension "customer_segment" {
    label       = "Segment"
    type        = "text"
    description = "Marketing segment in effect at the time of each sale."
    expr        = "customer_segment"
  }

  dimension "region" {
    label = "Region"
    type  = "text"
    expr  = "region"
  }

  dimension "is_current" {
    label  = "Is Current Version"
    type   = "bool"
    expr   = "is_current"
    hidden = true
  }
}
// semantic/sales/fct_sales_detail.smdl
entity "sales" {
  label       = "Sales"
  description = "Retail sales at order-line grain, all channels."
  table_name  = "core.fct_sales_detail"

  // ---- relationships: the star's joins, declared once ----
  relationship "order_date" {
    label = "Order Date"
    type  = "many_to_one"
    expr  = "sales.order_date_key = order_date.date_key"
  }

  relationship "customer" {
    label = "Customer"
    type  = "many_to_one"
    expr  = "sales.customer_key = customer.customer_key"
  }

  relationship "product" {
    label = "Product"
    type  = "many_to_one"
    expr  = "sales.product_key = product.product_key"
  }

  relationship "store" {
    label = "Store"
    type  = "many_to_one"
    expr  = "sales.store_key = store.store_key"
  }

  relationship "promotion" {
    label = "Promotion"
    type  = "many_to_one"
    expr  = "sales.promotion_key = promotion.promotion_key"
  }

  // ---- dimensions ----
  dimension "order_number" {
    label = "Order Number"
    type  = "text"
    expr  = "order_number"
  }

  // hidden operands so the system can type the measure expressions
  dimension "net_amount" {
    type   = "decimal"
    expr   = "net_amount"
    hidden = true
  }

  dimension "unit_cost_amount" {
    type   = "decimal"
    expr   = "unit_cost_amount"
    hidden = true
  }

  dimension "unit_quantity" {
    type   = "integer"
    expr   = "unit_quantity"
    hidden = true
  }

  // cross-entity convenience dimensions: the declared relationships
  // let these reference related entities; the join is added automatically
  // whenever a query uses them
  dimension "product_category" {
    label = "Product Category"
    type  = "text"
    expr  = "product.category_name"
  }

  dimension "product_name" {
    label = "Product Name"
    type  = "text"
    expr  = "product.product_name"
  }

  dimension "customer_segment" {
    label = "Customer Segment"
    type  = "text"
    expr  = "customer.customer_segment"
  }

  dimension "region" {
    label = "Customer Region"
    type  = "text"
    expr  = "customer.region"
  }

  dimension "fiscal_quarter" {
    label = "Fiscal Quarter"
    type  = "text"
    expr  = "order_date.fiscal_quarter"
  }

  // ---- measures: the metric definitions, once ----
  measure "units_sold" {
    label            = "Units Sold"
    type             = "number"
    aggregation_type = "sum"
    expr             = "unit_quantity"
  }

  measure "revenue" {
    label            = "Revenue"
    type             = "decimal"
    description      = "Net sales: gross minus line discounts. Excludes shipping."
    aggregation_type = "sum"
    expr             = "net_amount"
  }

  measure "cost" {
    label            = "Cost of Goods"
    type             = "decimal"
    aggregation_type = "sum"
    expr             = "unit_cost_amount * unit_quantity"
  }

  measure "promoted_revenue" {
    label            = "Promoted Revenue"
    type             = "decimal"
    description      = "Revenue from lines sold under any promotion."
    aggregation_type = "sum"
    expr             = "net_amount"
    filter           = "promotion_key != 0"
  }

  measure "order_count" {
    label            = "Order Count"
    type             = "number"
    aggregation_type = "count_distinct"
    expr             = "order_number"
  }

  // derived measures: post-aggregation arithmetic
  measure "profit" {
    label = "Profit"
    type  = "decimal"
    expr  = "revenue - cost"
  }

  measure "margin_pct" {
    label = "Margin %"
    type  = "number"
    expr  = "profit / NULLIF(revenue, 0)"
  }

  measure "promoted_revenue_share" {
    label = "Promoted Revenue Share"
    type  = "number"
    expr  = "promoted_revenue / NULLIF(revenue, 0)"
  }

  measure "average_order_value" {
    label = "Average Order Value"
    type  = "decimal"
    expr  = "revenue / NULLIF(order_count, 0)"
  }
}

The product, store, promotion, and order_date entities follow the same pattern as customer and are elided for brevity — each one a wide dimension exposed with business labels, with bookkeeping columns hidden.

Every design decision from the earlier chapters has surfaced here: the star's foreign keys became declared relationships; additivity discipline became simple vs. derived measures (margin_pct is a ratio of sums by construction); the business rule "promoted" became a filtered measure instead of forty hand-written CASE expressions; and count_distinct on the degenerate order number gives a correct order count at line grain — a calculation analysts routinely get wrong by hand.

Answering the business questions

All consumers — analysts, dashboards, external tools over JDBC/ODBC, and AI assistance — now query one set of definitions.

Revenue, profit, and margin by category and fiscal quarter:

SELECT
    product_category,
    fiscal_quarter,
    MEASURE(revenue)    AS revenue,
    MEASURE(profit)     AS profit,
    MEASURE(margin_pct) AS margin
FROM sales
GROUP BY product_category, fiscal_quarter
ORDER BY fiscal_quarter, revenue DESC

No joins written; product_category and fiscal_quarter resolve through the declared relationships to the product and date entities.

Did promotions work, by region?

SELECT
    region,
    MEASURE(revenue)                AS total_revenue,
    MEASURE(promoted_revenue)       AS promoted_revenue,
    MEASURE(promoted_revenue_share) AS promoted_share,
    MEASURE(margin_pct)             AS margin
FROM sales
GROUP BY region
ORDER BY promoted_share DESC

One query, four governed metrics, and the region attribute is the historically accurate Type 2 value — sales count toward the region the customer was in when they bought.

Top ten products among high-value customers, with rank:

WITH product_perf AS (
    SELECT
        product_category,
        product_name,
        MEASURE(revenue)             AS revenue,
        MEASURE(average_order_value) AS aov
    FROM sales
    WHERE customer_segment = 'High Value'
    GROUP BY product_category, product_name
)
SELECT
    product_category,
    product_name,
    revenue,
    aov,
    RANK() OVER (PARTITION BY product_category ORDER BY revenue DESC) AS category_rank
FROM product_perf
WHERE revenue > 0
ORDER BY product_category, category_rank
LIMIT 10

CTEs and window functions compose over governed measures — sophisticated analysis without re-deriving the basics.

And in natural language. With the semantic model selected as AI context, "show me margin by product category for high-value customers this fiscal year, promoted sales only" has everything it needs: margin is margin_pct (ratio of sums), high-value is a customer_segment value, fiscal year is a date-entity attribute, promoted is a defined filter. The generated query uses your definitions — the difference between AI that accelerates analysts and AI that fabricates joins.

The payoff, and the path

Look at the architecture you've assembled across this guide:

  1. Raw → staging: sources landed, cleaned at the boundary, quality-checked.
  2. Staging → core: a star schema — atomic-grain facts, conformed wide dimensions, history tracked deliberately — built as version-controlled, modular transformations.
  3. Core → semantic: entities, dimensions, measures, and relationships defined once in SMDL, in the same project, reviewed and promoted like code.
  4. Semantic → everyone: analysts writing semantic SQL, tools connecting over standard drivers, and AI generating queries against governed definitions.

Each layer does one job. The physical layer made data correct and efficient. The dimensional layer made it understandable and gave every number a declared grain. The semantic layer made it meaningful — and made that meaning enforceable.

That's data modeling done end to end: hard work, done once, in the right place — so that every question afterward is just a query.

Back to the guide overview →

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.