Semantic Modeling

Giving Your Data Business Meaning

9 min read

Part of the Data Modeling for Analytics guide. Previous: Dimension Table Design · Next: From Star Schema to Semantic Model →

A physical data model tells the database how to store data. A semantic model tells people and machines what the data means. It is the layer where fct_sls_dtl.unt_qty * fct_sls_dtl.unt_prc_amt becomes simply Sales Amount — defined once, computed the same way in every query, by every analyst, every dashboard, and every AI assistant.

This chapter explains what semantic modeling is, why it has moved from nice-to-have to essential, and how to build a semantic model — using Coginiti's Semantic Model Definition Language (SMDL) for concrete examples. If you've worked through the dimensional modeling chapters, you'll find the concepts familiar: a semantic model is, in large part, dimensional modeling expressed as machine-readable metadata.

The problem the semantic layer solves

Three chronic failures motivate the semantic layer:

Metric drift. "Revenue" is computed in forty queries across the organization, with subtle differences: some include refunds, some don't; some filter test accounts, some don't. Two dashboards disagree, an executive notices, and the data team spends a week reconciling instead of analyzing. Without a single place where revenue is defined, every query is an opportunity to redefine it.

The translation burden. Physical models are optimized for storage and loading, and even well-designed warehouses carry physical artifacts — surrogate keys, SCD bookkeeping columns, staging conventions — that consumers shouldn't have to understand. In less fortunate warehouses, the physical layer is actively hostile: tables named TBL_987_XREF, columns named FLD_23_AMT, no declared foreign keys. Every analyst pays a translation tax on every query; most business users simply can't pay it and queue up requests instead.

AI needs context. Natural-language analytics — asking "what were last quarter's sales by region?" and getting correct SQL (text-to-SQL) — works only as well as the context the language model receives. Given a raw schema of cryptic names and undeclared relationships, an LLM guesses, and guessed joins produce confidently wrong answers (a hallucination). Given a semantic model — business names, descriptions, typed attributes, explicit relationships, defined metrics — the same model generates correct queries reliably. The semantic layer is, among other things, the context layer for AI.

The common root: meaning that lives in people's heads and scattered SQL instead of in the system. The semantic layer moves it into the system.

The four building blocks

Semantic models everywhere share the same conceptual vocabulary — and it maps one-to-one onto the dimensional concepts from earlier chapters:

Semantic concept What it is Dimensional ancestor
Entity A queryable business dataset Fact or dimension table
Dimension A descriptive attribute for grouping/filtering Dimension attribute
Measure A quantitative value with a defined aggregation Fact + its additivity
Relationship A declared join with cardinality Fact→dimension foreign key

This is why dimensional modeling earns its emphasis in this guide: if you've built a star schema, your semantic model is mostly transcription. Teams without dimensional discipline discover the hard way that a semantic layer can't conjure clarity from a tangled schema — it can only express clarity the data model already has. (It can paper over a moderately awkward physical model, as we'll see, but the cleaner the foundation, the thinner the paper.)

In Coginiti, semantic models are defined in .smdl files (HCL-like syntax) stored in your project alongside your CoginitiScript transformation code — version-controlled, code-reviewed, and environment-aware like everything else. Let's build up the pieces.

Entities

An entity is a named dataset — typically mapped directly to a physical table:

entity "sales_detail" {
  label       = "Sales Detail"
  description = "One row per order line at the retail point of sale."
  table_name  = "core.fct_sales_detail"
}

When the physical table isn't consumption-ready, an entity can instead be defined over a query — a useful escape hatch for renaming, filtering soft deletes, or reshaping a legacy table without waiting for a pipeline change:

entity "sales_detail" {
  label = "Sales Detail"
  query = <<-EOF
    SELECT
      id          AS sales_detail_key,
      header_key  AS sales_header_key,
      product_key,
      unit_quantity,
      unit_price_amount
    FROM legacy.FCT_SLS_DTL
    WHERE is_deleted = FALSE
  EOF
}

Prefer table_name (query-based entities add overhead), and treat query-based entities as a signal of transformation work the physical layer still owes you.

Dimensions

Dimensions expose attributes with business names, types, and descriptions:

dimension "category_name" {
  label       = "Product Category"
  type        = "text"
  description = "Merchandising category, e.g. 'Frozen Foods'."
  expr        = "category_name"
}

dimension "order_month" {
  label = "Order Month"
  type  = "date"
  expr  = "DATE_TRUNC('month', order_date)"
}

dimension "full_name" {
  label = "Customer Name"
  type  = "text"
  expr  = "CONCAT(first_name, ' ', last_name)"
}

Two practical SMDL details: when a dimension's expression combines several physical columns, each operand column needs its own dimension defined in the entity (so the system knows operand types) — mark helpers hidden = true to keep them out of the UI. And descriptions aren't decoration: they're documentation for humans and context for AI assistants, so write them as if explaining the field to a new analyst.

Measures

Measures are where metric drift goes to die. A simple measure binds an expression to an aggregation:

measure "sales_amount" {
  label            = "Sales Amount"
  type             = "decimal"
  description      = "Extended sales: quantity × unit price."
  aggregation_type = "sum"
  expr             = "unit_quantity * unit_price_amount"
}

measure "order_count" {
  label            = "Order Count"
  type             = "number"
  aggregation_type = "count_distinct"
  expr             = "order_number"
}

A filtered measure restricts which rows feed the aggregation — defining business-rule metrics declaratively instead of in scattered CASE WHEN SQL:

measure "completed_revenue" {
  label            = "Completed Revenue"
  type             = "decimal"
  aggregation_type = "sum"
  expr             = "sales_amount"
  filter           = "order_status = 'completed'"
}

The filter compiles to SUM(CASE WHEN order_status = 'completed' THEN ... END) — every consumer gets the same rule.

A derived measure combines other measures post-aggregation — the semantic encoding of the additivity lessons from Fact Table Design:

measure "profit" {
  label = "Profit"
  type  = "decimal"
  expr  = "revenue - cost"          // → SUM(amount) - SUM(cost_amount)
}

measure "margin" {
  label = "Profit Margin"
  type  = "number"
  expr  = "profit / NULLIF(revenue, 0)"   // chains resolve transitively
}

Each referenced measure aggregates independently, then the arithmetic applies — ratio of sums, never sum of ratios, for any grouping a user chooses. Derived measures can reference other derived measures (the dependency chain resolves automatically; circular references are rejected), so a metrics tree — revenue → profit → margin — lives in one auditable place.

Relationships

Relationships declare how entities join, with explicit cardinality:

entity "sales_detail" {
  table_name = "core.fct_sales_detail"

  relationship "product" {
    label = "Product"
    type  = "many_to_one"
    expr  = "sales_detail.product_key = dim_product.product_key"
  }

  relationship "customer" {
    label = "Customer"
    type  = "many_to_one"
    expr  = "sales_detail.customer_key = dim_customer.customer_key"
  }
}

Declared relationships are what free consumers from join-writing. A dimension on one entity can reference a related entity's column (expr = "dim_product.product_name"), and the join is added automatically whenever it's used — including transitively across multi-hop paths (line → header → customer). The cardinality declarations are also exactly what prevents the fan-out double counting that hand-written fact-to-fact joins invite.

Querying the semantic model

Coginiti's semantic layer exposes standard SQL (with JDBC/ODBC connectivity for external tools), with one extension: measures are invoked through MEASURE(), which applies each measure's defined aggregation:

SELECT
    category_name,
    region,
    MEASURE(sales_amount)      AS sales,
    MEASURE(completed_revenue) AS completed_sales,
    MEASURE(margin)            AS margin
FROM sales_detail
GROUP BY category_name, region
ORDER BY sales DESC

Notice what's absent: no join syntax (relationships supply it — category_name and region live on related dimension entities), no aggregation functions to choose (definitions supply them), no re-statement of the completed-orders business rule, and no opportunity to compute margin as an average of ratios. The query says what in business terms; the semantic layer compiles how — translating to the native SQL of the underlying platform. CTEs, window functions, and HAVING MEASURE(...) are all available for sophisticated analysis on top of governed definitions.

The same definitions power natural-language analytics: select a semantic model as AI context in Coginiti, and "show me profit margin by product category for completed orders this quarter" generates against your definitions of margin, completed, and fiscal quarter — not a language model's guesses.

Scoping and organizing the semantic layer

Organize by subject area. One entity per .smdl file, grouped into folders (semantic/sales/, semantic/customer/), registered in the project configuration. Multiple semantic layers can coexist per project (e.g., a full model and a narrower sales-domain model), letting you expose different scopes to different audiences.

Expose business concepts, not plumbing. Surrogate keys, SCD bookkeeping, helper operands: hidden = true. If an attribute would confuse a business user, it shouldn't be visible — even though it must exist for joins and typing.

Name for the business, describe for the newcomer. Entity and field names should match the words used in business conversations; descriptions should resolve the ambiguities (does Sales Amount include tax?). This is simultaneously your data dictionary and your AI prompt context.

Keep logic in the right layer. Heavy computation — joins of raw sources, history tracking, allocation rules — belongs in the transformation layer (CoginitiScript), materialized into well-shaped tables. The semantic layer holds definitions: labels, relationships, aggregation semantics, metric formulas. When you find yourself writing elaborate SQL inside an entity query, that's transformation work asking to move down a layer.

Version and review like code. Because .smdl files live in the project, metric definition changes ride through the same review and promotion workflow as transformation changes — which is precisely how a definition change ("revenue now excludes marketplace fees") rolls out everywhere at once, with an audit trail.

The concepts are all in hand. What remains is to see them work together — a complete model, from source tables to star schema to semantic SQL.

Next: From Star Schema to Semantic Model: A Complete Walkthrough →

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.