Data Modeling for Analytics

From Physical Tables to the Semantic Layer

9 min read

Every analytics question your business asks — What were sales by region last quarter? Which customers are churning? How is margin trending? — gets answered by a query against data that someone, at some point, decided how to structure. That decision is data modeling, and it determines whether your team answers questions in minutes with numbers everyone trusts, or spends days reconciling three different versions of "revenue."

This guide teaches data modeling end to end, from the physical tables in your warehouse up through the semantic layer that exposes business concepts to analysts, BI tools, and AI. It places a strong emphasis on dimensional modeling — the approach developed by Ralph Kimball and refined over three decades — because dimensional thinking remains the most effective foundation for analytics, and because it maps almost one-to-one onto modern semantic models.

The guide is written for analysts and analytics engineers: people who write SQL every day and want a rigorous, practical framework for structuring data. Examples throughout use Coginiti's transformation layer (CoginitiScript) and Semantic Model Definition Language (SMDL), but the concepts apply to any modern cloud data warehouse.

What is data modeling?

Data modeling is the practice of deciding how data is structured, named, related, and exposed so that it can answer questions reliably. In an analytics context, modeling happens at several distinct levels:

Conceptual modeling identifies the things the business cares about — customers, products, orders, shipments — and how they relate. No tables, no columns; just business concepts.

Logical modeling translates concepts into structures: which entities exist, what attributes they carry, what keys identify them, and how they relate. A logical model is independent of any particular database platform.

Physical modeling implements the logical model on a real platform: actual tables, columns, data types, keys, partitioning, clustering, and materialization choices tuned for how the warehouse stores and scans data.

Semantic modeling sits above the physical layer. It defines business entities, dimensions, measures, and relationships in business language, so that anyone — or any AI — querying the data works with Sales Amount and Customer Region rather than FCT_SLS_DTL.AMT_23 joined to DIM_GEO_RGN.

Most data teams under-invest in the top and bottom of this stack. They write transformation SQL (physical) without a clear logical design, and they expose raw tables to BI tools without a semantic layer. The result is the familiar pathology: every dashboard computes metrics slightly differently, tribal knowledge lives in someone's head, and new analysts take months to become productive.

The analytics modeling stack

A well-modeled warehouse is organized into layers, each with a distinct job:

Data flows upward through these layers via transformations — in Coginiti, written as modular, reusable CoginitiScript packages — and is consumed from the top via the semantic layer, which translates business-friendly queries into SQL against the physical tables beneath.

The middle of this stack is where dimensional modeling does its work, and it's worth understanding why that approach won.

The major modeling approaches

Four schools of thought dominate analytics data modeling. You should know all four, even though this guide recommends (and focuses on) the dimensional approach for the analytics-facing layers.

Dimensional modeling (Kimball). Organize data around business processes. Each process gets a fact table holding its numeric measurements at a declared grain, surrounded by denormalized dimension tables holding descriptive context — the classic star schema. Kimball's approach, introduced in The Data Warehouse Toolkit (1996, third edition 2013), optimizes for understandability and query performance. It remains the dominant approach for analytics because business questions are naturally dimensional: a measure, sliced by attributes, filtered by conditions.

Normalized / relational modeling (Inmon). Build an enterprise data warehouse in third normal form (3NF), modeling business entities and their relationships with minimal redundancy, then derive downstream marts from it. Claudia Imhoff, Nicholas Galemmo, and Jonathan Geiger's Mastering Data Warehouse Design presents the mature version of this school: a normalized, subject-oriented core that feeds dimensional marts. Normalization excels at integration and consistency in the core layers; it is a poor shape to hand directly to analysts, because answering even simple questions requires many joins.

Data Vault. A hub-link-satellite pattern designed for auditability and resilience to source-system change. Hubs hold business keys, links hold relationships, satellites hold attributes with full history. Data Vault shines in heavily regulated environments with many volatile sources, at the cost of significant query complexity — Data Vault implementations almost always project dimensional marts on top for consumption.

Wide tables (One Big Table). Join everything into a single denormalized table per analysis area. Modern columnar warehouses handle wide tables well, and for small teams getting started this is a legitimately fast path to insight. The weakness appears at scale: when the same customer attributes are baked into a dozen wide tables, a definition change must be hunted down in a dozen places, and metric drift creeps in.

These approaches are not mutually exclusive. A common and effective pattern — and the one this guide teaches — is: light normalization in staging and integration where it aids consistency, dimensional models (star schemas) as the core analytical layer, optional wide tables materialized from the star for specific high-traffic workloads, and a semantic layer defined over the dimensional model as the single place where business logic and metric definitions live.

Why dimensional modeling still wins

Kimball's techniques were invented when storage was expensive and databases were slow, and some of his machinery existed purely to cope with those constraints. Cloud warehouses with columnar storage and massively parallel compute have made parts of the old playbook optional: you can often skip aggregate snapshot tables until query patterns demand them, and cheap storage opens simpler options for tracking history. Modeling can also be done just in time — transformations within the warehouse are cheap to change, so you don't need to design every star perfectly up front.

But the core of dimensional modeling was never about hardware. It survives because it solves problems that faster machines don't touch:

It matches how people ask questions. "Show me revenue by product category for EMEA over the last six quarters" is a measure sliced by three dimensions. A star schema is that sentence, expressed as tables.

It forces the grain conversation. Declaring exactly what one row of a fact table represents — one order line, one daily account balance, one click — eliminates the largest class of analytical errors: double counting and misaggregation.

It creates conformance. When every fact table shares the same customer dimension and date dimension, results from different processes can be compared and combined safely. This is the foundation of an integrated view of the business.

It is the natural substrate for a semantic layer. Semantic models are built from entities, dimensions, measures, and relationships. A star schema already is that structure. Teams with clean dimensional models find that defining their semantic layer is nearly mechanical; teams with tangled schemas find it nearly impossible.

That last point matters more every year. As AI assistants generate more of the SQL run against warehouses, the quality of generated queries depends directly on how much business meaning the system can see. A semantic model over a dimensional warehouse gives an LLM descriptive names, explicit relationships, and unambiguous metric definitions — the difference between AI that answers correctly and AI that hallucinates joins.

How this guide is organized

Each chapter below is a standalone deep dive. Read them in order for a complete course, or jump to what you need.

Physical Data Modeling for the Cloud Warehouse — The foundation: warehouse layers, normalization and when to use it, natural vs. surrogate vs. durable keys, naming conventions, data types, and platform-level physical design (partitioning, clustering, distribution, materialization) for columnar MPP warehouses.

Dimensional Modeling and the Star Schema — The heart of the guide: facts and dimensions, the star schema, Kimball's four-step design process, declaring the grain, the enterprise bus matrix, and conformed dimensions.

Fact Table Design — Transaction, periodic snapshot, accumulating snapshot, and factless fact tables; additive, semi-additive, and non-additive facts; degenerate dimensions; and the design mistakes that quietly corrupt metrics.

Dimension Table Design — Wide denormalized dimensions, the date dimension, surrogate and durable keys, slowly changing dimensions (Types 0–7), role-playing and junk dimensions, hierarchies, and bridge tables.

Semantic Modeling: Giving Your Data Business Meaning — What a semantic layer is and why it has become essential; entities, dimensions, measures, and relationships; defining metrics once with simple, filtered, and derived measures; and how dimensional concepts map directly onto semantic models, illustrated with Coginiti SMDL.

From Star Schema to Semantic Model: A Complete Walkthrough — An end-to-end retail example: design the star schema, build it, define the full semantic model in SMDL, and query it with semantic SQL — including how AI assistance benefits at each step.

A note on doing the hard work once

Kimball's philosophy can be compressed into one sentence: do the hard work now to make it easy to query later. The modern restatement adds a second clause: and define each business concept exactly once.

The physical model does the hard work of integration and structure. The dimensional model does the hard work of declaring grains and conforming dimensions. The semantic layer does the hard work of pinning down what "revenue," "active customer," and "margin" actually mean — so that every analyst, every dashboard, and every AI assistant computes them the same way.

Get those three layers right and everything downstream — self-service analytics, trustworthy dashboards, accurate AI-generated SQL — stops being a struggle and starts being a property of the system.

Start with Physical Data Modeling for the Cloud Warehouse →

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.

Request a Demo