Physical Data Modeling for the Cloud Warehouse
Layers, Normalization, Keys, and Columnar MPP Design
9 min read
Part of the Data Modeling for Analytics guide. Next: Dimensional Modeling and the Star Schema →
Physical data modeling is where intention meets the database: real tables, real columns, real keys, on a real platform with real performance characteristics. Get the physical layer right and every layer above it — dimensional models, semantic models, dashboards — becomes easier to build and cheaper to run. Get it wrong and you'll spend years compensating with clever SQL.
This chapter covers the physical architecture of an analytics warehouse: how to organize layers, when and how much to normalize, how to handle keys, and how to design for the columnar, massively parallel platforms that dominate modern analytics.
The layered warehouse
The single most important physical design decision is to organize your warehouse into layers with clear contracts between them. A proven structure:
Raw / landing
Exact, unmodified copies of source data, loaded by your ingestion tooling. Tables here mirror the source: same columns, same names, same mess. The raw layer exists so you can always re-derive everything downstream without re-extracting from sources, and so you can audit what the source actually said.
Rules for the raw layer: load it, never edit it, and never let anything except staging read from it.
Staging
Staging gives each raw table "a bath and a shave" — light, source-shaped transformations that make data usable without applying business logic:
- Rename cryptic or inconsistent columns (
insertion→inserted_at,name_first→first_name) - Cast to correct data types (string dates →
DATE,0/1flags →BOOLEAN) - Deduplicate rows the source delivered twice
- Flatten semi-structured payloads (JSON event properties → typed columns)
- Standardize values (trim whitespace, fix casing, map code values to readable ones)
Each raw table gets exactly one staging model, named predictably: stg_<source>_<entity> (e.g., stg_pos_orders, stg_crm_customers). Staging models are the only readers of raw data, and everything downstream reads staging, never raw. This single rule means that when you discover a new data quality problem, you fix it in one place.
In Coginiti, staging transformations are typically CoginitiScript blocks organized into a staging package — modular, version-controlled SQL with explicit dependencies, so the lineage from raw to staged is visible and testable. Data quality assertions (uniqueness of keys, non-null constraints, accepted values) belong here too: catch bad data at the boundary, not in a dashboard.
Integration / core
The core layer unifies entities across sources and applies real business logic. This is where a customer who exists in your CRM, your billing system, and your product database becomes one customer row; where currency conversions, fiscal calendars, and status derivations live; and — most importantly — where your dimensional models (fact and dimension tables, covered in the next chapter) are built.
Presentation / marts
Subject-area structures shaped for specific consumers: a finance mart, a marketing mart, perhaps wide denormalized tables materialized from the dimensional core for a high-traffic dashboard. Marts contain no new business logic — they reshape and subset what the core already defines.
Semantic layer
Above all the physical layers sits the semantic model: business names, relationships, and metric definitions, defined in Coginiti as .smdl files alongside your transformation code. The semantic layer is covered in depth in Semantic Modeling; for now, the physical takeaway is that the cleaner your core layer, the thinner and simpler your semantic layer can be.
Normalization: what it is and where it belongs
Normalization organizes data to eliminate redundancy. The practical ladder:
- First normal form (1NF): every column holds atomic values; no repeating groups or embedded lists.
- Second normal form (2NF): every non-key column depends on the whole primary key (relevant for composite keys).
- Third normal form (3NF): non-key columns depend only on the key — no transitive dependencies. If
customer_citydeterminescustomer_region, region belongs in its own table.
Normalization is the right tool when data is being written and integrated: it prevents update anomalies and gives each fact exactly one home. This is why the relational school of warehouse design (Inmoff-style architectures, articulated thoroughly in Imhoff, Galemmo, and Geiger's Mastering Data Warehouse Design) builds the integration core in or near 3NF: when you're reconciling five source systems' views of "customer," a normalized structure keeps the reconciliation logic sane.
Normalization is the wrong shape for analytics consumption. A fully normalized schema scatters the answer to "sales by product category by region" across a dozen tables, demanding joins that analysts must each rediscover and re-derive correctly. The dimensional layer deliberately denormalizes descriptive data back into wide dimension tables, trading some redundancy for queryability — a trade that columnar storage and cheap compute have made overwhelmingly favorable.
So the practical guidance is: normalize toward the sources (staging, integration of entities), denormalize toward the consumers (dimensions, marts). Know the rules of normalization so you can break them deliberately rather than accidentally.
Keys: natural, surrogate, and durable
Key handling separates robust warehouses from fragile ones. Three kinds of keys matter:
Natural keys (also called business keys) come from source systems: order numbers, SKUs, customer account IDs, email addresses. They carry meaning, but they're outside your control — sources recycle them, change their format, or assign different IDs to the same real-world entity.
Surrogate keys are meaningless integers (or hashes) assigned by the warehouse, one per row of a dimension table. Surrogate keys insulate the warehouse from source-system key chaos, enable the integration of multiple sources (two source customer IDs → one surrogate), keep fact tables narrow and join-efficient, and — critically — make it possible to track dimension history, because one natural key can map to several surrogate-keyed rows over time (see slowly changing dimensions).
Durable keys (Kimball's "durable supernatural keys") are stable identifiers for the entity itself, independent of versions. When customer 1047 has three historical rows (three surrogate keys) because their address changed twice, the durable key is the one value shared by all three, letting you count distinct customers or link to the current profile regardless of version.
Convention: fact tables contain surrogate keys to dimensions, dimensions carry surrogate key + natural key + durable key, and natural keys never appear as join keys in fact tables.
Naming conventions and data types
Boring, consistent names are a force multiplier. A workable standard:
- Layer prefixes for objects:
stg_,int_,fct_,dim_(e.g.,fct_sales_detail,dim_customer) - Snake_case everywhere; no spaces, no quoting-dependent identifiers
- Suffixes that declare semantics:
_key(surrogate keys),_id(natural keys),_at(timestamps),_date(dates),is_/has_(booleans),_amount(monetary),_quantity(counts of things) - Spell words out.
unit_price_amount, notunt_prc_amt. Storage is cheap; deciphering abbreviations is not. This pays double when an AI assistant reads your schema — descriptive names are context.
Data type discipline matters more in analytics than in OLTP:
- Money is
DECIMAL/NUMERIC, never float. Floating-point arithmetic produces rounding errors that sum into real discrepancies. - Timestamps in UTC, with timezone conversion at presentation time. Store the event's wall-clock timezone separately if it matters to the business.
- Dates as
DATE, not strings, not timestamps-at-midnight. - Low-cardinality codes as text, human-readable (
'completed', not4). Columnar compression makes the storage cost negligible, and readability accrues to every query and every semantic definition downstream.
Designing for columnar, MPP platforms
Modern warehouses (Snowflake, BigQuery, Redshift, PostgreSQL-family analytics engines, Yellowbrick) share two architectural traits that should shape your physical design:
Columnar storage. Data is stored by column, not by row. A query reading 5 columns of a 60-column table touches only those 5 columns' data. Consequences: wide tables are cheap to store (similar values compress extremely well) and cheap to query if queries select few columns; SELECT * is the enemy; and adding descriptive columns to dimensions costs almost nothing.
Massively parallel processing (MPP). Queries are distributed across many nodes. Consequences: scans are fast and joins are the expensive operation to think about — particularly joins that redistribute large tables across nodes.
Platform-level physical design follows:
Partitioning / clustering. Organize large fact tables by the column queries filter on most — almost always the transaction date. Date-partitioned (or date-clustered) fact tables let the engine prune to the requested time range and skip everything else. Dimensions are usually too small to bother.
Distribution / co-location (on platforms that expose it). Distribute large fact tables by a high-cardinality foreign key that joins use; replicate small dimension tables to all nodes. The goal is that fact-to-dimension joins happen locally on each node without shuffling data across the network.
Sort order. Within partitions, sorting by commonly filtered columns improves data skipping (min/max metadata per block lets the engine skip blocks that can't match).
Materialization strategy. Each model in your transformation DAG is materialized as a view (always fresh, computed per query), a table (fast to read, refreshed on schedule), or an incrementally maintained table (only new/changed rows processed per run). Practical defaults: staging models as views or transient tables; core dimensional models as tables; very large fact tables as incremental builds keyed on a reliable watermark (an updated_at or load timestamp). Promote a view to a table when query frequency makes the repeated computation more expensive than storage and a refresh schedule.
A liberating consequence of cheap storage and fast scans: many techniques that were once mandatory — aggregate tables for every report, elaborate pre-summarization pipelines — are now optimizations you apply when measurement says you need them, not defaults. Build atomic-grain models first; add aggregates only when a real workload justifies them.
Physical design checklist
Before promoting a model to the core layer, check:
- One staging model per raw source, and nothing downstream reads raw directly.
- Every table has a documented grain — what does one row represent? (This becomes the central question of the next chapter.)
- Keys are explicit: surrogate keys for dimensions, natural keys preserved as attributes, durable keys where history will be tracked.
- Names follow the convention and would make sense to a new analyst — or an AI assistant — with no tribal knowledge.
- Types are strict: decimals for money, UTC timestamps, real dates, readable codes.
- Large facts are partitioned/clustered by date and materialized incrementally if volume warrants.
- Quality assertions run at the staging boundary: key uniqueness, non-nulls, accepted values, referential checks.
With a disciplined physical foundation in place, you're ready for the layer that gives analytics its shape: the dimensional model.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.