Glossary/Open Table Formats

Table Metadata Layer

A table metadata layer is a structured system that tracks file references, transaction history, schema definitions, and data statistics for tables, enabling consistent access and governance.

Traditional data lakes organize data as collections of Parquet or ORC files with no centralized metadata management, leading to consistency issues when concurrent processes modify files. A table metadata layer solves this by maintaining authoritative records of which files belong to a table, their statistics, and modification history.

The metadata layer typically includes a manifest of active data files, a transaction log recording all table changes, schema definitions with versioning, and data statistics for optimization. This enables several critical features: atomic updates prevent partial writes from becoming visible, rollback capabilities allow reverting bad changes, and consistent snapshots ensure read operations see coherent data.

Implemented in open table formats like Apache Iceberg and Delta Lake, the metadata layer is usually stored alongside data files in cloud object storage. This avoids introducing separate databases or metadata servers that become bottlenecks or single points of failure. The design is optimized for object storage characteristics, using directory listing and small file operations efficiently.

Key Characteristics

  • Maintains authoritative inventory of data files comprising a table
  • Records transaction history with timestamps and operation details
  • Enforces schema consistency and versions across table evolution
  • Tracks data statistics (min, max, null counts) for query optimization
  • Enables atomic, all-or-nothing updates across file collections
  • Designed for cloud object storage with minimal metadata overhead

Why It Matters

  • Prevents data corruption and consistency violations in concurrent analytical environments
  • Enables efficient query optimization through accurate statistics and file pruning
  • Supports governance requirements through complete audit trails of changes
  • Simplifies disaster recovery by providing snapshots of table state
  • Reduces query costs by avoiding full table scans when possible
  • Facilitates schema management and backward compatibility during data evolution

Example

`
-- Metadata layer tracks this structure internally:
{
  "format_version": 1,
  "table_uuid": "e1234567",
  "snapshots": [
    {
      "snapshot_id": 123,
      "timestamp": 1704067200000,
      "manifest_list": "s3://bucket/metadata/snap-123.avro"
    }
  ],
  "schema": [
    {"id": 1, "name": "order_id", "type": "int", "required": true},
    {"id": 2, "name": "amount", "type": "decimal(10,2)"}
  ]
}
`

Coginiti Perspective

Coginiti adds a complementary metadata layer through its analytics catalog and semantic model (SMDL). While table metadata layers like Iceberg's track file-level statistics and transaction history, Coginiti's SMDL defines business-level metadata: entity descriptions, dimension types, measure aggregations, and relationship cardinalities. Together, the table metadata layer handles physical governance and the semantic model handles business governance, covering both levels of the stack.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.