Data Lakehouse
Data Lakehouse is an architecture that combines data lake storage advantages (cheap, flexible, scalable) with data warehouse query capabilities (schema, performance, governance).
Data lakehouses address a key tension: data lakes are cheap and flexible but querying raw data is complex; data warehouses enable efficient queries but are expensive and rigid. Lakehouses use cloud object storage (cheap) with layered query engines and metadata systems (schema, optimization). Technically, lakehouses use table formats (Apache Iceberg, Delta Lake) that layer structured metadata on top of object storage, enabling ACID transactions, schema enforcement, and optimization like data warehouses while maintaining lake flexibility and cost.
Lakehouses emerged from organizations recognizing that raw data and curated data both have value: raw data enables exploration and model training, curated data supports governed analytics. Rather than choosing one, lakehouses provide both: raw data lives in the lake, curated tables are organized in warehouse-like schemas, same query engine accesses both. Platforms like Databricks were built around lakehouse architecture, treating the system as unified storage with multiple access patterns.
In practice, lakehouses enable organizations to consolidate systems: instead of separate lake (for raw data, ML) and warehouse (for analytics), a single lakehouse serves both. This reduces duplication, simplifies data movement, and lowers costs. The tradeoff is complexity: lakehouses are newer technology with less operational maturity than established warehouses.
Key Characteristics
- ▶Combines lake flexibility with warehouse structure
- ▶Uses object storage for cost efficiency
- ▶Implements ACID transactions and schema enforcement
- ▶Supports both raw and curated data access
- ▶Provides query optimization like warehouses
- ▶Enables multiple access patterns on same data
Why It Matters
- ▶Reduces total cost versus separate lake and warehouse
- ▶Enables unified analytics and ML on same platform
- ▶Supports governance on raw data without separate systems
- ▶Eliminates data movement between lake and warehouse
- ▶Reduces complexity by consolidating storage systems
- ▶Enables new use cases by providing both raw and curated access
Example
A financial services firm uses Databricks lakehouse: raw transaction data lands in object storage in Parquet files, Delta Lake format adds ACID transactions and schema. Finance team uses SQL to query curated revenue tables; ML team uses Python/Spark to train risk models on raw transaction logs; data scientists explore raw data for new feature discovery. Same underlying storage, same infrastructure, different access patterns. Previously required separate S3 lake and Snowflake warehouse with complex data movement between them.
Coginiti Perspective
Coginiti supports lakehouse architectures directly through CoginitiScript's ability to publish Iceberg tables on Snowflake, Databricks, BigQuery, Trino, and Athena. This means governed transformations can produce open table format outputs that any lakehouse-compatible engine can read. The semantic layer provides consistent metric definitions whether the underlying data is accessed through a warehouse SQL interface or a lakehouse query engine, preventing definitional drift across access patterns.
Related Concepts
More in Data Storage & Compute
Cloud Data Warehouse
Cloud Data Warehouse is a managed analytics database service hosted in cloud infrastructure, providing elastic scaling, separated compute and storage, and usage-based pricing.
Columnar Storage
Columnar Storage is a data storage format that organizes data by column rather than by row, enabling efficient compression and fast analytical queries that access subsets of columns.
Compute Warehouse (e.g., Snowflake Virtual Warehouse)
Compute Warehouse is an elastic compute resource in a cloud data warehouse that allocates processing power for query execution, scaling up and down based on workload demands.
Data Caching
Data Caching is the storage of frequently accessed data in fast, temporary memory to reduce latency and computational cost by serving requests from cache rather than recomputing or refetching.
Data Lake
Data Lake is a large-scale storage system that retains data in its raw, original format from multiple sources, serving as a central repository for historical data and enabling diverse analytics and data science use cases.
Data Mart
Data Mart is a specialized analytics database serving a specific department or function, containing curated data optimized for particular analytical questions and consumer groups.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.