Glossary/Data Storage & Compute

Data Lake

Data Lake is a large-scale storage system that retains data in its raw, original format from multiple sources, serving as a central repository for historical data and enabling diverse analytics and data science use cases.

A data lake stores vast amounts of raw data without transformation: complete transaction logs, sensor readings, web server logs, unstructured documents. Unlike data warehouses that curate data into structured schemas, lakes store everything in native format (JSON, Parquet, raw images), enabling flexible exploration. Data lakes typically use cheap storage (cloud object storage like S3) optimized for high throughput rather than low latency. The philosophy is to capture everything and decide what to do with it later, preserving flexibility for future analysis that may not be anticipated.

Data lakes emerged from organizations realizing they were deleting data because transformation was expensive; lakes enable keeping everything and querying only what's needed. Data lakes require schema-on-read (understand structure when analyzing) rather than schema-on-write (define structure when loading). This flexibility supports diverse consumers: data scientists explore raw data, analysts query curated subsets, ML systems process raw logs.

The challenge with data lakes is management: without governance, they become "data swamps" where no one knows what data exists, quality is inconsistent, and finding relevant data is impossible. Mature data lakes include catalogs (metadata about what exists), governance (policies for retention and access), and quality management (monitoring data quality).

Key Characteristics

  • Stores data in raw, original formats
  • Uses cheap, scalable object storage
  • Retains historical data for long periods
  • Supports schema-on-read (structure defined during analysis)
  • Enables diverse analytics and data science
  • Requires governance to prevent becoming "data swamp"

Why It Matters

  • Reduces cost of retaining historical data through cheap storage
  • Enables flexibility: store everything, decide later how to use it
  • Supports data science by providing raw data for model training
  • Reduces data loss from over-filtering during ingestion
  • Enables new analytics use cases not anticipated when data was captured
  • Preserves audit trail of raw data for compliance

Example

A logistics company maintains data lake in S3: stores raw GPS data from every delivery vehicle (terabytes daily), complete order data (with all version history), carrier network logs, weather data from historical services, and customer surveys. Data warehouse contains curated delivery_metrics tables for dashboard reporting. Data scientists access raw data lake to build delivery-time prediction models using complete GPS history; geospatial analysts query historical weather and delivery correlation. Data lake enables use cases (prediction models) that wouldn't justify separate data ingestion.

Coginiti Perspective

Coginiti integrates with data lakes through its object store browser for managing files, direct query capabilities against data files in object storage, and CoginitiScript publication that materializes results as Parquet or CSV directly to S3, Azure Blob, or GCS. For teams using lake-first architectures, Coginiti connects to lake query engines like Athena, Trino, Databricks, and Spark, providing the same analytics catalog and semantic layer governance regardless of whether data lives in a warehouse or lake.

Related Concepts

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.