Data Compaction
Data compaction is a maintenance process that combines small data files into larger ones, improving query performance and reducing storage overhead without changing data or schema.
Data lakes accumulate small files from frequent incremental writes, ingestions, and streaming updates. Query engines incur overhead for each file accessed (network round trips, metadata operations), making many small files problematic. Compaction consolidates these scattered files into larger, more efficient units that reduce operation overhead and improve I/O throughput.
Compaction is non-destructive and transparent to queries. The process reads multiple files, writes consolidated output, and updates table metadata to reference new files while retiring old ones. The underlying data and schema remain unchanged. Scheduling compaction involves tradeoffs: frequent compaction reduces query overhead but consumes compute resources; infrequent compaction defers costs but allows query degradation.
Many open table formats implement intelligent compaction strategies. Rather than compacting entire tables, they target partitions with excessive small files or use size-based heuristics. Some tools can run compaction incrementally, processing subsets at a time. Integration with cloud object storage lifecycle policies can automate cleanup of old file versions after compaction commits.
Key Characteristics
- ▶Combine multiple small files into fewer, larger files for efficiency
- ▶Non-destructive operation that doesn't alter data or structure
- ▶Transparent to running queries, can happen concurrently in many systems
- ▶Reduce metadata overhead and I/O costs associated with many small files
- ▶Support incremental compaction of partitions or time ranges
- ▶Enable cleanup of intermediate files from updates and deletes
Why It Matters
- ▶Reduces query latency by 50+ percent in tables with excessive fragmentation
- ▶Lowers cloud storage costs through efficient file organization
- ▶Reduces metadata operation overhead from managing thousands of small files
- ▶Simplifies operations by automating cleanup after frequent updates or deletes
- ▶Improves resource utilization for concurrent queries on the same table
- ▶Defers performance degradation from incremental data ingestion patterns
Example
` -- Table has grown fragmented from daily incremental loads -- 5000 small files averaging 10 MB each = 50 GB metadata overhead -- Trigger compaction on recent partitions ALTER TABLE transactions COMPACT PARTITION year=2024, month=4; -- Compaction process internally: -- 1. Read 500 small files from the partition -- 2. Write 10 larger files (50 MB each) -- 3. Update metadata to reference new files -- 4. Old small files marked for deletion after retention period -- Query performance improves: -- Before: 500 file opens + network roundtrips -- After: 10 file opens + network roundtrips `
Coginiti Perspective
CoginitiScript's incremental publication can produce small files over time as append and merge operations accumulate. Compaction is handled by the target platform (Snowflake's automatic clustering, Databricks' OPTIMIZE, Iceberg's rewrite_data_files). Coginiti's role is ensuring that the transformation logic producing these files is governed and that publication metadata clearly defines materialization targets, so platform-level compaction processes know which tables to maintain.
Related Concepts
More in Open Table Formats
Apache Hudi
Apache Hudi is an open-source data lake framework providing incremental processing, ACID transactions, and fast ingestion for analytical and operational workloads.
Apache Iceberg
Apache Iceberg is an open-source table format that organizes data files with a metadata layer enabling ACID transactions, schema evolution, and time travel capabilities for data lakes.
Delta Lake
Delta Lake is an open-source storage layer providing ACID transactions, schema governance, and data versioning to data lakes built on cloud object storage.
Hidden Partitioning
Hidden partitioning is a table format feature that partitions data logically for query optimization without encoding partition values in file paths or requiring file reorganization during partition scheme changes.
Open Table Format
An open table format is a vendor-neutral specification for organizing and managing data files and metadata in data lakes, enabling ACID transactions and multi-engine interoperability.
Partitioning
Partitioning is a data organization technique that divides tables into logical or physical segments based on column values, enabling query engines to scan only relevant data.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.