Change Data Capture (CDC)
Change Data Capture is a technique that identifies and captures new, updated, and deleted records from source systems, enabling efficient incremental data movement instead of full refreshes.
Change Data Capture tracks what changed in a source system since the last pipeline run and extracts only those changes, rather than reloading the entire dataset. CDC uses database logs (write-ahead logs), timestamps (updated_at columns), or query-based approaches (SELECT WHERE updated_at > last_run). CDC is especially valuable for large tables where full refresh is expensive and impractical. A table with 100 million customer records may only change 100,000 records daily; CDC captures only those changes, reducing extraction time and bandwidth.
CDC technology matured with managed services (Fivetran, Debezium) that handle the complexity of reading database logs. Organizations initially used timestamp-based CDC (simple but imperfect) and evolved toward log-based CDC (more reliable but more complex). Cloud platforms now provide native CDC through services like AWS Database Migration Service.
CDC enables real-time and near-real-time data movement by triggering downstream pipelines immediately when data changes occur. This supports operational analytics (dashboards updated seconds after transactions) and reduces freshness latency. The trade-off is complexity: CDC systems must handle out-of-order changes, deletions, and ensure no changes are missed.
Key Characteristics
- ▶Identifies changes (inserts, updates, deletes) in source systems
- ▶Uses database logs, timestamps, or query comparison methods
- ▶Tracks changes since last extraction point
- ▶Enables incremental movement of only changed data
- ▶Reduces extraction time and bandwidth for large datasets
- ▶Supports real-time or near-real-time latency
Why It Matters
- ▶Reduces cost and latency of data movement by eliminating full refreshes
- ▶Enables real-time data availability for operational analytics
- ▶Reduces source system load by not continuously scanning for changes
- ▶Improves freshness of downstream data through frequent incremental updates
- ▶Enables compliance with data deletion by capturing deletes
- ▶Scales efficiently to handle large data volumes
Example
A payment processor uses CDC to stream transactions: database logs capture every new transaction and update (refund, reversal), Debezium reads logs microseconds after commit, streams to Kafka, and multiple consumers: settlement_system updates account balances, analytics_warehouse increments transaction counts, fraud_detector scores for patterns. Customer transaction data is current within seconds, eliminating batch latency.
Coginiti Perspective
CDC feeds naturally into Coginiti's ELT workflow. Once change data lands in a warehouse or lake, CoginitiScript's incremental publication strategies (append, merge, and merge_conditionally) handle the downstream transformation logic. The publication.Incremental() function lets blocks detect whether they are running in incremental or full-refresh mode, so the same CoginitiScript code handles both CDC-driven updates and initial loads without separate pipeline definitions.
Related Concepts
More in Data Integration & Transformation
Data Cleansing
Data Cleansing is the process of identifying and correcting errors, inconsistencies, and anomalies in data to improve quality and reliability for analysis.
Data Deduplication
Data Deduplication is the process of identifying and eliminating duplicate records or data points that represent the same entity but appear multiple times in a dataset.
Data Dependency Graph
Data Dependency Graph is a directed representation of relationships between data entities, showing which tables, pipelines, or datasets depend on which other ones.
Data Enrichment
Data Enrichment is the process of enhancing data by adding valuable attributes, calculated fields, or external information that provides additional context and insight.
Data Ingestion
Data Ingestion is the process of capturing data from source systems and moving it into platforms for processing, storage, and analysis.
Data Replication
Data Replication is the process of copying data from a source system to one or more target systems, maintaining consistency and handling synchronization of copies.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.