Full Refresh
Full Refresh is a data pipeline pattern that reprocesses and reloads an entire dataset from scratch on each execution, discarding previous results and recomputing everything.
Full refresh loads all source data and recomputes all results from beginning to end: a customer table is completely reloaded, all transformations rerun, all metrics recalculated. Full refresh is conceptually simple: no need to track state, no edge cases with late-arriving data, no complexity with partial failures. A full refresh is idempotent: running twice produces identical results. This simplicity makes full refresh the default choice for new pipelines.
Trade-offs are significant: full refresh is inefficient (processing terabytes when only gigabytes changed), expensive (scales costs with total data rather than change volume), and often infeasible (terabyte-scale data can't be fully refreshed daily). Full refresh is appropriate for small datasets, low-freshness requirements, or scenarios where simplicity is more valuable than efficiency. As datasets grow, organizations typically migrate to incremental processing.
In practice, organizations use hybrid approaches: full refresh on a schedule (weekly) for validation and accuracy, incremental daily (or more frequently) for freshness. This combines incremental efficiency with periodic full-refresh validation that correctness hasn't drifted.
Key Characteristics
- ▶Reprocesses entire dataset on each execution
- ▶Discards all previous results before recomputation
- ▶Idempotent results regardless of execution count
- ▶Simple to implement and understand
- ▶No state tracking or watermark management required
- ▶Inefficient for large datasets with few changes
Why It Matters
- ▶Provides simplicity and ease of understanding for new pipelines
- ▶Guarantees correctness by starting from scratch each time
- ▶Useful for small datasets where efficiency isn't critical
- ▶Enables clean recovery from corruption by starting fresh
- ▶Appropriate for validation and periodic audits
- ▶Reduces debugging complexity from stateful incremental processing
Example
An analytics startup uses full refresh initially: data is small enough (1GB daily) that refreshing completely runs in 30 minutes on modest infrastructure, no incremental logic complexity, easy to understand and debug. As data grows to 10GB, cost becomes significant, so they implement incremental processing for daily runs but keep weekly full refresh for validation. After three years at 500GB, full refresh is only annual audit; daily incremental pipeline handles freshness.
Coginiti Perspective
CoginitiScript handles the full-refresh versus incremental decision within a single block definition. The publication.Incremental() function returns false when the target table does not exist or when fullRefresh=true is passed to publication.Run(), causing the block to execute its full-refresh path. This means teams write one block that handles both modes rather than maintaining separate full and incremental pipelines, reducing code duplication and the risk of logic divergence between the two paths.
Related Concepts
More in Data Integration & Transformation
Change Data Capture (CDC)
Change Data Capture is a technique that identifies and captures new, updated, and deleted records from source systems, enabling efficient incremental data movement instead of full refreshes.
Data Cleansing
Data Cleansing is the process of identifying and correcting errors, inconsistencies, and anomalies in data to improve quality and reliability for analysis.
Data Deduplication
Data Deduplication is the process of identifying and eliminating duplicate records or data points that represent the same entity but appear multiple times in a dataset.
Data Dependency Graph
Data Dependency Graph is a directed representation of relationships between data entities, showing which tables, pipelines, or datasets depend on which other ones.
Data Enrichment
Data Enrichment is the process of enhancing data by adding valuable attributes, calculated fields, or external information that provides additional context and insight.
Data Ingestion
Data Ingestion is the process of capturing data from source systems and moving it into platforms for processing, storage, and analysis.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.