Glossary/Data Integration & Transformation

Full Refresh

Full Refresh is a data pipeline pattern that reprocesses and reloads an entire dataset from scratch on each execution, discarding previous results and recomputing everything.

Full refresh loads all source data and recomputes all results from beginning to end: a customer table is completely reloaded, all transformations rerun, all metrics recalculated. Full refresh is conceptually simple: no need to track state, no edge cases with late-arriving data, no complexity with partial failures. A full refresh is idempotent: running twice produces identical results. This simplicity makes full refresh the default choice for new pipelines.

Trade-offs are significant: full refresh is inefficient (processing terabytes when only gigabytes changed), expensive (scales costs with total data rather than change volume), and often infeasible (terabyte-scale data can't be fully refreshed daily). Full refresh is appropriate for small datasets, low-freshness requirements, or scenarios where simplicity is more valuable than efficiency. As datasets grow, organizations typically migrate to incremental processing.

In practice, organizations use hybrid approaches: full refresh on a schedule (weekly) for validation and accuracy, incremental daily (or more frequently) for freshness. This combines incremental efficiency with periodic full-refresh validation that correctness hasn't drifted.

Key Characteristics

▶Reprocesses entire dataset on each execution
▶Discards all previous results before recomputation
▶Idempotent results regardless of execution count
▶Simple to implement and understand
▶No state tracking or watermark management required
▶Inefficient for large datasets with few changes

Why It Matters

▶Provides simplicity and ease of understanding for new pipelines
▶Guarantees correctness by starting from scratch each time
▶Useful for small datasets where efficiency isn't critical
▶Enables clean recovery from corruption by starting fresh
▶Appropriate for validation and periodic audits
▶Reduces debugging complexity from stateful incremental processing

Example

An analytics startup uses full refresh initially: data is small enough (1GB daily) that refreshing completely runs in 30 minutes on modest infrastructure, no incremental logic complexity, easy to understand and debug. As data grows to 10GB, cost becomes significant, so they implement incremental processing for daily runs but keep weekly full refresh for validation. After three years at 500GB, full refresh is only annual audit; daily incremental pipeline handles freshness.

Coginiti Perspective

CoginitiScript handles the full-refresh versus incremental decision within a single block definition. The publication.Incremental() function returns false when the target table does not exist or when fullRefresh=true is passed to publication.Run(), causing the block to execute its full-refresh path. This means teams write one block that handles both modes rather than maintaining separate full and incremental pipelines, reducing code duplication and the risk of logic divergence between the two paths.

Related Concepts

Incremental Processing Data Pipeline Batch ProcessingWatermarkIdempotent PipelinesData FreshnessData ValidationBackfill

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.

Request a Demo

Full Refresh

Key Characteristics

Why It Matters

Example

Coginiti Perspective

Related Concepts

More in Data Integration & Transformation

Change Data Capture (CDC)

Data Cleansing

Data Deduplication

Data Dependency Graph

Data Enrichment

Data Ingestion

See Semantic Intelligence in Action