Data Virtualization
Data Virtualization is a technology that provides unified query and access to data across heterogeneous sources without requiring copying data into a central location.
Data virtualization abstracts the physical location and format of data, allowing users to query as if all data exists in a single system. Queries are routed to appropriate sources, results are combined, and returned through a unified interface. This eliminates the need to copy data for analysis: an analyst can query Oracle database columns, PostgreSQL tables, and S3 data in a single query without extracting and loading. Virtual layers act as a schema mapping layer, translating business-friendly definitions to underlying system schemas.
Data virtualization became practical through advances in query federations: systems that can push computation to sources (predicate pushdown) to avoid moving massive datasets. Trade-offs exist: queries may be slower than directly querying a single warehouse because data movement happens at query time rather than being pre-materialized. Virtual layers add operational complexity because they depend on connectivity to source systems.
Organizations use data virtualization for specific scenarios: accessing rarely-used data that doesn't justify ETL, connecting to source systems that can't be replicated, or providing users access to data without copying sensitive information. Some modern data platforms (Snowflake, BigQuery) include federated query capabilities, reducing need for separate virtualization tools.
Key Characteristics
- ▶Provides unified interface to heterogeneous data sources
- ▶Routes queries to appropriate sources without copying data
- ▶Includes schema mapping layer translating logical to physical schemas
- ▶Implements query pushdown to minimize data movement
- ▶Supports caching to improve performance of repeated queries
- ▶Simplifies data governance by managing access through virtual layer
Why It Matters
- ▶Reduces latency and cost of accessing data not in central warehouse
- ▶Enables access to sensitive data without copying to central location
- ▶Supports real-time queries on operational systems without replication lag
- ▶Reduces time-to-analytics for new data sources without ETL development
- ▶Improves security by centralizing access control through virtual layer
- ▶Reduces total cost of ownership by avoiding unnecessary data movement
Example
A healthcare provider uses data virtualization to query across siloed systems: patient records in on-premises legacy database, insurance claims in a SaaS platform, genomic data in research cloud environment. Analyst writes single query for "patients with diabetes, their medications, and insurance coverage" in the virtual layer, which routes parts of query to each source, retrieves results, and combines them. Data never leaves source systems, reducing compliance risk.
Coginiti Perspective
Coginiti's semantic layer and 21+ native connectors provide a form of practical virtualization grounded in governed definitions rather than just federated access. Analysts interact with consistent business concepts in the semantic layer while Coginiti routes queries to the appropriate underlying platform. This approach complements ELT patterns: data remains in the platforms best suited for its workload, while the semantic layer ensures that users experience a unified, governed view regardless of where the data physically resides.
Related Concepts
More in Core Data Architecture
Batch Processing
Batch Processing is the execution of computational jobs on large volumes of data in scheduled intervals, processing complete datasets at once rather than responding to individual requests.
Data Architecture
Data Architecture is the structural design of systems, tools, and processes that capture, store, process, and deliver data across an organization to support analytics and business operations.
Data Ecosystem
Data Ecosystem is the complete collection of interconnected data systems, platforms, tools, people, and processes that organizations use to collect, manage, analyze, and act on data.
Data Fabric
Data Fabric is an integrated, interconnected architecture that unifies diverse data sources, platforms, and tools to provide seamless access and movement of data across the organization.
Data Integration
Data Integration is the process of combining data from multiple heterogeneous sources into a unified, consistent format suitable for analysis or operational use.
Data Lifecycle
Data Lifecycle is the complete journey of data from creation or ingestion through processing, usage, governance, and eventual deletion or archival.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.