Coginiti menu Coginiti menu

Three Varieties of Hybrid Query Execution

Matthew Mullins
December 12, 2024

In the evolving landscape of data analytics, hybrid query execution is emerging as a pivotal innovation. This approach allows for flexible data processing by combining local and remote computational resources. Underpinning this paradigm shift are two important factors. First, the availability of increasingly powerful local computing environments—modern laptops and desktops have multiple cores, ample RAM, and high-speed SSD storage. Second, recent research and usage patterns reveal that most user queries are relatively small, often processing under 1GB of data. Because such data can comfortably fit into memory on a personal machine, a local-first approach to data analytics has become both practical and appealing.

Platforms like Motherduck, GlareDB, and Coginiti exemplify this hybrid model. They start queries locally using small yet powerful in-process analytic query engines such as DuckDB and Apache DataFusion, and then scale out to the cloud as needed. While each platform takes a slightly different approach, their common thread is balancing the best of local and remote worlds to meet diverse analytical needs.

The Local-First Approach: Why It Matters

Historically, large-scale data processing was synonymous with massive, centralized data warehouses and cloud-based clusters. Yet, a substantial fraction of analytical workloads simply don’t require that scale. According to analyses of Snowflake and Redshift usage patterns, a significant majority of user queries scan less than 1GB of data—an amount easily processed in-memory on a modern laptop. (For reference, see insights from Fivetran’s blog on Snowflake and Redshift usage and Motherduck’s blog discussing Redshift query sizes.

This reality challenges the assumption that “bigger is always better” for data processing infrastructure. Instead, many data professionals benefit from the performance and cost-efficiency of local execution for everyday analytic tasks. With robust local query engines, even complicated analytical queries can run interactively right on a developer’s machine.

Enabling Technologies: In-Process Analytic Query Engines

Central to the local-first paradigm are a new class of in-process analytic query engines. These engines provide powerful query execution capabilities directly within the host environment—no need for a separate server or distributed cluster. Two prominent examples are:

DuckDB: An embeddable analytics database designed to run entirely within your process. It’s optimized for analytical queries on columnar data, delivering near in-memory speed for tasks on files like CSV and Parquet.

Apache DataFusion: Part of the Apache Arrow ecosystem, DataFusion is a query execution framework written in Rust. It’s designed for efficient, in-memory operations and can be embedded into applications or other systems.

Such engines eliminate the overhead of remote round-trips and allow users to interact with data quickly and iteratively. When working datasets are small (under 1GB), these local engines shine, enabling interactive data exploration, prototyping, and advanced analytics with minimal latency and overhead.

The Platforms: Motherduck, GlareDB, and Coginiti

Motherduck: Scaling DuckDB to the Cloud

Motherduck builds on DuckDB’s local execution strengths. Users start by running queries locally in the command line or within a notebook. This setup is ideal for the majority of everyday queries that fit comfortably in local memory. However, when a user encounters more significant workloads—larger datasets, more concurrent queries, or complex transformations—Motherduck allows pushing the query execution to their cloud infrastructure. By combining local interactivity with scalable remote compute, Motherduck positions itself as a cloud data warehouse that meets users where they start, on their own machine.

GlareDB: Combining Local Execution with a Private VPC

GlareDB begins with local processing via its own database engine. As query sizes or complexity grow, users can seamlessly move the workload to a private Virtual Private Cloud (VPC), bringing their own cloud resources into play. While GlareDB identifies as a database, its ability to federate queries across different data sources—akin to data fabric tools like Trino or Dremio—expands its utility. This hybrid execution ensures that small queries remain snappy and local, while larger or distributed analytics workflows can scale out and leverage remote resources.

Coginiti: Hybrid Execution with Choice of Cloud Warehouse

Coginiti integrates DuckDB directly within the client application, again taking advantage of low-latency, in-memory queries on small datasets. When queries demand additional horsepower, Coginiti enables users to push workloads to the cloud data warehouse of their choice—be it BigQuery, Snowflake, Redshift, or another platform. This approach fosters a collaborative data operations environment, allowing teams to work with their local data interactively and only move to the cloud when beneficial. The result is a flexible, user-driven decision-making process around resource utilization and cost efficiency.

Common Threads and Distinguishing Factors

All three platforms embrace the local-first methodology, starting with in-process query engines that excel on small, easily managed datasets. They also support a range of open table formats (like Apache Iceberg and Delta Lake) and support the broader trend toward more open, flexible data architectures.

– Motherduck emphasizes a seamless transition from local DuckDB execution to a fully-managed, scalable cloud data warehouse environment.

– GlareDB doubles as a query federation layer, bridging local execution and diverse remote data sources. Its BYOC model lets enterprises retain control over their computing environment.

– Coginiti champions user choice and collaboration, enabling teams to work locally and push workloads to a variety of existing cloud warehouses.

The hybrid query execution model is grounded in a recognition that most queries are small—so small that they can efficiently run in memory on a laptop. When combined with powerful in-process analytic engines like DuckDB and Apache DataFusion, this realization paves the way for a local-first approach. Platforms like Motherduck, GlareDB, and Coginiti illustrate how this strategy can elegantly scale: start local, move to the cloud as needed, and meet users where they are, with the datasets they have.

This hybrid architecture doesn’t just reflect technological advancement in query engines; it mirrors a shift in how we conceptualize and operationalize analytics. Rather than defaulting to expensive remote infrastructure, we leverage local resources for speed, interactivity, and cost efficiency, turning to the cloud only when the scope or scale of our tasks genuinely demands it.