Glossary/Analytics & Querying

Data Aggregation

Data aggregation is the process of combining multiple rows of data using aggregate functions to compute summary statistics, totals, averages, and other derived metrics.

Aggregation transforms raw transactional data into business metrics. Rather than examining individual transactions, aggregation computes totals (sum of revenue), counts (number of customers), averages (mean purchase value), or other summaries. These metrics drive business decisions: How much revenue did we generate? How many customers are active? What is average order value by region?

Common aggregate functions include SUM (total), COUNT (row count), AVG (average), MIN/MAX (extremes), and statistical functions (STDDEV, PERCENTILE). Queries often group aggregations by dimensions (region, product, time period) to slice metrics across business domains. Aggregation is computationally expensive for large datasets, requiring scanning all rows and combining results through hash tables or sorting.

Aggregation is so fundamental to analytics that entire optimization strategies focus on making it efficient. Materialized views pre-aggregate, dynamic tables incrementally update aggregates, and indexes enable aggregate computation without full table scans. Modern systems often separate OLTP (efficient single-row operations) from OLAP (efficient aggregation) using different data structures.

Key Characteristics

  • Combine rows using aggregate functions (SUM, COUNT, AVG, MIN, MAX)
  • Group results by dimensions for sliced metrics
  • Reduce data volume from millions to thousands of rows
  • Require scanning all relevant data before computing results
  • Support hierarchy (total company, by division, by region)
  • Enable efficient metrics computation through materialization

Why It Matters

  • Enable business metrics and KPIs from raw transaction data
  • Reduce data volume from billions to millions for dashboards
  • Support fast metric queries through pre-aggregation
  • Enable trend analysis and comparative metrics across dimensions
  • Drive decision-making through summarized, actionable insights
  • Reduce query costs by computing aggregates once instead of repeatedly

Example

`
-- Row-level transaction data (millions of rows)
SELECT * FROM orders LIMIT 5;
-- order_id | customer_id | order_date | product_id | order_value
-- 1        | 100         | 2024-01-01 | 50         | 99.99
-- 2        | 101         | 2024-01-01 | 51         | 149.99
-- 3        | 100         | 2024-01-02 | 52         | 59.99

-- Aggregation: total revenue by day
SELECT 
  DATE(order_date) as sale_date,
  SUM(order_value) as daily_revenue,
  COUNT(*) as order_count,
  COUNT(DISTINCT customer_id) as unique_customers,
  AVG(order_value) as avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY DATE(order_date)
ORDER BY sale_date;

-- Result (thousands of rows instead of millions):
-- sale_date  | daily_revenue | order_count | unique_customers | avg_order_value
-- 2024-01-01 | 249.98        | 2           | 2                | 124.99
-- 2024-01-02 | 59.99         | 1           | 1                | 59.99

-- Hierarchical aggregation
SELECT 
  product_category,
  region,
  SUM(order_value) as category_region_revenue,
  SUM(SUM(order_value)) OVER (PARTITION BY product_category) as category_total
FROM orders o
JOIN products p ON o.product_id = p.id
JOIN customers c ON o.customer_id = c.id
GROUP BY product_category, region;
`

Coginiti Perspective

Coginiti's semantic layer centralizes aggregation definitions. SMDL measures declare their aggregation type (sum, count, count_distinct, avg, min, max, median, stdev, stdevp, var, varp, or custom expressions), and the MEASURE() function in Semantic SQL applies the correct aggregation automatically. This eliminates the common problem of different teams using different aggregation logic for the same metric, because the aggregation is defined once in the semantic model and enforced for every query.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.