Hierarchical Approximate Aggregation for Interactive Analytics on Compressed Columnar Storage in the Cloud
Abstract
Cloud data platforms increasingly rely on compressed columnar storage to reduce storage footprint and I/O, yet interactive analytics remains constrained by the latency of scanning, decompressing, and aggregating large datasets. This tension is most visible in exploratory workflows where analysts iteratively adjust filters, groupings, and time ranges while expecting near-immediate feedback. Approximate query processing can lower latency by trading exactness for bounded error, but many existing techniques either ignore the physical organization of columnar files or require heavy precomputation that limits query flexibility. This paper presents a hierarchical approximate aggregation approach designed specifically for compressed columnar storage in cloud object stores. The core idea is to construct a multi-resolution hierarchy aligned with row groups and pages, where each node stores mergeable synopses for common aggregates, including additive measures, distinct counts, and distributional summaries. By operating in a compression-aware manner, the synopsis construction and updates exploit encoding structure such as dictionary maps and run-length segments to reduce decode overhead. At query time, an adaptive planner selects a cut through the hierarchy to satisfy an interactive latency budget while controlling statistical error, then refines results progressively as users drill down. The design integrates with cloud-native execution by decoupling synopses into compact sidecar objects, supporting append-heavy ingestion and multi-tenant caching. Analytical models characterize error propagation and cost trade-offs, motivating a practical selection strategy for hierarchical refinement under heterogeneous cloud I/O and compute variability.