Documentation Index
Fetch the complete documentation index at: https://engineering.unkey.com/llms.txt
Use this file to discover all available pages before exploring further.
This doc is specifically about why we chose the cumulative counter pattern over writing deltas. The mechanism itself (how counters become chart values) lives in metrics-architecture. Read that first if you haven’t.
The question this doc answers
People look at the pipeline and reasonably ask:Why don’t we just subtract consecutive readings in the agent and write the result? One row per bucket saying “container X used 234 microseconds of CPU between 100s and 115s”. Less math at query time. Simpler schema.The short version: cumulative counter rows are idempotent under duplicate writes. Delta rows are not. In any pipeline with retries, rolling updates, or backfills, duplicate rows will happen. With cumulative counters they don’t change the aggregate. With deltas they inflate it. The rest of this doc is why that difference shapes the entire architecture.
The invariant we protect
Never produce an inflated reading. Undercount is acceptable.An undercount is a silent revenue leak (if anyone ever invoices from this data) or a minor chart gap (if they don’t). An overcount can become a wrong invoice, a refund request, an audit trail problem, or a lawsuit. The asymmetry forces an asymmetric design: the pipeline must be structurally incapable of double-counting, even when individual components misbehave. Every realistic metering pipeline contains at least the following unreliable steps:
- The agent crashes and its replacement starts up.
- A rolling update puts two agents on the same node for 30 seconds.
- ClickHouse rejects a batch and the agent retries.
- An
async_insertbuffer is flushed twice because of a network blip. - A backfill job re-runs a day’s ingestion after a schema fix.
- An engineer re-runs heimdall against a test pod to reproduce a bug.
Delta model: duplicates inflate the aggregate
Agent reads the counter, subtracts the previous reading, writes the delta:sum(delta) = 1100. Correct.
Now the agent retries the bucket_2 write after a network blip:
sum(delta) = 1700. The customer was billed 55% more than they actually used.
There’s no aggregation function over (500, 600, 600) that gives the right answer without also knowing “this row is the duplicate”. Deduplication by (pod, ts) only works if timestamps are guaranteed identical, and under concurrent agents they rarely are.
The exactly-once requirement propagates all the way to the wire: every single write must be delivered exactly once, or the bill is wrong. That’s unachievable in any realistic distributed system.
Cumulative counter model: duplicates don’t matter
Agent reads the counter, writes the raw value:max(counter) - min(counter) = 2100 - 1500 = 600. The duplicate row has zero effect on the aggregate. max and min over the same set of values are identical whether you include a value once or a hundred times.
This is the load-bearing property. Every other design choice in the metering pipeline exists to preserve it.
This is the standard pattern, not a local invention
Every serious monitoring system that cares about metering-grade correctness converges on cumulative counters:- Prometheus defines
Counteras monotonically non-decreasing; rates are always computed at query time viarate()/increase(), never sent over the wire. Metric types, rate() semantics. - Prometheus remote_write is explicitly designed around cumulative counters so that retries and duplicates are idempotent. Remote write spec.
- OpenMetrics (the CNCF spec that grew out of Prometheus) mandates a
_totalsuffix and monotonic-non-decreasing values for counters. OpenMetrics spec. - OpenTelemetry distinguishes Cumulative and Delta “temporality” in its spec, and is explicit that delta temporality requires exactly-once delivery to be correct while cumulative does not. Data model temporality.
- GCP Cloud Monitoring marks per-instance network counters (
instance/network/received_bytes_countand friends) asCUMULATIVE. GCP metrics. - RRDtool (the 1999 ancestor of all of this) has a
COUNTERdata source type that stores raw cumulative values and computes rates at read time. rrdcreate(1). - The Linux kernel itself exposes every per-cgroup and per-device resource counter as cumulative (
cpu.stat:usage_usec,memory.current,/proc/net/dev). We’re matching the grain of the source, not fighting it.
NetworkIn / NetworkOut, which publish per-period byte sums (delta-like) from the hypervisor. That works because AWS owns the entire pipeline end-to-end: their ingestion is effectively exactly-once because they wrote both the publisher and the consumer. We don’t have that guarantee between heimdall and ClickHouse, so we can’t make the same bet.
When I’d reconsider
The cumulative-counter design is strictly safer today. I’d re-evaluate only if both of these became true:- We moved metering behind a single-writer log-structured pipeline with guaranteed exactly-once delivery to ClickHouse (Kafka with transactional writes, a reliable stream processor holding offsets). That would eliminate duplicate-write risk at the source.
- We had a separate, independent audit path that re-reads raw counters and cross-checks the aggregated number, so a delta-pipeline bug would be caught before any downstream consumer (dashboard, export, billing) acted on it.
Further reading
- Google SRE Book, Ch. 6 “Monitoring Distributed Systems”. Grounding on counter-based monitoring and the “four golden signals”.
- Prometheus: Counter vs Gauge. The standard writeup for why counters work the way they do.

