Skip to main content

Overview

soonTM = I dont have a good name yet. soonTM is a Go DaemonSet that collects per-deployment CPU, memory, and network egress metrics from Kubernetes nodes and writes them to ClickHouse for billing. It also tracks deployment lifecycle events (start, stop, scale) with millisecond-precise timestamps. soonTM runs once per node in the cluster, on every node that hosts customer deployments (untrusted nodepool).

Architecture

1

Collect

soonTM scrapes two kubelet endpoints on the local node at a configurable interval (default 15s) and watches pod events via a K8s informer. It also queries Cilium Hubble for public vs internal egress classification.
SourceWhatEndpoint
kubeletCPU + memory/metrics/resource
kubeletNetwork tx/rx bytes/stats/summary
K8s APILifecycle events (start/stop/scale)Pod informer (watch)
Cilium HubblePublic egress classificationgRPC via hubble.relay
2

Buffer to disk

Every sample and lifecycle event is written to a disk WAL (EBS volume) before anything else. Billing data is never held only in memory.
3

Drain to ClickHouse

A background loop reads completed WAL segments and batch-inserts them into ClickHouse. On success, the segment is deleted. On failure, it stays on disk for retry.
4

S3 overflow

If the disk fills up, the oldest segments are uploaded to S3. Once ClickHouse recovers, re-ingest from S3:
INSERT INTO default.container_resources_raw_v1
SELECT * FROM s3('s3://bucket/metering-wal/**/*.ndjson', 'JSONEachRow')

What soonTM Collects

Resource Usage Samples (configurable interval, default 15s)

For each krane-managed pod on the node, soonTM records:
MetricSourceDescription
cpu_millicores/metrics/resourceActual CPU usage rate (computed from cumulative nanosecond counter delta)
memory_working_set_bytes/metrics/resourceActual memory working set
cpu_request_millicoresPod specRequested CPU (scheduling guarantee)
cpu_limit_millicoresPod specCPU limit (hard cap)
memory_request_bytesPod specRequested memory
memory_limit_bytesPod specMemory limit
network_tx_bytes/stats/summaryTotal egress bytes since last sample (delta)
network_tx_bytes_publicCilium HubblePublic-only egress (non-RFC1918 destinations)
Each sample is tagged with workspace_id, project_id, app_id, environment_id, deployment_id, instance_id (pod name), region, and platform.

Lifecycle Events (real-time)

soonTM watches for pod state changes via Kubernetes informers and emits events with millisecond-precise timestamps:
EventWhenWhy it matters
startedPod appears and is runningBilling window begins
stoppedPod is removedBilling window ends
scaledReplicaSet replica count changesAllocated billing changes mid-period
Each lifecycle event records the deployment’s resource allocation at that moment (replicas, cpu_limit, memory_limit).

Why Two Data Streams

Usage samples and lifecycle events answer different questions:
  • Usage samples → “how much CPU/memory did this pod actually consume in this collection interval?”
  • Lifecycle events → “exactly when did this deployment start, stop, or change its allocation?”

Why usage samples alone aren’t enough

Usage samples arrive every collection interval. But deployments don’t start and stop on interval boundaries:
Timeline:
  :00.000  ─── nothing ───
  :00.100  ← pod starts (no sample knows about this)
  :15.000  ← first usage sample arrives
  :30.000  ← second sample
  :45.000  ← last sample
  :52.300  ← pod stops (no sample captures this either)
Without lifecycle events, those first 14.9 seconds and last 7.3 seconds are invisible. For both billing models that’s unacceptable — we’d be rounding to 15s boundaries when we have the data to be ms-precise.

What lifecycle events enable

Usage samples give us the actual consumption rate. Lifecycle events give us the exact billing window. We know the pod started at :00.100 and stopped at :52.300, so we prorate the first and last intervals to the millisecond instead of snapping to the nearest 15s sample.

How allocated billing works

Each lifecycle event marks a change in what’s reserved. Between two events, the allocation is constant:
Timeline for deployment X:

  14:00:00.100  started  │ 2 replicas × 500m CPU × 256Mi mem

  14:32:17.483  scaled   │ 4 replicas × 500m CPU × 256Mi mem

  15:07:44.917  stopped  │ 0

Billing:
  Interval 1: 14:00:00.100 → 14:32:17.483 = 1,937,383ms
    allocated_cpu = 2 × 500m × 1,937,383ms = 1,937,383,000 millicores·ms

  Interval 2: 14:32:17.483 → 15:07:44.917 = 2,127,434ms
    allocated_cpu = 4 × 500m × 2,127,434ms = 4,254,868,000 millicores·ms

  Total: 6,192,251,000 millicores·ms = ~1.72 CPU-hours
The billing service fetches lifecycle events from ClickHouse, walks them chronologically per deployment, and computes replicas × limit × duration for each interval. This is done in Go, not SQL — ClickHouse stores the events, the billing service does the math.

How CPU is Measured

The kernel tracks CPU as a cumulative nanosecond counter — it only ever goes up. The kubelet reads this via cAdvisor and exposes it through its API. We don’t read cgroups directly. Every collection interval, soonTM grabs the counter from the kubelet. With two consecutive readings we compute the actual usage:
Example:
  t0 (00:00:00): counter = 5,000,000,000 ns
  t1 (00:00:15): counter = 5,750,000,000 ns

  delta = 750,000,000 ns consumed in 15s
  rate  = 750,000,000 / 15,000,000,000 = 0.05 cores = 50 millicores
This is not an instantaneous snapshot — it’s exactly how much CPU was consumed between two readings. No spikes are missed, no idle time is overcounted. The kernel counted every nanosecond.
For billing, we store each of these computed rates as a sample. The total CPU consumed over a billing period is the sum of all samples — not an average. Each sample represents actual usage over its collection interval.

Edge Windows: Start and Stop

Computing CPU rate requires two consecutive readings. This creates a blind spot at pod start (no previous reading) and pod stop (no next reading). This is a physical limitation — every metrics tool has it.
We do an immediate kubelet read when the pod informer fires AddFunc:
:00.100  pod starts → informer AddFunc fires
:00.120  soonTM immediately reads kubelet → counter = X (reading #1)
:15.000  regular tick → counter = Y (reading #2)
         → cpu for :00.120 → :15.000 = (Y - X) / 14.88s ✓
The blind spot shrinks from ~15s to milliseconds (however fast we can hit the kubelet API after the informer event).
The worst-case gap is one collection interval billed at the allocated rate instead of actual usage. For a pod running hours or days, this is negligible. The allocated rate is also the ceiling — the customer is never charged more than what they reserved.Even after a container dies, cAdvisor retains its stats in memory for up to 2 minutes (--storage_duration), so the “race to read on Terminating” has a decent safety margin. The kubelet’s container GC used to be tunable via --minimum-container-ttl-duration but that flag has been removed.

How Network Egress is Split

Total egress comes from the kubelet Summary API (txBytes counter delta). To split internal vs public:
1

Query Hubble

soonTM queries Cilium Hubble (gRPC API via hubble.relay) for each pod’s outbound network flows.
2

Classify by destination

Flows to RFC1918 destinations (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) are internal. Everything else is public.
3

Store both

network_tx_bytes_public = sum of tx bytes to non-RFC1918 destinations. Internal egress = network_tx_bytes - network_tx_bytes_public (computed at query time).
Only public egress is typically billed. Internal egress (pod-to-pod, pod-to-service) is free.

Durability: Disk WAL + S3 Overflow

Billing data cannot be dropped. Unlike analytics (where losing a few events is acceptable), missing billing data means lost revenue or overcharging.
soonTM uses a write-ahead log (WAL) on a dedicated EBS volume:
1

Write to disk first

Every metric sample and lifecycle event is written to a segment file on disk before anything else.
2

Drain to ClickHouse

A background drain loop reads completed segments and batch-inserts them into ClickHouse. On success, the segment file is deleted.
3

Retry on failure

If ClickHouse is down, the segment stays on disk and is retried on the next loop. Data is never lost.
4

S3 overflow

If the disk fills up (EBS volume approaching capacity), the oldest segments are uploaded to S3 and deleted locally.
5

Re-ingest from S3

Once ClickHouse recovers, data from S3 can be re-ingested with zero custom code:
INSERT INTO default.container_resources_raw_v1
SELECT * FROM s3(
    's3://bucket/metering-wal/resources/**/*.ndjson',
    'JSONEachRow'
)
Segment files use NDJSON format (one JSON object per line) so ClickHouse can read them directly from S3 without any transformation.

ClickHouse Data Model

Raw Tables

TableRowsTTLPurpose
container_resources_raw_v11 per instance per collection interval90 daysRaw usage samples
deployment_lifecycle_events_v11 per start/stop/scale365 daysLifecycle events

Table Schemas

CREATE TABLE default.container_resources_raw_v1
(
    `time` Int64,                        -- unix milliseconds
    `workspace_id` String,
    `project_id` String,
    `app_id` String,
    `environment_id` String,
    `deployment_id` String,
    `instance_id` String,                -- pod name
    `region` LowCardinality(String),
    `platform` LowCardinality(String),   -- "aws", "gcp", etc.

    -- Actual usage (from kubelet)
    `cpu_millicores` Float64,            -- computed rate from counter delta
    `memory_working_set_bytes` Int64,

    -- Allocated resources (from pod spec)
    `cpu_request_millicores` Int32,
    `cpu_limit_millicores` Int32,
    `memory_request_bytes` Int64,
    `memory_limit_bytes` Int64,

    -- Network egress (deltas since last sample)
    `network_tx_bytes` Int64,            -- total egress
    `network_tx_bytes_public` Int64      -- public-only (non-RFC1918)
)
ENGINE = MergeTree()
ORDER BY (workspace_id, app_id, deployment_id, time)
PARTITION BY toDate(fromUnixTimestamp64Milli(time))
TTL toDateTime(fromUnixTimestamp64Milli(time)) + INTERVAL 90 DAY DELETE
CREATE TABLE default.deployment_lifecycle_events_v1
(
    `time` Int64,                        -- unix milliseconds (ms-precise)
    `workspace_id` String,
    `project_id` String,
    `app_id` String,
    `environment_id` String,
    `deployment_id` String,
    `region` LowCardinality(String),
    `platform` LowCardinality(String),

    `event` LowCardinality(String),      -- "started", "stopped", "scaled"
    `replicas` Int32,                    -- replica count at this moment
    `cpu_limit_millicores` Int32,        -- per-replica CPU limit
    `memory_limit_bytes` Int64           -- per-replica memory limit
)
ENGINE = MergeTree()
ORDER BY (workspace_id, app_id, deployment_id, time)
PARTITION BY toDate(fromUnixTimestamp64Milli(time))
TTL toDateTime(fromUnixTimestamp64Milli(time)) + INTERVAL 365 DAY DELETE
All three aggregation tables share the same column structure, differing only in time granularity and TTL:
-- Same structure for per_minute (30d TTL), per_hour (90d TTL), per_day (365d TTL), per_month (no TTL)
CREATE TABLE default.container_resources_per_{hour,day,month}_v1
(
    `time` DateTime,                                            -- or Date for day/month
    `workspace_id` String,
    `project_id` String,
    `app_id` String,
    `environment_id` String,
    `deployment_id` String,

    `cpu_millicores_sum` SimpleAggregateFunction(sum, Float64),
    `memory_bytes_max` SimpleAggregateFunction(max, Int64),
    `memory_bytes_sum` SimpleAggregateFunction(sum, Float64),
    `cpu_limit_millicores_max` SimpleAggregateFunction(max, Int32),
    `memory_limit_bytes_max` SimpleAggregateFunction(max, Int64),
    `network_tx_bytes_sum` SimpleAggregateFunction(sum, Int64),
    `network_tx_bytes_public_sum` SimpleAggregateFunction(sum, Int64),
    `sample_count` SimpleAggregateFunction(sum, Int64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (workspace_id, app_id, deployment_id, time)
Each level is populated by a materialized view that aggregates from the level below (raw → minute → hour → day → month).

Materialized View Aggregation Chain

container_resources_raw_v1
       │ (MV: group by minute, full hierarchy)

container_resources_per_minute_v1   ← 30 day TTL
       │ (MV: group by hour)

container_resources_per_hour_v1     ← 90 day TTL
       │ (MV: group by day)

container_resources_per_day_v1      ← 365 day TTL
       │ (MV: group by month)

container_resources_per_month_v1    ← no TTL
All aggregation tables preserve the full hierarchy: workspace_id, project_id, app_id, environment_id, deployment_id. This enables queries at any level — per-deployment, per-app, or per-workspace.These MVs exist for dashboard performance — pre-aggregating so time-series graphs don’t scan millions of raw rows. Billing uses the raw tables + lifecycle events directly.

What Each Aggregation Level Stores

ColumnMeaningHow to use
cpu_millicores_sumSum of all CPU samples/ sample_count = avg CPU
memory_bytes_maxPeak memory in the windowFor peak-based billing
memory_bytes_sumSum of all memory samples/ sample_count = avg memory
cpu_limit_millicores_maxMax allocated CPUFor allocated billing
memory_limit_bytes_maxMax allocated memoryFor allocated billing
network_tx_bytes_sumTotal egress bytesFor egress billing
network_tx_bytes_public_sumPublic egress bytesFor public egress billing
sample_countNumber of samplesFor computing averages

Billing Models Supported

Bill for what was actually consumed. The billing service:
  1. Fetches lifecycle events to get the exact billing window (ms-precise start/stop)
  2. Queries raw samples within that window for actual CPU/memory/egress consumed
  3. Prorates the first and last intervals to the millisecond using the started/stopped event timestamps
  4. For edge windows where no CPU sample exists, bills at the allocated rate (cpu_limit)

Deployment

soonTM runs as a Kubernetes DaemonSet on all untrusted nodes (the same nodes that run customer deployments).
ResourceValue
CPU request20m
CPU limit50m
Memory request32Mi
Memory limit64Mi
Disk (EBS)5-10Gi gp3
Collection intervalconfigurable (default 15s)

Known Considerations

Customer deployments run under gVisor (runtimeClassName: gvisor). gVisor sandboxes container processes from the host kernel, which can affect cAdvisor’s ability to read cgroup metrics. The kubelet’s CRI stats provider (used by /metrics/resource) works independently of cAdvisor and should report correctly for gVisor pods.This needs to be verified on a staging cluster before shipping.
  • /metrics/resource is the officially recommended endpoint for CPU/memory. Future-proof.
  • /stats/summary has been “planned for deprecation” since 2018 with no concrete action (kubernetes#106080). We use it only for network stats (which /metrics/resource doesn’t provide). If it’s ever actually deprecated, we can fall back to Cilium Hubble for all network data.