metric collection

Overview

soonTM = I dont have a good name yet. soonTM is a Go DaemonSet that collects per-deployment CPU, memory, and network egress metrics from Kubernetes nodes and writes them to ClickHouse for billing. It also tracks deployment lifecycle events (start, stop, scale) with millisecond-precise timestamps. soonTM runs once per node in the cluster, on every node that hosts customer deployments (untrusted nodepool).

Architecture

Collect

soonTM scrapes two kubelet endpoints on the local node at a configurable interval (default 15s) and watches pod events via a K8s informer. It also queries Cilium Hubble for public vs internal egress classification.

Source	What	Endpoint
kubelet	CPU + memory	`/metrics/resource`
kubelet	Network tx/rx bytes	`/stats/summary`
K8s API	Lifecycle events (start/stop/scale)	Pod informer (watch)
Cilium Hubble	Public egress classification	gRPC via `hubble.relay`

Buffer to disk

Every sample and lifecycle event is written to a disk WAL (EBS volume) before anything else. Billing data is never held only in memory.

Drain to ClickHouse

A background loop reads completed WAL segments and batch-inserts them into ClickHouse. On success, the segment is deleted. On failure, it stays on disk for retry.

S3 overflow

If the disk fills up, the oldest segments are uploaded to S3. Once ClickHouse recovers, re-ingest from S3:

INSERT INTO default.container_resources_raw_v1
SELECT * FROM s3('s3://bucket/metering-wal/**/*.ndjson', 'JSONEachRow')

What soonTM Collects

Resource Usage Samples (configurable interval, default 15s)

For each krane-managed pod on the node, soonTM records:

Metric	Source	Description
`cpu_millicores`	`/metrics/resource`	Actual CPU usage rate (computed from cumulative nanosecond counter delta)
`memory_working_set_bytes`	`/metrics/resource`	Actual memory working set
`cpu_request_millicores`	Pod spec	Requested CPU (scheduling guarantee)
`cpu_limit_millicores`	Pod spec	CPU limit (hard cap)
`memory_request_bytes`	Pod spec	Requested memory
`memory_limit_bytes`	Pod spec	Memory limit
`network_tx_bytes`	`/stats/summary`	Total egress bytes since last sample (delta)
`network_tx_bytes_public`	Cilium Hubble	Public-only egress (non-RFC1918 destinations)

Each sample is tagged with workspace_id, project_id, app_id, environment_id, deployment_id, instance_id (pod name), region, and platform.

Lifecycle Events (real-time)

soonTM watches for pod state changes via Kubernetes informers and emits events with millisecond-precise timestamps:

Event	When	Why it matters
`started`	Pod appears and is running	Billing window begins
`stopped`	Pod is removed	Billing window ends
`scaled`	ReplicaSet replica count changes	Allocated billing changes mid-period

Each lifecycle event records the deployment’s resource allocation at that moment (replicas, cpu_limit, memory_limit).

Why Two Data Streams

Usage samples and lifecycle events answer different questions:

Usage samples → “how much CPU/memory did this pod actually consume in this collection interval?”
Lifecycle events → “exactly when did this deployment start, stop, or change its allocation?”

Why usage samples alone aren’t enough

Usage samples arrive every collection interval. But deployments don’t start and stop on interval boundaries:

Timeline:
  :00.000  ─── nothing ───
  :00.100  ← pod starts (no sample knows about this)
  :15.000  ← first usage sample arrives
  :30.000  ← second sample
  :45.000  ← last sample
  :52.300  ← pod stops (no sample captures this either)

Without lifecycle events, those first 14.9 seconds and last 7.3 seconds are invisible. For both billing models that’s unacceptable — we’d be rounding to 15s boundaries when we have the data to be ms-precise.

What lifecycle events enable

Active billing (usage-based)
Allocated billing (reservation-based)

Usage samples give us the actual consumption rate. Lifecycle events give us the exact billing window. We know the pod started at :00.100 and stopped at :52.300, so we prorate the first and last intervals to the millisecond instead of snapping to the nearest 15s sample.

How allocated billing works

Each lifecycle event marks a change in what’s reserved. Between two events, the allocation is constant:

Timeline for deployment X:

  14:00:00.100  started  │ 2 replicas × 500m CPU × 256Mi mem
                         │
  14:32:17.483  scaled   │ 4 replicas × 500m CPU × 256Mi mem
                         │
  15:07:44.917  stopped  │ 0

Billing:
  Interval 1: 14:00:00.100 → 14:32:17.483 = 1,937,383ms
    allocated_cpu = 2 × 500m × 1,937,383ms = 1,937,383,000 millicores·ms

  Interval 2: 14:32:17.483 → 15:07:44.917 = 2,127,434ms
    allocated_cpu = 4 × 500m × 2,127,434ms = 4,254,868,000 millicores·ms

  Total: 6,192,251,000 millicores·ms = ~1.72 CPU-hours

The billing service fetches lifecycle events from ClickHouse, walks them chronologically per deployment, and computes replicas × limit × duration for each interval. This is done in Go, not SQL — ClickHouse stores the events, the billing service does the math.

How CPU is Measured

The kernel tracks CPU as a cumulative nanosecond counter — it only ever goes up. The kubelet reads this via cAdvisor and exposes it through its API. We don’t read cgroups directly. Every collection interval, soonTM grabs the counter from the kubelet. With two consecutive readings we compute the actual usage:

Example:
  t0 (00:00:00): counter = 5,000,000,000 ns
  t1 (00:00:15): counter = 5,750,000,000 ns

  delta = 750,000,000 ns consumed in 15s
  rate  = 750,000,000 / 15,000,000,000 = 0.05 cores = 50 millicores

This is not an instantaneous snapshot — it’s exactly how much CPU was consumed between two readings. No spikes are missed, no idle time is overcounted. The kernel counted every nanosecond.

For billing, we store each of these computed rates as a sample. The total CPU consumed over a billing period is the sum of all samples — not an average. Each sample represents actual usage over its collection interval.

Edge Windows: Start and Stop

Computing CPU rate requires two consecutive readings. This creates a blind spot at pod start (no previous reading) and pod stop (no next reading). This is a physical limitation — every metrics tool has it.

On start
On stop

We do an immediate kubelet read when the pod informer fires AddFunc:

:00.100  pod starts → informer AddFunc fires
:00.120  soonTM immediately reads kubelet → counter = X (reading #1)
:15.000  regular tick → counter = Y (reading #2)
         → cpu for :00.120 → :15.000 = (Y - X) / 14.88s ✓

The blind spot shrinks from ~15s to milliseconds (however fast we can hit the kubelet API after the informer event).

Harder — the container may be gone before we can read. Our approach:

The informer fires UpdateFunc when the pod transitions to Terminating (before the container is actually killed, during the graceful shutdown window)
soonTM races to read the kubelet one last time
If we win the race: we get a real final delta, full coverage
If we lose (container already dead): fall back to billing the allocated rate (cpu_limit) for the gap between the last real sample and the lifecycle stop event

Best case (win the race):
  :45.000  regular tick → counter = A
  :52.100  SIGTERM → pod becomes Terminating
  :52.110  informer fires → soonTM reads kubelet → counter = B ✓
  :52.300  container dies
           → :45 → :52.110 = real usage ✓
           → :52.110 → :52.300 = billed at allocated rate

Worst case (lose the race):
  :45.000  regular tick → counter = A (last real sample)
  :52.300  container dies before we can read
           → :45 → :52.300 = 7.3s at allocated rate

The worst-case gap is one collection interval billed at the allocated rate instead of actual usage. For a pod running hours or days, this is negligible. The allocated rate is also the ceiling — the customer is never charged more than what they reserved.Even after a container dies, cAdvisor retains its stats in memory for up to 2 minutes (--storage_duration), so the “race to read on Terminating” has a decent safety margin. The kubelet’s container GC used to be tunable via --minimum-container-ttl-duration but that flag has been removed.

How Network Egress is Split

Total egress comes from the kubelet Summary API (txBytes counter delta). To split internal vs public:

Query Hubble

soonTM queries Cilium Hubble (gRPC API via hubble.relay) for each pod’s outbound network flows.

Classify by destination

Flows to RFC1918 destinations (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) are internal. Everything else is public.

Store both

network_tx_bytes_public = sum of tx bytes to non-RFC1918 destinations. Internal egress = network_tx_bytes - network_tx_bytes_public (computed at query time).

Only public egress is typically billed. Internal egress (pod-to-pod, pod-to-service) is free.

Durability: Disk WAL + S3 Overflow

Billing data cannot be dropped. Unlike analytics (where losing a few events is acceptable), missing billing data means lost revenue or overcharging.

soonTM uses a write-ahead log (WAL) on a dedicated EBS volume:

Write to disk first

Every metric sample and lifecycle event is written to a segment file on disk before anything else.

Drain to ClickHouse

A background drain loop reads completed segments and batch-inserts them into ClickHouse. On success, the segment file is deleted.

Retry on failure

If ClickHouse is down, the segment stays on disk and is retried on the next loop. Data is never lost.

S3 overflow

If the disk fills up (EBS volume approaching capacity), the oldest segments are uploaded to S3 and deleted locally.

Re-ingest from S3

Once ClickHouse recovers, data from S3 can be re-ingested with zero custom code:

INSERT INTO default.container_resources_raw_v1
SELECT * FROM s3(
    's3://bucket/metering-wal/resources/**/*.ndjson',
    'JSONEachRow'
)

Segment files use NDJSON format (one JSON object per line) so ClickHouse can read them directly from S3 without any transformation.

ClickHouse Data Model

Raw Tables

Table	Rows	TTL	Purpose
`container_resources_raw_v1`	1 per instance per collection interval	90 days	Raw usage samples
`deployment_lifecycle_events_v1`	1 per start/stop/scale	365 days	Lifecycle events

Table Schemas

container_resources_raw_v1

CREATE TABLE default.container_resources_raw_v1
(
    `time` Int64,                        -- unix milliseconds
    `workspace_id` String,
    `project_id` String,
    `app_id` String,
    `environment_id` String,
    `deployment_id` String,
    `instance_id` String,                -- pod name
    `region` LowCardinality(String),
    `platform` LowCardinality(String),   -- "aws", "gcp", etc.

    -- Actual usage (from kubelet)
    `cpu_millicores` Float64,            -- computed rate from counter delta
    `memory_working_set_bytes` Int64,

    -- Allocated resources (from pod spec)
    `cpu_request_millicores` Int32,
    `cpu_limit_millicores` Int32,
    `memory_request_bytes` Int64,
    `memory_limit_bytes` Int64,

    -- Network egress (deltas since last sample)
    `network_tx_bytes` Int64,            -- total egress
    `network_tx_bytes_public` Int64      -- public-only (non-RFC1918)
)
ENGINE = MergeTree()
ORDER BY (workspace_id, app_id, deployment_id, time)
PARTITION BY toDate(fromUnixTimestamp64Milli(time))
TTL toDateTime(fromUnixTimestamp64Milli(time)) + INTERVAL 90 DAY DELETE

deployment_lifecycle_events_v1

CREATE TABLE default.deployment_lifecycle_events_v1
(
    `time` Int64,                        -- unix milliseconds (ms-precise)
    `workspace_id` String,
    `project_id` String,
    `app_id` String,
    `environment_id` String,
    `deployment_id` String,
    `region` LowCardinality(String),
    `platform` LowCardinality(String),

    `event` LowCardinality(String),      -- "started", "stopped", "scaled"
    `replicas` Int32,                    -- replica count at this moment
    `cpu_limit_millicores` Int32,        -- per-replica CPU limit
    `memory_limit_bytes` Int64           -- per-replica memory limit
)
ENGINE = MergeTree()
ORDER BY (workspace_id, app_id, deployment_id, time)
PARTITION BY toDate(fromUnixTimestamp64Milli(time))
TTL toDateTime(fromUnixTimestamp64Milli(time)) + INTERVAL 365 DAY DELETE

Aggregation tables (per_hour, per_day, per_month)

All three aggregation tables share the same column structure, differing only in time granularity and TTL:

-- Same structure for per_minute (30d TTL), per_hour (90d TTL), per_day (365d TTL), per_month (no TTL)
CREATE TABLE default.container_resources_per_{hour,day,month}_v1
(
    `time` DateTime,                                            -- or Date for day/month
    `workspace_id` String,
    `project_id` String,
    `app_id` String,
    `environment_id` String,
    `deployment_id` String,

    `cpu_millicores_sum` SimpleAggregateFunction(sum, Float64),
    `memory_bytes_max` SimpleAggregateFunction(max, Int64),
    `memory_bytes_sum` SimpleAggregateFunction(sum, Float64),
    `cpu_limit_millicores_max` SimpleAggregateFunction(max, Int32),
    `memory_limit_bytes_max` SimpleAggregateFunction(max, Int64),
    `network_tx_bytes_sum` SimpleAggregateFunction(sum, Int64),
    `network_tx_bytes_public_sum` SimpleAggregateFunction(sum, Int64),
    `sample_count` SimpleAggregateFunction(sum, Int64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (workspace_id, app_id, deployment_id, time)

Each level is populated by a materialized view that aggregates from the level below (raw → minute → hour → day → month).

Materialized View Aggregation Chain

container_resources_raw_v1
       │ (MV: group by minute, full hierarchy)
       ▼
container_resources_per_minute_v1   ← 30 day TTL
       │ (MV: group by hour)
       ▼
container_resources_per_hour_v1     ← 90 day TTL
       │ (MV: group by day)
       ▼
container_resources_per_day_v1      ← 365 day TTL
       │ (MV: group by month)
       ▼
container_resources_per_month_v1    ← no TTL

All aggregation tables preserve the full hierarchy: workspace_id, project_id, app_id, environment_id, deployment_id. This enables queries at any level — per-deployment, per-app, or per-workspace.These MVs exist for dashboard performance — pre-aggregating so time-series graphs don’t scan millions of raw rows. Billing uses the raw tables + lifecycle events directly.

What Each Aggregation Level Stores

Column	Meaning	How to use
`cpu_millicores_sum`	Sum of all CPU samples	`/ sample_count` = avg CPU
`memory_bytes_max`	Peak memory in the window	For peak-based billing
`memory_bytes_sum`	Sum of all memory samples	`/ sample_count` = avg memory
`cpu_limit_millicores_max`	Max allocated CPU	For allocated billing
`memory_limit_bytes_max`	Max allocated memory	For allocated billing
`network_tx_bytes_sum`	Total egress bytes	For egress billing
`network_tx_bytes_public_sum`	Public egress bytes	For public egress billing
`sample_count`	Number of samples	For computing averages

Billing Models Supported

Active (usage-based)
Allocated (reservation-based)

Bill for what was actually consumed. The billing service:

Fetches lifecycle events to get the exact billing window (ms-precise start/stop)
Queries raw samples within that window for actual CPU/memory/egress consumed
Prorates the first and last intervals to the millisecond using the started/stopped event timestamps
For edge windows where no CPU sample exists, bills at the allocated rate (cpu_limit)

Bill for what was reserved, regardless of actual usage. The billing service:

Fetches lifecycle events for the billing period
Walks them chronologically per deployment
Each started/scaled event marks the beginning of an interval with known allocation (replicas × cpu_limit)
Each stopped/scaled event marks the end of that interval
Computes allocated_cpu_ms = sum(replicas × cpu_limit_millicores × interval_duration_ms) across all intervals

This is done in Go, not SQL — ClickHouse stores the events, the billing service does the math.

Deployment

soonTM runs as a Kubernetes DaemonSet on all untrusted nodes (the same nodes that run customer deployments).

Resource	Value
CPU request	20m
CPU limit	50m
Memory request	32Mi
Memory limit	64Mi
Disk (EBS)	5-10Gi gp3
Collection interval	configurable (default 15s)

Known Considerations

gVisor Compatibility

Customer deployments run under gVisor (runtimeClassName: gvisor). gVisor sandboxes container processes from the host kernel, which can affect cAdvisor’s ability to read cgroup metrics. The kubelet’s CRI stats provider (used by /metrics/resource) works independently of cAdvisor and should report correctly for gVisor pods.This needs to be verified on a staging cluster before shipping.

Kubelet API Stability

/metrics/resource is the officially recommended endpoint for CPU/memory. Future-proof.
/stats/summary has been “planned for deprecation” since 2018 with no concrete action (kubernetes#106080). We use it only for network stats (which /metrics/resource doesn’t provide). If it’s ever actually deprecated, we can fall back to Cilium Hubble for all network data.

Infra

Clusters

Observability

Metering

Deployments

Secrets

ClickHouse

Legacy (2025)

Overview

Architecture

What soonTM Collects

Resource Usage Samples (configurable interval, default 15s)

Lifecycle Events (real-time)

Why Two Data Streams

Why usage samples alone aren’t enough

What lifecycle events enable

How allocated billing works

How CPU is Measured

Edge Windows: Start and Stop

How Network Egress is Split

Durability: Disk WAL + S3 Overflow

ClickHouse Data Model

Raw Tables

Table Schemas

Materialized View Aggregation Chain

What Each Aggregation Level Stores

Billing Models Supported

Deployment

Known Considerations

Infra

Clusters

Observability

Metering

Deployments

Secrets

ClickHouse

Legacy (2025)

​Overview

​Architecture

​What soonTM Collects

​Resource Usage Samples (configurable interval, default 15s)

​Lifecycle Events (real-time)

​Why Two Data Streams

​Why usage samples alone aren’t enough

​What lifecycle events enable

​How allocated billing works

​How CPU is Measured

​Edge Windows: Start and Stop

​How Network Egress is Split

​Durability: Disk WAL + S3 Overflow

​ClickHouse Data Model

​Raw Tables

​Table Schemas

​Materialized View Aggregation Chain

​What Each Aggregation Level Stores

​Billing Models Supported

​Deployment

​Known Considerations

Overview

Architecture

What soonTM Collects

Resource Usage Samples (configurable interval, default 15s)

Lifecycle Events (real-time)

Why Two Data Streams

Why usage samples alone aren’t enough

What lifecycle events enable

How allocated billing works

How CPU is Measured

Edge Windows: Start and Stop

How Network Egress is Split

Durability: Disk WAL + S3 Overflow

ClickHouse Data Model

Raw Tables

Table Schemas

Materialized View Aggregation Chain

What Each Aggregation Level Stores

Billing Models Supported

Deployment

Known Considerations