Overview
soonTM = I dont have a good name yet. soonTM is a Go DaemonSet that collects per-deployment CPU, memory, and network egress metrics from Kubernetes nodes and writes them to ClickHouse for billing. It also tracks deployment lifecycle events (start, stop, scale) with millisecond-precise timestamps. soonTM runs once per node in the cluster, on every node that hosts customer deployments (untrusted nodepool).
Architecture
Collect
soonTM scrapes two kubelet endpoints on the local node at a configurable interval (default 15s) and watches pod events via a K8s informer. It also queries Cilium Hubble for public vs internal egress classification.
| Source | What | Endpoint |
|---|---|---|
| kubelet | CPU + memory | /metrics/resource |
| kubelet | Network tx/rx bytes | /stats/summary |
| K8s API | Lifecycle events (start/stop/scale) | Pod informer (watch) |
| Cilium Hubble | Public egress classification | gRPC via hubble.relay |
Buffer to disk
Every sample and lifecycle event is written to a disk WAL (EBS volume) before anything else. Billing data is never held only in memory.
Drain to ClickHouse
A background loop reads completed WAL segments and batch-inserts them into ClickHouse. On success, the segment is deleted. On failure, it stays on disk for retry.
What soonTM Collects
Resource Usage Samples (configurable interval, default 15s)
For each krane-managed pod on the node, soonTM records:| Metric | Source | Description |
|---|---|---|
cpu_millicores | /metrics/resource | Actual CPU usage rate (computed from cumulative nanosecond counter delta) |
memory_working_set_bytes | /metrics/resource | Actual memory working set |
cpu_request_millicores | Pod spec | Requested CPU (scheduling guarantee) |
cpu_limit_millicores | Pod spec | CPU limit (hard cap) |
memory_request_bytes | Pod spec | Requested memory |
memory_limit_bytes | Pod spec | Memory limit |
network_tx_bytes | /stats/summary | Total egress bytes since last sample (delta) |
network_tx_bytes_public | Cilium Hubble | Public-only egress (non-RFC1918 destinations) |
workspace_id, project_id, app_id, environment_id, deployment_id, instance_id (pod name), region, and platform.
Lifecycle Events (real-time)
soonTM watches for pod state changes via Kubernetes informers and emits events with millisecond-precise timestamps:| Event | When | Why it matters |
|---|---|---|
started | Pod appears and is running | Billing window begins |
stopped | Pod is removed | Billing window ends |
scaled | ReplicaSet replica count changes | Allocated billing changes mid-period |
Why Two Data Streams
Usage samples and lifecycle events answer different questions:- Usage samples → “how much CPU/memory did this pod actually consume in this collection interval?”
- Lifecycle events → “exactly when did this deployment start, stop, or change its allocation?”
Why usage samples alone aren’t enough
Usage samples arrive every collection interval. But deployments don’t start and stop on interval boundaries:What lifecycle events enable
- Active billing (usage-based)
- Allocated billing (reservation-based)
Usage samples give us the actual consumption rate. Lifecycle events give us the exact billing window. We know the pod started at
:00.100 and stopped at :52.300, so we prorate the first and last intervals to the millisecond instead of snapping to the nearest 15s sample.How allocated billing works
Each lifecycle event marks a change in what’s reserved. Between two events, the allocation is constant:The billing service fetches lifecycle events from ClickHouse, walks them chronologically per deployment, and computes
replicas × limit × duration for each interval. This is done in Go, not SQL — ClickHouse stores the events, the billing service does the math.How CPU is Measured
The kernel tracks CPU as a cumulative nanosecond counter — it only ever goes up. The kubelet reads this via cAdvisor and exposes it through its API. We don’t read cgroups directly. Every collection interval, soonTM grabs the counter from the kubelet. With two consecutive readings we compute the actual usage:This is not an instantaneous snapshot — it’s exactly how much CPU was consumed between two readings. No spikes are missed, no idle time is overcounted. The kernel counted every nanosecond.
Edge Windows: Start and Stop
Computing CPU rate requires two consecutive readings. This creates a blind spot at pod start (no previous reading) and pod stop (no next reading). This is a physical limitation — every metrics tool has it.- On start
- On stop
We do an immediate kubelet read when the pod informer fires The blind spot shrinks from ~15s to milliseconds (however fast we can hit the kubelet API after the informer event).
AddFunc:The worst-case gap is one collection interval billed at the allocated rate instead of actual usage. For a pod running hours or days, this is negligible. The allocated rate is also the ceiling — the customer is never charged more than what they reserved.Even after a container dies, cAdvisor retains its stats in memory for up to 2 minutes (
--storage_duration), so the “race to read on Terminating” has a decent safety margin. The kubelet’s container GC used to be tunable via --minimum-container-ttl-duration but that flag has been removed.How Network Egress is Split
Total egress comes from the kubelet Summary API (txBytes counter delta). To split internal vs public:
Query Hubble
soonTM queries Cilium Hubble (gRPC API via
hubble.relay) for each pod’s outbound network flows.Classify by destination
Flows to RFC1918 destinations (
10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) are internal. Everything else is public.Durability: Disk WAL + S3 Overflow
soonTM uses a write-ahead log (WAL) on a dedicated EBS volume:Write to disk first
Every metric sample and lifecycle event is written to a segment file on disk before anything else.
Drain to ClickHouse
A background drain loop reads completed segments and batch-inserts them into ClickHouse. On success, the segment file is deleted.
Retry on failure
If ClickHouse is down, the segment stays on disk and is retried on the next loop. Data is never lost.
S3 overflow
If the disk fills up (EBS volume approaching capacity), the oldest segments are uploaded to S3 and deleted locally.
ClickHouse Data Model
Raw Tables
| Table | Rows | TTL | Purpose |
|---|---|---|---|
container_resources_raw_v1 | 1 per instance per collection interval | 90 days | Raw usage samples |
deployment_lifecycle_events_v1 | 1 per start/stop/scale | 365 days | Lifecycle events |
Table Schemas
container_resources_raw_v1
container_resources_raw_v1
deployment_lifecycle_events_v1
deployment_lifecycle_events_v1
Aggregation tables (per_hour, per_day, per_month)
Aggregation tables (per_hour, per_day, per_month)
All three aggregation tables share the same column structure, differing only in time granularity and TTL:Each level is populated by a materialized view that aggregates from the level below (raw → minute → hour → day → month).
Materialized View Aggregation Chain
All aggregation tables preserve the full hierarchy:
workspace_id, project_id, app_id, environment_id, deployment_id. This enables queries at any level — per-deployment, per-app, or per-workspace.These MVs exist for dashboard performance — pre-aggregating so time-series graphs don’t scan millions of raw rows. Billing uses the raw tables + lifecycle events directly.What Each Aggregation Level Stores
| Column | Meaning | How to use |
|---|---|---|
cpu_millicores_sum | Sum of all CPU samples | / sample_count = avg CPU |
memory_bytes_max | Peak memory in the window | For peak-based billing |
memory_bytes_sum | Sum of all memory samples | / sample_count = avg memory |
cpu_limit_millicores_max | Max allocated CPU | For allocated billing |
memory_limit_bytes_max | Max allocated memory | For allocated billing |
network_tx_bytes_sum | Total egress bytes | For egress billing |
network_tx_bytes_public_sum | Public egress bytes | For public egress billing |
sample_count | Number of samples | For computing averages |
Billing Models Supported
- Active (usage-based)
- Allocated (reservation-based)
Bill for what was actually consumed. The billing service:
- Fetches lifecycle events to get the exact billing window (ms-precise start/stop)
- Queries raw samples within that window for actual CPU/memory/egress consumed
- Prorates the first and last intervals to the millisecond using the
started/stoppedevent timestamps - For edge windows where no CPU sample exists, bills at the allocated rate (
cpu_limit)
Deployment
soonTM runs as a Kubernetes DaemonSet on alluntrusted nodes (the same nodes that run customer deployments).
| Resource | Value |
|---|---|
| CPU request | 20m |
| CPU limit | 50m |
| Memory request | 32Mi |
| Memory limit | 64Mi |
| Disk (EBS) | 5-10Gi gp3 |
| Collection interval | configurable (default 15s) |
Known Considerations
gVisor Compatibility
gVisor Compatibility
Customer deployments run under gVisor (
runtimeClassName: gvisor). gVisor sandboxes container processes from the host kernel, which can affect cAdvisor’s ability to read cgroup metrics. The kubelet’s CRI stats provider (used by /metrics/resource) works independently of cAdvisor and should report correctly for gVisor pods.This needs to be verified on a staging cluster before shipping.Kubelet API Stability
Kubelet API Stability
/metrics/resourceis the officially recommended endpoint for CPU/memory. Future-proof./stats/summaryhas been “planned for deprecation” since 2018 with no concrete action (kubernetes#106080). We use it only for network stats (which/metrics/resourcedoesn’t provide). If it’s ever actually deprecated, we can fall back to Cilium Hubble for all network data.

