Documentation Index
Fetch the complete documentation index at: https://engineering.unkey.com/llms.txt
Use this file to discover all available pages before exploring further.
Sentinel runs as a Kubernetes Deployment managed by Krane. For details on how Krane manages sentinel state, see the Krane documentation.
When sentinels are created
Sentinels are created as part of the deployment workflow in the control plane worker (svc/ctrl/worker/deploy/deploy_handler.go). During the deploying phase, the ensureSentinels step checks whether each target region already has a sentinel for the environment. If a region has no sentinel, the workflow inserts one into the database and writes a deployment_changes outbox entry in the same transaction.
Once the outbox entry exists, Krane picks it up via the WatchDeploymentChanges stream and applies the corresponding Kubernetes resources.
Convergence tracking
After creating a sentinel, the deploy workflow calls SentinelService.Deploy() (svc/ctrl/worker/sentinel/) and blocks until the sentinel has fully converged in Kubernetes. This ensures traffic can route to the sentinel before the deployment proceeds to domain assignment.
SentinelService is a Restate virtual object keyed by sentinel ID, so calls for the same sentinel serialise automatically. A Deploy call:
- Reads the current sentinel row and merges the request fields over it (zero values mean “keep current”). If nothing changed and the sentinel is already healthy on the desired image, the call returns
READY immediately.
- Writes the new config plus a
deployment_changes outbox entry in a single transaction and sets deploy_status = progressing. Krane picks up the outbox entry and applies the update to Kubernetes.
- Creates a Restate awakeable and persists its ID under the
notify_ready_awakeable state key, then suspends until either the awakeable resolves or a 10-minute timeout fires.
- The awakeable is resolved by
ReportSentinelStatus on the control-plane cluster service: when Krane reports a sentinel whose running_image matches image and whose health is healthy, the RPC calls SentinelService.NotifyReady (a shared handler on the same virtual object), which resolves the stored awakeable ID and unblocks Deploy.
- On resolve,
deploy_status is set to ready and the call returns. On timeout, deploy_status is set to failed — the single-sentinel path does not self-rollback; failed sentinels stay on whatever state Kubernetes ended up in, and fleet-level rollback is the operator’s call via SentinelRolloutService.RollbackAll.
Krane reports the following fields on every Deployment status change, which the control plane writes into the sentinels row so anything that needs to reason about rollout progress can do so from the DB:
| Field | Purpose |
|---|
available_replicas | Pods ready for MinReadySeconds |
updated_replicas | Pods running the current pod template spec |
ready_replicas | Pods passing readiness probes |
observed_generation | Last generation processed by K8s |
The sentinel’s deploy_status column is an enum with four values: idle, progressing, ready, failed. Fleet rollouts have their own lifecycle state (including rolling_back) which is stored in Restate K/V, not in the sentinel row — see the next section.
Fleet-wide image rollouts
Changing the sentinel image for a single sentinel is a SentinelService.Deploy call. Rolling the image across the whole fleet is a separate service: SentinelRolloutService (svc/ctrl/worker/sentinel/rollout_*.go). The deploy workflow never does this implicitly — it only creates sentinels for regions that don’t have one yet and never auto-upgrades existing sentinels. Fleet rollouts are initiated explicitly (e.g. operator tooling).
SentinelRolloutService is a Restate virtual object keyed by the literal string singleton, which serialises all rollout operations globally. Its state lives in the Restate K/V store under the rollout key and tracks the full lifecycle so Resume, Cancel, and RollbackAll can pick up where the previous call left off.
Rollout lifecycle
Rollout state is one of:
| State | Meaning |
|---|
idle | No rollout in flight (also the effective state before the first Rollout call). |
in_progress | A rollout is executing waves. |
paused | A wave had at least one failing sentinel. Waiting for operator to call Resume or RollbackAll. |
rolling_back | RollbackAll is reverting succeeded sentinels to their previous image. |
cancelled | Operator called Cancel, or a rollback finished. |
completed | All waves finished with no failures. |
Waves
Rollout(image, wavePercentages?, slackWebhookUrl?) starts a rollout:
- Lists all running sentinels (paged) and their current image. Sentinels already on the target image are filtered out; the rest are captured into
PreviousImages so a later rollback knows what to revert to.
- Splits the remaining sentinel IDs into waves by cumulative percentage. Defaults to
[1, 5, 25, 50, 100] — e.g. with 100 sentinels, waves of [1, 4, 20, 25, 50]. Callers can override via wave_percentages (computeWaves in rollout_state.go).
- Persists the full
rolloutState (image, waves, previous images, Slack webhook, counters) into Restate state and starts executing waves.
Each wave fans out SentinelService.Deploy calls via RequestFuture and then collects all responses:
- If every sentinel in the wave reports
READY, the wave is recorded in SucceededIDs and the next wave starts.
- If any sentinel fails or errors, the failed IDs are recorded in
FailedIDs, the rollout transitions to paused, and the call returns. Sentinels that succeeded within the paused wave stay in SucceededIDs.
Resume, Cancel, RollbackAll
Operator handlers on the paused rollout:
Resume — only valid from paused. Advances CurrentWave by one (skipping the wave that failed) and re-enters executeWaves. Sentinels that failed in the skipped wave are not retried; they stay in FailedIDs.
Cancel — valid from in_progress or paused. Flips the state to cancelled. Succeeded sentinels keep the new image; failed ones stay where they are. This is the “live with it” exit.
RollbackAll — valid from paused or cancelled. Fans out SentinelService.Deploy back to the per-sentinel entry in PreviousImages for everything in SucceededIDs. Failed sentinels are not touched — they already never made it to the new image. Returns the count of sentinels that reverted successfully, then transitions to cancelled.
Re-entrancy: Rollout is rejected while a rollout is in any non-terminal state (i.e. anything that isn’t idle, completed, or cancelled). This check is in Rollout itself rather than relying solely on the virtual-object lock, because clients can send multiple Rollout calls to the singleton object.
Slack notifications
If slack_webhook_url is provided on the initial Rollout call, the service posts progress updates at each phase transition (rollout started, wave started, wave completed, rollout paused, rollout resumed, rollout completed, rollback started, rollback completed). Notification failures are logged but do not fail the rollout.
Where sentinels run
All sentinel pods run in a dedicated sentinel Kubernetes namespace, separate from customer workloads and other Unkey services. This namespace contains:
- Sentinel Deployments (one per environment per region)
- Services for routing traffic to sentinel pods
- Gossip headless Services and CiliumNetworkPolicies for cache invalidation
- Secrets for database, ClickHouse, and Redis credentials
Sentinel pods are scheduled onto dedicated sentinel node class nodes using a toleration for the node-class=sentinel:NoSchedule taint. This keeps sentinel workloads isolated from customer instance pods at the node level, preventing resource contention between the proxy layer and the workloads it routes to.
Kubernetes resources
Each sentinel consists of five resources, all created via server-side apply:
| Resource | Scope | Purpose |
|---|
| Deployment | Per sentinel | Sentinel pods with rolling update strategy |
| ClusterIP Service | Per sentinel | Routes traffic to sentinel pods on port 8040 |
| PodDisruptionBudget | Per sentinel | Keeps at least one pod available during disruptions |
| Headless Service | Per environment | Gossip peer discovery (resolves to pod IPs on port 7946) |
| CiliumNetworkPolicy | Per environment | Allows gossip traffic between sentinel pods |
The environment-scoped resources (headless Service, CiliumNetworkPolicy) are shared across all sentinels in an environment and are not owned by any single Deployment.
Deployment spec
| Setting | Value |
|---|
| Strategy | RollingUpdate |
| Max Unavailable | 0 |
| Max Surge | 1 |
| Min Ready Seconds | 5 |
| Topology Spread | maxSkew=1 across availability zones, ScheduleAnyway |
| Ports | 8040 (HTTP), 7946 (gossip TCP+UDP) |
Probes
| Probe | Value |
|---|
| Liveness | GET /_unkey/internal/health on port 8040 |
| Readiness | GET /_unkey/internal/health on port 8040 |
Two consecutive failures remove the pod from the Service endpoints, stopping traffic from reaching it.
Both probes hit the same trivial endpoint that returns 200 unconditionally without checking
dependencies. A sentinel with a dead database connection or unavailable Redis reports as healthy.
This needs to be migrated to proper liveness and readiness checks like our other services:
liveness must verify the process is alive, readiness must verify sentinel can actually serve
traffic (database reachable, middleware engine initialized). Tracked in
#5367.
Labels
All sentinel resources carry these labels:
| Label | Value |
|---|
app.kubernetes.io/managed-by | krane |
app.kubernetes.io/component | sentinel |
unkey.com/workspace.id | Workspace ID |
unkey.com/project.id | Project ID |
unkey.com/app.id | App ID |
unkey.com/environment.id | Environment ID |
unkey.com/sentinel.id | Sentinel ID |
Cache prewarming
On startup, the router service loads all deployments with status READY in the environment and prefetches their instances. This avoids cold-start latency spikes on the first requests after a pod restart.