When sentinels are created
Sentinels are created as part of the deployment workflow in the control plane worker (svc/ctrl/worker/deploy/deploy_handler.go). During the deploying phase, the ensureSentinels step checks whether each target region already has a sentinel for the environment. If a region has no sentinel, the workflow inserts one into the database and writes a deployment_changes outbox entry in the same transaction.
Once the outbox entry exists, Krane picks it up via the WatchDeploymentChanges stream and applies the corresponding Kubernetes resources.
Convergence tracking
After creating a sentinel, the deploy workflow callsSentinelService.Deploy() (svc/ctrl/worker/sentinel/) and blocks until the sentinel has fully converged in Kubernetes. This ensures traffic can route to the sentinel before the deployment proceeds to domain assignment.
SentinelService is a Restate virtual object keyed by sentinel ID, so calls for the same sentinel serialise automatically. A Deploy call:
- Reads the current sentinel row and merges the request fields over it (zero values mean “keep current”). If nothing changed and the sentinel is already healthy on the desired image, the call returns
READYimmediately. - Writes the new config plus a
deployment_changesoutbox entry in a single transaction and setsdeploy_status = progressing. Krane picks up the outbox entry and applies the update to Kubernetes. - Creates a Restate awakeable and persists its ID under the
notify_ready_awakeablestate key, then suspends until either the awakeable resolves or a 10-minute timeout fires. - The awakeable is resolved by
ReportSentinelStatuson the control-plane cluster service: when Krane reports a sentinel whoserunning_imagematchesimageand whose health ishealthy, the RPC callsSentinelService.NotifyReady(a shared handler on the same virtual object), which resolves the stored awakeable ID and unblocksDeploy. - On resolve,
deploy_statusis set toreadyand the call returns. On timeout,deploy_statusis set tofailed— the single-sentinel path does not self-rollback; failed sentinels stay on whatever state Kubernetes ended up in, and fleet-level rollback is the operator’s call viaSentinelRolloutService.RollbackAll.
sentinels row so anything that needs to reason about rollout progress can do so from the DB:
| Field | Purpose |
|---|---|
available_replicas | Pods ready for MinReadySeconds |
updated_replicas | Pods running the current pod template spec |
ready_replicas | Pods passing readiness probes |
observed_generation | Last generation processed by K8s |
deploy_status column is an enum with four values: idle, progressing, ready, failed. Fleet rollouts have their own lifecycle state (including rolling_back) which is stored in Restate K/V, not in the sentinel row — see the next section.
Fleet-wide image rollouts
Changing the sentinel image for a single sentinel is aSentinelService.Deploy call. Rolling the image across the whole fleet is a separate service: SentinelRolloutService (svc/ctrl/worker/sentinel/rollout_*.go). The deploy workflow never does this implicitly — it only creates sentinels for regions that don’t have one yet and never auto-upgrades existing sentinels. Fleet rollouts are initiated explicitly (e.g. operator tooling).
SentinelRolloutService is a Restate virtual object keyed by the literal string singleton, which serialises all rollout operations globally. Its state lives in the Restate K/V store under the rollout key and tracks the full lifecycle so Resume, Cancel, and RollbackAll can pick up where the previous call left off.
Rollout lifecycle
Rollout state is one of:| State | Meaning |
|---|---|
idle | No rollout in flight (also the effective state before the first Rollout call). |
in_progress | A rollout is executing waves. |
paused | A wave had at least one failing sentinel. Waiting for operator to call Resume or RollbackAll. |
rolling_back | RollbackAll is reverting succeeded sentinels to their previous image. |
cancelled | Operator called Cancel, or a rollback finished. |
completed | All waves finished with no failures. |
Waves
Rollout(image, wavePercentages?, slackWebhookUrl?) starts a rollout:
- Lists all running sentinels (paged) and their current image. Sentinels already on the target image are filtered out; the rest are captured into
PreviousImagesso a later rollback knows what to revert to. - Splits the remaining sentinel IDs into waves by cumulative percentage. Defaults to
[1, 5, 25, 50, 100]— e.g. with 100 sentinels, waves of[1, 4, 20, 25, 50]. Callers can override viawave_percentages(computeWavesinrollout_state.go). - Persists the full
rolloutState(image, waves, previous images, Slack webhook, counters) into Restate state and starts executing waves.
SentinelService.Deploy calls via RequestFuture and then collects all responses:
- If every sentinel in the wave reports
READY, the wave is recorded inSucceededIDsand the next wave starts. - If any sentinel fails or errors, the failed IDs are recorded in
FailedIDs, the rollout transitions topaused, and the call returns. Sentinels that succeeded within the paused wave stay inSucceededIDs.
Resume, Cancel, RollbackAll
Operator handlers on the paused rollout:Resume— only valid frompaused. AdvancesCurrentWaveby one (skipping the wave that failed) and re-entersexecuteWaves. Sentinels that failed in the skipped wave are not retried; they stay inFailedIDs.Cancel— valid fromin_progressorpaused. Flips the state tocancelled. Succeeded sentinels keep the new image; failed ones stay where they are. This is the “live with it” exit.RollbackAll— valid frompausedorcancelled. Fans outSentinelService.Deployback to the per-sentinel entry inPreviousImagesfor everything inSucceededIDs. Failed sentinels are not touched — they already never made it to the new image. Returns the count of sentinels that reverted successfully, then transitions tocancelled.
Rollout is rejected while a rollout is in any non-terminal state (i.e. anything that isn’t idle, completed, or cancelled). This check is in Rollout itself rather than relying solely on the virtual-object lock, because clients can send multiple Rollout calls to the singleton object.
Slack notifications
Ifslack_webhook_url is provided on the initial Rollout call, the service posts progress updates at each phase transition (rollout started, wave started, wave completed, rollout paused, rollout resumed, rollout completed, rollback started, rollback completed). Notification failures are logged but do not fail the rollout.
Where sentinels run
All sentinel pods run in a dedicatedsentinel Kubernetes namespace, separate from customer workloads and other Unkey services. This namespace contains:
- Sentinel Deployments (one per environment per region)
- Services for routing traffic to sentinel pods
- Secrets for database, ClickHouse, and Redis credentials
sentinel node class nodes using a toleration for the node-class=sentinel:NoSchedule taint. This keeps sentinel workloads isolated from customer instance pods at the node level, preventing resource contention between the proxy layer and the workloads it routes to.
Kubernetes resources
Each sentinel consists of three resources, all created via server-side apply:| Resource | Scope | Purpose |
|---|---|---|
| Deployment | Per sentinel | Sentinel pods with rolling update strategy |
| ClusterIP Service | Per sentinel | Routes traffic to sentinel pods on port 8040 |
| PodDisruptionBudget | Per sentinel | Keeps at least one pod available during disruptions |
Deployment spec
| Setting | Value |
|---|---|
| Strategy | RollingUpdate |
| Max Unavailable | 0 |
| Max Surge | 1 |
| Min Ready Seconds | 5 |
| Topology Spread | maxSkew=1 across availability zones, ScheduleAnyway |
| Ports | 8040 (HTTP) |
Probes
| Probe | Value |
|---|---|
| Liveness | GET /_unkey/internal/health on port 8040 |
| Readiness | GET /_unkey/internal/health on port 8040 |
Both probes hit the same trivial endpoint that returns 200 unconditionally without checking
dependencies. A sentinel with a dead database connection or unavailable Redis reports as healthy.
This needs to be migrated to proper liveness and readiness checks like our other services:
liveness must verify the process is alive, readiness must verify sentinel can actually serve
traffic (database reachable, middleware engine initialized). Tracked in
#5367.
Labels
All sentinel resources carry these labels:| Label | Value |
|---|---|
app.kubernetes.io/managed-by | krane |
app.kubernetes.io/component | sentinel |
unkey.com/workspace.id | Workspace ID |
unkey.com/project.id | Project ID |
unkey.com/app.id | App ID |
unkey.com/environment.id | Environment ID |
unkey.com/sentinel.id | Sentinel ID |
Cache prewarming
On startup, the router service loads all deployments with statusREADY in the environment and prefetches their instances. This avoids cold-start latency spikes on the first requests after a pod restart.
