Skip to main content
The control plane deployment service creates deployment records and delegates execution to Restate workflows.

Virtual object keying

Each Restate VO uses the narrowest key that gives the serialization it needs:
  • DeployService is keyed by deployment_id. Each deployment runs as its own isolated workflow, so multiple deployments in the same environment can build in parallel.
  • RoutingService is keyed by env_id. All routing changes for an environment (frontline route assignment + the live-deployment swap) serialize here, so concurrent deploys/rollbacks/promotes for the same env can never race on apps.current_deployment_id.
  • BuildSlotService is keyed by workspace_id. Caps how many deployments build at once across the workspace and prioritises production over preview waiters.
  • DeploymentService (delayed desired-state transitions) is keyed by deployment_id.
Rollback and Promote are themselves workflows but they don’t need their own env-keyed gate — the actual mutation goes through RoutingService.SwapLiveDeployment, which is per-env serialized. Key components:

Flow: create deployment

Flow: cancel deployment

Flow: promote

Flow: rollback

BuildSlotService (workspace concurrency)

BuildSlotService caps concurrent builds per workspace. It is a Restate VO keyed by workspace_id, which makes Acquire and Release race-free across concurrent deploy handlers. State held in the VO:
  • active_slots — set of deployment IDs currently holding a slot
  • prod_wait_list — FIFO of production waiters
  • preview_wait_list — FIFO of non-production waiters
Acquire flow:
  1. Deploy handler creates a Restate awakeable and calls AcquireOrWait(deployment_id, awakeable_id, is_production).
  2. The workspace’s max_concurrent_builds quota is fetched. If len(active_slots) < limit, the deployment is added to active_slots and the awakeable is resolved immediately.
  3. Otherwise the deployment is appended to prod_wait_list (if production) or preview_wait_list (if not). The Deploy handler stays suspended on awakeable.Result().
Production deployments respect the same quota cap as preview — they don’t bypass — but they get priority by going onto a separate wait list that Release drains first. Release flow:
  1. Deploy handler calls Release(deployment_id) — from the success path explicitly, or from the compensation stack on failure/cancel.
  2. If the deployment was in active_slots, it is removed and a waiter is promoted: prod_wait_list first, then preview_wait_list. The promoted waiter’s awakeable is resolved.
  3. If the deployment was in either wait list (cancelled before it ever got a slot), it is removed.
This gives push-based slot hand-off with priority — no polling. The concurrency cap is quota.max_concurrent_builds per workspace.

Commit deduplication

When a new deployment is created, dedup.CancelOlderSiblings looks for older deployments on the same (app, environment, branch) that are still in the build queue (pending or awaiting_approval) and cancels them. Once a deployment acquires a build slot and transitions to starting, it is committed — newer commits will not supersede it. This avoids the pathological case where rapid pushes keep cancelling builds and nothing ever finishes. Cancellation happens in three steps, all batched:
  1. One SELECT — list older queued sibling deployments with their invocation IDs.
  2. One batch UPDATE — stamp every sibling’s in-flight steps with "Superseded by newer commit" (first-write-wins via WHERE ended_at IS NULL).
  3. One batch UPDATE — transition every sibling to status=superseded.
  4. N HTTP callsrestateAdmin.CancelInvocation for each sibling that has an invocation ID.
Only git-sourced deployments with a branch are deduplicated; Docker-image redeploys bypass this path.

Instance readiness (awakeable-based)

After createTopologies, ensureSentinelRows, and ensureCiliumNetworkPolicy, the Deploy handler fires off SentinelService.Deploy RPCs as RequestFutures (non-blocking) and then enters waitForDeployments. Both waits happen in parallel on krane’s side. waitForDeployments flow:
  1. Load per-region min replicas via FindDeploymentTopologyMinReplicas.
  2. Count running instances per region.
  3. Require numRegions - 1 healthy regions (minimum 1, tolerating one regional outage).
  4. Repeat the DB check until the threshold is met or regionReadyTimeout elapses.
The health-check loop stays inside one journaled Restate step so routine polling does not create one journal entry per check while krane converges.

Self-skip (belt-and-suspenders dedup)

In addition to the proactive cancel above, the Deploy handler checks HasNewerActiveDeployment at the top of its workflow. If a newer sibling on the same (app, env, branch) is already pending, starting, building, deploying, network, finalizing, ready, or awaiting_approval, the current deployment self-skips. This catches races where the proactive cancel didn’t land (e.g. the newer deployment hadn’t persisted its invocation ID yet).

State serialization (desired state)

Scheduled state changes are serialized via a Restate virtual object keyed by deployment ID in svc/ctrl/worker/deployment. The object stores a nonce for the most recent transition so older delayed requests no-op.

Retry policy

The DeployService is registered with an exponential-backoff retry policy: 30s → 1m → 2m → 4m → 5m (capped), 10 attempts total (~30 minutes). If a deploy can’t make progress after 10 retries (persistent MySQL connection errors, Depot outage), Restate kills the invocation, the compensation stack runs, and the deployment is marked failed. This replaces an older 150-attempt policy that could leave a deploy stuck retrying for ~24 hours.

Compensation stack

The Deploy handler maintains a LIFO compensation stack registered via Compensation.Add (for side-effects wrapped in restate.RunVoid) and Compensation.AddCtx (for raw ObjectContext operations like BuildSlotService.Release().Send). The stack fires on any error or cancellation:
  • Release the build slot
  • Mark the deployment as failed (only if still in an active status — the conditional UpdateDeploymentStatusIfActive query prevents overwriting superseded or ready)
  • Undo topology inserts, route assignments, etc.