Durable Workflows with Restate
How we use Restate for durable execution in the control plane
Durable Workflows with Restate
Unkey uses Restate for durable workflow execution in the control plane. All workflow services live in svc/ctrl/worker/ with protobuf definitions in svc/ctrl/proto/hydra/v1/.
Restate gives us:
- Durable Execution: Operations resume from the last successful step after crashes
- Automatic Retries: Transient failures are retried without manual intervention
- Concurrency Control: Virtual objects serialize access per key, eliminating distributed locking
- Observability: Built-in UI to inspect running workflows, step history, and failures
Core Concepts
Workflows vs Virtual Objects
Restate offers two service types, both used in the worker:
- Workflow (
WORKFLOW): Runs once per workflow ID with exactly-once semantics. Used for multi-step pipelines like deployments where each step is durable and the entire operation should not re-execute on retry. - Virtual Object (
VIRTUAL_OBJECT): Keyed by an arbitrary string. All calls to the same key are serialized, preventing concurrent mutations. Used for services like routing, certificates, and deployment state management.
Durable Steps
Each restate.Run() call executes once and stores its result. After failures, workflows resume from stored results without re-executing completed steps. Use restate.WithName("step name") on every step for observability in the Restate UI.
Service Communication
Services call each other using:
- Blocking:
Object.Request()— waits for the result - Fire-and-forget:
Object.Send()— enqueues the call and returns immediately - Delayed:
Object.Send()with a delay — enqueues a call to fire after a duration
Workflow Services
DeployService
Location: svc/ctrl/worker/deploy/
Proto: svc/ctrl/proto/hydra/v1/deploy.proto
Type: Workflow
Key: caller-supplied workflow ID
Operations: Deploy, Rollback, Promote, ScaleDownIdlePreviewDeployments
Orchestrates the full deployment lifecycle. Deploy validates the deployment record, builds a container image (from Git via Depot or a pre-built Docker image), provisions containers across regions with per-region versioning, polls for instance health in parallel, generates frontline routes (per-commit, per-branch, per-environment), reassigns sticky routes through RoutingService, and updates the project's live deployment pointer. The previous live deployment is scheduled for standby after 30 minutes via DeploymentService.
Rollback switches sticky frontline routes from the current live deployment to a previous one and sets the project's isRolledBack flag to prevent future deploys from automatically claiming live routes. Promote reverses a rollback by reassigning routes and clearing the flag.
ScaleDownIdlePreviewDeployments is called by a cron to archive preview deployments with zero traffic in the last 6 hours.
See: Deployment Service
DeploymentService
Location: svc/ctrl/worker/deployment/
Proto: svc/ctrl/proto/hydra/v1/deployment.proto
Type: Virtual Object
Key: deployment_id
Operations: ScheduleDesiredStateChange, ChangeDesiredState, ClearScheduledStateChanges
Serializes all desired-state mutations for a single deployment. Multiple actors (deploy workflow, idle scaler, operators) may need to change a deployment's state concurrently — the virtual object key guarantees sequential processing per deployment.
Uses a nonce-based last-writer-wins mechanism for scheduled transitions: ScheduleDesiredStateChange generates a unique nonce, stores it, and sends a delayed ChangeDesiredState call. If a newer schedule arrives before the delay elapses, it overwrites the nonce, causing the stale delayed call to no-op. Target states are RUNNING, STANDBY, and ARCHIVED.
RoutingService
Location: svc/ctrl/worker/routing/
Proto: svc/ctrl/proto/hydra/v1/routing.proto
Type: Virtual Object (ingress private)
Key: project_id
Operations: AssignFrontlineRoutes
Reassigns frontline routes to point at a target deployment by updating the deployment_id column in the frontline_routes table. Called by DeployService during deploy, rollback, and promote operations. Marked as ingress-private so it cannot be invoked directly from outside Restate.
See: Routing Service
VersioningService
Location: svc/ctrl/worker/versioning/
Proto: svc/ctrl/proto/hydra/v1/versioning.proto
Type: Virtual Object (ingress private)
Key: region name
Operations: NextVersion, GetVersion
Generates monotonically increasing version numbers per region for state synchronization between the control plane and edge agents (krane). The per-region key design allows parallel version generation across regions while maintaining strict ordering within each.
Before mutating a deployment or sentinel, callers request a new version and stamp it on the resource row. Edge agents track their last-seen version and query for changes: WHERE region = ? AND version > ?.
CertificateService
Location: svc/ctrl/worker/certificate/
Proto: svc/ctrl/proto/hydra/v1/certificate.proto
Type: Virtual Object
Key: domain name
Operations: ProcessChallenge, RenewExpiringCertificates
Handles ACME certificate issuance and renewal. ProcessChallenge runs the full ACME flow — automatically selecting HTTP-01 for regular domains or DNS-01 (via Route53) for wildcards. Rate limit responses from Let's Encrypt trigger durable sleeps rather than consuming retry budget. Private keys are encrypted via Vault before database storage.
RenewExpiringCertificates is called periodically to find and renew certificates approaching expiry. Configured with a 15-minute inactivity timeout to accommodate DNS propagation delays.
CustomDomainService
Location: svc/ctrl/worker/customdomain/
Proto: svc/ctrl/proto/hydra/v1/custom_domain.proto
Type: Virtual Object
Key: domain name
Operations: VerifyDomain, RetryVerification
Verifies custom domain ownership through a two-step DNS validation: first a TXT record at _unkey.<domain> to prove ownership, then a CNAME pointing to a unique subdomain under the platform's DNS apex (e.g., <random>.unkey-dns.com). Both checks must pass before the domain is marked verified.
Configured with a fixed 1-minute retry interval for up to 24 hours (1440 attempts) to accommodate DNS propagation. After verification, triggers certificate issuance via CertificateService and creates frontline routes for traffic routing.
ClickhouseUserService
Location: svc/ctrl/worker/clickhouseuser/
Proto: svc/ctrl/proto/hydra/v1/clickhouse_user.proto
Type: Virtual Object
Key: workspace_id
Operations: ConfigureUser
Provisions ClickHouse users for workspace analytics access. Creates users with SHA256 authentication, SELECT permissions on analytics tables, row-level security policies restricting data to the owning workspace, time-based retention filters, and per-query quotas (execution time, memory, result rows).
Passwords are generated with crypto/rand, encrypted via Vault, and stored in MySQL. The handler is idempotent — repeated calls preserve existing passwords while updating quotas and reapplying permissions.
Optional — only enabled when CLICKHOUSE_ADMIN_URL and Vault are both configured.
QuotaCheckService
Location: svc/ctrl/worker/quotacheck/
Proto: svc/ctrl/proto/hydra/v1/quota_check.proto
Type: Virtual Object
Key: billing period (YYYY-MM)
Operations: RunCheck
Monitors workspace quota usage and sends Slack notifications for newly exceeded quotas. Uses Restate state to deduplicate — each workspace is notified at most once per billing period. Self-schedules the next run 24 hours later with idempotency keys to prevent duplicate runs. When the month rolls over, a new virtual object with fresh state is used automatically.
Configuration
Services are bound and registered in svc/ctrl/worker/run.go. When Restate.RegisterAs is configured, the worker self-registers with the Restate admin API on startup. In Kubernetes environments, registration is handled externally.
Config fields (see svc/ctrl/worker/config.go):
Restate.AdminURL: Restate admin endpoint for service registrationRestate.APIKey: API key for authenticating with Restate admin APIRestate.HttpPort: Port where the worker listens for Restate HTTP requestsRestate.RegisterAs: Public URL of this service for self-registration (optional in k8s)
Error Handling
- Terminal Errors: Use
restate.TerminalError(err, statusCode)for business logic failures that should not retry (invalid input, not found, unauthorized) - Transient Errors: Return regular errors for automatic retry (network timeouts, temporary failures)
- Rate Limits: Use
restate.Sleep()for durable waits when rate-limited (e.g., ACME)
Best Practices
- Idempotent Steps: Use UPSERT instead of INSERT for database operations
- Named Steps: Always use
restate.WithName("step name")for observability - Small Steps: Break operations into focused, single-purpose steps
- Virtual Objects: Use for automatic serialization instead of manual locking
- Ingress Privacy: Mark internal-only services with
restate.WithIngressPrivate(true)