Krane Sync Engine Architecture
Deep dive into Krane's version-based synchronization that queries resource tables directly and ensures eventual consistency
The Krane Sync Engine implements a Kubernetes-style List+Watch pattern for synchronizing desired infrastructure state from the control plane. It uses version numbers embedded in resource tables to track state changes, enabling efficient incremental synchronization and reliable recovery after disconnections.
Architecture
Sync Protocol
The sync engine uses WatchDeployments and WatchSentinels RPCs to receive state changes from the control plane. These RPCs establish server-streaming connections where the control plane sends DeploymentState and SentinelState messages containing deployment or sentinel operations.
Version Tracking
Each resource (deployment_topology, sentinel) has a version column updated on every mutation via the Restate VersioningService singleton. This provides a globally unique, monotonically increasing version across all resources.
Each controller maintains a versionLastSeen field that tracks the highest version successfully processed. On startup, this is zero. After processing each state message, the controller tracks the max version seen but only commits it after a clean stream close. When reconnecting after a failure, Krane sends its last-seen version in the watch request, allowing the control plane to resume from the correct position.
Message Types
Each state message contains a version number and one of two payloads:
DeploymentState contains either an ApplyDeployment (create or update a StatefulSet with the specified image, replicas, and resource limits) or DeleteDeployment (remove the StatefulSet and its associated Service).
SentinelState contains either an ApplySentinel (create or update a sentinel deployment) or DeleteSentinel (remove the sentinel).
Stream close signals that the current batch (or bootstrap) is complete. The client tracks the highest version from received messages and uses it for the next watch request.
Controller Loops
Each controller (deployment and sentinel) runs a continuous loop with jittered reconnection timing (1-5 seconds between attempts). Each iteration establishes a watch stream and processes messages until the stream closes or an error occurs.
This design prioritizes simplicity and reliability over latency. The jittered timing prevents thundering herd problems when multiple Krane instances reconnect simultaneously after a control plane restart.
State Handling
Each controller handles its own state messages:
The version is returned but NOT committed until stream closes cleanly. This ensures atomic bootstrap: if the stream breaks mid-bootstrap, the client retries from version 0 rather than skipping resources that were never received.
Soft Deletes
"Deletes" are implemented as soft deletes: setting desired_replicas=0 or desired_state='archived'. The row remains in the table with its version updated, so clients receive the change and can delete the corresponding Kubernetes resource.
This eliminates the need for a separate changelog table. The resource tables themselves are the source of truth, and each row carries its version for efficient incremental sync.
Kubernetes Watchers
In addition to receiving desired state from the control plane, Krane watches Kubernetes for actual state changes. Pod and StatefulSet watchers notify the controllers when resources change (pod becomes ready, pod fails, etc.). The controllers then report these changes back to the control plane through ReportDeploymentStatus and ReportSentinelStatus RPCs.
This bidirectional flow ensures the control plane always knows the actual state of resources, enabling the UI to show accurate deployment status and the workflow to detect when deployments are ready.
Buffered Updates
Status updates to the control plane are buffered in memory before sending. This smooths over traffic spikes and reduces load on the control plane during high-churn scenarios (like rolling updates affecting many pods). The buffers use retries with exponential backoff and circuit breakers to handle transient failures without overwhelming a recovering control plane.
Failure Modes
Stream disconnection: The watcher reconnects with jittered backoff. Since version is not committed until clean close, reconnection resumes from the last committed version.
Control plane unavailable: The circuit breaker opens after repeated failures, preventing Krane from overwhelming a struggling control plane. Local Kubernetes state continues to function; only sync with the control plane is paused.
Bootstrap + GC: After a full bootstrap (version=0), Krane garbage-collects any Kubernetes resources not mentioned in the bootstrap stream. This ensures stale resources are cleaned up even if they were hard-deleted from the database.