Vault stores encrypted data encryption keys (DEKs) in object storage. These DEKs are required to encrypt and decrypt recoverable customer key material. Vault caches DEKs to reduce object-storage reads, but a cache miss still requires the backing bucket. Vault-backed routes started failing after cache entries expired. The affected routes wereDocumentation Index
Fetch the complete documentation index at: https://engineering.unkey.com/llms.txt
Use this file to discover all available pages before exploring further.
/v2/apis.listKeys, /v2/keys.getKey, and /v2/keys.createKey.
Increasing the cache TTLs to a fresh TTL of 1h, a stale TTL of 24h helps warm reads and many createKey requests, but it does not help cold reads, evicted entries, first-time workspace encryption, key rotation, or outages longer than the cache window.
Why
Vault relies on Cloudflare R2 for durable storage and their recent incidents impacted our API’s capability to read/create encrypted keys. We do not control R2 or any of its upstream dependencies. This is a single point of failure that we must address. The goal is to add a regional recovery path while keeping vault’s architecture simple. Operators can promote the replica when the primary region is unavailable. Regional failures are rare, but they happen, so let’s prepare for it. AWS has strong regional isolation, a single regional S3 failure does not affect S3 in another region. Therefore if we used two S3 regions, our services could accept total loss of availability in one region. AWS also provides (async and slow) cross-region replication.Design
Vault’s code does not change. The design is exactly the same, we only change the durable data source. Instead of a single R2 bucket, we will create an S3 bucket in region A and one in region B. We will also configure replication from A to B. All vault instances read from and write to the primary S3 bucket in region A during normal operation. S3 Cross-Region Replication (CRR) copies objects to the replica bucket in region B. Vault does not read from the replica unless operators promote it during an incident.Consistency and RPO
S3 Replication Time Control is still asynchronous with an SLA of replicating 99.9% of objects within 15 minutes. If the primary region fails immediately after vault writes new DEK material, the replica may not have that object yet. Recently encrypted recoverable material may be unavailable until the primary recovers or the missing object is restored. This RFC accepts non-zero RPO in favour of keeping the architecture and migration simple.Failover
If the primary region becomes unavailable, vault keeps using the unavailable primary bucket and requests fail until operators promote the replica. To promote, we manually change the S3 secrets in AWS Secrets Managerunkey/vault to point the S3 URL to the replica region’s bucket, then sync all ExternalSecrets and restart vault pods.
Migration
We’ll do a classical dual-write migration.- Vault will perform dual writes to both the old R2 and new primary S3.
- We will copy all objects from R2 to S3.
- Vault switches reads from R2 to S3.
- We remove the dual write setup.
- We remove the R2 buckets.

