0008 Dataplane - engineering

We need to design a globally distributed architecture for Unkey where the dataplane can operate independently of the primary database for improved availability.

Goals

Achieve 100% dataplane availability independent of primary database
Provide fast access to dynamic data across global regions
Propagate data across the system quickly
Minimize load on expensive storage
Enable efficient cache invalidation
Can run on any cloud or on premise

Options

1. Direct S3 + In-Memory Cache with SWR

┌─────────────────────┐     ┌─────────┐
│ Sentinel             │     │         │
│ ┌───────────────┐   │────►│   S3    │
│ │ Memory Cache  │   │     │         │
│ └───────────────┘   │     │         │
└─────────────────────┘     └─────────┘

Pros

Simple, straightforward design
Low architectural complexity

Cons

Cache invalidation requires communication with all machines or really low TTLs (<10s)
Inefficient cache refresh patterns
High load on S3 due to concurrent SWR requests from multiple machines

2. S3 + In-Memory Cache with Gossip Protocol

┌─────────────────────┐
│ Sentinel             │
│ ┌───────────────┐   │──┐
│ │ Memory Cache  │   │  │
│ └───────────────┘   │  │
└─────────────────────┘  │
         ▲               │
         │ Gossip        │    ┌─────────┐
         ▼               ├───►│         │
┌─────────────────────┐  │    │   S3    │
│ Sentinel             │  │    │         │
│ ┌───────────────┐   │──┘    └─────────┘
│ │ Memory Cache  │   │
│ └───────────────┘   │
└─────────────────────┘

If we had global fast and efficient eviction, we could set much higher TTLs.

Pros

Efficient cache invalidation through gossip, allowing higher TTLs
Reduced load on primary storage
Only need to notify one node for changes

Cons

Need to implement ordering mechanism (timestamps/Lamport clocks)
More complex system architecture
Global gossip cluster management overhead

3. S3 + Dedicated Cache Layer

┌─────────────────┐
│ Sentinel 1       │───┐
└─────────────────┘   │
                      │
┌─────────────────┐   │    ┌────────────┐
│ Sentinel 2       │───┼───►│   Load     │    ┌────────────┐
└─────────────────┘   │    │  Balancer  │───►│ Cache      │──┐
                      │    │            │    │ Node 1     │  │
┌─────────────────┐   │    │            │    └────────────┘  │
│ Sentinel 3       │───┤    │            │                    │    ┌─────────┐
└─────────────────┘   │    │            │                    ├───►│   S3    │
                      ├───►│            │    ┌────────────┐  │    └─────────┘
┌─────────────────┐   │    │            │───►│ Cache      │──┘
│ Sentinel 4       │───┤    │            │    │ Node 2     │
└─────────────────┘   │    └────────────┘    └────────────┘
                      │
┌─────────────────┐   │
│ Sentinel n       │───┘
└─────────────────┘

Sentinels still use an in-memory SWR cache. The cache nodes help reduce the cost and latency of S3, but will likely have the same freshness/staleness as the sentinels. TBD.. Everything here, except the S3 bucket, would be duplicated per region.

Pros

Better cache retention due to less frequent reboots
Optional global eviction via gossip/kafka later
Maybe we only need 1 S3 region now instead of replicating it

Cons

Additional infrastructure to manage

4. DynamoDB Global Tables + Caching

Option 4A: Direct DynamoDB + Sentinel Memory Cache

Each sentinel maintains a local memory SWR cache with a TTL of 10s
DynamoDB serves as source of truth
Automatic multi-region replication handled by AWS

┌─────────────────────┐     ┌──────────────────┐
│ Sentinel (US)        │     │  DynamoDB        │
│ ┌───────────────┐   │────►│  (US-WEST-1)     │
│ │ Memory Cache  │   │     │                  │
│ └───────────────┘   │     └──────────────────┘
└─────────────────────┘            ▲
                                   │
                                   │ Replication
                                   │
┌─────────────────────┐            ▼
│ Sentinel (EU)        │     ┌──────────────────┐
│ ┌───────────────┐   │────►│  DynamoDB        │
│ │ Memory Cache  │   │     │  (EU-WEST-1)     │
│ └───────────────┘   │     │                  │
└─────────────────────┘     └──────────────────┘

Option 4B: With Dedicated Cache Layer A dedicated cache layer could be added to reduce the load on DynamoDB and improve read performance as well as cost. Whether this actually saves money is debatable, we’ll have to try. These cache nodes would be dumb, they only cache reads for 10s and don’t have any manual eviction possibilities.

┌─────────────────┐
│ Sentinel 1       │───┐
└─────────────────┘   │
                      │    ┌────────────┐
┌─────────────────┐   │    │   Load     │    ┌────────────┐
│ Sentinel 2       │───┼───►│  Balancer  │───►│ Cache      │──┐
└─────────────────┘   │    │            │    │ Node 1     │  │    ┌──────────────┐
                      │    │            │    └────────────┘  ├───►│  DynamoDB    │
                      │    │            │                    │    │  Global      │
                      │    │            │    ┌────────────┐  │    │  Tables      │
┌─────────────────┐   │    │            │───►│ Cache      │──┘    └──────────────┘
│ Sentinel n       │───┘    └────────────┘    │ Node 2     │
└─────────────────┘                          └────────────┘

Pros

Built-in multi-region replication with strong consistency
No need to manage complex replication logic
Lower latency reads from local region
Automatic conflict resolution
Serverless and fully managed by AWS
Cheaper for small reads than S3
99.999% availability (s3 only has 99.99%)

Cons

Vendor lock-in to AWS -> we need to have an abstraction
Higher storage cost compared to S3 due to replication
Cost of replication
Replication lag is controlled by AWS, not us
More expensive for large >13kb reads than S3

​Goals

​Options

​1. Direct S3 + In-Memory Cache with SWR

​Pros

​Cons

​2. S3 + In-Memory Cache with Gossip Protocol

​Pros

​Cons

​3. S3 + Dedicated Cache Layer

​Pros

​Cons

​4. DynamoDB Global Tables + Caching

​Pros

​Cons

Goals

Options

1. Direct S3 + In-Memory Cache with SWR

Pros

Cons

2. S3 + In-Memory Cache with Gossip Protocol

Pros

Cons

3. S3 + Dedicated Cache Layer

Pros

Cons

4. DynamoDB Global Tables + Caching

Pros

Cons