Unkey
RunbooksServicesAgent Service

Agent Deployment Issues

Troubleshooting agent service deployment failures and GHCR image issues

GHCR Image Missing - Service Returns 503 Errors

Symptoms

  • Incident.io pages on-call engineer for 5XX alerts
  • API returns 500 errors for specific endpoints (e.g., listing keys)
  • Underlying cause: Agent service returns 503 Service Temporarily Unavailable
  • Agent service appears to be down

Investigation Steps

  1. Check Axiom logs for encryption errors:

    unable to decrypt, fetch error: <html>
    <head><title>503 Service Temporarily Unavailable</title></head>
    <body>
    <center><h1>503 Service Temporarily Unavailable</h1></center>
    </body>
    </html>

    Note: The API receives a 503 from the agent service, which causes the API to return 500 to clients. Your 5XX alerts trigger on the API's 500 response, but the root cause is the agent's 503.

  2. Log into AWS Console

  3. Check ECS Cluster

    • Navigate to ECS → Clusters → agent-cluster-ce813cc
    • Look for running tasks count (should show 0 if image is missing)
  4. Check ECS Tasks

    • Click on the cluster
    • Review task definitions and recent stopped tasks
    • Look for error messages about image pull failures
  5. Verify GHCR Image

Resolution

Option 1: Rebuild and Push New Image

Build Process: Images are built automatically via GitHub Actions when you push a git tag with format agent/v*.

  1. Clone the repository and create new version tag:

    # Clone the main repository
    git clone https://github.com/unkeyed/unkey.git
    cd unkey
     
    # Increment version number appropriately
    git tag agent/v1.2.4
    git push origin agent/v1.2.4
  2. Monitor the build:

  3. Verify image was pushed:

Option 2: Deploy Existing Image

If a good image already exists in GHCR but AWS isn't using it:

  1. Clone infrastructure repository and update image tag:

    # Clone the infrastructure repository
    git clone https://github.com/unkeyed/infra.git
    cd infra/pulumi/projects/agent
     
    # Edit main.go to update image tag:
    # Image: pulumi.String("ghcr.io/unkeyed/agent:v1.2.4")
  2. Deploy infrastructure update:

    # Login to AWS SSO first
    aws sso login --sso-session unkey
     
    # Deploy to production (US East 1)
    AWS_PROFILE=unkey-production001-admin pulumi up --stack production001-use1
  3. Wait for deployment to complete

  4. Verify service health:

    curl https://agent.us-east-1.production001.aws.unkey.com/v1/liveness
    • Should return a healthy response
    • Check that API endpoints are working again

On this page