Agent Deployment Issues
Troubleshooting agent service deployment failures and GHCR image issues
GHCR Image Missing - Service Returns 503 Errors
Symptoms
- Incident.io pages on-call engineer for 5XX alerts
- API returns 500 errors for specific endpoints (e.g., listing keys)
- Underlying cause: Agent service returns 503 Service Temporarily Unavailable
- Agent service appears to be down
Investigation Steps
-
Check Axiom logs for encryption errors:
Note: The API receives a 503 from the agent service, which causes the API to return 500 to clients. Your 5XX alerts trigger on the API's 500 response, but the root cause is the agent's 503.
-
Log into AWS Console
- AWS SSO Login
- Account:
unkey-production001
- Region:
us-east-1
-
Check ECS Cluster
- Navigate to ECS → Clusters →
agent-cluster-ce813cc
- Look for running tasks count (should show 0 if image is missing)
- Navigate to ECS → Clusters →
-
Check ECS Tasks
- Click on the cluster
- Review task definitions and recent stopped tasks
- Look for error messages about image pull failures
-
Verify GHCR Image
- Check agent packages on GitHub
- Look for the required image tag in
ghcr.io/unkeyed/agent
- Note the missing tag version
Resolution
Option 1: Rebuild and Push New Image
Build Process: Images are built automatically via GitHub Actions when you push a git tag with format agent/v*
.
-
Clone the repository and create new version tag:
-
Monitor the build:
- Check GitHub Actions
- Monitor agent build workflows
- Wait for completion (builds image and pushes to GHCR)
-
Verify image was pushed:
- Check GitHub Packages for the new tag
- Image should be available at
ghcr.io/unkeyed/agent:v1.2.4
Option 2: Deploy Existing Image
If a good image already exists in GHCR but AWS isn't using it:
-
Clone infrastructure repository and update image tag:
-
Deploy infrastructure update:
-
Wait for deployment to complete
-
Verify service health:
- Should return a healthy response
- Check that API endpoints are working again