Planning work
Not everything in infra is a quick fix. Rolling out to a new region, changing how networking works, adjusting capacity, rethinking cost — these are changes that ripple. They affect multiple clusters, multiple services, and sometimes multiple AWS accounts. They can’t be developed and shipped in a day, and they shouldn’t be. Before starting significant work, write up what you’re planning and get it in front of the rest of the team. This doesn’t need to be formal — a GitHub issue, a doc, even a detailed Slack thread in #infrastructure — but it needs to exist somewhere the others can read it async and weigh in. The goal is to make sure we’re aligned before anyone starts writing code, not after it’s already in a PR. Things worth planning together:- New region rollouts — cluster provisioning, secret replication, DNS, gossip functionality, environment files for every chart. There’s a checklist but it still touches a lot of surface area.
- Capacity changes — node group sizing, scaling limits, instance types. These have cost implications and we must be deliberate about them.
- Cost-related changes — anything that changes what we’re paying for or how much. Reserved instances, savings plans, storage classes, new managed services.
- Architectural changes — swapping out a component, changing how traffic routes, adding or removing a dependency. These are the ones where getting a second opinion early saves the most time.
- Cross-cutting changes — anything that touches the promotion files, ArgoCD ApplicationSets, or shared Helm chart structure. If it affects how we deploy, we all need to know.
Promotions
This is how we get code from a PR into staging and production. The process is lightweight on purpose — most of the time it stays out of your way. The few rules that do exist are there because most of us have been on the wrong end of a production incident with no context, and none of us want to be there again.Why we do it this way
A common alternative is a long-running staging branch where merging staging into main means “deploy to production.” That works until it doesn’t. If five changes land on staging and one of them is broken, we’re stuck — we can’t promote the four good ones without also shipping the bad one or doing a lot ofgit cherry-pickin’. Everything queues up behind the fix to the one bad change. The branches drift apart, merge conflicts accumulate, and rolling back means untangling which of a dozen changes in a staging→main merge caused the problem.
By pinning production to a specific commit SHA, staging and production are decoupled. A broken change on staging doesn’t touch production at all — production is still sitting on the last known-good SHA while we sort things out. We pick exactly which commit to promote and when, per component if we need to. Rolling back is one revert commit. And there’s no second branch to maintain — everything lives on main.
How it works
Every environment has a promotion in apromotions/<env>/<service>.yaml file that controls which Git revision ArgoCD deploys. ArgoCD ApplicationSets read these files and set targetRevision on each Application.
- Staging tracks
main— every merge to main auto-deploys within minutes. - Production is pinned to a specific commit SHA. Deploying to production requires explicitly updating this SHA and getting the change reviewed.
Merging to main
Merges to main do not require a review unless the change modifies a file undereks-cluster/promotions/. Everything merged to main deploys to staging automatically. We’ll get some kind of gating mechanism put into place eventually as the infrastructure stabilizes.
Why good PRs and commit messages matter
A merge to main is the unit of work that eventually gets promoted to production. The PR title, description, and commit messages become the primary audit trail for everything that runs in production. When something goes wrong at 2am, the first thing we do is check what changed recently. The promotion file points to a SHA, that SHA points to a merge commit, and that merge commit points back to a PR. If the PR says “fix stuff” with no description, whoever is triaging has to read every line of the diff to figure out what changed and whether it could be related. If the PR clearly explains what was changed, why, and what it affects, we can make that call in seconds. This matters beyond incidents too. When reviewing a promotion PR, we’re deciding whether this change is safe for production. We follow the SHA back to the original PR to understand what we’re approving. A well-documented PR lets us review with confidence. A poorly documented one forces us to either rubber-stamp it or spend time reverse-engineering the intent from code. Write PRs into main as if one of us will need to triage a production issue using only the PR title and description. Include:- What changed and why
- What services or components are affected
- Any risks or things to watch after deployment
- Links to related issues, docs, or prior discussion
Deploying to production
Production promotions require review. It’s on whoever is promoting to make the reviewer’s job easy — link back to the original PR, use the right SHA, and make sure the context is all there. If the reviewer has to go digging to understand what they’re approving, we’ve already failed at the process.Step by step
- Merge your change to main. This really should be a PR, but as long as the change is well-documented — clear title, description of what changed and why, and any relevant context… you can just push to main… but a PR is required for production, always. Violators will be prosecuted to the fullest extent of the law.
- Verify it works in staging. Wait for ArgoCD to deploy the change to staging and confirm it behaves correctly.
-
Get the commit SHA from the merged PR. This is the merge commit on main that you validated in staging:
- Run the promote script:
-
Commit, push, and open a promotion PR. The promotion PR should:
- Set the
revisionto the SHA of the change being promoted - Link to the original PR that introduced the change (the one merged to main), or write up an explanation.
- Be scoped narrowly — one promotion per PR when possible
- Set the
- Get a review and merge. You’ve done the work to make this easy to review — the reviewer just needs to confirm the SHA matches what was tested in staging and that the linked PR tells the full story.
- ArgoCD picks up the change and deploys.
What the reviewer should check
- The
revisionvalue is a real commit SHA (not a branch name) - The promotion PR links to the original PR that introduced the change
- That change was already deployed and validated in staging
- The scope makes sense (all components vs. single component)
Rolling back
Revert the promotion commit and push:The promote script
promotions/<env>.yaml) and regenerates per-component files that ArgoCD reads. You still need to commit and push the result.
Component overrides
You can pin a single component to a different revision than the environment default:revision.
Currently valid components
Hopefully, none of this is heavy process. It’s a few small habits — plan before building, write a good PR description, link the SHA, get a review for production — that help create codebase where any of us can pick up the thread at any point and understand what’s running and why. When you need a promotion reviewed and nobody is online yet, or you want a second opinion on an approach before you start, use #infrastructure on Slack. It’s a lot easier to coordinate async when the plans and PRs are well-documented — and when they’re not, that’s when things fall through the cracks and make things harder for each other.

