How alerts get to people
Prometheus evaluates alerting rules in each cluster. When something fires, Alertmanager sends it to incident.io, which decides what to do based on two things: the source (which cluster sent it) and the severity label on the alert.| Severity | What happens | Route |
|---|---|---|
critical | Pages on-call via Engineering On-Call escalation path | Production Alerts |
warning | Posts to #alerts in Slack, no page | Production Warnings |
| (missing) | Posts to #alerts with no page — safety net so you know the label is missing | Unrouted Alerts |
severity label. It posts to #alerts so someone notices and fixes the missing label. It doesn’t page anyone. If your alert has a severity value that isn’t critical or warning (like severity=info), it won’t match any of the three routes and will be silently dropped. Don’t do that.
Staging alerts never page anyone regardless of severity. They go to #alerts and that’s it.
Always include a severity label set to critical or warning. Anything else and the alert is effectively invisible.
When to use which severity
critical — would you wake someone up for this? Then it’s critical.
- Active customer impact (elevated error rates, full outage)
- Data loss or risk of data loss
- Pods that should be running but aren’t
- Anything where waiting until morning makes it worse
warning — is this something we should know about but can wait?
- Elevated latency that hasn’t crossed into “customers are mad” territory
- Resource (high memory, goroutine counts, connection pools filling up)
- Indicators that could become critical if left alone
- Anything you’d look at during business hours but wouldn’t lose sleep over
warning. It’s easy to promote to critical later. Going the other direction means someone already got woken up for nothing.
Current alerts
Custom (defined in this repo)
| Alert | Severity | What it catches | File |
|---|---|---|---|
| PodNotRunning | critical | Pod stuck in CrashLoopBackOff, ImagePullBackOff, etc. for 5m | observability/templates/prometheus-alerts.yaml |
| FrontlineErrorRateHigh | critical | 5xx error rate > 5% for 2m | observability/templates/prometheus-alerts-frontline.yaml |
| FrontlineP99LatencyHigh | critical | P99 latency > 2.5s for 5m | observability/templates/prometheus-alerts-frontline.yaml |
| FrontlineP95LatencyHigh | warning | P95 latency > 1s for 5m | observability/templates/prometheus-alerts-frontline.yaml |
| FrontlineHighActiveRequests | warning | Active requests > 100 for 2m | observability/templates/prometheus-alerts-frontline.yaml |
| FrontlineGoroutineLeak | warning | Goroutines > 1000 for 10m | observability/templates/prometheus-alerts-frontline.yaml |
kube-prometheus-stack built-ins
We get a bunch of default alerting rules from kube-prometheus-stack. We’ve disabled the ones that don’t apply to EKS (etcd, apiserver, scheduler, controller-manager, cert rotation, RAID, bonding — all managed by AWS) and a couple we replaced with custom rules. The full list of disabled rules is invalues.yaml under defaultRules.disabled.
Disabled for EKS (AWS manages these):
- All etcd alerts
- KubeAPIDown, KubeAPIErrorBudgetBurn, KubeAPITerminatedRequests
- KubeAggregatedAPIDown, KubeAggregatedAPIErrors
- KubeControllerManagerDown, KubeSchedulerDown, KubeProxyDown
- KubeClientCertificateExpiration, KubeVersionMismatch
- KubeletClient/ServerCertificateExpiration and RenewalErrors
- NodeRAIDDegraded, NodeRAIDDiskFailure, NodeBondingDegraded
- KubePodCrashLooping — replaced by PodNotRunning which excludes krane-managed customer workloads
- KubePodNotReady — was firing for customer deployment containers
Kubernetes — critical
Kubernetes — warning
| Alert | What it means | Runbook |
|---|---|---|
| KubeCPUOvercommit | Cluster CPU requests exceed capacity | link |
| KubeCPUQuotaOvercommit | CPU quota overcommitted | link |
| KubeClientErrors | API server client is seeing errors | link |
| KubeContainerWaiting | Container stuck waiting for > 1 hour. Check events, logs, and resource availability (configmaps, secrets, volumes) | link |
| KubeDaemonSetMisScheduled | DaemonSet pods landed on wrong nodes | link |
| KubeDaemonSetNotScheduled | DaemonSet pods not getting scheduled | link |
| KubeDaemonSetRolloutStuck | DaemonSet rollout stalled | link |
| KubeDeploymentGenerationMismatch | Deployment generation mismatch — possible failed rollback | link |
| KubeDeploymentReplicasMismatch | Deployment doesn’t have expected replica count | link |
| KubeDeploymentRolloutStuck | Deployment rollout not progressing | link |
| KubeHpaMaxedOut | HPA running at max replicas | link |
| KubeHpaReplicasMismatch | HPA hasn’t reached desired replicas | link |
| KubeJobFailed | Job failed. Check kubectl describe job and pod logs | link |
| KubeJobNotCompleted | Job didn’t finish in time | link |
| KubeMemoryOvercommit | Cluster memory requests exceed capacity | link |
| KubeMemoryQuotaOvercommit | Memory quota overcommitted | link |
| KubeNodeNotReady | Node not ready. Check kubectl get node $NODE -o yaml, fix or terminate the instance | link |
| KubeNodeReadinessFlapping | Node keeps flipping between ready and not ready | link |
| KubeNodeUnreachable | Node is unreachable | link |
| KubePdbNotEnoughHealthyPods | PDB doesn’t have enough healthy pods — blocks voluntary disruptions | link |
| KubePersistentVolumeFillingUp | PV filling up (warning threshold, predicted to fill in 4 days) | link |
| KubePersistentVolumeInodesFillingUp | PV inodes filling up (warning threshold) | link |
| KubeQuotaExceeded | Namespace quota exceeded | link |
| KubeStatefulSetGenerationMismatch | StatefulSet generation mismatch | link |
| KubeStatefulSetReplicasMismatch | StatefulSet doesn’t have expected replicas | link |
| KubeStatefulSetUpdateNotRolledOut | StatefulSet update hasn’t rolled out | link |
Kubelet — critical
| Alert | What it means | Runbook |
|---|---|---|
| KubeletDown | Kubelet target gone | link |
Kubelet — warning
Node — critical
| Alert | What it means | Runbook |
|---|---|---|
| NodeFileDescriptorLimit | Kernel predicted to run out of file descriptors soon | link |
| NodeFilesystemAlmostOutOfFiles | < 3% inodes remaining | link |
| NodeFilesystemAlmostOutOfSpace | < 3% disk space remaining | link |
| NodeFilesystemFilesFillingUp | Predicted to run out of inodes in 4 hours | link |
| NodeFilesystemSpaceFillingUp | Predicted to run out of space in 4 hours | link |
Node — warning
| Alert | What it means | Runbook |
|---|---|---|
| NodeClockNotSynchronising | NTP not syncing | link |
| NodeClockSkewDetected | Clock skew on node | link |
| NodeDiskIOSaturation | Disk IO queue is high | link |
| NodeFileDescriptorLimit | FD limit approaching (warning threshold) | link |
| NodeFilesystemAlmostOutOfFiles | < 5% inodes remaining | link |
| NodeFilesystemAlmostOutOfSpace | < 5% disk space remaining | link |
| NodeFilesystemFilesFillingUp | Predicted to run out of inodes in 24 hours | link |
| NodeFilesystemSpaceFillingUp | Predicted to run out of space in 24 hours | link |
| NodeHighNumberConntrackEntriesUsed | Conntrack table getting full | link |
| NodeMemoryHighUtilization | Node running low on memory | link |
| NodeMemoryMajorPagesFaults | Heavy major page faults — something is swapping hard | link |
| NodeNetworkInterfaceFlapping | NIC keeps going up and down | link |
| NodeNetworkReceiveErrs | Lots of receive errors on a NIC | link |
| NodeNetworkTransmitErrs | Lots of transmit errors on a NIC | link |
| NodeSystemSaturation | Load per core is very high | link |
| NodeSystemdServiceCrashlooping | A systemd service keeps restarting | link |
| NodeSystemdServiceFailed | A systemd service has failed | link |
| NodeTextFileCollectorScrapeError | Node exporter text file collector failed | link |
Alertmanager — critical
| Alert | What it means | Runbook |
|---|---|---|
| AlertmanagerClusterCrashlooping | Half or more Alertmanager instances crashlooping | link |
| AlertmanagerClusterDown | Half or more Alertmanager instances down | link |
| AlertmanagerClusterFailedToSendAlerts | Failed to send to a critical integration | link |
| AlertmanagerConfigInconsistent | Alertmanager instances have different configs | link |
| AlertmanagerFailedReload | Config reload failed | link |
| AlertmanagerMembersInconsistent | Cluster member can’t find other members | link |
Alertmanager — warning
Prometheus — critical
| Alert | What it means | Runbook |
|---|---|---|
| PrometheusBadConfig | Config reload failed. Check kubectl logs on the prometheus pod | link |
| PrometheusErrorSendingAlertsToAnyAlertmanager | > 3% errors sending alerts to all Alertmanagers | link |
| PrometheusRemoteStorageFailures | Failing to send samples to remote storage | link |
| PrometheusRemoteWriteBehind | Remote write is falling behind | link |
| PrometheusRuleFailures | Rule evaluations failing | link |
| PrometheusTargetSyncFailure | Target sync failed | link |
Prometheus — warning
| Alert | What it means | Runbook |
|---|---|---|
| PrometheusNotConnectedToAlertmanagers | Can’t reach any Alertmanagers | link |
| PrometheusNotIngestingSamples | Not ingesting samples | link |
| PrometheusHighQueryLoad | Hitting max concurrent query capacity | link |
| PrometheusDuplicateTimestamps | Dropping samples with duplicate timestamps | link |
| PrometheusOutOfOrderTimestamps | Dropping out-of-order samples | link |
| PrometheusNotificationQueueRunningFull | Alert queue predicted to fill up within 30m | link |
| PrometheusErrorSendingAlertsToSomeAlertmanagers | Errors sending to some (not all) Alertmanagers | link |
| PrometheusRemoteWriteDesiredShards | Remote write wants more shards than configured | link |
| PrometheusTSDBCompactionsFailing | Block compaction failing | link |
| PrometheusTSDBReloadsFailing | Block reload failing | link |
| PrometheusKubernetesListWatchFailures | SD list/watch requests failing | link |
| PrometheusMissingRuleEvaluations | Rule group evaluation too slow, skipping evaluations | link |
| PrometheusSDRefreshFailure | Service discovery refresh failing | link |
| PrometheusLabelLimitHit | Dropping targets that exceed label limits | link |
| PrometheusTargetLimitHit | Dropping targets that exceed target limits | link |
| PrometheusScrapeBodySizeLimitHit | Dropping targets that exceed body size limit | link |
| PrometheusScrapeSampleLimitHit | Dropping scrapes that exceed sample limit | link |
Prometheus Operator — warning
| Alert | What it means | Runbook |
|---|---|---|
| ConfigReloaderSidecarErrors | Config reloader sidecar failing for 10m | link |
| PrometheusOperatorListErrors | List operation errors | link |
| PrometheusOperatorWatchErrors | Watch operation errors | link |
| PrometheusOperatorSyncFailed | Last reconciliation failed | link |
| PrometheusOperatorReconcileErrors | Reconciliation errors | link |
| PrometheusOperatorNodeLookupErrors | Node lookup errors during reconciliation | link |
| PrometheusOperatorNotReady | Operator not ready | link |
| PrometheusOperatorRejectedResources | Resources rejected by operator | link |
| PrometheusOperatorStatusUpdateErrors | Status update errors | link |
kube-state-metrics — critical
General
| Alert | Severity | What it means | Runbook |
|---|---|---|---|
| TargetDown | warning | A Prometheus scrape target is unreachable. Check /targets in Prometheus UI, verify ServiceMonitor config and network policies | link |
| Watchdog | none | Always-firing deadman switch — if this stops, Alertmanager is broken | link |
| InfoInhibitor | none | Suppresses info-level alerts when higher-severity alerts are already firing for the same target | link |
Info
These don’t page or post to Slack. They exist for dashboards and as context when other alerts are firing.| Alert | What it means | Runbook |
|---|---|---|
| CPUThrottlingHigh | Processes getting CPU-throttled | link |
| KubeNodeEviction | Node is evicting pods | link |
| KubeNodePressure | Node has an active pressure condition (memory, disk, PID) | link |
| KubeQuotaAlmostFull | Namespace quota approaching limit | link |
| KubeQuotaFullyUsed | Namespace quota fully consumed | link |
| KubeletTooManyPods | Kubelet running at pod capacity | link |
| NodeCPUHighUsage | High CPU usage on node | link |
Adding a new alert
Alerts live in PrometheusRule manifests undereks-cluster/helm-chart/observability/templates/. Add yours to an existing file if it fits, or create a new prometheus-alerts-<service>.yaml if you’re adding a group for a new service.
Things to keep in mind
- Always set
severitytocriticalorwarning. No severity = no routing. - Exclude customer workloads if your alert is pod-level. Add
container!="deployment"to the metric selector and useunless on(namespace, pod) kube_pod_labels{label_app_kubernetes_io_managed_by="krane"}to skip krane-managed pods. We don’t want to page ourselves for a customer’s broken container. - Include region in the summary when the data has it. Getting paged with “SomethingBad (us-east-1 / api)” is a lot more useful than just “SomethingBad.”
- Set a reasonable
forduration. Too short and you get flapping alerts. Too long and you find out late. 2-5 minutes is a good starting point for most things. - Think about staging vs. production. The
environmentexternal label is available on all metrics (production001orstaging). If you want different severity per environment, write two rules with differentenvironmentfilters — one critical for production, one warning for staging.

