> ## Documentation Index
> Fetch the complete documentation index at: https://engineering.unkey.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Alerting

> Alert routing, severities, and adding rules.

How our alerts work, what severity means, and how to add new ones without making the on-call person's life worse.

## How alerts get to people

Prometheus evaluates alerting rules in each cluster. When something fires, Alertmanager sends it to incident.io, which decides what to do based on two things: the **source** (which cluster sent it) and the **severity label** on the alert.

| Severity    | What happens                                                                | Route               |
| ----------- | --------------------------------------------------------------------------- | ------------------- |
| `critical`  | Pages on-call via Engineering On-Call escalation path                       | Production Alerts   |
| `warning`   | Posts to #alerts in Slack, no page                                          | Production Warnings |
| *(missing)* | Posts to #alerts with no page — safety net so you know the label is missing | Unrouted Alerts     |

There's a catch-all route ("Unrouted Alerts") that picks up Alertmanager alerts with no `severity` label. It posts to #alerts so someone notices and fixes the missing label. It doesn't page anyone. If your alert has a severity value that isn't `critical` or `warning` (like `severity=info`), it won't match any of the three routes and will be silently dropped. Don't do that.

Staging alerts never page anyone regardless of severity. They go to #alerts and that's it.

**Always include a `severity` label set to `critical` or `warning`.** Anything else and the alert is effectively invisible.

## When to use which severity

**`critical`** — would you wake someone up for this? Then it's critical.

* Active customer impact (elevated error rates, full outage)
* Data loss or risk of data loss
* Pods that should be running but aren't
* Anything where waiting until morning makes it worse

**`warning`** — is this something we should know about but can wait?

* Elevated latency that hasn't crossed into "customers are mad" territory
* Resource (high memory, goroutine counts, connection pools filling up)
* Indicators that *could* become critical if left alone
* Anything you'd look at during business hours but wouldn't lose sleep over

If you're not sure, start with `warning`. It's easy to promote to `critical` later. Going the other direction means someone already got woken up for nothing.

## Current alerts

### Custom (defined in this repo)

| Alert                               | Severity | What it catches                                                       | File                                                       |
| ----------------------------------- | -------- | --------------------------------------------------------------------- | ---------------------------------------------------------- |
| PodNotRunning                       | critical | Pod stuck in `CrashLoopBackOff`, `ImagePullBackOff`, etc. for 5m      | `observability/templates/prometheus-alerts.yaml`           |
| **Frontline**                       |          |                                                                       |                                                            |
| FrontlinePlatformErrorRateHigh      | warning  | Platform (our) error rate > 2% for 2m — excludes customer/user errors | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineRoutingErrorsHigh          | warning  | Sustained routing errors > 0.5/s for 2m                               | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineProxyConnectionFailures    | warning  | Connection failures (timeout, refused, reset, DNS) > 5% for 2m        | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineSentinel5xxRate            | warning  | Sentinel-sourced 5xx > 5% for 2m (sentinel itself is broken)          | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineP99LatencyHigh             | warning  | P99 latency > 2.5s for 5m                                             | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineP95LatencyHigh             | warning  | P95 latency > 1s for 5m                                               | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineRoutingLatencyHigh         | warning  | Routing P99 > 1.5s for 5m                                             | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineBackendLatencyHigh         | warning  | Backend P99 > 5s for 5m                                               | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineHighActiveRequests         | warning  | Active requests > 100 for 2m                                          | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineExcessiveHops              | warning  | P99 cross-region hops >= 2 for 5m                                     | `observability/templates/prometheus-alerts-frontline.yaml` |
| FrontlineGoroutineLeak              | warning  | Goroutines > 1000 for 10m                                             | `observability/templates/prometheus-alerts-frontline.yaml` |
| **Sentinel**                        |          |                                                                       |                                                            |
| SentinelPlatformErrorRateHigh       | warning  | Platform (our) error rate > 2% for 2m — excludes customer/user errors | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelProxyErrorsHigh             | warning  | Proxy errors > 1/s for 2m (excl. client cancellations)                | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelEngineEvaluationErrors      | warning  | Policy engine error rate > 1% for 3m                                  | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelRoutingFailures             | warning  | Deployment/instance routing failures > 0.5/s for 3m                   | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelP99LatencyHigh              | warning  | P99 latency > 2.5s for 5m                                             | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelP95LatencyHigh              | warning  | P95 latency > 1s for 5m                                               | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelEngineEvaluationLatencyHigh | warning  | Policy eval P99 > 1.5s for 5m                                         | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelRoutingLatencyHigh          | warning  | Routing P99 > 1.5s for 5m                                             | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelUpstreamLatencyHigh         | warning  | Upstream P99 > 10s for 5m                                             | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelHighActiveRequests          | warning  | Active requests > 100 for 2m                                          | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelNoRunningInstancesSustained | warning  | No running instances > 0.5/s for 5m                                   | `observability/templates/prometheus-alerts-sentinel.yaml`  |
| SentinelGoroutineLeak               | warning  | Goroutines > 1000 for 10m                                             | `observability/templates/prometheus-alerts-sentinel.yaml`  |

### kube-prometheus-stack built-ins

We get a bunch of default alerting rules from kube-prometheus-stack. We've disabled the ones that don't apply to EKS (etcd, apiserver, scheduler, controller-manager, cert rotation, RAID, bonding — all managed by AWS) and a couple we replaced with custom rules. The full list of disabled rules is in `values.yaml` under `defaultRules.disabled`.

Disabled for EKS (AWS manages these):

* All etcd alerts
* KubeAPIDown, KubeAPIErrorBudgetBurn, KubeAPITerminatedRequests
* KubeAggregatedAPIDown, KubeAggregatedAPIErrors
* KubeControllerManagerDown, KubeSchedulerDown, KubeProxyDown
* KubeClientCertificateExpiration, KubeVersionMismatch
* KubeletClient/ServerCertificateExpiration and RenewalErrors
* NodeRAIDDegraded, NodeRAIDDiskFailure, NodeBondingDegraded

Disabled because we replaced them:

* **KubePodCrashLooping** — replaced by PodNotRunning which excludes krane-managed customer workloads
* **KubePodNotReady** — was firing for customer deployment containers

Everything below is what's still active. These come from upstream [kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin). We didn't write them, but they fire in our clusters and you should know what they are. Runbook links are included where available.

#### Kubernetes — critical

| Alert                               | What it means                                                                                 | Runbook                                                                                                  |
| ----------------------------------- | --------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| KubePersistentVolumeErrors          | PV provisioning is broken                                                                     | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeerrors)          |
| KubePersistentVolumeFillingUp       | PV has \< 3% space left. Check if storageclass allows expansion, resize PVC, or clean up data | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup)       |
| KubePersistentVolumeInodesFillingUp | PV has \< 3% inodes left                                                                      | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeinodesfillingup) |

#### Kubernetes — warning

| Alert                               | What it means                                                                                                      | Runbook                                                                                                  |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------- |
| KubeCPUOvercommit                   | Cluster CPU requests exceed capacity                                                                               | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit)                   |
| KubeCPUQuotaOvercommit              | CPU quota overcommitted                                                                                            | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit)              |
| KubeClientErrors                    | API server client is seeing errors                                                                                 | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclienterrors)                    |
| KubeContainerWaiting                | Container stuck waiting for > 1 hour. Check events, logs, and resource availability (configmaps, secrets, volumes) | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting)                |
| KubeDaemonSetMisScheduled           | DaemonSet pods landed on wrong nodes                                                                               | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled)           |
| KubeDaemonSetNotScheduled           | DaemonSet pods not getting scheduled                                                                               | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled)           |
| KubeDaemonSetRolloutStuck           | DaemonSet rollout stalled                                                                                          | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetrolloutstuck)           |
| KubeDeploymentGenerationMismatch    | Deployment generation mismatch — possible failed rollback                                                          | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentgenerationmismatch)    |
| KubeDeploymentReplicasMismatch      | Deployment doesn't have expected replica count                                                                     | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentreplicasmismatch)      |
| KubeDeploymentRolloutStuck          | Deployment rollout not progressing                                                                                 | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentrolloutstuck)          |
| KubeHpaMaxedOut                     | HPA running at max replicas                                                                                        | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout)                     |
| KubeHpaReplicasMismatch             | HPA hasn't reached desired replicas                                                                                | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpareplicasmismatch)             |
| KubeJobFailed                       | Job failed. Check `kubectl describe job` and pod logs                                                              | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed)                       |
| KubeJobNotCompleted                 | Job didn't finish in time                                                                                          | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobnotcompleted)                 |
| KubeMemoryOvercommit                | Cluster memory requests exceed capacity                                                                            | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubememoryovercommit)                |
| KubeMemoryQuotaOvercommit           | Memory quota overcommitted                                                                                         | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubememoryquotaovercommit)           |
| KubeNodeNotReady                    | Node not ready. Check `kubectl get node $NODE -o yaml`, fix or terminate the instance                              | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready)                    |
| KubeNodeReadinessFlapping           | Node keeps flipping between ready and not ready                                                                    | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping)           |
| KubeNodeUnreachable                 | Node is unreachable                                                                                                | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeunreachable)                 |
| KubePdbNotEnoughHealthyPods         | PDB doesn't have enough healthy pods — blocks voluntary disruptions                                                | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepdbnotenoughhealthypods)         |
| KubePersistentVolumeFillingUp       | PV filling up (warning threshold, predicted to fill in 4 days)                                                     | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup)       |
| KubePersistentVolumeInodesFillingUp | PV inodes filling up (warning threshold)                                                                           | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeinodesfillingup) |
| KubeQuotaExceeded                   | Namespace quota exceeded                                                                                           | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded)                   |
| KubeStatefulSetGenerationMismatch   | StatefulSet generation mismatch                                                                                    | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetgenerationmismatch)   |
| KubeStatefulSetReplicasMismatch     | StatefulSet doesn't have expected replicas                                                                         | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetreplicasmismatch)     |
| KubeStatefulSetUpdateNotRolledOut   | StatefulSet update hasn't rolled out                                                                               | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout)   |

#### Kubelet — critical

| Alert       | What it means       | Runbook                                                                          |
| ----------- | ------------------- | -------------------------------------------------------------------------------- |
| KubeletDown | Kubelet target gone | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletdown) |

#### Kubelet — warning

| Alert                        | What it means                                 | Runbook                                                                                           |
| ---------------------------- | --------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| KubeletPlegDurationHigh      | Pod lifecycle event generator taking too long | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh)      |
| KubeletPodStartUpLatencyHigh | Pods taking too long to start                 | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh) |

#### Node — critical

| Alert                          | What it means                                        | Runbook                                                                                       |
| ------------------------------ | ---------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| NodeFileDescriptorLimit        | Kernel predicted to run out of file descriptors soon | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit)        |
| NodeFilesystemAlmostOutOfFiles | \< 3% inodes remaining                               | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles) |
| NodeFilesystemAlmostOutOfSpace | \< 3% disk space remaining                           | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace) |
| NodeFilesystemFilesFillingUp   | Predicted to run out of inodes in 4 hours            | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup)   |
| NodeFilesystemSpaceFillingUp   | Predicted to run out of space in 4 hours             | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup)   |

#### Node — warning

| Alert                              | What it means                                        | Runbook                                                                                           |
| ---------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| NodeClockNotSynchronising          | NTP not syncing                                      | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising)          |
| NodeClockSkewDetected              | Clock skew on node                                   | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected)              |
| NodeDiskIOSaturation               | Disk IO queue is high                                | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodediskiosaturation)               |
| NodeFileDescriptorLimit            | FD limit approaching (warning threshold)             | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit)            |
| NodeFilesystemAlmostOutOfFiles     | \< 5% inodes remaining                               | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles)     |
| NodeFilesystemAlmostOutOfSpace     | \< 5% disk space remaining                           | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace)     |
| NodeFilesystemFilesFillingUp       | Predicted to run out of inodes in 24 hours           | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup)       |
| NodeFilesystemSpaceFillingUp       | Predicted to run out of space in 24 hours            | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup)       |
| NodeHighNumberConntrackEntriesUsed | Conntrack table getting full                         | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused) |
| NodeMemoryHighUtilization          | Node running low on memory                           | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodememoryhighutilization)          |
| NodeMemoryMajorPagesFaults         | Heavy major page faults — something is swapping hard | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodememorymajorpagesfaults)         |
| NodeNetworkInterfaceFlapping       | NIC keeps going up and down                          | [link](https://runbooks.prometheus-operator.dev/runbooks/general/nodenetworkinterfaceflapping)    |
| NodeNetworkReceiveErrs             | Lots of receive errors on a NIC                      | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs)             |
| NodeNetworkTransmitErrs            | Lots of transmit errors on a NIC                     | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs)            |
| NodeSystemSaturation               | Load per core is very high                           | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemsaturation)               |
| NodeSystemdServiceCrashlooping     | A systemd service keeps restarting                   | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemdservicecrashlooping)     |
| NodeSystemdServiceFailed           | A systemd service has failed                         | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemdservicefailed)           |
| NodeTextFileCollectorScrapeError   | Node exporter text file collector failed             | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror)   |

#### Alertmanager — critical

| Alert                                 | What it means                                    | Runbook                                                                                                      |
| ------------------------------------- | ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| AlertmanagerClusterCrashlooping       | Half or more Alertmanager instances crashlooping | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclustercrashlooping)       |
| AlertmanagerClusterDown               | Half or more Alertmanager instances down         | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclusterdown)               |
| AlertmanagerClusterFailedToSendAlerts | Failed to send to a critical integration         | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclusterfailedtosendalerts) |
| AlertmanagerConfigInconsistent        | Alertmanager instances have different configs    | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerconfiginconsistent)        |
| AlertmanagerFailedReload              | Config reload failed                             | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerfailedreload)              |
| AlertmanagerMembersInconsistent       | Cluster member can't find other members          | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagermembersinconsistent)       |

#### Alertmanager — warning

| Alert                                 | What it means                                | Runbook                                                                                                      |
| ------------------------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| AlertmanagerClusterFailedToSendAlerts | Failed to send to a non-critical integration | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclusterfailedtosendalerts) |
| AlertmanagerFailedToSendAlerts        | An instance failed to send notifications     | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerfailedtosendalerts)        |

#### Prometheus — critical

| Alert                                         | What it means                                                    | Runbook                                                                                                            |
| --------------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| PrometheusBadConfig                           | Config reload failed. Check `kubectl logs` on the prometheus pod | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusbadconfig)                           |
| PrometheusErrorSendingAlertsToAnyAlertmanager | > 3% errors sending alerts to all Alertmanagers                  | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuserrorsendingalertstoanyalertmanager) |
| PrometheusRemoteStorageFailures               | Failing to send samples to remote storage                        | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusremotestoragefailures)               |
| PrometheusRemoteWriteBehind                   | Remote write is falling behind                                   | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusremotewritebehind)                   |
| PrometheusRuleFailures                        | Rule evaluations failing                                         | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusrulefailures)                        |
| PrometheusTargetSyncFailure                   | Target sync failed                                               | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustargetsyncfailure)                   |

#### Prometheus — warning

| Alert                                           | What it means                                        | Runbook                                                                                                              |
| ----------------------------------------------- | ---------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| PrometheusNotConnectedToAlertmanagers           | Can't reach any Alertmanagers                        | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotconnectedtoalertmanagers)           |
| PrometheusNotIngestingSamples                   | Not ingesting samples                                | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotingestingsamples)                   |
| PrometheusHighQueryLoad                         | Hitting max concurrent query capacity                | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheushighqueryload)                         |
| PrometheusDuplicateTimestamps                   | Dropping samples with duplicate timestamps           | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusduplicatetimestamps)                   |
| PrometheusOutOfOrderTimestamps                  | Dropping out-of-order samples                        | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusoutofordertimestamps)                  |
| PrometheusNotificationQueueRunningFull          | Alert queue predicted to fill up within 30m          | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotificationqueuerunningfull)          |
| PrometheusErrorSendingAlertsToSomeAlertmanagers | Errors sending to some (not all) Alertmanagers       | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuserrorsendingalertstosomealertmanagers) |
| PrometheusRemoteWriteDesiredShards              | Remote write wants more shards than configured       | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusremotewritedesiredshards)              |
| PrometheusTSDBCompactionsFailing                | Block compaction failing                             | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustsdbcompactionsfailing)                |
| PrometheusTSDBReloadsFailing                    | Block reload failing                                 | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustsdbreloadsfailing)                    |
| PrometheusKubernetesListWatchFailures           | SD list/watch requests failing                       | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuskuberneteslistwatchfailures)           |
| PrometheusMissingRuleEvaluations                | Rule group evaluation too slow, skipping evaluations | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusmissingruleevaluations)                |
| PrometheusSDRefreshFailure                      | Service discovery refresh failing                    | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheussdrefreshfailure)                      |
| PrometheusLabelLimitHit                         | Dropping targets that exceed label limits            | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuslabellimithit)                         |
| PrometheusTargetLimitHit                        | Dropping targets that exceed target limits           | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustargetlimithit)                        |
| PrometheusScrapeBodySizeLimitHit                | Dropping targets that exceed body size limit         | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusscrapebodysizelimithit)                |
| PrometheusScrapeSampleLimitHit                  | Dropping scrapes that exceed sample limit            | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusscrapesamplelimithit)                  |

#### Prometheus Operator — warning

| Alert                                | What it means                            | Runbook                                                                                                            |
| ------------------------------------ | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| ConfigReloaderSidecarErrors          | Config reloader sidecar failing for 10m  | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/configreloadersidecarerrors)          |
| PrometheusOperatorListErrors         | List operation errors                    | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorlisterrors)         |
| PrometheusOperatorWatchErrors        | Watch operation errors                   | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorwatcherrors)        |
| PrometheusOperatorSyncFailed         | Last reconciliation failed               | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorsyncfailed)         |
| PrometheusOperatorReconcileErrors    | Reconciliation errors                    | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorreconcileerrors)    |
| PrometheusOperatorNodeLookupErrors   | Node lookup errors during reconciliation | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatornodelookuperrors)   |
| PrometheusOperatorNotReady           | Operator not ready                       | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatornotready)           |
| PrometheusOperatorRejectedResources  | Resources rejected by operator           | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorrejectedresources)  |
| PrometheusOperatorStatusUpdateErrors | Status update errors                     | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorstatusupdateerrors) |

#### kube-state-metrics — critical

| Alert                            | What it means            | Runbook                                                                                                       |
| -------------------------------- | ------------------------ | ------------------------------------------------------------------------------------------------------------- |
| KubeStateMetricsListErrors       | List operations failing  | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricslisterrors)       |
| KubeStateMetricsWatchErrors      | Watch operations failing | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricswatcherrors)      |
| KubeStateMetricsShardingMismatch | Sharding misconfigured   | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricsshardingmismatch) |
| KubeStateMetricsShardsMissing    | Shards missing           | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricsshardsmissing)    |

#### General

| Alert         | Severity | What it means                                                                                                                   | Runbook                                                                         |
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| TargetDown    | warning  | A Prometheus scrape target is unreachable. Check `/targets` in Prometheus UI, verify ServiceMonitor config and network policies | [link](https://runbooks.prometheus-operator.dev/runbooks/general/targetdown)    |
| Watchdog      | none     | Always-firing deadman switch — if this stops, Alertmanager is broken                                                            | [link](https://runbooks.prometheus-operator.dev/runbooks/general/watchdog)      |
| InfoInhibitor | none     | Suppresses info-level alerts when higher-severity alerts are already firing for the same target                                 | [link](https://runbooks.prometheus-operator.dev/runbooks/general/infoinhibitor) |

#### Info

These don't page or post to Slack. They exist for dashboards and as context when other alerts are firing.

| Alert               | What it means                                             | Runbook                                                                                  |
| ------------------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| CPUThrottlingHigh   | Processes getting CPU-throttled                           | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/cputhrottlinghigh)   |
| KubeNodeEviction    | Node is evicting pods                                     | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeeviction)    |
| KubeNodePressure    | Node has an active pressure condition (memory, disk, PID) | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodepressure)    |
| KubeQuotaAlmostFull | Namespace quota approaching limit                         | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaalmostfull) |
| KubeQuotaFullyUsed  | Namespace quota fully consumed                            | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotafullyused)  |
| KubeletTooManyPods  | Kubelet running at pod capacity                           | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods)  |
| NodeCPUHighUsage    | High CPU usage on node                                    | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodecpuhighusage)          |

## Adding a new alert

Alerts live in PrometheusRule manifests under `eks-cluster/helm-chart/observability/templates/`. Add yours to an existing file if it fits, or create a new `prometheus-alerts-<service>.yaml` if you're adding a group for a new service.

```yaml theme={"theme":"kanagawa-wave"}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <service>-alerts
  namespace: {{ .Release.Namespace }}
  labels:
    app.kubernetes.io/name: <service>-alerts
spec:
  groups:
    - name: <service>-health
      rules:
        - alert: SomethingBad
          expr: <your promql>
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Short description with {{ "{{" }} $labels.region {{ "}}" }}"
            description: "Longer explanation of what's happening."
```

### Things to keep in mind

* **Always set `severity`** to `critical` or `warning`. No severity = no routing.
* **Exclude customer workloads** if your alert is pod-level. Add `container!="deployment"` to the metric selector and use `unless on(namespace, pod) kube_pod_labels{label_app_kubernetes_io_managed_by="krane"}` to skip krane-managed pods. We don't want to page ourselves for a customer's broken container.
* **Include region in the summary** when the data has it. Getting paged with "SomethingBad (us-east-1 / api)" is a lot more useful than just "SomethingBad."
* **Set a reasonable `for` duration.** Too short and you get flapping alerts. Too long and you find out late. 2-5 minutes is a good starting point for most things.
* **Think about staging vs. production.** The `environment` external label is available on all metrics (`production001` or `staging`). If you want different severity per environment, write two rules with different `environment` filters — one critical for production, one warning for staging.

## incident.io

Alert routing config and API usage is documented in [incident.io](/infra/observability/incident-io).
