Alerting

How our alerts work, what severity means, and how to add new ones without making the on-call person’s life worse.

How alerts get to people

Prometheus evaluates alerting rules in each cluster. When something fires, Alertmanager sends it to incident.io, which decides what to do based on two things: the source (which cluster sent it) and the severity label on the alert.

Severity	What happens	Route
`critical`	Pages on-call via Engineering On-Call escalation path	Production Alerts
`warning`	Posts to #alerts in Slack, no page	Production Warnings
(missing)	Posts to #alerts with no page — safety net so you know the label is missing	Unrouted Alerts

There’s a catch-all route (“Unrouted Alerts”) that picks up Alertmanager alerts with no severity label. It posts to #alerts so someone notices and fixes the missing label. It doesn’t page anyone. If your alert has a severity value that isn’t critical or warning (like severity=info), it won’t match any of the three routes and will be silently dropped. Don’t do that. Staging alerts never page anyone regardless of severity. They go to #alerts and that’s it. Always include a severity label set to critical or warning. Anything else and the alert is effectively invisible.

When to use which severity

critical — would you wake someone up for this? Then it’s critical.

Active customer impact (elevated error rates, full outage)
Data loss or risk of data loss
Pods that should be running but aren’t
Anything where waiting until morning makes it worse

warning — is this something we should know about but can wait?

Elevated latency that hasn’t crossed into “customers are mad” territory
Resource (high memory, goroutine counts, connection pools filling up)
Indicators that could become critical if left alone
Anything you’d look at during business hours but wouldn’t lose sleep over

If you’re not sure, start with warning. It’s easy to promote to critical later. Going the other direction means someone already got woken up for nothing.

Current alerts

Custom (defined in this repo)

Alert	Severity	What it catches	File
PodNotRunning	critical	Pod stuck in `CrashLoopBackOff`, `ImagePullBackOff`, etc. for 5m	`observability/templates/prometheus-alerts.yaml`
Frontline
FrontlinePlatformErrorRateHigh	warning	Platform (our) error rate > 2% for 2m — excludes customer/user errors	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineRoutingErrorsHigh	warning	Sustained routing errors > 0.5/s for 2m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineProxyConnectionFailures	warning	Connection failures (timeout, refused, reset, DNS) > 5% for 2m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineSentinel5xxRate	warning	Sentinel-sourced 5xx > 5% for 2m (sentinel itself is broken)	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineP99LatencyHigh	warning	P99 latency > 2.5s for 5m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineP95LatencyHigh	warning	P95 latency > 1s for 5m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineRoutingLatencyHigh	warning	Routing P99 > 1.5s for 5m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineBackendLatencyHigh	warning	Backend P99 > 5s for 5m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineHighActiveRequests	warning	Active requests > 100 for 2m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineExcessiveHops	warning	P99 cross-region hops >= 2 for 5m	`observability/templates/prometheus-alerts-frontline.yaml`
FrontlineGoroutineLeak	warning	Goroutines > 1000 for 10m	`observability/templates/prometheus-alerts-frontline.yaml`
Sentinel
SentinelPlatformErrorRateHigh	warning	Platform (our) error rate > 2% for 2m — excludes customer/user errors	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelProxyErrorsHigh	warning	Proxy errors > 1/s for 2m (excl. client cancellations)	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelEngineEvaluationErrors	warning	Policy engine error rate > 1% for 3m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelRoutingFailures	warning	Deployment/instance routing failures > 0.5/s for 3m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelP99LatencyHigh	warning	P99 latency > 2.5s for 5m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelP95LatencyHigh	warning	P95 latency > 1s for 5m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelEngineEvaluationLatencyHigh	warning	Policy eval P99 > 1.5s for 5m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelRoutingLatencyHigh	warning	Routing P99 > 1.5s for 5m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelUpstreamLatencyHigh	warning	Upstream P99 > 10s for 5m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelHighActiveRequests	warning	Active requests > 100 for 2m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelNoRunningInstancesSustained	warning	No running instances > 0.5/s for 5m	`observability/templates/prometheus-alerts-sentinel.yaml`
SentinelGoroutineLeak	warning	Goroutines > 1000 for 10m	`observability/templates/prometheus-alerts-sentinel.yaml`

kube-prometheus-stack built-ins

We get a bunch of default alerting rules from kube-prometheus-stack. We’ve disabled the ones that don’t apply to EKS (etcd, apiserver, scheduler, controller-manager, cert rotation, RAID, bonding — all managed by AWS) and a couple we replaced with custom rules. The full list of disabled rules is in values.yaml under defaultRules.disabled. Disabled for EKS (AWS manages these):

All etcd alerts
KubeAPIDown, KubeAPIErrorBudgetBurn, KubeAPITerminatedRequests
KubeAggregatedAPIDown, KubeAggregatedAPIErrors
KubeControllerManagerDown, KubeSchedulerDown, KubeProxyDown
KubeClientCertificateExpiration, KubeVersionMismatch
KubeletClient/ServerCertificateExpiration and RenewalErrors
NodeRAIDDegraded, NodeRAIDDiskFailure, NodeBondingDegraded

Disabled because we replaced them:

KubePodCrashLooping — replaced by PodNotRunning which excludes krane-managed customer workloads
KubePodNotReady — was firing for customer deployment containers

Everything below is what’s still active. These come from upstream kubernetes-mixin. We didn’t write them, but they fire in our clusters and you should know what they are. Runbook links are included where available.

Kubernetes — critical

Alert	What it means	Runbook
KubePersistentVolumeErrors	PV provisioning is broken	link
KubePersistentVolumeFillingUp	PV has < 3% space left. Check if storageclass allows expansion, resize PVC, or clean up data	link
KubePersistentVolumeInodesFillingUp	PV has < 3% inodes left	link

Kubernetes — warning

Alert	What it means	Runbook
KubeCPUOvercommit	Cluster CPU requests exceed capacity	link
KubeCPUQuotaOvercommit	CPU quota overcommitted	link
KubeClientErrors	API server client is seeing errors	link
KubeContainerWaiting	Container stuck waiting for > 1 hour. Check events, logs, and resource availability (configmaps, secrets, volumes)	link
KubeDaemonSetMisScheduled	DaemonSet pods landed on wrong nodes	link
KubeDaemonSetNotScheduled	DaemonSet pods not getting scheduled	link
KubeDaemonSetRolloutStuck	DaemonSet rollout stalled	link
KubeDeploymentGenerationMismatch	Deployment generation mismatch — possible failed rollback	link
KubeDeploymentReplicasMismatch	Deployment doesn’t have expected replica count	link
KubeDeploymentRolloutStuck	Deployment rollout not progressing	link
KubeHpaMaxedOut	HPA running at max replicas	link
KubeHpaReplicasMismatch	HPA hasn’t reached desired replicas	link
KubeJobFailed	Job failed. Check `kubectl describe job` and pod logs	link
KubeJobNotCompleted	Job didn’t finish in time	link
KubeMemoryOvercommit	Cluster memory requests exceed capacity	link
KubeMemoryQuotaOvercommit	Memory quota overcommitted	link
KubeNodeNotReady	Node not ready. Check `kubectl get node $NODE -o yaml`, fix or terminate the instance	link
KubeNodeReadinessFlapping	Node keeps flipping between ready and not ready	link
KubeNodeUnreachable	Node is unreachable	link
KubePdbNotEnoughHealthyPods	PDB doesn’t have enough healthy pods — blocks voluntary disruptions	link
KubePersistentVolumeFillingUp	PV filling up (warning threshold, predicted to fill in 4 days)	link
KubePersistentVolumeInodesFillingUp	PV inodes filling up (warning threshold)	link
KubeQuotaExceeded	Namespace quota exceeded	link
KubeStatefulSetGenerationMismatch	StatefulSet generation mismatch	link
KubeStatefulSetReplicasMismatch	StatefulSet doesn’t have expected replicas	link
KubeStatefulSetUpdateNotRolledOut	StatefulSet update hasn’t rolled out	link

Kubelet — critical

Alert	What it means	Runbook
KubeletDown	Kubelet target gone	link

Kubelet — warning

Alert	What it means	Runbook
KubeletPlegDurationHigh	Pod lifecycle event generator taking too long	link
KubeletPodStartUpLatencyHigh	Pods taking too long to start	link

Node — critical

Alert	What it means	Runbook
NodeFileDescriptorLimit	Kernel predicted to run out of file descriptors soon	link
NodeFilesystemAlmostOutOfFiles	< 3% inodes remaining	link
NodeFilesystemAlmostOutOfSpace	< 3% disk space remaining	link
NodeFilesystemFilesFillingUp	Predicted to run out of inodes in 4 hours	link
NodeFilesystemSpaceFillingUp	Predicted to run out of space in 4 hours	link

Node — warning

Alert	What it means	Runbook
NodeClockNotSynchronising	NTP not syncing	link
NodeClockSkewDetected	Clock skew on node	link
NodeDiskIOSaturation	Disk IO queue is high	link
NodeFileDescriptorLimit	FD limit approaching (warning threshold)	link
NodeFilesystemAlmostOutOfFiles	< 5% inodes remaining	link
NodeFilesystemAlmostOutOfSpace	< 5% disk space remaining	link
NodeFilesystemFilesFillingUp	Predicted to run out of inodes in 24 hours	link
NodeFilesystemSpaceFillingUp	Predicted to run out of space in 24 hours	link
NodeHighNumberConntrackEntriesUsed	Conntrack table getting full	link
NodeMemoryHighUtilization	Node running low on memory	link
NodeMemoryMajorPagesFaults	Heavy major page faults — something is swapping hard	link
NodeNetworkInterfaceFlapping	NIC keeps going up and down	link
NodeNetworkReceiveErrs	Lots of receive errors on a NIC	link
NodeNetworkTransmitErrs	Lots of transmit errors on a NIC	link
NodeSystemSaturation	Load per core is very high	link
NodeSystemdServiceCrashlooping	A systemd service keeps restarting	link
NodeSystemdServiceFailed	A systemd service has failed	link
NodeTextFileCollectorScrapeError	Node exporter text file collector failed	link

Alertmanager — critical

Alert	What it means	Runbook
AlertmanagerClusterCrashlooping	Half or more Alertmanager instances crashlooping	link
AlertmanagerClusterDown	Half or more Alertmanager instances down	link
AlertmanagerClusterFailedToSendAlerts	Failed to send to a critical integration	link
AlertmanagerConfigInconsistent	Alertmanager instances have different configs	link
AlertmanagerFailedReload	Config reload failed	link
AlertmanagerMembersInconsistent	Cluster member can’t find other members	link

Alertmanager — warning

Alert	What it means	Runbook
AlertmanagerClusterFailedToSendAlerts	Failed to send to a non-critical integration	link
AlertmanagerFailedToSendAlerts	An instance failed to send notifications	link

Prometheus — critical

Alert	What it means	Runbook
PrometheusBadConfig	Config reload failed. Check `kubectl logs` on the prometheus pod	link
PrometheusErrorSendingAlertsToAnyAlertmanager	> 3% errors sending alerts to all Alertmanagers	link
PrometheusRemoteStorageFailures	Failing to send samples to remote storage	link
PrometheusRemoteWriteBehind	Remote write is falling behind	link
PrometheusRuleFailures	Rule evaluations failing	link
PrometheusTargetSyncFailure	Target sync failed	link

Prometheus — warning

Alert	What it means	Runbook
PrometheusNotConnectedToAlertmanagers	Can’t reach any Alertmanagers	link
PrometheusNotIngestingSamples	Not ingesting samples	link
PrometheusHighQueryLoad	Hitting max concurrent query capacity	link
PrometheusDuplicateTimestamps	Dropping samples with duplicate timestamps	link
PrometheusOutOfOrderTimestamps	Dropping out-of-order samples	link
PrometheusNotificationQueueRunningFull	Alert queue predicted to fill up within 30m	link
PrometheusErrorSendingAlertsToSomeAlertmanagers	Errors sending to some (not all) Alertmanagers	link
PrometheusRemoteWriteDesiredShards	Remote write wants more shards than configured	link
PrometheusTSDBCompactionsFailing	Block compaction failing	link
PrometheusTSDBReloadsFailing	Block reload failing	link
PrometheusKubernetesListWatchFailures	SD list/watch requests failing	link
PrometheusMissingRuleEvaluations	Rule group evaluation too slow, skipping evaluations	link
PrometheusSDRefreshFailure	Service discovery refresh failing	link
PrometheusLabelLimitHit	Dropping targets that exceed label limits	link
PrometheusTargetLimitHit	Dropping targets that exceed target limits	link
PrometheusScrapeBodySizeLimitHit	Dropping targets that exceed body size limit	link
PrometheusScrapeSampleLimitHit	Dropping scrapes that exceed sample limit	link

Prometheus Operator — warning

Alert	What it means	Runbook
ConfigReloaderSidecarErrors	Config reloader sidecar failing for 10m	link
PrometheusOperatorListErrors	List operation errors	link
PrometheusOperatorWatchErrors	Watch operation errors	link
PrometheusOperatorSyncFailed	Last reconciliation failed	link
PrometheusOperatorReconcileErrors	Reconciliation errors	link
PrometheusOperatorNodeLookupErrors	Node lookup errors during reconciliation	link
PrometheusOperatorNotReady	Operator not ready	link
PrometheusOperatorRejectedResources	Resources rejected by operator	link
PrometheusOperatorStatusUpdateErrors	Status update errors	link

kube-state-metrics — critical

Alert	What it means	Runbook
KubeStateMetricsListErrors	List operations failing	link
KubeStateMetricsWatchErrors	Watch operations failing	link
KubeStateMetricsShardingMismatch	Sharding misconfigured	link
KubeStateMetricsShardsMissing	Shards missing	link

General

Alert	Severity	What it means	Runbook
TargetDown	warning	A Prometheus scrape target is unreachable. Check `/targets` in Prometheus UI, verify ServiceMonitor config and network policies	link
Watchdog	none	Always-firing deadman switch — if this stops, Alertmanager is broken	link
InfoInhibitor	none	Suppresses info-level alerts when higher-severity alerts are already firing for the same target	link

Info

These don’t page or post to Slack. They exist for dashboards and as context when other alerts are firing.

Alert	What it means	Runbook
CPUThrottlingHigh	Processes getting CPU-throttled	link
KubeNodeEviction	Node is evicting pods	link
KubeNodePressure	Node has an active pressure condition (memory, disk, PID)	link
KubeQuotaAlmostFull	Namespace quota approaching limit	link
KubeQuotaFullyUsed	Namespace quota fully consumed	link
KubeletTooManyPods	Kubelet running at pod capacity	link
NodeCPUHighUsage	High CPU usage on node	link

Adding a new alert

Alerts live in PrometheusRule manifests under eks-cluster/helm-chart/observability/templates/. Add yours to an existing file if it fits, or create a new prometheus-alerts-<service>.yaml if you’re adding a group for a new service.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <service>-alerts
  namespace: {{ .Release.Namespace }}
  labels:
    app.kubernetes.io/name: <service>-alerts
spec:
  groups:
    - name: <service>-health
      rules:
        - alert: SomethingBad
          expr: <your promql>
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Short description with {{ "{{" }} $labels.region {{ "}}" }}"
            description: "Longer explanation of what's happening."

Things to keep in mind

Always set severity to critical or warning. No severity = no routing.
Exclude customer workloads if your alert is pod-level. Add container!="deployment" to the metric selector and use unless on(namespace, pod) kube_pod_labels{label_app_kubernetes_io_managed_by="krane"} to skip krane-managed pods. We don’t want to page ourselves for a customer’s broken container.
Include region in the summary when the data has it. Getting paged with “SomethingBad (us-east-1 / api)” is a lot more useful than just “SomethingBad.”
Set a reasonable for duration. Too short and you get flapping alerts. Too long and you find out late. 2-5 minutes is a good starting point for most things.
Think about staging vs. production. The environment external label is available on all metrics (production001 or staging). If you want different severity per environment, write two rules with different environment filters — one critical for production, one warning for staging.

incident.io

Alert routing config and API usage is documented in incident.io.

Infra

Clusters

Observability

Metering

Deployments

Custom Domains

Secrets

ClickHouse

Legacy (2025)

How alerts get to people

When to use which severity

Current alerts

Custom (defined in this repo)

kube-prometheus-stack built-ins

Kubernetes — critical

Kubernetes — warning

Kubelet — critical

Kubelet — warning

Node — critical

Node — warning

Alertmanager — critical

Alertmanager — warning

Prometheus — critical

Prometheus — warning

Prometheus Operator — warning

kube-state-metrics — critical

General

Info

Adding a new alert

Things to keep in mind

incident.io

Infra

Clusters

Observability

Metering

Deployments

Custom Domains

Secrets

ClickHouse

Legacy (2025)

Documentation Index

​How alerts get to people

​When to use which severity

​Current alerts

​Custom (defined in this repo)

​kube-prometheus-stack built-ins

​Kubernetes — critical

​Kubernetes — warning

​Kubelet — critical

​Kubelet — warning

​Node — critical

​Node — warning

​Alertmanager — critical

​Alertmanager — warning

​Prometheus — critical

​Prometheus — warning

​Prometheus Operator — warning

​kube-state-metrics — critical

​General

​Info

​Adding a new alert

​Things to keep in mind

​incident.io

How alerts get to people

When to use which severity

Current alerts

Custom (defined in this repo)

kube-prometheus-stack built-ins

Kubernetes — critical

Kubernetes — warning

Kubelet — critical

Kubelet — warning

Node — critical

Node — warning

Alertmanager — critical

Alertmanager — warning

Prometheus — critical

Prometheus — warning

Prometheus Operator — warning

kube-state-metrics — critical

General

Info

Adding a new alert

Things to keep in mind

incident.io