Skip to main content
How our alerts work, what severity means, and how to add new ones without making the on-call person’s life worse.

How alerts get to people

Prometheus evaluates alerting rules in each cluster. When something fires, Alertmanager sends it to incident.io, which decides what to do based on two things: the source (which cluster sent it) and the severity label on the alert.
SeverityWhat happensRoute
criticalPages on-call via Engineering On-Call escalation pathProduction Alerts
warningPosts to #alerts in Slack, no pageProduction Warnings
(missing)Posts to #alerts with no page — safety net so you know the label is missingUnrouted Alerts
There’s a catch-all route (“Unrouted Alerts”) that picks up Alertmanager alerts with no severity label. It posts to #alerts so someone notices and fixes the missing label. It doesn’t page anyone. If your alert has a severity value that isn’t critical or warning (like severity=info), it won’t match any of the three routes and will be silently dropped. Don’t do that. Staging alerts never page anyone regardless of severity. They go to #alerts and that’s it. Always include a severity label set to critical or warning. Anything else and the alert is effectively invisible.

When to use which severity

critical — would you wake someone up for this? Then it’s critical.
  • Active customer impact (elevated error rates, full outage)
  • Data loss or risk of data loss
  • Pods that should be running but aren’t
  • Anything where waiting until morning makes it worse
warning — is this something we should know about but can wait?
  • Elevated latency that hasn’t crossed into “customers are mad” territory
  • Resource (high memory, goroutine counts, connection pools filling up)
  • Indicators that could become critical if left alone
  • Anything you’d look at during business hours but wouldn’t lose sleep over
If you’re not sure, start with warning. It’s easy to promote to critical later. Going the other direction means someone already got woken up for nothing.

Current alerts

Custom (defined in this repo)

AlertSeverityWhat it catchesFile
PodNotRunningcriticalPod stuck in CrashLoopBackOff, ImagePullBackOff, etc. for 5mobservability/templates/prometheus-alerts.yaml
FrontlineErrorRateHighcritical5xx error rate > 5% for 2mobservability/templates/prometheus-alerts-frontline.yaml
FrontlineP99LatencyHighcriticalP99 latency > 2.5s for 5mobservability/templates/prometheus-alerts-frontline.yaml
FrontlineP95LatencyHighwarningP95 latency > 1s for 5mobservability/templates/prometheus-alerts-frontline.yaml
FrontlineHighActiveRequestswarningActive requests > 100 for 2mobservability/templates/prometheus-alerts-frontline.yaml
FrontlineGoroutineLeakwarningGoroutines > 1000 for 10mobservability/templates/prometheus-alerts-frontline.yaml

kube-prometheus-stack built-ins

We get a bunch of default alerting rules from kube-prometheus-stack. We’ve disabled the ones that don’t apply to EKS (etcd, apiserver, scheduler, controller-manager, cert rotation, RAID, bonding — all managed by AWS) and a couple we replaced with custom rules. The full list of disabled rules is in values.yaml under defaultRules.disabled. Disabled for EKS (AWS manages these):
  • All etcd alerts
  • KubeAPIDown, KubeAPIErrorBudgetBurn, KubeAPITerminatedRequests
  • KubeAggregatedAPIDown, KubeAggregatedAPIErrors
  • KubeControllerManagerDown, KubeSchedulerDown, KubeProxyDown
  • KubeClientCertificateExpiration, KubeVersionMismatch
  • KubeletClient/ServerCertificateExpiration and RenewalErrors
  • NodeRAIDDegraded, NodeRAIDDiskFailure, NodeBondingDegraded
Disabled because we replaced them:
  • KubePodCrashLooping — replaced by PodNotRunning which excludes krane-managed customer workloads
  • KubePodNotReady — was firing for customer deployment containers
Everything below is what’s still active. These come from upstream kubernetes-mixin. We didn’t write them, but they fire in our clusters and you should know what they are. Runbook links are included where available.

Kubernetes — critical

AlertWhat it meansRunbook
KubePersistentVolumeErrorsPV provisioning is brokenlink
KubePersistentVolumeFillingUpPV has < 3% space left. Check if storageclass allows expansion, resize PVC, or clean up datalink
KubePersistentVolumeInodesFillingUpPV has < 3% inodes leftlink

Kubernetes — warning

AlertWhat it meansRunbook
KubeCPUOvercommitCluster CPU requests exceed capacitylink
KubeCPUQuotaOvercommitCPU quota overcommittedlink
KubeClientErrorsAPI server client is seeing errorslink
KubeContainerWaitingContainer stuck waiting for > 1 hour. Check events, logs, and resource availability (configmaps, secrets, volumes)link
KubeDaemonSetMisScheduledDaemonSet pods landed on wrong nodeslink
KubeDaemonSetNotScheduledDaemonSet pods not getting scheduledlink
KubeDaemonSetRolloutStuckDaemonSet rollout stalledlink
KubeDeploymentGenerationMismatchDeployment generation mismatch — possible failed rollbacklink
KubeDeploymentReplicasMismatchDeployment doesn’t have expected replica countlink
KubeDeploymentRolloutStuckDeployment rollout not progressinglink
KubeHpaMaxedOutHPA running at max replicaslink
KubeHpaReplicasMismatchHPA hasn’t reached desired replicaslink
KubeJobFailedJob failed. Check kubectl describe job and pod logslink
KubeJobNotCompletedJob didn’t finish in timelink
KubeMemoryOvercommitCluster memory requests exceed capacitylink
KubeMemoryQuotaOvercommitMemory quota overcommittedlink
KubeNodeNotReadyNode not ready. Check kubectl get node $NODE -o yaml, fix or terminate the instancelink
KubeNodeReadinessFlappingNode keeps flipping between ready and not readylink
KubeNodeUnreachableNode is unreachablelink
KubePdbNotEnoughHealthyPodsPDB doesn’t have enough healthy pods — blocks voluntary disruptionslink
KubePersistentVolumeFillingUpPV filling up (warning threshold, predicted to fill in 4 days)link
KubePersistentVolumeInodesFillingUpPV inodes filling up (warning threshold)link
KubeQuotaExceededNamespace quota exceededlink
KubeStatefulSetGenerationMismatchStatefulSet generation mismatchlink
KubeStatefulSetReplicasMismatchStatefulSet doesn’t have expected replicaslink
KubeStatefulSetUpdateNotRolledOutStatefulSet update hasn’t rolled outlink

Kubelet — critical

AlertWhat it meansRunbook
KubeletDownKubelet target gonelink

Kubelet — warning

AlertWhat it meansRunbook
KubeletPlegDurationHighPod lifecycle event generator taking too longlink
KubeletPodStartUpLatencyHighPods taking too long to startlink

Node — critical

AlertWhat it meansRunbook
NodeFileDescriptorLimitKernel predicted to run out of file descriptors soonlink
NodeFilesystemAlmostOutOfFiles< 3% inodes remaininglink
NodeFilesystemAlmostOutOfSpace< 3% disk space remaininglink
NodeFilesystemFilesFillingUpPredicted to run out of inodes in 4 hourslink
NodeFilesystemSpaceFillingUpPredicted to run out of space in 4 hourslink

Node — warning

AlertWhat it meansRunbook
NodeClockNotSynchronisingNTP not syncinglink
NodeClockSkewDetectedClock skew on nodelink
NodeDiskIOSaturationDisk IO queue is highlink
NodeFileDescriptorLimitFD limit approaching (warning threshold)link
NodeFilesystemAlmostOutOfFiles< 5% inodes remaininglink
NodeFilesystemAlmostOutOfSpace< 5% disk space remaininglink
NodeFilesystemFilesFillingUpPredicted to run out of inodes in 24 hourslink
NodeFilesystemSpaceFillingUpPredicted to run out of space in 24 hourslink
NodeHighNumberConntrackEntriesUsedConntrack table getting fulllink
NodeMemoryHighUtilizationNode running low on memorylink
NodeMemoryMajorPagesFaultsHeavy major page faults — something is swapping hardlink
NodeNetworkInterfaceFlappingNIC keeps going up and downlink
NodeNetworkReceiveErrsLots of receive errors on a NIClink
NodeNetworkTransmitErrsLots of transmit errors on a NIClink
NodeSystemSaturationLoad per core is very highlink
NodeSystemdServiceCrashloopingA systemd service keeps restartinglink
NodeSystemdServiceFailedA systemd service has failedlink
NodeTextFileCollectorScrapeErrorNode exporter text file collector failedlink

Alertmanager — critical

AlertWhat it meansRunbook
AlertmanagerClusterCrashloopingHalf or more Alertmanager instances crashloopinglink
AlertmanagerClusterDownHalf or more Alertmanager instances downlink
AlertmanagerClusterFailedToSendAlertsFailed to send to a critical integrationlink
AlertmanagerConfigInconsistentAlertmanager instances have different configslink
AlertmanagerFailedReloadConfig reload failedlink
AlertmanagerMembersInconsistentCluster member can’t find other memberslink

Alertmanager — warning

AlertWhat it meansRunbook
AlertmanagerClusterFailedToSendAlertsFailed to send to a non-critical integrationlink
AlertmanagerFailedToSendAlertsAn instance failed to send notificationslink

Prometheus — critical

AlertWhat it meansRunbook
PrometheusBadConfigConfig reload failed. Check kubectl logs on the prometheus podlink
PrometheusErrorSendingAlertsToAnyAlertmanager> 3% errors sending alerts to all Alertmanagerslink
PrometheusRemoteStorageFailuresFailing to send samples to remote storagelink
PrometheusRemoteWriteBehindRemote write is falling behindlink
PrometheusRuleFailuresRule evaluations failinglink
PrometheusTargetSyncFailureTarget sync failedlink

Prometheus — warning

AlertWhat it meansRunbook
PrometheusNotConnectedToAlertmanagersCan’t reach any Alertmanagerslink
PrometheusNotIngestingSamplesNot ingesting sampleslink
PrometheusHighQueryLoadHitting max concurrent query capacitylink
PrometheusDuplicateTimestampsDropping samples with duplicate timestampslink
PrometheusOutOfOrderTimestampsDropping out-of-order sampleslink
PrometheusNotificationQueueRunningFullAlert queue predicted to fill up within 30mlink
PrometheusErrorSendingAlertsToSomeAlertmanagersErrors sending to some (not all) Alertmanagerslink
PrometheusRemoteWriteDesiredShardsRemote write wants more shards than configuredlink
PrometheusTSDBCompactionsFailingBlock compaction failinglink
PrometheusTSDBReloadsFailingBlock reload failinglink
PrometheusKubernetesListWatchFailuresSD list/watch requests failinglink
PrometheusMissingRuleEvaluationsRule group evaluation too slow, skipping evaluationslink
PrometheusSDRefreshFailureService discovery refresh failinglink
PrometheusLabelLimitHitDropping targets that exceed label limitslink
PrometheusTargetLimitHitDropping targets that exceed target limitslink
PrometheusScrapeBodySizeLimitHitDropping targets that exceed body size limitlink
PrometheusScrapeSampleLimitHitDropping scrapes that exceed sample limitlink

Prometheus Operator — warning

AlertWhat it meansRunbook
ConfigReloaderSidecarErrorsConfig reloader sidecar failing for 10mlink
PrometheusOperatorListErrorsList operation errorslink
PrometheusOperatorWatchErrorsWatch operation errorslink
PrometheusOperatorSyncFailedLast reconciliation failedlink
PrometheusOperatorReconcileErrorsReconciliation errorslink
PrometheusOperatorNodeLookupErrorsNode lookup errors during reconciliationlink
PrometheusOperatorNotReadyOperator not readylink
PrometheusOperatorRejectedResourcesResources rejected by operatorlink
PrometheusOperatorStatusUpdateErrorsStatus update errorslink

kube-state-metrics — critical

AlertWhat it meansRunbook
KubeStateMetricsListErrorsList operations failinglink
KubeStateMetricsWatchErrorsWatch operations failinglink
KubeStateMetricsShardingMismatchSharding misconfiguredlink
KubeStateMetricsShardsMissingShards missinglink

General

AlertSeverityWhat it meansRunbook
TargetDownwarningA Prometheus scrape target is unreachable. Check /targets in Prometheus UI, verify ServiceMonitor config and network policieslink
WatchdognoneAlways-firing deadman switch — if this stops, Alertmanager is brokenlink
InfoInhibitornoneSuppresses info-level alerts when higher-severity alerts are already firing for the same targetlink

Info

These don’t page or post to Slack. They exist for dashboards and as context when other alerts are firing.
AlertWhat it meansRunbook
CPUThrottlingHighProcesses getting CPU-throttledlink
KubeNodeEvictionNode is evicting podslink
KubeNodePressureNode has an active pressure condition (memory, disk, PID)link
KubeQuotaAlmostFullNamespace quota approaching limitlink
KubeQuotaFullyUsedNamespace quota fully consumedlink
KubeletTooManyPodsKubelet running at pod capacitylink
NodeCPUHighUsageHigh CPU usage on nodelink

Adding a new alert

Alerts live in PrometheusRule manifests under eks-cluster/helm-chart/observability/templates/. Add yours to an existing file if it fits, or create a new prometheus-alerts-<service>.yaml if you’re adding a group for a new service.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <service>-alerts
  namespace: {{ .Release.Namespace }}
  labels:
    app.kubernetes.io/name: <service>-alerts
spec:
  groups:
    - name: <service>-health
      rules:
        - alert: SomethingBad
          expr: <your promql>
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Short description with {{ "{{" }} $labels.region {{ "}}" }}"
            description: "Longer explanation of what's happening."

Things to keep in mind

  • Always set severity to critical or warning. No severity = no routing.
  • Exclude customer workloads if your alert is pod-level. Add container!="deployment" to the metric selector and use unless on(namespace, pod) kube_pod_labels{label_app_kubernetes_io_managed_by="krane"} to skip krane-managed pods. We don’t want to page ourselves for a customer’s broken container.
  • Include region in the summary when the data has it. Getting paged with “SomethingBad (us-east-1 / api)” is a lot more useful than just “SomethingBad.”
  • Set a reasonable for duration. Too short and you get flapping alerts. Too long and you find out late. 2-5 minutes is a good starting point for most things.
  • Think about staging vs. production. The environment external label is available on all metrics (production001 or staging). If you want different severity per environment, write two rules with different environment filters — one critical for production, one warning for staging.

incident.io

Alert routing config and API usage is documented in incident.io.