---
title: 'Incidents'
description: 'How Yorker groups correlated alerts into incidents, tracks their lifecycle, and dispatches opinionated, investigator-grade notifications.'
section: 'Concepts'
canonical_url: 'https://yorkermonitoring.com/docs/concepts/incidents'
---

# Incidents

A Yorker incident is a correlated group of alerts treated as one investigable unit. Each incident has a fingerprint, a severity, a lifecycle, and a notification policy. Incidents reduce noise by collapsing many alerts into one ticket and by emitting structured, investigator-grade payloads to your channels.

## Why incidents exist

A single alert answers "is this check failing right now?" It does not answer the question an on-call engineer actually needs: **what is the blast radius, and is it related to something else that's breaking?**

Synthetic monitors often fire in bursts. An upstream DNS provider hiccups and ten HTTP checks page at once. A CDN edge degrades and browser checks across three regions turn red. Without correlation, you get ten pages for one problem.

Yorker groups those alerts into an incident, computes a scoped hypothesis from the observations (HTTP status codes, locations, shared failing domains, symptom timing), and dispatches **one** ticket per channel per incident — not one per alert.

## The incident lifecycle

Every incident moves through a small set of states. Each state transition is recorded as a first-class event and dispatched to subscribed channels.

| State           | Entered by                                                         |
| --------------- | ------------------------------------------------------------------ |
| `open`          | Correlated alerts above the score threshold                        |
| `acknowledged`  | A user clicks "Acknowledge" in the dashboard or API                |
| `auto_resolved` | All member alerts recovered and the 15-minute cool-down elapsed    |
| `closed`        | A user closes the incident explicitly                              |
| `reopened`      | A user reopens a previously closed/resolved incident               |

The transient states `reopened` → `open` are preserved in the event log so downstream consumers can replay the exact sequence.

## Event types

Every lifecycle transition emits one of these events. Every event carries the full observations + hypothesis snapshot so a consumer replaying one event has complete context without querying back.

- `opened` — new incident created
- `alert_attached` — an additional alert joined an active incident
- `severity_changed` — severity escalated or de-escalated
- `acknowledged` — a user took ownership
- `note_added` — a user added a freeform note
- `auto_resolved` — all members recovered and cool-down elapsed
- `closed` — a user closed it
- `reopened` — a user reopened a previously resolved incident

Each event is persisted to `incident_events`, emitted as an OTel log record (if an OTLP endpoint is configured for the team), and dispatched to every channel subscribed to incidents for the team.

## Default notification routing

Different channel types have different sensible defaults. Yorker opts into the minimum-noise routing that matches each channel's audience:

| Channel      | Receives by default                                                                  |
| ------------ | ------------------------------------------------------------------------------------ |
| Slack        | Every lifecycle event (timeline-style thread)                                        |
| Email        | `opened`, `auto_resolved`, `closed` only (inboxes should not be a running timeline)  |
| Webhook      | Every lifecycle event                                                                |
| PagerDuty    | `opened`, `acknowledged`, `auto_resolved`, `closed`, `reopened`, `note_added`        |
| ServiceNow   | `opened`, `severity_changed`, `acknowledged`, `auto_resolved`, `closed`, `note_added` |

PagerDuty skips `severity_changed` because the Events API v2 has no matching action. ServiceNow skips `reopened` because Yorker's reopen semantics don't map cleanly to ServiceNow's reopen concept — a Yorker "reopen" after a recurrence creates a new external ticket rather than mutating the old one.

See the [Slack](/docs/integrations/slack), [PagerDuty](/docs/integrations/pagerduty), [ServiceNow](/docs/integrations/servicenow), [Email](/docs/integrations/email), and [Webhook](/docs/integrations/webhook) integration pages for the exact payload shapes.

## Scoped hypothesis

Every outbound incident payload carries a `hypothesis` block that tells the reader what Yorker thinks is going on — scoped to what an external synthetic sensor can prove:

```json
{
  "hypothesis": {
    "summary": "Stripe API is returning 503/504; checkout is blocked.",
    "confidence": 0.75,
    "ruledIn": ["shared_failing_domain=api.stripe.com"],
    "ruledOut": [
      "DNS resolution: NXDOMAIN not observed",
      "TLS: handshake completes"
    ],
    "correlationDimensionsMatched": ["shared_failing_domain", "error_pattern"],
    "scope": "external_symptoms_only"
  }
}
```

`scope: external_symptoms_only` is the honesty baseline. Yorker can prove the external symptom — users cannot reach checkout — and can rule out classes of causes it directly measured (DNS, TLS, shared failing domains). It cannot see your backend logs, so it never claims the backend is the culprit.

## Dedupe + rate limiting

- **30s dedupe window** — a retry firing the same event to the same channel within 30 seconds is recorded as `skipped_dedupe` in `incident_notification_dispatches`, not sent again.
- **1-per-minute note rate limit** — per (channel, incident), a second `note_added` within 60 seconds of a prior send attempt (successful **or** failed) is recorded as `skipped_rate_limit`. Failed attempts count because each one still hit the upstream endpoint — a flaky webhook returning 5xx must not leak a retry burst past the cap. Prevents an operator running a backfill script from spamming hundreds of notes.

Both checks fail **open** on database errors — losing a notification is worse than double-sending one.

## User-editable templates

Every channel's default payload can be overridden with a Handlebars template attached to the notification channel. The rendering context matches `serializeIncidentEventForExport` plus a few helpers (`severityEmoji`, `eventEmoji`, `join`, `ifHasSource`, `jsonBody`).

A render error or JSON-parse failure on the override **falls back to the default** and logs — a bad template never fails dispatch.

### In the web UI

For Slack, email, and webhook channels, **Settings > Notification Channels > Templates** opens a full editor with per-event tabs, a live preview rendered against canonical fixtures, a library of starter and example templates, a diff view comparing the draft against the last saved version, and a **Send test** button that dispatches the current saved template to the real channel. The editor is the recommended authoring path for these three channel types. PagerDuty and ServiceNow overrides are currently API-only.

### Via the API

Template overrides are sent via the notification-channel API:

```bash
curl -X PUT https://yorkermonitoring.com/api/notification-channels/nch_abc \
  -H "Authorization: Bearer $YORKER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "incidentTemplate": {
      "channelType": "slack",
      "overrides": {
        "opened": {
          "blocks": "{\"blocks\":[{\"type\":\"section\",\"text\":{\"type\":\"mrkdwn\",\"text\":\"{{severityEmoji incident.severity}} {{incident.title}}\"}}]}"
        }
      }
    }
  }'
```

To disable a channel from receiving incident events (fall back to legacy per-alert dispatch), set `incidentSubscribed: false` on the channel.

## Audit trail

Every dispatch writes one row to `incident_notification_dispatches` with status `sent`, `skipped_dedupe`, `skipped_rate_limit`, `skipped_not_routed`, or `failed`, plus any channel-specific response payload (PagerDuty `dedup_key`, ServiceNow `sys_id`). This is the source of truth for "did we actually notify?" — the UI will expose it in a later iteration.
