Engineering lens demos/eng-09-postmortem.html Fictional data ← all artifacts
Incident post-mortem · INC-2417
INC-2417   SEV-2   CLOSED

Post-mortem: duplicate invoices in the overnight batch-billing run

ABC Corp is a fictional company. Every name, number and date is invented. This is a reference artifact generated with an LLM coding agent; the brief that produces it is at the bottom of this page.

Incident date 3 Jun 2026, 02:14–05:26 CET Services affected batch-billing order-gateway Detected by morning reconciliation check Post-mortem owner R. Okafor Review held 9 Jun 2026

Impact

3h 12m
duration (failover → all-clear)
1,842
duplicate draft invoices created
0
invoices sent to customers
2
services affected

The duplicates were caught by the scheduled reconciliation check before the dispatch window opened: no customer ever saw a duplicate invoice and no customer communication was required.

Timeline · 12 shown

  1. DETECT

    Primary orders database fails over to the standby after a storage fault; application connections reset for 41 seconds.

  2. DETECT

    batch-billing retry wrapper re-runs the invoice-generation stage for segments 7–12 after connection errors. The retry carries no idempotency key.

  3. DETECT

    Run resumes against the new primary; re-inserted drafts for segments 7–12 are written as new rows. Duplicates begin accumulating silently.

  4. DETECT

    Batch run completes with status SUCCESS. Row-count delta vs the previous night is +4.1%, inside the alerting threshold, so no alert fires.

  5. DETECT

    Scheduled morning reconciliation check starts, comparing draft invoices against source orders.

  6. DETECT

    Reconciliation flags 1,842 drafts sharing an order_ref with an existing draft from the same run. Pager alert sent to billing on-call.

  7. ESCALATE

    On-call (M. Tanaka) acknowledges, confirms the pattern is run-wide, declares SEV-2 and opens INC-2417.

  8. ESCALATE

    billing-ops and order-gateway on-call (J. Lindqvist) join the bridge. Outbound invoice dispatch queue paused as a precaution; dispatch window opens at 07:00, so customer exposure is zero while paused.

  9. MITIGATE

    Duplicates traced to the 02:15 retry of segments 7–12 following the database failover. Duplicate set fully identified by (order_ref, run_id) pairs.

  10. MITIGATE

    Cleanup script dry-run against a staging copy of the run output: 1,842 drafts matched, 0 false positives. Script and match-list peer-reviewed on the bridge.

  11. MITIGATE

    Duplicates soft-deleted in production. Reconciliation re-run completes clean: draft count matches source orders exactly.

  12. RESOLVE

    Dispatch queue resumed, well before the 07:00 window. All-clear posted to the incident channel; INC-2417 moved to monitoring, closed after the 07:00 dispatch ran clean.

Log excerpts (fictional)

# batch-billing · application log · 3 Jun 2026 (fictional excerpt)
02:14:52 ERROR db.pool        connection reset by peer (orders-primary); 38 conns dropped
02:14:55 WARN  retry.wrapper  stage=invoice-generation attempt=1 failed: ConnectionError
02:15:09 INFO  db.pool        reconnected to orders-primary (now: standby-2, post-failover)
02:15:10 WARN  retry.wrapper  stage=invoice-generation attempt=2 starting (segments 7-12)
02:15:10 WARN  retry.wrapper  no idempotency key configured for stage; re-running whole stage
02:58:41 INFO  run.controller run 2026-06-03T02:00 finished status=SUCCESS rows=46,217 (+4.1%)
# reconciliation-check · alert log · 3 Jun 2026 (fictional excerpt)
04:36:02 ALERT recon.invoice  1,842 draft invoices share order_ref with another draft
04:36:02 ALERT recon.invoice  affected segments: 7,8,9,10,11,12 · run_id=2026-06-03T02:00
04:36:03 INFO  recon.invoice  dispatch queue state: HELD until 07:00 window · no drafts sent
04:36:04 INFO  pager          paging billing on-call · policy=sev-undetermined

Root cause · five whys

  1. Why were duplicate invoices created? The invoice-generation stage ran twice for segments 7–12, and both runs wrote draft invoices.
  2. Why did the stage run twice? The retry wrapper re-ran the whole stage after the database failover, and the drafts already written before the failover were not recognised.
  3. Why were the existing drafts not recognised? Invoice writes carry no idempotency key, so the re-inserted rows looked like new invoices rather than retries of the same work.
  4. Why was there no idempotency key? The retry wrapper was added in 2025 for transient network blips; idempotent writes were noted as a follow-up (backlog item LH-731) but never scheduled.
  5. Why was it never scheduled? There is no engineering standard requiring retried writes to be idempotent, and LH-731 had no owner; nothing forced the gap to surface before an incident did.

Root cause: a stage-level retry without an idempotency key, executed across a database failover: a process gap (no idempotency standard for retried writes), not an individual error.

Remediation · 3 of 5 complete

ActionOwnerDueStatus
Add idempotency key (run_id, order_ref) to invoice writes; re-inserts become no-ops (batch-billing#498). M. Tanaka12 Jun 2026DONE
Change the retry wrapper to refuse automatic retry of stages not marked idempotent; such stages now halt for operator decision. J. Lindqvist10 Jun 2026DONE
Add mid-run duplicate-reference alerting so duplicates page on-call within minutes, not at the 04:30 reconciliation. R. Okafor11 Jun 2026DONE
Run a failure drill: trigger a database failover during a full staging batch run and verify zero duplicates end-to-end. S. de Wit03 Jul 2026OPEN
Add "retried writes must be idempotent" to the engineering standards and the design-review checklist; sweep existing batch jobs for the same pattern. A. Kowalczyk10 Jul 2026OPEN

Customer impact statement

No customer impact. All 1,842 duplicate drafts were detected and removed before the 07:00 dispatch window opened. Zero duplicate invoices were sent, no customer data was disclosed, and no customer notification or remediation was required. Financial records were reconciled to source before close of the run.

Control classification

ControlTypeOutcome
Idempotency on retried writesPreventiveFAILED — absent; root cause
Row-count delta alertingDetectiveMISSED — +4.1% sat inside threshold
Scheduled reconciliation checkDetectiveWORKED — caught all 1,842
Dispatch-window holdContainmentWORKED — zero customer exposure

Evidence pack · attachment checklist

Blameless note

This review examined process, not people. The retry wrapper behaved exactly as designed, the on-call response was fast and careful, and the reconciliation safety net did its job; the gap was a missing standard, an unowned backlog item, and an alert that only ran once a day. Those are system properties, and the remediation list above fixes the system. No individual action contributed to this incident, and none of the follow-ups are directed at a person.

How this was made: the brief, how to reproduce it, and an honesty note

The brief

From this incident channel export and my timeline notes [paste], build a
single-file HTML post-mortem for INC-2417: impact summary cards, a
minute-by-minute timeline filterable by phase (detect/escalate/mitigate/
resolve), log excerpts, a five-whys root cause chain, and a remediation
checklist with owners and due dates. Blameless tone. One file; it
becomes the archived incident record.

How to reproduce

Paste the brief into any capable LLM: GPT, Claude, Gemini, Grok, DeepSeek, or the assistant your company provides. Iterate a few rounds on layout and content until it reads well. Save the final answer as a .html file and open it in any browser. Expect similar output, not identical: every model has its own taste, and that is fine.

Honesty note

This reference artifact was built with Claude Code, an LLM coding agent, over several iterations. Treat it as the bar to aim for, not as a guaranteed first answer. All data on this page is fictional.

Next artifact Release evidence report →
vishalshah.app