Patterns icon
Pattern guide

Checkpointing

Persist progress so long-running jobs can resume without reprocessing or losing state.

Intent

Persist progress so long-running jobs can resume without reprocessing or losing state.

When to use

  • Automations modify external systems or production data.
  • You need safe retries without duplicate effects.
  • You must prove what changed for audit or compliance.
  • Runs are long or high volume and can fail midstream.

Core mechanics

  • Fetch current state and compute a desired state.
  • Generate a diff and an explicit action plan.
  • Apply only the delta with safeguards.
  • Record results, errors, and timing for each action.

Implementation checklist

  1. Define desired state inputs and validation rules.
  2. Capture current state with stable identifiers.
  3. Compute a deterministic diff and action plan.
  4. Provide a dry-run output for review.
  5. Apply actions with rate limits and retries.
  6. Write audit logs and summarize outcomes.

Failure modes and mitigations

  • Non-idempotent actions -> add guards or uniqueness checks.
  • Partial runs -> add checkpointing and resume support.
  • API rate limits -> throttle and back off.
  • Audit gaps -> log before and after state.

Observability and validation

  • Counts: planned vs applied vs failed actions.
  • Duration per phase and per record.
  • Error rate and top failure reasons.
  • Links to audit reports and logs.

Artifacts

  • Diff report or action plan.
  • Audit log with timestamps and outcomes.
  • Summary report for stakeholders.
Seen in production

Seen in production as

Related

Related patterns