Event Service Agent Kata

ADR-0002: Message Broker

Status: Accepted

Executive Summary

Non-goals

Problem

Choose a broker family that satisfies:

Context

Evaluation Criteria

Additional Attributes: see Appendix for extended comparison dimensions.

Conceptual Primer

Broker Styles & Use Cases

Log/Stream (Kafka, JetStream Streams):

Subject routing (NATS):

Queues (RabbitMQ):

Push vs Pull: see Conceptual Differences. Both work; MVP leans pull for orchestrator pacing.

Dumb vs Smart: see Conceptual Differences. Use native delay/DLQ/quotas where available; keep ports neutral.

Visual

Common Flow (baseline)

sequenceDiagram
    autonumber
    participant API as API
    participant Orc as Orchestration
    participant DB as DB
    participant Out as Outbox
    participant Bus as EventBusPort
    participant BRK as Broker
    participant Exe as Execution Worker
    participant Tmr as Timer

    API->>Orc: Command (tenantId, serviceCallId)
    Orc->>DB: Persist state (single writer)
    DB-->>Orc: Commit OK
    Orc->>Out: Enqueue event in outbox
    Out->>Bus: Publish event (messageId, tenantId.serviceCallId)
    Bus->>BRK: Send (route by tenant)
    Note over BRK: Retention, DLQ, redelivery policies
    par Immediate handling
      Exe->>BRK: Subscribe / Pull
      BRK-->>Exe: Deliver next message
      Exe->>Exe: Idempotent handle + side effects
      Exe-->>BRK: Ack / Commit
    and Scheduled handling
      Orc->>Tmr: Schedule dueAt (optional)
      Tmr->>BRK: Enqueue delayed message (if broker supports)
    end

For Tenancy isolation and failure/redelivery diagrams, see Appendix.

Options

1. Kafka / Redpanda

Characteristics

Trade-offs

Pros:

Cons:

Integration

Use the Common Flow in this ADR; per-broker visuals are available in the Appendix.

Latency/Throughput

See Appendix for detailed latency notes.

Implications for This Project

Variants

See Appendix for variants and alternative patterns.

Tenant separation and routing (MVP → scale)

For a diagrammed view, see Appendix (Kafka MVP→Scale).

Risks & mitigations

Scoring

Criterion Weight Score Weighted
Tenancy 0.25 4 1.00
Delayed delivery 0.25 2 0.50
Ordering 0.20 5 1.00
Dev ergonomics 0.15 3 0.45
Ops footprint 0.10 2 0.20
Retention/replay 0.05 5 0.25
Total     3.40

2. NATS JetStream

Characteristics

Trade-offs

Pros:

Cons:

Integration

Use the Common Flow in this ADR; per-broker visuals are available in the Appendix.

Latency/Throughput

See Appendix for detailed latency notes.

Implications for This Project

Variants

See Appendix for variants and alternative patterns.

Tenant separation and routing (MVP → scale)

For a diagrammed view, see Appendix (NATS MVP→Scale).

Scoring

Criterion Weight Score Weighted
Tenancy 0.25 4 1.00
Delayed delivery 0.25 2 0.50
Ordering 0.20 4 0.80
Dev ergonomics 0.15 5 0.75
Ops footprint 0.10 5 0.50
Retention/replay 0.05 3 0.15
Total     3.70

3. RabbitMQ (+ delayed-exchange plugin)

Characteristics

Trade-offs

Pros:

Cons:

Integration

Use the Common Flow in this ADR; per-broker visuals are available in the Appendix.

Latency/Throughput

See Appendix for detailed latency notes.

Implications for This Project

Variants

See Appendix for variants and alternative patterns.

Tenant separation and routing (MVP → scale)

For a diagrammed view, see Appendix (RabbitMQ MVP→Scale).

Scoring

Criterion Weight Score Weighted
Tenancy 0.25 4 1.00
Delayed delivery 0.25 4 1.00
Ordering 0.20 3 0.60
Dev ergonomics 0.15 4 0.60
Ops footprint 0.10 4 0.40
Retention/replay 0.05 2 0.10
Total     3.70

4. Managed Cloud Offerings (condensed)

AWS/GCP options exist (MSK/Confluent, Amazon MQ, Pub/Sub, SNS/SQS/EventBridge). For MVP we prefer local parity and minimal friction; adapters keep a path to managed later. See Appendix for details.

Why not for MVP


Decision Framing

Scoring recap

Corrected understanding: All three broker options require external Timer service for delayed delivery. NATS JetStream does NOT have native per-message delayed delivery. RabbitMQ’s delayed-exchange plugin scores higher (4/5) but introduces plugin dependency and ordering constraints.

Acceptance checklist

Original criteria retained here for audit; superseded by post-decision validation summary above.

Decision

Adopt NATS JetStream for MVP and near-term evolution. Rationale:

Corrected evaluation: Initial analysis incorrectly assumed NATS JetStream had native per-message delayed delivery. Investigation revealed this capability does NOT exist. All three broker options (Kafka, NATS, RabbitMQ) require an external Timer service for scheduling.

Decision with corrected scores (NATS 3.70, tied with RabbitMQ):

Since all brokers require Timer service (~300 lines), the choice prioritizes:

Alternative considerations:

Decision constraints & guardrails:

Lifecycle Fit (business uncertainty)

Relevance to This Project

Notes on weights moved to Appendix; criteria weights above apply.

Sensitivity notes moved to Appendix.

Consequences

Why one shared stream first

One shared JetStream stream (svc) interleaves all tenants’ subjects but preserves each tenant’s and aggregate’s ordering (ordering is per subject; handlers enforce per-key single-flight). This minimizes early operational overhead (one retention policy, simple provisioning, fast local parity) and defers complexity until real isolation pressures appear. We split to per-tenant streams or accounts only when: (a) divergent retention/ACL needs, (b) a noisy tenant dominates backlog, (c) compliance or deletion isolation is required, or (d) subject/consumer cardinality approaches operational limits.

Integration Diagrams (NATS JetStream Realization)

The following diagrams make the decision concrete. They do not restate design intent already in earlier ADRs; they show how NATS shapes implementation.

A. Runtime Topology (MVP)

flowchart LR

  subgraph Monolith[Monolith]
    UI[UI] -->|HTTP| API[API]
    API -->|SubmitServiceCall| Orchestr[Orchestration]

    subgraph OrchestrationLayer
      Orchestr -->|tx write+outbox| DB[(SQLite DB)]
      DB --> OutboxDisp[Outbox Dispatcher]
      OutboxDisp -->|publish| EventBus[[EventBusPort]]
    end

    subgraph TimerLayer
      Timer[Timer Service] -->|consume ScheduleTimer| EventBus
      Timer -->|publish StartExecution @ dueAt| EventBus
    end

    subgraph ExecutionLayer
      Exec[Execution Worker\npull single-flight] -->|pull| EventBus
      Exec -->|ack/nack| EventBus
    end
  end

  EventBus -->|NATS| NATS[(JetStream\nstream: svc)]
  NATS -->|deliver| EventBus
  NATS -.->|redelivery| EventBus
  NATS -->|DLQ svc.<tenant>.dlq| DLQTool[DLQ Tool]

  classDef ext fill:#eef,stroke:#555,color:#111;
  class NATS,DLQTool ext;

Key implications: single shared stream svc; no per-tenant streams; Timer service is mandatory for delayed delivery (broker-agnostic design). Timer implementation details in ADR-0003.

B. Submit → (Optional Delay) → Execute Sequence

sequenceDiagram
  autonumber
  participant UI
  participant API
  participant Orchestr as Orchestration
  participant DB as DB+Outbox
  participant Outbox as OutboxDisp
  participant Bus as EventBusPort
  participant Timer as Timer Service
  participant NATS as JetStream
  participant Exec as ExecWorker

  UI->>API: POST /service-calls
  API->>Orchestr: SubmitServiceCall(tenantId, serviceCallId, dueAt)
  Orchestr->>DB: txn: write state (Scheduled + dueAt) + outbox row(s)
  DB-->>Orchestr: commit ok
  Outbox->>Bus: publish svc.<tenant>.serviceCall (ServiceCallSubmitted|ServiceCallScheduled)
  alt dueAt > now
    Outbox->>Bus: publish ScheduleTimer(tenantId, serviceCallId, dueAt)
    Bus->>Timer: deliver ScheduleTimer
    Timer->>Timer: schedule wake-up
    Timer->>Bus: ack
    Note over Timer: At dueAt...
    Timer->>Bus: publish StartExecution
  else dueAt <= now
    Outbox->>Bus: publish StartExecution(immediate)
  end
  Exec->>Bus: pull(1..N)
  Bus->>NATS: fetch
  NATS-->>Bus: messages (events + StartExecution interleaved)
  Bus-->>Exec: envelope (filtered in handler when type=StartExecution)
  Exec->>Exec: guard state (still Scheduled & due?)
  alt valid + not already Running/terminal
    Exec->>Exec: transition→Running & perform call
  else stale/canceled/rescheduled
    Exec->>Exec: no-op (idempotent discard)
  end
  Exec->>Bus: ack
  Bus->>NATS: ack

Implications: SubmitServiceCall is never delayed—persistence + domain events are immediate. When dueAt > now, Orchestration publishes ScheduleTimer command; Timer service schedules wake-up and publishes StartExecution at dueAt. StartExecution shares the same subject and the execution worker inspects envelope.type to act only on StartExecution. Interleaving is acceptable at MVP; if pacing or clarity issues emerge we can split subjects (see evolution notes).

Timer implementation: Internal scheduling mechanism, persistence strategy, accuracy guarantees, and failure recovery are detailed in ADR-0003.

C. Redelivery & DLQ

Redelivery policy is handled natively by JetStream via consumer configuration; the application code only acks on success.

Flow (concise):

  1. Exec pulls message (attempt=1).
  2. Handler runs under single-flight (per tenantId.serviceCallId).
  3. On success → explicit ack → message complete.
  4. On transient failure → do not ack (or negative ack with backoff) → broker schedules redelivery (attempt incremented) until MaxDeliver.
  5. On exceeding MaxDeliver (or detecting a fatal, non-retriable error) → publish original envelope (plus failure metadata) to svc.<tenant>.dlq.
  6. DLQ tooling (out of scope here) supports inspection, replay (republish to original subject), or discard.

Key parameters (defaults to validate):

Idempotency & safety:

Operational notes:

Escalation path (future ADR if needed): introduce a parking subject before DLQ for manual triage if noisy transient failures inflate DLQ noise.

D. Subject & Consumer Evolution (minimal pattern)

MVP adopts the most minimal workable taxonomy:

svc.<tenant>.serviceCall   # all domain events + StartExecution (immediate or scheduled)
svc.<tenant>.dlq           # dead letters (after MaxDeliver or fatal)

Rationale:

Trade-offs (accepted for MVP):

Evolution triggers to split subjects (defer until one is true):

  1. Different retry/backoff policy needed for StartExecution vs lifecycle events.
  2. High volume of lifecycle events causing StartExecution latency (measured queueing delay > SLA).
  3. Need to restrict a consumer to ONLY domain events without payload inspection for performance/security reasons.

First split path (if triggered):

svc.<tenant>.serviceCall.events
svc.<tenant>.serviceCall.start
svc.<tenant>.dlq

Explanation Summary

Design Implications

EventBusPort sketch retained conceptually (publish, subscribePull/Push). Full interfaces live in Appendix.

Operational Considerations

Tenancy & Routing Strategy (adapter-level)

Glossary