Event Service Agent Kata

ADR-0003: Timer Strategy

Status: Accepted

Problem

The Service Agent domain requires scheduled execution of service calls: a user submits a service call with a dueAt timestamp, and the system must initiate execution at or shortly after that time.

Core Requirements:

  1. Schedule registration: Accept (tenantId, serviceCallId, dueTimeMs) and persist durably
  2. Due detection: Identify when dueTimeMs <= now and emit signal to trigger execution
  3. Durability: Survive process crashes/restarts without losing scheduled timers
  4. At-least-once delivery: Emit due signal at least once; duplicates acceptable (Orchestration guards)
  5. Idempotency: Multiple schedule requests for same (tenantId, serviceCallId) don’t create duplicates

Domain Constraints:

Key Question: How do we implement durable, at-least-once, single-shot timers that remain broker-agnostic (per ADR-0002)?


Context

Timer Module Responsibility

Timer is an Infrastructure supporting module that tracks future due times and emits [DueTimeReached] events when time arrives. It must remain broker-agnostic and stateless with respect to domain logic.

Timer Lifecycle (per serviceCallId):

stateDiagram-v2
    [*] --> Pending: ScheduleTimer(tenantId, serviceCallId, dueAt)
    Pending --> Firing: time passes (now >= dueAt)
    Firing --> [*]: emit DueTimeReached(tenantId, serviceCallId)

Message Contract:

Broker Landscape

ADR-0002: Broker Selection evaluated broker options and concluded all brokers require external Timer service. NATS JetStream chosen for MVP, but Timer must remain broker-agnostic for architectural flexibility.

MVP Constraints

Out of Scope:


Approach: First Principles Analysis

How do we detect when now >= dueAt?

Option 1: External Notification (Push)

Option 2: Self-Checking (Poll)

Can we optimize with in-memory scheduling?

Idea: Load schedules into memory, use setTimeout-like timers for exact wake-ups.

Analysis:

Conclusion

Converged Design:

1. Subscribe to ScheduleTimer commands from broker
2. On command: Persist to storage (upsert by tenantId, serviceCallId)
3. Worker loop (every 5s):
   - Query: SELECT WHERE dueAt <= now AND state = 'Scheduled'
   - For each: Publish DueTimeReached, update state = 'Reached'
4. On restart: Resume worker loop (durable in DB)

Why this design:

Requirements satisfied:


Decision

Adopt periodic polling (5-second interval) as the Timer detection mechanism.

Rationale:

First principles analysis shows this as the only viable path satisfying all MVP constraints:

Why NOT alternatives:

Acceptable trade-offs:

Evolution path: Can optimize later with distributed coordination or in-memory scheduling when scale demands it.


Key Design Choices

1. Storage Model: Shared DB with Strong Boundaries

Choice: Single SQLite database (event_service.db) with Timer owning timer_schedules table.

Why shared database in MVP:

Why strong boundaries despite shared database:

Boundaries (enforced in code):

Migration Path:

Phase 1 (MVP): Single event_service.db + strong code boundaries
              ↓ (Preserves architectural discipline)
Phase N: Extract to separate timer.db (already event-driven, denormalized)
         ↓ (No code changes needed, just database split)
Final: timer.db + timer_service running independently

2. Data Model

Why this schema design:

Composite Primary Key (tenantId, serviceCallId):

State field (Scheduled | Reached):

Timestamps (dueAt, registeredAt, reachedAt):

CorrelationId (optional):

Entity-Relationship Diagram:

erDiagram
    TENANTS {
        string tenant_id PK "Tenant identifier"
    }

    SERVICE_CALLS {
        string tenant_id PK,FK "Tenant owner"
        string service_call_id PK "Aggregate identity"
    }

    TIMER_SCHEDULES {
        string tenant_id PK,FK "References TENANTS"
        string service_call_id PK,FK "References SERVICE_CALLS"
        timestamp due_at "When to fire (indexed)"
        string state "Scheduled | Reached"
        timestamp registered_at "When schedule was created"
        timestamp reached_at "When event was published (nullable)"
        string correlation_id "For distributed tracing (nullable)"
    }

    TENANTS ||--o{ SERVICE_CALLS : "owns"
    TENANTS ||--o{ TIMER_SCHEDULES : "scopes"
    SERVICE_CALLS ||--|| TIMER_SCHEDULES : "has exactly one"

Table: timer_schedules

```typescript ignore type TimerSchedule = { // Composite Primary Key readonly tenantId: TenantId // Why first: Tenant-scoped queries readonly serviceCallId: ServiceCallId // Why: One timer per ServiceCall

// Schedule
readonly dueAt: timestamp // Why indexed: Polling query performance
state: 'Scheduled' | 'Reached' // Why: Idempotent processing via state filter

// Audit
readonly registeredAt: timestamp // Why: Latency metrics
reachedAt?: timestamp // Why nullable: Only set after firing

// Observability
readonly correlationId?: string // Why optional: System timers lack context } ```

Indexes:

Critical Polling Query:

-- Why this query structure:
-- 1. state = 'Scheduled' filter leverages index (excludes Reached)
-- 2. due_at <= datetime('now') uses index range scan (efficient)
-- 3. ORDER BY due_at ASC fires oldest timers first (fairness)
-- 4. LIMIT 100 prevents memory issues on large batches (safety)

SELECT tenant_id, service_call_id, due_at, correlation_id
FROM timer_schedules
WHERE state = 'Scheduled' AND due_at <= datetime('now')
ORDER BY due_at ASC
LIMIT 100;

3. Command Handler vs Polling Worker

Observation: Timer has two distinct responsibilities with different characteristics.

Command Handler (Reactive, Lightweight):

Polling Worker (Autonomous, Heavy):

Benefits of Separation:

  1. Different performance profiles (low-latency writes vs batch processing)
  2. Isolated failures (command fails ≠ polling fails)
  3. Independent testing strategies
  4. Clear scaling paths (command handler scales horizontally, worker stays single instance)

MVP Deployment: Both in same process, different threads.


4. State Transition Pattern

Critical Decision: How to handle Publish → Update atomicity?

Pattern: Publish-then-update (accept at-least-once duplicates)

```typescript ignore for (const schedule of dueSchedules) { // Envelope is self-contained with all routing metadata const envelope = { id: generateEnvelopeId(), type: ‘DueTimeReached’, tenantId: schedule.tenantId, correlationId: schedule.correlationId, // Optional timestampMs: now, payload: DueTimeReached(schedule), }

await eventBus.publish([envelope]) // First
await db.update("SET state = 'Reached' WHERE ...") // Second } ```

Failure Scenarios:

Scenario Outcome Acceptable?
Publish succeeds, update fails Duplicate event next poll ✅ Yes
Update succeeds, publish fails Event never sent ❌ No
Crash after publish Duplicate event on restart ✅ Yes

Rationale:

Consequence: Orchestration MUST be idempotent on DueTimeReached.


5. Idempotency Edge Case

Question: What if duplicate ScheduleTimer arrives after timer already fired?

Decision: Ignore if state = ‘Reached’

INSERT INTO timer_schedules (...)
VALUES (?, ?, ?, 'Scheduled')
ON CONFLICT (tenant_id, service_call_id)
DO UPDATE SET due_at = EXCLUDED.due_at
WHERE timer_schedules.state = 'Scheduled';
-- WHERE clause prevents updating 'Reached' schedules

Rationale: Domain constraint “no rescheduling after execution starts.”


Consequences

MVP Scope (What We Build)

Core Functionality:

Minimal Error Handling:

Deployment:

What This Gives Us: Satisfies all 5 core requirements for dozens to hundreds of tenants.


Deferred to Future

Why Defer: MVP constraints make simple solution sufficient. Build what we need, not what we might need.


Scale Limits (When MVP Breaks)

The design stops being sufficient when:

  1. Tenant count > ~500: Polling query becomes I/O bound → partition by tenant
  2. Schedule density > 1000/min: Can’t process batch in 5s → reduce interval or increase batch size
  3. Time horizon extends to days/weeks: Table grows large → tiered storage (near-term in memory)
  4. HA required (99.9% uptime): Single instance is SPOF → leader election
  5. Command rate > 100/sec: Write contention → horizontal scaling of command handler

Key Insight: These limits are well beyond MVP scope. Room to grow before needing complexity.


Integration Impact

Orchestration Module:

Execution Module:

API Module:


Operational Requirements

Deployment:

Monitoring:

Configuration:

TIMER_POLLING_INTERVAL=5000        # 5s default
TIMER_BATCH_SIZE=100               # Max per poll
TIMER_DB_PATH=./event_service.db
TIMER_BROKER_URL=nats://localhost:4222

References