Status: Accepted
The Service Agent domain requires scheduled execution of service calls: a user submits a service call with a dueAt timestamp, and the system must initiate execution at or shortly after that time.
Core Requirements:
(tenantId, serviceCallId, dueTimeMs) and persist durablydueTimeMs <= now and emit signal to trigger execution(tenantId, serviceCallId) don’t create duplicatesDomain Constraints:
tenantId[ScheduleTimer] commands, publish [DueTimeReached] events via brokerKey Question: How do we implement durable, at-least-once, single-shot timers that remain broker-agnostic (per ADR-0002)?
Timer is an Infrastructure supporting module that tracks future due times and emits [DueTimeReached] events when time arrives. It must remain broker-agnostic and stateless with respect to domain logic.
Timer Lifecycle (per serviceCallId):
stateDiagram-v2
[*] --> Pending: ScheduleTimer(tenantId, serviceCallId, dueAt)
Pending --> Firing: time passes (now >= dueAt)
Firing --> [*]: emit DueTimeReached(tenantId, serviceCallId)
Message Contract:
ScheduleTimer command → { tenantId, serviceCallId, dueAt: ISO8601 }DueTimeReached event → { tenantId, serviceCallId, reachedAt?: ISO8601 }(tenantId, serviceCallId) ensures idempotencyADR-0002: Broker Selection evaluated broker options and concluded all brokers require external Timer service. NATS JetStream chosen for MVP, but Timer must remain broker-agnostic for architectural flexibility.
Out of Scope:
now >= dueAt?Option 1: External Notification (Push)
pg_cron, OS schedulers, cloud scheduler servicesOption 2: Self-Checking (Poll)
WHERE dueAt <= nowIdea: Load schedules into memory, use setTimeout-like timers for exact wake-ups.
Analysis:
Converged Design:
1. Subscribe to ScheduleTimer commands from broker
2. On command: Persist to storage (upsert by tenantId, serviceCallId)
3. Worker loop (every 5s):
- Query: SELECT WHERE dueAt <= now AND state = 'Scheduled'
- For each: Publish DueTimeReached, update state = 'Reached'
4. On restart: Resume worker loop (durable in DB)
Why this design:
Requirements satisfied:
(tenantId, serviceCallId) + state filteringAdopt periodic polling (5-second interval) as the Timer detection mechanism.
Rationale:
First principles analysis shows this as the only viable path satisfying all MVP constraints:
Why NOT alternatives:
setTimeout: Lost on restart (durability violated)pg_cron: Couples to PostgreSQL (we use SQLite for MVP)Acceptable trade-offs:
Evolution path: Can optimize later with distributed coordination or in-memory scheduling when scale demands it.
Choice: Single SQLite database (event_service.db) with Timer owning timer_schedules table.
Why shared database in MVP:
Why strong boundaries despite shared database:
Boundaries (enforced in code):
ScheduleTimer event
Migration Path:
Phase 1 (MVP): Single event_service.db + strong code boundaries
↓ (Preserves architectural discipline)
Phase N: Extract to separate timer.db (already event-driven, denormalized)
↓ (No code changes needed, just database split)
Final: timer.db + timer_service running independently
Why this schema design:
Composite Primary Key (tenantId, serviceCallId):
State field (Scheduled | Reached):
Timestamps (dueAt, registeredAt, reachedAt):
CorrelationId (optional):
Entity-Relationship Diagram:
erDiagram
TENANTS {
string tenant_id PK "Tenant identifier"
}
SERVICE_CALLS {
string tenant_id PK,FK "Tenant owner"
string service_call_id PK "Aggregate identity"
}
TIMER_SCHEDULES {
string tenant_id PK,FK "References TENANTS"
string service_call_id PK,FK "References SERVICE_CALLS"
timestamp due_at "When to fire (indexed)"
string state "Scheduled | Reached"
timestamp registered_at "When schedule was created"
timestamp reached_at "When event was published (nullable)"
string correlation_id "For distributed tracing (nullable)"
}
TENANTS ||--o{ SERVICE_CALLS : "owns"
TENANTS ||--o{ TIMER_SCHEDULES : "scopes"
SERVICE_CALLS ||--|| TIMER_SCHEDULES : "has exactly one"
Table: timer_schedules
```typescript ignore type TimerSchedule = { // Composite Primary Key readonly tenantId: TenantId // Why first: Tenant-scoped queries readonly serviceCallId: ServiceCallId // Why: One timer per ServiceCall
// Schedule
readonly dueAt: timestamp // Why indexed: Polling query performance
state: 'Scheduled' | 'Reached' // Why: Idempotent processing via state filter
// Audit
readonly registeredAt: timestamp // Why: Latency metrics
reachedAt?: timestamp // Why nullable: Only set after firing
// Observability
readonly correlationId?: string // Why optional: System timers lack context } ```
Indexes:
(state, dueAt) — For polling query (CRITICAL for performance)
(tenantId, serviceCallId) — Primary key (business invariant)
Critical Polling Query:
-- Why this query structure:
-- 1. state = 'Scheduled' filter leverages index (excludes Reached)
-- 2. due_at <= datetime('now') uses index range scan (efficient)
-- 3. ORDER BY due_at ASC fires oldest timers first (fairness)
-- 4. LIMIT 100 prevents memory issues on large batches (safety)
SELECT tenant_id, service_call_id, due_at, correlation_id
FROM timer_schedules
WHERE state = 'Scheduled' AND due_at <= datetime('now')
ORDER BY due_at ASC
LIMIT 100;
Observation: Timer has two distinct responsibilities with different characteristics.
Command Handler (Reactive, Lightweight):
Polling Worker (Autonomous, Heavy):
Benefits of Separation:
MVP Deployment: Both in same process, different threads.
Critical Decision: How to handle Publish → Update atomicity?
Pattern: Publish-then-update (accept at-least-once duplicates)
```typescript ignore for (const schedule of dueSchedules) { // Envelope is self-contained with all routing metadata const envelope = { id: generateEnvelopeId(), type: ‘DueTimeReached’, tenantId: schedule.tenantId, correlationId: schedule.correlationId, // Optional timestampMs: now, payload: DueTimeReached(schedule), }
await eventBus.publish([envelope]) // First
await db.update("SET state = 'Reached' WHERE ...") // Second } ```
Failure Scenarios:
| Scenario | Outcome | Acceptable? |
|---|---|---|
| Publish succeeds, update fails | Duplicate event next poll | ✅ Yes |
| Update succeeds, publish fails | Event never sent | ❌ No |
| Crash after publish | Duplicate event on restart | ✅ Yes |
Rationale:
Consequence: Orchestration MUST be idempotent on DueTimeReached.
Question: What if duplicate ScheduleTimer arrives after timer already fired?
Decision: Ignore if state = ‘Reached’
INSERT INTO timer_schedules (...)
VALUES (?, ?, ?, 'Scheduled')
ON CONFLICT (tenant_id, service_call_id)
DO UPDATE SET due_at = EXCLUDED.due_at
WHERE timer_schedules.state = 'Scheduled';
-- WHERE clause prevents updating 'Reached' schedules
Rationale: Domain constraint “no rescheduling after execution starts.”
Core Functionality:
ScheduleTimer via brokertimer_schedules table in shared DBScheduled → Reached(tenantId, serviceCallId)Minimal Error Handling:
Deployment:
What This Gives Us: Satisfies all 5 core requirements for dozens to hundreds of tenants.
Why Defer: MVP constraints make simple solution sufficient. Build what we need, not what we might need.
The design stops being sufficient when:
Key Insight: These limits are well beyond MVP scope. Room to grow before needing complexity.
Orchestration Module:
DueTimeReached eventsExecution Module:
API Module:
Deployment:
timer-serviceMonitoring:
timer.commands.received, timer.schedules.processed, timer.polling.durationConfiguration:
TIMER_POLLING_INTERVAL=5000 # 5s default
TIMER_BATCH_SIZE=100 # Max per poll
TIMER_DB_PATH=./event_service.db
TIMER_BROKER_URL=nats://localhost:4222