The Reconnection Storm Problem

Mobile WebSocket clients face a hostile environment: cellular handoffs, app backgrounding, network switches between WiFi and LTE. A typical chat or collaboration app might see 15–30 reconnection attempts per user per hour in real-world conditions. Without careful design, each reconnect can trigger duplicate message deliveries, phantom notifications, or corrupted application state.

The naive approach—replaying all pending operations on reconnect—fails spectacularly at scale. Consider a voice messaging app where a user records a 30-second clip, taps send, then immediately walks into an elevator. The client loses connection mid-upload, retries three times over 45 seconds, and finally succeeds. Without idempotency, the server processes four upload requests, charges the user four times for transcription API calls, and delivers four copies to the recipient.

Client-Side Token Generation

Every state-mutating operation needs a unique, client-generated idempotency key before transmission. The key must be:

  • Deterministic per operation: regenerating the same key for retry attempts
  • Collision-resistant: UUIDv4 or timestamp + counter + device ID
  • Persisted locally: survive app crashes and force-quits

In Flutter apps shipping real-time features, a typical implementation stores pending operations in a local SQLite queue with generated keys. When the WebSocket connects, the client drains the queue, attaching the stored key to each message payload. If connection drops mid-send, the same key accompanies the retry—the server's job is to recognize it.

{
  "type": "message.send",
  "idempotency_key": "a3f2c891-4b2e-4d1f-9c3a-7e8f6d5c4b3a",
  "channel_id": "ch_prod_8x2k",
  "content": "Meeting confirmed for 3pm",
  "client_timestamp": 1704123456789
}

The client_timestamp provides a secondary signal for ordering and deduplication window logic, discussed below.

Server-Side Deduplication Window

The server maintains a time-bounded cache of processed idempotency keys, typically implemented as a Redis sorted set with automatic expiration. The window size represents a trade-off: too short and legitimate retries after network partitions get processed twice; too long and memory pressure grows with active user count.

For messaging apps, a 5-minute window covers 99.7% of reconnection scenarios based on production telemetry from apps handling medical consultations and customer support. High-value financial transactions might extend to 30 minutes. The calculation depends on p99 reconnection latency plus a safety margin.

When a message arrives, the server performs an atomic check-and-set:

const cacheKey = `idem:${idempotencyKey}`;
const exists = await redis.set(
  cacheKey,
  messageId,
  'NX',  // only set if not exists
  'EX',  // expire after
  300    // 5 minutes
);

if (!exists) {
  const cachedId = await redis.get(cacheKey);
  return { duplicate: true, messageId: cachedId };
}

// Process message...
const messageId = await db.messages.insert(payload);
return { duplicate: false, messageId };

The NX flag ensures atomicity even under concurrent reconnection attempts from the same client—critical when users rapidly toggle airplane mode or drive through tunnels.

Handling Partial Failures

The hardest case: the server successfully processes the operation and stores the idempotency key, but the acknowledgment never reaches the client due to connection drop. The client retries with the same key. The server must return the original result, not an error.

This requires storing operation results alongside idempotency keys. For simple mutations, the generated resource ID suffices. For complex workflows—like initiating a multi-step payment or starting a video call—the cached value includes enough state for the client to resume.

{
  "idempotency_key": "a3f2c891...",
  "result": {
    "message_id": "msg_9x4k2p",
    "created_at": 1704123457123,
    "delivery_status": "sent"
  }
}

When the retry arrives, the server returns this cached result with a 200 status and a custom header X-Idempotent-Replay: true, allowing the client to distinguish original responses from replays for metrics and debugging.

Client State Reconciliation

The client must handle three reconnection scenarios gracefully:

  1. Clean reconnect: all pending operations acknowledged before disconnect, queue empty
  2. Partial send: some operations acknowledged, others still pending with stored keys
  3. Full replay: connection dropped before any acks, entire queue needs transmission

On reconnect, the client sends a sync frame containing the timestamp of its last successfully acknowledged operation. The server responds with any messages the client missed during disconnection, plus the status of any operations in the deduplication window matching that client's device ID. This bidirectional reconciliation prevents both message loss and duplication.

In apps built with real-time collaboration features—think shared whiteboards or live document editing—this sync frame includes vector clocks or causal timestamps to preserve operation ordering across multiple concurrent editors.

Monitoring and Observability

Production systems need metrics on:

  • Duplicate rate: percentage of operations arriving with known idempotency keys, segmented by time-since-first-attempt
  • Window hit rate: how many retries arrive within the configured deduplication window versus after expiration
  • Replay latency: time between retry arrival and cached result return

A sudden spike in duplicate rate often signals infrastructure issues—load balancer health check failures, database connection pool exhaustion, or upstream API timeouts—before user-facing errors appear. In one healthcare messaging platform, duplicate rate jumped from 0.3% to 4.2% during a silent Redis cluster failover, providing 90 seconds of early warning before patient-facing latency degraded.

Edge Cases and Failure Modes

Clock skew between client and server can cause issues with timestamp-based ordering. Always use server-assigned timestamps for authoritative ordering, treating client timestamps as hints for conflict resolution only.

Malicious clients can attempt cache poisoning by generating operations with future timestamps or flooding the deduplication window. Rate limiting at the WebSocket connection level, combined with per-user operation quotas, mitigates this. For high-security applications, sign idempotency keys with HMAC using a per-session secret established during authentication.

Memory pressure from the deduplication cache grows with active user count and operation rate. At 100,000 concurrent users sending one message per minute with 5-minute windows, expect ~8GB of Redis memory for keys alone (assuming 80-byte keys + 40-byte values). Vertical scaling or cache sharding becomes necessary beyond this point.

Alternative Approaches

Event sourcing systems can achieve idempotency through append-only logs where the idempotency key becomes part of the event identity. This eliminates the deduplication cache but requires more complex query patterns for read operations.

CRDTs (Conflict-free Replicated Data Types) provide mathematical guarantees about convergence under concurrent updates, making explicit idempotency keys unnecessary for certain operation types. The trade-off is increased payload size and client-side merge complexity—acceptable for collaborative editing, prohibitive for high-throughput messaging.

Production Impact

Implementing stateful reconnection with idempotency keys reduced duplicate message rates from 2.1% to 0.04% in a deployed telemedicine platform, eliminating user complaints about phantom notifications during commutes. Server-side processing costs dropped by 31% due to avoided duplicate transcription and translation API calls. The 5-minute deduplication window added 12MB of Redis memory per 10,000 active users—a negligible cost for the reliability gain.

The pattern extends beyond messaging to any stateful mobile application: collaborative tools, real-time dashboards, multiplayer games, IoT device control. Anywhere network unreliability meets user expectations of consistency, idempotency keys transform reconnection from a source of bugs into a solved infrastructure problem.