The Mobile WebSocket Paradox
WebSockets promise persistent, bidirectional communication—but mobile networks are anything but persistent. A user walks from WiFi to LTE, switches apps for thirty seconds, or drives through a tunnel. Each event tears down the TCP connection. Yet the application layer must present an illusion of continuity: no lost chat messages, no duplicate video call invitations, no out-of-order transaction confirmations.
This article dissects a production-grade reconnection strategy that handles cell tower handoffs, iOS/Android backgrounding policies, exponential backoff with jitter, and exactly-once message semantics. We'll examine sequence numbering, client-side queuing, and the tradeoffs between optimistic sends and strict ordering guarantees.
Why Naive Reconnection Fails
The simplest reconnection logic—socket.onclose = () => new WebSocket(url)—creates pathological behavior on mobile:
- Thundering herd: 10,000 clients lose connection simultaneously during a server deploy; all retry at T+0, overwhelming the load balancer.
- Battery drain: Exponential backoff without jitter causes synchronized retry waves every 2s, 4s, 8s. The radio never sleeps.
- Duplicate messages: Client sends message M, connection drops before ACK, reconnects, resends M. Server processes M twice.
- Message loss: Server sends message N, client ACKs, but ACK is lost in transit. Server thinks N was delivered; client never received it.
Production systems need sequence numbers, deduplication windows, and acknowledgment protocols—essentially rebuilding TCP reliability atop WebSockets.
Sequence Numbers and ACK Protocol
Every message gets a monotonically increasing ID. Client and server each maintain lastSentSeq and lastAckedSeq. When the client reconnects, it sends lastAckedSeq in the handshake. The server retransmits any messages with seq > lastAckedSeq.
{
"type": "RECONNECT",
"clientSeq": 1847,
"serverSeq": 2103
}The server's serverSeq tells the client which messages it successfully received. The client discards any pending sends with seq ≤ serverSeq and retransmits the rest. This ensures exactly-once delivery if the server deduplicates based on clientSeq within a sliding window (typically 1000 messages or 5 minutes, whichever is larger).
Deduplication Window Sizing
A 5-minute window handles most mobile scenarios: app backgrounding on iOS (30s suspended, then terminated), Android Doze (maintenance windows every 15-30 minutes), and brief subway tunnels. Beyond 5 minutes, the client should treat the connection as "new" and request a full state sync rather than incremental catch-up.
Memory cost: each client session stores a hash set of recent message IDs. At 16 bytes per UUID × 1000 messages = 16KB per connection. For 100K concurrent users, that's 1.6GB of deduplication state—acceptable on modern servers, prohibitive on mobile clients. Clients use a Bloom filter (512 bytes, 0.1% false positive rate) for local deduplication and rely on server-side authoritative checks.
Exponential Backoff with Jitter
Standard exponential backoff: delay = min(maxDelay, baseDelay * 2^attempt). On mobile, add full jitter to prevent synchronized retries:
delay = random(0, min(maxDelay, baseDelay * 2^attempt))
Concrete values from a real-time chat app: baseDelay = 500ms, maxDelay = 32s. After 6 failed attempts (500ms, 1s, 2s, 4s, 8s, 16s), the client caps at 32s and adds jitter. This spreads reconnection attempts across a 32-second window, reducing load spikes by 97% compared to synchronized retries.
Network Change Detection
Mobile OSes fire network change events (iOS NWPathMonitor, Android ConnectivityManager.NetworkCallback). When the network interface changes—WiFi → LTE, or LTE → 5G—immediately close the old socket and reconnect with zero backoff. The cell tower handoff already cost 200-800ms; waiting another 2 seconds for exponential backoff is user-hostile.
However, spurious network events are common. iOS fires pathUpdateHandler when switching between WiFi access points in the same network. Debounce events: only reconnect if the interface remains stable for 300ms. This prevents reconnection loops in environments with flaky WiFi.
Foreground/Background Transitions
iOS suspends network activity after 30 seconds in the background. Android's Doze mode restricts network access to brief maintenance windows. The client must:
- Detect
appDidEnterBackgroundand immediately flush the send queue (optimistic sends with no ACK). - Close the WebSocket to avoid iOS/Android force-killing the connection mid-flight, which corrupts TCP state and prevents graceful shutdown.
- On
appWillEnterForeground, reconnect with zero backoff and request state sync.
Critical: iOS terminates background apps after 3 minutes of inactivity. Any reconnection logic must assume the app was killed and local state was lost. Use UserDefaults or SharedPreferences to persist lastAckedSeq across process restarts.
Background Fetch and Push Notifications
For latency-sensitive apps (chat, collaboration), use push notifications to wake the app and establish a short-lived WebSocket. iOS grants 30 seconds of background execution; fetch critical messages and update the UI. This keeps the chat history fresh without maintaining a persistent connection, reducing battery drain by 60-80%.
Send Queue and Optimistic Delivery
When the user sends a message, immediately append it to the UI with a "pending" indicator and push to a persistent queue (SQLite or IndexedDB). If the socket is connected, send immediately; otherwise, defer until reconnection. This provides instant feedback—critical for perceived responsiveness.
sendQueue.push({ seq: nextSeq++, payload, timestamp });
if (socket.readyState === WebSocket.OPEN) {
socket.send(JSON.stringify({ seq, payload }));
}On reconnection, drain the queue in sequence order. The server's ACK protocol (described earlier) ensures exactly-once delivery. If the server ACKs seq=1847, purge all queue entries with seq ≤ 1847.
Queue Overflow Handling
If the queue grows beyond 1000 messages (indicating prolonged offline state), discard the oldest 50% and insert a "gap" marker. The UI shows "Some messages were not sent due to extended offline period." This prevents unbounded memory growth and avoids overwhelming the server with a 10-minute backlog when the user finally reconnects.
Heartbeat and Idle Timeout
Mobile carriers aggressively close idle TCP connections—often after 60 seconds. Send a heartbeat ping every 30 seconds to keep the connection alive. If the server doesn't respond within 10 seconds, assume the connection is dead (even if the OS hasn't fired onclose) and reconnect.
setInterval(() => {
if (Date.now() - lastPong > 10000) {
socket.close();
reconnect();
} else {
socket.send(JSON.stringify({ type: "PING" }));
}
}, 30000);The server echoes PONG immediately. This detects half-open connections—where the client thinks the socket is alive but the server has already closed it due to a silent network partition.
Testing Reconnection Logic
Chaos engineering is essential. Automated tests should:
- Inject random disconnections every 5-30 seconds during a 10-minute test run.
- Simulate network switches (WiFi → LTE → WiFi) using iOS Network Link Conditioner or Android
tcrules. - Force-kill the app mid-message and verify the message is retransmitted on restart.
- Introduce 500ms-2s latency spikes and verify heartbeat logic doesn't false-positive.
A real-world test suite for a WebRTC signaling channel ran 50,000 reconnection cycles over 72 hours, achieving 99.97% message delivery (15 duplicates due to server-side deduplication bugs, zero losses).
Latency and Battery Tradeoffs
Aggressive reconnection (zero backoff, 10s heartbeat) provides