Offline-First State Sync: CRDTs in Production

Why Offline-First Demands More Than Optimistic UI

Most mobile apps today treat network connectivity as the default state. When the connection drops, users see spinners, error toasts, or frozen interfaces. This approach breaks down catastrophically in regions with intermittent connectivity or for applications where continuous availability is non-negotiable—think medical record apps in clinics with spotty Wi-Fi, field service tools in remote areas, or collaborative editing in subway tunnels.

Optimistic UI patterns mask latency but don't solve the fundamental problem: how do you merge conflicting edits from multiple devices when each has been working independently? A naive last-write-wins strategy discards user intent. Locking mechanisms require connectivity. The real solution lies in conflict-free replicated data types, or CRDTs, which mathematically guarantee eventual consistency without coordination.

CRDT Fundamentals: Commutativity and Convergence

A CRDT is a data structure designed so that concurrent operations commute—meaning the order of application doesn't affect the final state. There are two broad categories: operation-based CRDTs, which transmit operations, and state-based CRDTs, which transmit entire states and merge them. Both guarantee strong eventual consistency: all replicas that have seen the same set of updates converge to identical states.

Consider a simple counter CRDT. Instead of storing a single integer, each replica maintains a vector of integers, one per device. Incrementing adds to your device's slot. Merging takes the element-wise maximum. This trivially commutes—max(a,b) equals max(b,a)—and converges. Decrementing requires a two-vector approach with separate increment and decrement counts, yielding the difference as the logical value.

For text editing, operational transforms have dominated for years in products like Google Docs, but they require a central server to serialize operations. CRDTs like YJS or Automerge use a tree-based structure where each character has a unique, immutable identifier derived from a logical timestamp and replica ID. Insertions reference positions via these IDs, not indices. Deletions mark characters as tombstones. Concurrent inserts at the same position resolve deterministically via lexicographic comparison of IDs.

Vector Clocks and Causality

CRDTs rely on vector clocks to track causality. Each replica maintains a vector of logical timestamps, one per known replica. When generating an operation, increment your own clock. When receiving an operation, merge the remote vector element-wise with your local vector. This Lamport-style clock lets you determine if two operations are concurrent or causally ordered.

In practice, vector clocks grow linearly with the number of devices that have ever edited the document. Pruning strategies like dotted version vectors or interval tree clocks keep metadata bounded, but at the cost of implementation complexity. For mobile apps with hundreds of thousands of users, you need a garbage collection strategy: archive inactive replicas after a threshold, accept a small risk of duplicate delivery, and implement idempotency at the application layer.

Implementation Tradeoffs in Mobile Environments

Shipping CRDTs on mobile means confronting constraints that don't exist in backend systems: limited RAM, battery-sensitive CPU budgets, and storage latency on flash. A naive CRDT implementation can balloon metadata size. For a 10,000-character document edited by 20 devices, a YJS structure might carry 50KB of metadata—acceptable on desktop, problematic when syncing over cellular.

Delta-State CRDTs and Bandwidth

State-based CRDTs traditionally transmit the entire state on every sync. For a shopping cart with 50 items, that's redundant. Delta-state CRDTs transmit only the operations since the last acknowledged sync. The receiving replica applies the delta and merges it with local state. This reduces bandwidth by an order of magnitude in typical usage.

Implementing deltas requires tracking which operations have been sent to each peer. A bloom filter or a vector of last-sent timestamps works for small peer sets. For many-to-many sync topologies, a causal broadcast protocol or a gossip-based dissemination layer handles propagation. Libraries like Automerge and Yrs handle this internally, but understanding the mechanics is critical when debugging sync loops or data loss.

Persistence and Snapshotting

Mobile apps can't keep the entire CRDT history in memory. Snapshotting compresses the operation log into a checkpoint. When a user opens the app, load the latest snapshot and replay operations since then. Snapshots must preserve causality—store the vector clock alongside the state. Without it, you can't correctly merge incoming operations that causally precede the snapshot.

In a Flutter app shipping an offline-first note-taking feature, we snapshot every 500 operations or 60 seconds of inactivity, whichever comes first. Snapshots live in SQLite with a separate table for operation logs. On startup, read the snapshot, deserialize the CRDT, then apply logged operations. This keeps memory under 20MB even for users with thousands of notes and sub-200ms cold start latency.

Conflict Resolution Semantics

CRDTs resolve conflicts automatically, but the semantics matter. For a last-write-wins register, the "last" is determined by wall-clock timestamps, which are notoriously unreliable on mobile devices. A hybrid logical clock—combining wall-clock time with a logical counter—provides better ordering while remaining human-readable for debugging.

For sets, add-wins semantics mean that if one replica adds an item and another removes it concurrently, the add wins. This is intuitive for shopping carts but wrong for access control lists. Observed-remove sets track causality per element, allowing removes to win if they observe the add. Choosing the right CRDT primitive for each data type is an architectural decision with UX consequences.

Application-Level Conflict Handlers

Sometimes mathematical convergence isn't enough. In a collaborative calendar app, if two users schedule conflicting meetings in the same time slot, the CRDT will preserve both. The application layer must detect the conflict and present a resolution UI—merge into a single meeting, split into alternatives, or prompt the user. This requires a conflict detection pass after every merge, comparing logical constraints that the CRDT layer can't understand.

Sync Topology and Backend Architecture

CRDTs enable peer-to-peer sync, but most production systems use a client-server topology for pragmatic reasons: authentication, authorization, backup, and cross-device sync without requiring devices to be online simultaneously. The server acts as a highly available replica that never goes offline.

In a hybrid architecture, clients sync with the server over HTTP or WebSocket. The server runs the same CRDT merge logic as clients. For scale, partition the CRDT space by user or document ID and route requests accordingly. Each partition is a single-threaded actor to avoid concurrency bugs in merge logic. We've run this pattern in production with millions of documents, achieving p99 merge latency under 15ms on commodity VMs.

Sync Protocols and Retry Logic

A robust sync protocol handles transient failures gracefully. Use exponential backoff with jitter for retries. Include the local vector clock in every sync request so the server can return only operations the client hasn't seen. Implement idempotency tokens to prevent duplicate application of operations if a retry succeeds after a timeout.

For large documents, chunked sync reduces memory pressure. Split the operation log into 1000-operation chunks and sync them in order. The server acknowledges each chunk, and the client advances its sync cursor. If a chunk fails, retry from the last acknowledged position. This pattern is essential for syncing multi-megabyte documents over unreliable mobile networks.

Testing and Debugging CRDT Systems

CRDT bugs are subtle. A merge function that violates commutativity only manifests when operations arrive in a specific order, which is nondeterministic in distributed systems. Generative testing with property-based frameworks like Hypothesis or fast-check is invaluable. Generate random sequences of operations, apply them in all possible orders, and assert that the final state is identical.

We use a simulation harness that models network partitions, message reordering, and device crashes. Each test runs hundreds of scenarios, logging operation traces. When a convergence failure occurs, the harness replays the minimal trace that triggers the bug. This caught a vector clock comparison bug in our production CRDT library that only appeared when three devices edited the same field within a 50ms window.

Observability for Sync Health

Instrument your CRDT system with metrics: operation log size, snapshot frequency, merge latency, and sync success rate. Expose the local vector clock in debug builds so engineers can manually verify causality. Log every merge conflict at the application layer with enough context to reproduce the user's actions. When users report data loss, these logs are the only way to reconstruct what happened across multiple devices and network hops.

When Not to Use CRDTs

CRDTs add complexity. For read-heavy apps with rare writes, a simpler eventual consistency model with server-authoritative state works fine. For apps where users rarely work offline, optimistic UI with server reconciliation is easier to implement and reason about. CRDTs shine when offline usage is the norm, when multi-device collaboration is a core feature, or when your user base operates in environments with unreliable connectivity.

The metadata overhead also matters. A CRDT-based document editor might use 10-20% more storage than a plain text file. For a medical records app storing gigabytes of data per user, this overhead compounds. Evaluate whether the offline-first guarantee justifies the cost. In some domains, like field service or clinical workflows, the answer is an unequivocal yes. For a social media feed, probably not.