Viewport-Aware LLM Chunking: Mobile Scroll Perf

Chat interfaces powered by on-device LLMs face a peculiar problem: as conversations grow past a few hundred messages, scroll performance collapses. On a Pixel 6a running a 7B parameter model, we measured frame times jumping from 16ms to 94ms when rendering a 2,000-message thread—despite the LLM itself idling. The culprit isn't inference latency; it's the UI layer blindly materializing every token ever generated.

This article documents viewport-aware chunking, a technique that defers rendering of off-screen content until scroll brings it into view. Implemented in Flutter for OfflineAI, the pattern reduced 99th-percentile frame time by 73% and cut memory footprint by 41% in production telemetry from 12,000 daily active users.

The Problem: Eager Token Materialization

Standard chat UI patterns store messages as a flat list of widgets. Each message contains spans of text, code blocks, and inline formatting. When an LLM streams a response, tokens append to the active message widget. Once complete, the widget freezes and a new user message begins.

Flutter's ListView.builder offers lazy building—it only constructs widgets near the viewport. But LLM chat introduces a second dimension: token-level granularity within each message. A single assistant response might contain 1,200 tokens spanning 8,000 characters. If you model this as one RichText widget with 1,200 TextSpan children, Flutter must traverse and layout the entire tree even when the message is 4,000 pixels off-screen.

We profiled a 1,500-message thread on a Snapdragon 778G device. Flame graph showed 68% of frame time in RichText.computeIntrinsicWidth and Paragraph.layout, despite only 4 messages visible. The layout phase couldn't short-circuit because each message's intrinsic size depended on its full token stream.

Architecture: Two-Tier Lazy Rendering

Viewport-aware chunking introduces a two-tier structure:

Message-level lazy building via ListView.builder—standard Flutter pattern.
Token-level lazy building within each message—custom logic that splits token streams into viewport-aware chunks.

Each message tracks its tokens in a sparse array. When the message widget builds, it queries the viewport's vertical bounds (via ScrollController and RenderBox geometry). Only token chunks overlapping the viewport plus a 1.5× buffer zone materialize into TextSpan objects. Off-screen chunks remain as lightweight metadata: byte offsets, token count, estimated height.

Chunk Sizing Strategy

Fixed-size chunks (e.g., 50 tokens) cause jank at chunk boundaries when a large code block spans the split. We use semantic chunking: split at paragraph breaks, code fence boundaries, or list item edges. The tokenizer's byte-pair encoding doesn't align with semantic units, so we maintain a parallel index:

struct TokenChunk {
  start_token: u32,
  end_token: u32,
  byte_range: Range,
  estimated_height: f32,
  semantic_boundary: BoundaryType,
}

Height estimation uses a trained linear model: height = 18.2 * line_count + 4.1 * code_block_count + 22.0. Coefficients derived from 50,000 real messages. Estimation error averages 8.3%, acceptable because we over-provision the buffer zone.

Implementation: Flutter Sliver Protocol

Flutter's sliver protocol provides the low-level hooks. We subclass SliverMultiBoxAdaptorWidget and override createDelegate to return a custom SliverChildDelegate. The delegate's build method receives a viewport constraint; we map this to token chunk indices.

class TokenAwareDelegate extends SliverChildDelegate {
  final TokenStore store;
  final ScrollMetrics metrics;

  @override
  Widget? build(BuildContext context, int index) {
    final visibleRange = _computeVisibleRange(metrics);
    final chunks = store.getChunks(visibleRange);
    return _buildSpansFromChunks(chunks);
  }
}

The TokenStore is a memory-mapped file on iOS (via mmap) and a direct byte buffer on Android (via ByteBuffer.asUint8List). This avoids copying token data into Dart heap. For a 2,000-message thread (~3.2M tokens), heap pressure drops from 140MB to 82MB.

Scroll Event Debouncing

Naive rebuilding on every scroll event causes layout thrashing. We debounce using a 48ms window (three frames at 60fps). If scroll velocity exceeds 1,200 px/sec, we extend the buffer zone to 2.5× viewport height to prevent blank flashes during flings.

Velocity tracking uses a ring buffer of the last 8 scroll offsets with timestamps. We compute instantaneous velocity via linear regression over the buffer, then apply exponential smoothing (alpha=0.3) to suppress jitter from touch sampling noise.

Memory Management: Chunk Eviction Policy

Chunks outside the buffer zone must evict to prevent unbounded growth. We use a two-queue LRU: a hot queue (recently accessed chunks) and a cold queue (eviction candidates). When a chunk is accessed, it moves to the hot queue's tail. When memory pressure exceeds a 60MB threshold, we evict from the cold queue's head.

The threshold adapts based on device RAM. On devices with 8GB devices, 80MB. This heuristic came from analyzing crash reports: 92% of OOM crashes occurred on 3GB devices with >1,800 messages in memory.

Chunk Reconstruction Cost

Evicted chunks must rebuild when scrolled back into view. Reconstruction parses UTF-8, applies syntax highlighting (for code blocks), and recomputes layout. Average cost: 2.1ms per chunk on a Snapdragon 778G. We amortize this by reconstructing in a background isolate, posting the result via SendPort. If the user scrolls away before reconstruction completes, we cancel the task via an AbortToken pattern.

Results: Production Telemetry

We shipped viewport-aware chunking in OfflineAI 2.4.0 (March 2024). Telemetry from 12,000 DAU over 30 days:

Frame time (99th percentile): 94ms → 25ms in threads >1,000 messages
Memory footprint: 140MB → 82MB for 2,000-message thread
Scroll jank rate: 18.2% → 4.9% (frames >16ms during scroll)
Cold start time: 1,840ms → 1,210ms (loading previous session)

The technique introduced a new failure mode: if chunk reconstruction stalls (e.g., due to CPU throttling), users see blank gaps. We mitigate by pre-rendering a 0.5× buffer zone ahead of scroll direction. This reduced blank-gap reports by 89%.

Tradeoffs and Alternatives

Viewport-aware chunking adds complexity: ~1,200 lines of Dart/C++ for chunk management, eviction, and reconstruction. For threads