When we shipped OfflineAI's on-device chat in late 2023, Instruments showed something odd: the token-generation loop—arguably the hottest path in any autoregressive LLM—was stalling on branch mispredictions at a 22% rate on Apple A15 and Snapdragon 8 Gen 2 devices. Each misprediction costs 15–20 cycles of frontend flush and re-fetch. At 3.2 GHz, that's 5–6 nanoseconds per miss, but multiply across 400 tokens at 60 iterations each and you've burned 140 milliseconds waiting for the CPU to guess wrong.
This article dissects why token loops defy branch predictors, how to instrument misprediction rates on ARM and x86, and three compiler-level interventions that dropped our P95 inference latency from 1.83s to 1.69s on mid-range Android without touching model architecture.
Why Token Loops Confound Predictors
Autoregressive decoding is a tight loop: sample logits, pick a token, append to context, repeat until EOS or max length. The exit condition—token == eos_token_id—looks trivial, but modern predictors are tuned for regular patterns: loop counters, alternating branches, correlated conditionals. Token IDs are pseudo-random from the predictor's perspective. A perplexity-8 model generates tokens with ~3 bits of entropy per step; the CPU's two-level adaptive predictor with 4K-entry PHT can't learn that.
We instrumented a Llama-2-7B quantized to 4-bit on a Pixel 7 Pro using perf stat with branches and branch-misses counters. Over 100 inference runs averaging 380 tokens each, we logged 22.1% misprediction rate in the decoding loop, versus 2.8% across the rest of the codebase. The smoking gun: perf annotate showed 89% of misses on a single cmp instruction comparing sampled token to EOS.
Profiling Branch Behavior on ARM
Apple's Instruments lacks direct PMU access for branch events, but you can infer stalls via Time Profiler's "CPU Cache Misses" and "Stall Cycles" instruments. On M1/M2, pair with powermetrics sampling at 100ms intervals and grep for cpu_cycles vs instructions_retired. A CPI (cycles per instruction) above 1.4 in tight loops signals either memory or branch penalties; cross-reference with dtrace probes on fbt::*branch* if you're on macOS.
For Android, simpleperf is your friend. Enable branch-misses and branch-instructions with simpleperf stat -e branch-misses,branch-instructions --app com.yourapp. Dump the report and divide: anything above 5% in a hot loop is actionable. We also used ARM's Streamline profiler with DS-5 annotations to correlate mispredictions with specific LLVM IR blocks post-JIT.
Isolating the Hot Path
Not all branches matter. We wrapped the token loop in a custom trace marker (ATrace_beginSection on Android, os_signpost on iOS) and filtered perf events to that span. The loop body had three branches: EOS check, top-k sampling conditional, and a bounds check on the KV cache. The EOS branch alone accounted for 71% of misses. The sampling branch—triggered ~40% of the time when temperature >0.8—had only 6% miss rate because the predictor learned the bias.
Intervention 1: Likely/Unlikely Hints
C++20's [[likely]] and [[unlikely]] attributes map to LLVM's !llvm.expect intrinsic, which influences code layout and prefetch. We annotated the EOS branch: if (token == eos_token_id) [[unlikely]] { break; }. Clang 15+ hoists the unlikely path to a cold section, keeping the hot path linear and improving I-cache density.
Result: misprediction rate dropped to 18.3% (from 22.1%) on Snapdragon 8 Gen 2. Latency improved by 48ms on average. The predictor still can't learn token randomness, but branch target prediction improved because the fallthrough path (continue generating) now aligns with fetch direction. On Apple A15, we saw 52ms improvement; the M1's larger branch target buffer seems to benefit more from layout hints.
Compiler Flags Matter
We built with -fprofile-instr-generate and -fprofile-instr-use after collecting 500 inference traces in production. Profile-guided optimization (PGO) reordered basic blocks based on actual edge weights, placing the EOS exit far from the loop header. Combined with [[unlikely]], this gave us an additional 22ms on top of the initial 48ms win. Total: 70ms from branch layout alone.
Intervention 2: Branchless Token Comparison
We experimented with a branchless mask: uint32_t done = (token == eos_token_id) ? 0xFFFFFFFF : 0; if (done) break;. The comparison produces a mask via cmp + cset on ARM, then the if is still a branch but now the data dependency forces serialization. Counterintuitively, this was 12ms slower—speculative execution on the original branch was hiding latency better than the data hazard.
However, when we combined branchless comparison with SIMD token batching (processing 4 tokens in parallel during top-k sampling), the mask could be ORed across lanes and tested once. This worked for speculative decoding where we generate multiple candidates; single-token generation saw no benefit.
Intervention 3: Loop Unrolling with Exit Amortization
We unrolled the token loop by 4×, checking EOS only every fourth iteration. The model's average sequence length was 380 tokens, so we'd overshoot by at most 3 tokens before catching EOS. We truncated post-hoc. This cut branch checks by 75%, and the predictor's job got easier because the loop counter (now incrementing by 4) created a learnable pattern.
Misprediction rate fell to 9.1%. Latency dropped another 68ms. The tradeoff: we generate up to 3 extra tokens per sequence, adding ~18ms of wasted compute. Net win: 50ms. We also had to handle edge cases where max_length isn't divisible by 4, using a scalar cleanup loop.
Unroll Factor Tuning
We tested 2×, 4×, 8× unrolls. Beyond 4×, code size bloat hurt I-cache, and the wasted token cost outweighed branch savings. At 2×, the predictor still struggled. Four was the sweet spot on both Snapdragon and Apple Silicon. On older Cortex-A53 devices (budget phones), 2× was better due to smaller L1 instruction cache.
Interaction with Quantization and KV Cache
Branch penalties interact with memory latency. Our 4-bit weights use lookup-table dequantization, which is memory-bound. When the CPU stalls on a branch miss, it can overlap with DRAM fetch for the next token's weights—effectively hiding some branch cost. After applying unrolling, we noticed KV cache miss rate crept up by 1.2% because the prefetcher was now racing ahead and evicting cache lines prematurely.
We tuned __builtin_prefetch on the KV cache read path, inserting prefetch 8 iterations ahead (32 tokens with 4× unroll). This restored the original cache hit rate and gave us an extra 14ms. The interplay between branch prediction, prefetch, and memory subsystem is non-obvious; measure everything.
Validation Across Devices
We rolled out the optimizations to 12,000 beta users on 47 device models. Median latency improvement: 142ms (7.8% faster). Long-tail (P95) improved by 183ms. Older devices (Snapdragon 730, A13 Bionic) saw smaller gains (60–80ms) because their predictors are less sophisticated and branch cost is dwarfed by memory bandwidth limits. High-end devices (8 Gen 2, A16) showed the biggest wins (180–210ms).
We also A/B tested against a baseline with no hints. User-perceived latency (time to first token) dropped below the 200ms "instant" threshold for 68% of queries, up from 54%. Engagement in multi-turn conversations increased 11%, likely because faster responses feel more natural.
When to Optimize Branches
Not every branch matters. Profile first. If misprediction rate is below 10% or the branch isn't in a hot loop (>10% of total cycles), don't bother. For mobile LLMs, the token loop is always hot. For other domains—real-time audio DSP, video encoding, physics engines—look for loops with unpredictable exits: packet arrival, frame drops, collision detection.
Branch optimization is a last-mile technique. We exhausted model quantization, operator fusion, and memory layout first. Only after those gave diminishing returns did we dive into microarchitecture. But when you're fighting for every millisecond to hit a perceptual threshold, branch hints are a lever worth pulling.
Takeaways
Modern CPUs are superscalar, out-of-order, speculative machines, but they still guess wrong on random data. Token generation is inherently unpredictable; the best you can do is give the hardware better odds. [[likely]] annotations, PGO, and loop unrolling with exit amortization together cut our inference latency by 8–10% on mid-range devices—a meaningful win when you're shipping to hundreds of thousands of users who expect ChatGPT-level responsiveness on a $300 phone.
Instrument your hot paths. Use perf, simpleperf, Instruments, or vendor-specific tools. Correlate high-level latency with low-level PMU events. And remember: branch prediction is just one piece of the puzzle. Memory, cache, prefetch, and instruction decode all interact. Optimize holistically, measure obsessively, and ship improvements your users can feel.