The Thermal Wall in Mobile Inference

When you run a 3B-parameter LLM on a phone for more than thirty seconds, the device gets hot. When it gets hot enough, the operating system steps in: CPU cores downclock, GPU frequency drops, and—if you're unlucky—the inference thread is killed outright. This isn't a bug. It's thermal throttling, the last line of defense before silicon damage.

Most mobile ML tutorials gloss over this. They benchmark cold-start latency on a device fresh out of the box, report tokens-per-second, and call it done. But production apps—voice assistants, real-time translation, continuous health monitoring—run inference continuously. Thermal behavior becomes the dominant constraint, not theoretical TOPS.

This article dissects how modern SoCs manage thermal budgets during sustained inference, why naive approaches fail, and how power gating at the execution block level keeps your app responsive without melting the user's hand.

Thermal Throttling: What Actually Happens

When junction temperature crosses a threshold—typically 85–95°C on ARM SoCs—the kernel's thermal governor kicks in. On Apple silicon, this is managed by IOThermSensor and clpc (closed-loop performance controller). On Qualcomm, it's the thermal-engine daemon.

The throttling cascade looks like this:

  1. Phase 1 (light): Boost clocks disabled. Performance cores drop from 3.5 GHz to 3.0 GHz.
  2. Phase 2 (moderate): Efficiency cores take over more work. GPU frequency capped at 70%.
  3. Phase 3 (severe): All cores downclocked. Neural engine (ANE) or DSP throttled. Frame drops, jank, or process termination.

In a continuous LLM inference loop—say, a 1.5B model generating tokens at 15 tok/s—you hit Phase 2 within 45 seconds on an iPhone 14 Pro at room temperature. By 90 seconds, you're in Phase 3. Token throughput drops to 6 tok/s. The user notices.

Measuring Thermal State

On iOS, you can poll ProcessInfo.processInfo.thermalState. It returns .nominal, .fair, .serious, or .critical. On Android, read /sys/class/thermal/thermal_zone*/temp and watch for zone-specific thresholds. These APIs lag real hardware state by 200–500ms, so you need predictive logic.

Why Naive Approaches Fail

The obvious fix: just pause inference when thermal state goes critical. This works for batch jobs, but fails for interactive apps. A voice assistant that pauses mid-sentence is worse than one that never started.

Another common mistake: running inference at 100% duty cycle until throttling hits, then backing off. This creates a sawtooth power profile—high current draw, rapid heating, aggressive throttling, cooldown, repeat. The user experiences stutter. The battery takes a beating from voltage sag.

A third trap: offloading everything to the GPU or Neural Engine. These accelerators are thermally coupled to the CPU package. When you saturate the ANE, the CPU thermal budget shrinks. You've just traded one bottleneck for another.

Power Gating: Fine-Grained Thermal Control

Power gating is a hardware feature that lets you selectively shut down execution units—ALUs, SIMD lanes, cache slices—without stopping the entire core. Modern ARM cores (Cortex-X3, Firestorm) and Apple's P-cores support per-block gating at sub-microsecond latency.

The idea: instead of running inference full-throttle until thermal shutdown, you modulate power by enabling only the execution resources you need, when you need them. For LLM inference, this means:

  • Matrix multiply: Enable NEON/AMX units, gate scalar ALUs.
  • Activation functions: Gate matrix units, enable scalar + vector.
  • Attention softmax: Enable vector divide, gate everything else.

This isn't exposed via high-level APIs. You get it for free if your runtime (ONNX Runtime, llama.cpp, Core ML) schedules ops intelligently and the kernel's DVFS (dynamic voltage-frequency scaling) reacts fast enough.

Measuring Power Gating Impact

On Apple silicon, use powermetrics --samplers cpu_power to see per-cluster power draw. On a MacBook with M1 Pro running a 3B LLM at 20 tok/s, you'll see P-cluster power oscillate between 2W (attention compute) and 6W (matmul). The E-cluster stays under 1W. Total package power: 8–10W sustained, vs. 15W+ without gating.

On Android, cat /sys/class/power_supply/battery/current_now gives instantaneous current in microamps. Multiply by voltage to get watts. A well-gated inference loop on Snapdragon 8 Gen 2 shows 3–4W average vs. 7W for naive full-throttle execution.

Implementing Thermal-Aware Inference

Here's a practical strategy for a mobile LLM app:

1. Thermal Budget Predictor

Track rolling average of per-token power consumption. Use a Kalman filter to predict when you'll hit the next thermal threshold. This gives you 5–10 seconds of warning before the OS steps in.

let powerHistory: [Double] = [] // watts per token
let kalmanGain = 0.3
var predictedBudget = 10.0 // watts

func updateBudget(measuredPower: Double) {
    powerHistory.append(measuredPower)
    let innovation = measuredPower - predictedBudget
    predictedBudget += kalmanGain * innovation
}

2. Adaptive Batch Size

When thermal budget shrinks, reduce the number of tokens processed per inference call. Instead of generating 16 tokens at once, drop to 8 or 4. This spreads heat generation over time, giving the SoC time to dissipate between bursts.

3. Hybrid CPU-GPU Scheduling

Run matmul on the GPU, attention on the CPU. The GPU has higher thermal mass and better cooling (larger die, closer to chassis). The CPU handles branchy code (sampling, tokenization) more efficiently. This load-balancing keeps both units below throttle thresholds.

4. Preemptive Downclocking

When you detect thermal state transitioning from .nominal to .fair, voluntarily drop your inference thread priority or insert 50ms sleeps between tokens. Let the OS cool down before it forces you to. The user gets consistent 10 tok/s instead of 20 tok/s collapsing to 5.

Real-World Results

In a production offline translation app tested on iPhone 14 Pro, naive inference sustained 18 tok/s for 40 seconds, then dropped to 7 tok/s as throttling kicked in. With thermal-aware gating and adaptive batching, the same model sustained 13 tok/s for 5 minutes straight, never hitting severe throttle. Battery drain: 35% over 30 minutes vs. 42% for the naive approach.

On a Samsung Galaxy S23 Ultra running a 1.5B model with hybrid CPU-GPU scheduling, sustained inference at 11 tok/s for 8 minutes before throttling to 9 tok/s—vs. 3 tok/s after 2 minutes without power management.

When Power Gating Isn't Enough

If your model is too large or your target device too constrained, power gating alone won't save you. You'll need:

  • Model pruning: Remove 20–30% of weights that contribute least to output quality.
  • Quantization: Drop from 8-bit to 4-bit. Halves memory bandwidth, cuts power by 40%.
  • Speculative decoding: Use a tiny draft model to predict next tokens, verify with the full model. Reduces full-model invocations by 60%.

These are complementary. Power gating buys you headroom; model optimization makes the headroom bigger.

Takeaways for Mobile ML Engineers

Thermal throttling is not an edge case. It's the steady-state reality of sustained on-device inference. Benchmarks that ignore it are measuring the wrong thing.

Power gating gives you fine-grained control over thermal budget without sacrificing too much throughput. But it requires instrumentation—real-time power monitoring, predictive thermal modeling, adaptive scheduling. The iOS and Android APIs exist; you just have to use them.

In production apps where inference runs continuously—voice, vision, health monitoring—thermal design is as important as model accuracy. A 3B model that throttles after one minute is worse than a 1B model that runs forever.

The goal isn't maximum tokens-per-second. It's consistent, thermally sustainable performance that doesn't burn the user's hand or drain the battery in twenty minutes. Power gating is how you get there.