Shipping on-device AI means confronting a hard truth: your model might run beautifully in Xcode or Android Studio, but after 45 seconds of continuous inference on a real device in a user's pocket, performance can collapse by 60% or more. Thermal throttling—the CPU and GPU's self-preservation mechanism—is the silent killer of mobile AI products.
This article explores thermal-aware inference design: detection strategies, graceful degradation patterns, and architectural choices that keep your app responsive when the device heats up. Examples draw from production systems like GlucoScan AI, which processes PPG signals continuously, and OfflineAI, where multi-turn LLM conversations can span minutes.
Why Thermal Throttling Destroys Inference
Modern mobile SoCs throttle frequency when junction temperature exceeds ~80–95°C. On an iPhone 13 Pro, sustained Core ML inference can trigger throttling in under 30 seconds. On mid-range Android devices with passive cooling, it happens faster. Once throttled, inference latency can spike from 120ms to 400ms per pass, frame rates drop, and battery drain accelerates as the scheduler fights thermals.
The worst part: throttling is non-linear. A model that runs at 8fps might suddenly drop to 3fps, then recover briefly, then crater again. Users perceive this as jank, crashes, or broken features.
Measurement: Knowing When You're Cooked
iOS exposes ProcessInfo.processInfo.thermalState with four levels: nominal, fair, serious, critical. Android is harder—PowerManager.getThermalHeadroom() (API 31+) returns seconds until throttling, but older devices require heuristics: CPU frequency polling via /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq, or observing inference latency drift over rolling windows.
In practice, you need both proactive and reactive signals. Proactive: check thermal state before starting a heavy workload (e.g., batch transcription). Reactive: monitor per-inference latency and detect when P95 latency exceeds baseline by 50%+. On iOS, subscribe to ProcessInfo.thermalStateDidChangeNotification; on Android, poll headroom every 2–5 seconds during inference.
Degradation Strategies
1. Dynamic Batch Size Reduction
If you're processing frames or tokens in batches, shrink batch size when thermals rise. A vision model running 4 frames/batch at 10fps can drop to 1 frame/batch at 8fps—total throughput falls 20%, but latency stays predictable. This works well for real-time pipelines (OCR, object detection) where dropping frames is acceptable.
Implementation: maintain a thermal budget counter. Start at batch=4. On thermalState == .serious, decrement to 2. On critical, go to 1. When state recovers to fair, increment slowly (hysteresis prevents flapping).
2. Model Swapping: Heavy → Lite
Ship multiple quantized variants of your model. For example, an LLM might have Q8, Q6, Q4 versions. Start with Q8 for quality. When thermal headroom drops below 10 seconds, swap to Q6 mid-session. At critical, drop to Q4. The user sees slightly less coherent output, but the app doesn't freeze.
In ONNX Runtime, this means preloading multiple InferenceSession instances (memory trade-off) or lazy-loading on demand (latency spike). Preloading wins if you have 500ms persists for >30s.
When to Avoid On-Device Inference
If your app requires sustained high-throughput inference (e.g., live video analysis at 60fps, multi-hour audio transcription), on-device may not be viable. Thermal constraints are physics. Consider hybrid architectures: on-device for low-latency preview, cloud for final processing. Or restrict features to short bursts (30s max) with mandatory cooldowns.
Key Takeaways
Thermal throttling is not an edge case—it's the default state for sustained mobile AI workloads. Design for it from day one:
- Instrument thermal state and inference latency in production.
- Ship multiple model variants (quantization levels) and swap dynamically.
- Use batch size, frame skipping, and cooldown windows to spread thermal load.
- Test on mid-range devices in warm environments, not just flagship phones in air-conditioned offices.
The best mobile AI products aren't the ones with the biggest models—they're the ones that stay responsive when the device is hot, the battery is low, and the user is in a hurry.