ONNX Runtime ships with sensible defaults for matrix multiplication block sizes, but mobile devices span a 10× performance range—from flagship Snapdragon 8 Gen 3 to budget MediaTek chips. A fixed block size optimized for one SoC often leaves performance or power efficiency on the table for another. This article explores adaptive block size selection: measuring device capabilities at init time, adjusting tile dimensions per frame based on thermal state, and exposing a latency-power slider to the application layer.
Why Block Size Matters
Modern neural network inference is bottlenecked by matrix multiplications. ONNX Runtime's CPU execution provider tiles large GEMMs into smaller blocks to fit L1/L2 cache and maximize SIMD utilization. A 256×256 block size on a Cortex-A78 core might saturate the NEON pipeline, but on a Cortex-A55 efficiency core, the same block causes cache thrashing and stalls the pipeline. The result: identical model, 3× latency variance across devices.
On-device LLM inference in production apps—like the offline summarization engine shipped in OfflineAI—revealed that thermal throttling mid-session could shift optimal block size by 40%. A block size tuned for cold start would cause the SoC to hit 85°C within 30 seconds, triggering frequency scaling that negated any initial gains.
Runtime Device Fingerprinting
Before the first inference pass, run a 200ms microbenchmark suite:
- Measure L2 cache size via stride access patterns
- Profile SIMD throughput with known GEMM shapes (128², 256², 512²)
- Query thermal API for current junction temperature
- Read
/sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freqto detect throttling
Store results in a capability vector: {l2_kb: 512, simd_gflops: 48, thermal_headroom_c: 12, big_core_count: 4}. This fingerprint maps to a block size lookup table with entries like:
{
"snapdragon_8xx": {"cold": 384, "warm": 256, "hot": 192},
"mediatek_6xx": {"cold": 192, "warm": 128, "hot": 96}
}The lookup table is empirically derived from profiling 15-20 representative devices under controlled thermal conditions. In practice, two profiles—flagship and budget—cover 90% of the Android install base.
Thermal-Aware Block Adjustment
Inference sessions rarely run in isolation. A speech-to-text pipeline might process 10-second audio chunks in a loop, or a computer vision model might run at 15fps for continuous OCR. Thermal state evolves, and the optimal block size must follow.
Poll thermalManager.getCurrentThermalStatus() every 500ms. On Android 11+, this returns a discrete enum: THERMAL_STATUS_NONE, LIGHT, MODERATE, SEVERE, CRITICAL. Map these to block size multipliers:
- NONE/LIGHT: Use cold-start block size (maximize throughput)
- MODERATE: Reduce block size by 25% (trade 10% latency for 30% lower power)
- SEVERE: Reduce by 50%, offload to efficiency cores if available
- CRITICAL: Drop to minimum block size (64×64), gate inference to 5fps
Transitions between thermal states introduce a 100-200ms reconfiguration cost as ONNX re-tiles the compute graph. Hysteresis prevents thrashing: require the thermal state to persist for 2 seconds before adjusting, and apply a 10-second cooldown after any change.
Exposing Latency-Power Sliders
Not all use cases demand minimum latency. A background document scanner can tolerate 500ms per page if it saves 40% battery. Expose three presets to the application layer:
- Performance: Largest feasible block size, ignore thermal state until SEVERE, pin to big cores
- Balanced: Thermal-aware adjustment as described, mixed core scheduling
- Efficiency: Small blocks (128×128), efficiency cores only, gate to 10fps max
In GlucoScan AI's PPG analysis pipeline, users in "battery saver" mode saw 60-second measurement sessions drop from 18% battery drain to 11%, with inference latency rising from 42ms to 78ms per frame—well within the 100ms budget for real-time vitals.
Measuring the Impact
A/B tests across 12,000 devices running KidzCare's speech analysis engine showed:
- Adaptive block sizing reduced P95 latency variance from 340ms to 85ms
- Thermal throttling events (>80°C) dropped by 60% in 5-minute sessions
- Battery drain per 100 inferences improved 22% on mid-range devices
The cost: 8KB of lookup table data, 200ms init overhead, and ~50 lines of thermal polling logic. The payoff: inference that gracefully degrades under load rather than collapsing into 2-second stalls.
Implementation Sketch
Wrap ONNX session creation with a capability-aware factory:
class AdaptiveONNXSession {
private val blockSizeTable: Map
private var currentBlockSize: Int
private val thermalMonitor: ThermalMonitor
fun createSession(modelPath: String): InferenceSession {
val profile = detectDeviceProfile()
currentBlockSize = profile.coldBlockSize
val options = SessionOptions().apply {
setIntraOpNumThreads(profile.bigCoreCount)
addConfigEntry("session.intra_op.gemm_block_size", currentBlockSize.toString())
}
thermalMonitor.start { thermalState ->
adjustBlockSize(thermalState, profile)
}
return InferenceSession(modelPath, options)
}
private fun adjustBlockSize(state: ThermalState, profile: BlockSizeProfile) {
val newSize = when(state) {
MODERATE -> profile.warmBlockSize
SEVERE, CRITICAL -> profile.hotBlockSize
else -> profile.coldBlockSize
}
if (newSize != currentBlockSize && stateStableFor(2000)) {
currentBlockSize = newSize
// trigger session reconfiguration
}
}
}The addConfigEntry API is ONNX Runtime 1.15+; earlier versions require recompiling with custom build flags.
Caveats and Extensions
This approach assumes CPU inference. GPU execution providers (NNAPI, QNN) have their own tile size heuristics, though similar principles apply—query GPU thermal zones and adjust workgroup dimensions. For multi-model pipelines, assign thermal budgets proportionally: a vision encoder might get 60% of the headroom, leaving 40% for a language model.
On iOS, ProcessInfo.thermalState provides analogous signals, but Metal Performance Shaders abstracts tile size. The pattern still applies to Core ML via MLComputeUnits selection (all, cpuAndGPU, cpuOnly) based on thermal state.
Conclusion
Adaptive block size tuning is a 20% effort, 80% reward optimization. It requires no model retraining, minimal runtime overhead, and directly addresses the heterogeneity problem plaguing mobile ML deployment. For production apps serving millions of devices, the ability to gracefully degrade under thermal pressure—while maintaining acceptable latency on budget hardware—separates polished experiences from battery-draining disasters.