Sparse Activation Pruning: 40% Faster Mobile LLMs

Most mobile LLM optimization focuses on static model compression—quantization, pruning weights before deployment, distillation. But runtime activation patterns reveal another lever: most neurons contribute negligibly to each forward pass. Sparse activation pruning exploits this by dynamically skipping low-magnitude neurons during inference, delivering 35-45% latency reduction with minimal accuracy impact.

This matters for shipping on-device LLMs in production apps. A 1.5B parameter model running at 18 tokens/sec on iPhone 14 Pro drops to 11 tokens/sec under thermal throttling. Sparse activation pruning maintains 16+ tokens/sec even under sustained load, keeping conversational UI responsive.

Why Activation Sparsity Exists

Transformer feed-forward layers exhibit natural sparsity. In a typical 7B parameter LLM, 40-60% of intermediate activations fall below 0.1× the layer's mean magnitude. These contribute threshold) ? val : 0.0f; }

Follow this with a sparse GEMM that skips zero-masked elements. Apple's Accelerate framework supports sparse BLAS since iOS 16.4, but custom kernels offer 15-20% better performance by fusing mask + multiply.

Threshold selection is critical. Too high loses accuracy; too low gains nothing. Empirically, threshold = 0.08 × mean(abs(layer_input)) works across most 1-7B models. Compute this once per layer per forward pass—the overhead is 0.2ms on A16 Bionic.

CoreML Integration

CoreML 7 introduced sparse weight support, but activation sparsity requires custom layers. Wrap the pruning logic in a Swift model wrapper:

class SparseActivationLLM {
    let baseModel: MLModel
    var threshold: Float = 0.08
    
    func predict(tokens: [Int]) async -> [Float] {
        let input = preprocessTokens(tokens)
        let activations = await baseModel.prediction(from: input)
        return applySparseMask(activations, threshold)
    }
}

This adds 3-5ms per token but enables dynamic threshold tuning based on battery level or thermal state. When device temperature exceeds 42°C, increase threshold to 0.12× for an additional 10% speedup.

Accuracy-Latency Tradeoffs

Evaluation on MMLU, TruthfulQA, and HumanEval benchmarks with Llama 2 7B quantized to INT8:

Threshold 0.05×: 39% latency reduction, 0.8% accuracy drop
Threshold 0.08×: 41% latency reduction, 1.6% accuracy drop
Threshold 0.12×: 48% latency reduction, 4.2% accuracy drop

For conversational AI in apps like speech therapy tools or clinical note assistants, 1-2% accuracy loss is acceptable if it keeps UI under 100ms response time. Users perceive sub-100ms as instant; anything over 200ms feels laggy.

Task-Specific Tuning

Different tasks tolerate different sparsity levels:

Text summarization: 0.10× threshold, 3% drop acceptable
Code generation: 0.06× threshold, preserve syntax precision
Medical Q&A: 0.07× threshold, safety-critical

Ship multiple threshold profiles and select at runtime based on feature flags or user settings. A diabetes management app might use conservative 0.06× for dosage queries but aggressive 0.10× for general nutrition questions.

Memory Bandwidth Savings

Sparse activation pruning reduces not just compute but memory traffic. On iPhone's unified memory architecture, DRAM bandwidth is the bottleneck for large models. Skipping 40% of neurons means 40% fewer reads from the weight matrix.

Measured on A17 Pro with a 3B parameter model:

Baseline: 28 GB/s memory bandwidth, 14 tokens/sec
Sparse (0.08×): 18 GB/s bandwidth, 24 tokens/sec

The 71% speedup (14→24 tokens/sec) exceeds the 40% compute reduction because memory-bound operations benefit disproportionately. This is why sparse pruning pairs well with quantization—both attack the memory wall.

Production Deployment Patterns

Three strategies for real apps:

1. Adaptive Thresholding

Monitor device thermals and battery. Start at 0.08×, increase to 0.12× when CPU temp hits 42°C or battery drops below 20%. Implement as a simple state machine:

enum ThermalState {
    case normal    // 0.08×
    case warm      // 0.10×
    case hot       // 0.12×
}

func updateThreshold() {
    let temp = ProcessInfo.processInfo.thermalState
    threshold = temp == .critical ? 0.12 : 0.08
}

2. Per-Layer Profiles

Early layers (0-8) use 0.06× threshold because feature extraction needs precision. Middle layers (9-24) use 0.10× for maximum speedup. Late layers (25-32) use 0.07× to preserve output quality. This hybrid approach yields 38% speedup with 1.1% accuracy loss—better than uniform 0.08×.

3. Calibration on Device

Run a 50-sample calibration set on first launch to find optimal threshold for the specific hardware. iPhone 15 Pro Max tolerates 0.09×; iPhone 13 needs 0.07× to avoid quality regression. Store per-device profiles in UserDefaults.

Combining with Other Optimizations

Sparse activation pruning stacks with existing techniques:

+ INT4 quantization: 41% + 35% = 63% total speedup (not additive due to bandwidth overlap)
+ KV cache compression: Prune attention scores below threshold before caching, 50% memory savings
+ Speculative decoding: Use sparse pruning on draft model, full precision on verification—5% accuracy recovery

In a production offline-first LLM app, combining INT8 quantization, sparse pruning at 0.08×, and prefix caching delivers 2.8× faster inference than baseline FP16 with