Gesture recognition in mobile apps demands real-time performance: users expect instant feedback when they wave, point, or perform custom motions. Achieving 120fps inference on an iPhone camera feed—matching ProMotion display refresh rates—requires a tightly optimized pipeline from capture to classification. This article walks through the architecture decisions, model design, and Metal integration patterns that make sub-8ms inference possible on A15 and newer silicon.

The 8.3ms Budget

At 120fps, each frame has an 8.3ms budget. Subtract camera capture overhead (~1.5ms), display synchronization (~0.5ms), and you have roughly 6ms for inference plus pre/post-processing. This constraint eliminates most off-the-shelf models: MobileNetV3 averages 12-18ms on iPhone 13 Pro even with GPU acceleration. You need a purpose-built architecture.

Our gesture classifier uses a lightweight temporal convolutional network: three 1D convolution layers over a sliding window of 16 frames at 224×224 resolution. Input is grayscale (single channel) to halve memory bandwidth. The model outputs 12 gesture classes plus a null class for non-gestures. Total parameter count: 340K. Inference on Neural Engine: 4.2ms average, 5.8ms p99.

Model Architecture Trade-offs

We evaluated three approaches during prototyping. A 3D CNN (MobileNetV2 backbone with temporal dimension) achieved 94% validation accuracy but required 22ms per inference even with INT8 quantization. A two-stage pipeline—pose estimation then gesture classification—hit 11ms but suffered from pose detector false negatives in poor lighting.

The winning architecture uses depthwise separable convolutions with kernel size 3×3 in spatial layers, followed by 1×3 temporal kernels. Batch normalization after each conv layer. ReLU6 activation for quantization friendliness. Final fully-connected layer outputs logits; softmax runs on CPU during post-processing. Training dataset: 180K labeled gesture sequences collected from 340 participants, augmented with random crops, brightness jitter, and simulated motion blur.

Quantization Strategy

CoreML supports float16, int8, and mixed precision. We use int8 for all convolution weights and float16 for batch norm parameters. This hybrid approach reduces model size from 1.4MB to 380KB while maintaining 92.7% top-1 accuracy (vs 93.1% for full float32). Quantization-aware training is essential: post-training quantization dropped accuracy to 87%.

The coremltools conversion pipeline applies per-channel quantization for convolutional layers. Calibration data: 5,000 representative frames sampled uniformly across gesture classes and lighting conditions. Activation quantization uses symmetric mode with scale factors computed from min/max observed values during calibration.

Metal-Accelerated Preprocessing

Camera frames arrive as CVPixelBuffer in YCbCr format. Converting to grayscale and resizing to 224×224 on the CPU takes 3-4ms. Metal compute shaders reduce this to 0.6ms. Our pipeline uses a single compute kernel that samples the Y plane (luminance), applies bilinear interpolation for resize, and writes directly to the CoreML input buffer.

kernel void preprocessFrame(
    texture2d inTexture [[texture(0)]],
    texture2d outTexture [[texture(1)]],
    uint2 gid [[thread_position_in_grid]])
{
    float2 texCoord = float2(gid) / float2(224, 224);
    float2 srcCoord = texCoord * float2(inTexture.get_width(), 
                                         inTexture.get_height());
    float luminance = inTexture.read(uint2(srcCoord)).r;
    outTexture.write(float4(luminance, 0, 0, 1), gid);
}

The Metal command buffer is submitted asynchronously. While the GPU executes preprocessing, the CPU prepares the next frame's capture settings. This overlap hides most of the preprocessing latency.

Temporal Window Management

Gesture recognition requires temporal context. Our model consumes 16 consecutive frames, creating a 133ms observation window at 120fps. A naive implementation would buffer frames in an array and copy them into the model input on each inference. This introduces memory allocation churn and cache misses.

Instead, we use a circular buffer backed by a single contiguous MTLBuffer. The preprocessing Metal kernel writes each new frame to the next slot in the buffer. The CoreML model input is configured as an MLMultiArray that directly references this Metal buffer—no copy required. When the buffer wraps, older frames are overwritten. This zero-copy design eliminates 1.2ms of overhead per frame.

Handling Variable Frame Rates

Camera capture does not always hit 120fps. Thermal throttling, background tasks, or scene complexity can drop the rate to 90-100fps. Our gesture classifier adapts by tracking inter-frame timestamps. If a frame arrives late (delta > 10ms), we duplicate the previous frame to maintain the 16-frame window size. This prevents temporal misalignment without retraining the model for variable rates.

Neural Engine Scheduling

CoreML automatically schedules models on Neural Engine, GPU, or CPU based on model characteristics and system load. For our gesture classifier, Neural Engine is optimal: it delivers 4.2ms inference with 0.3W power draw, compared to 7.1ms and 1.1W on GPU. However, Neural Engine contention from other apps or system tasks can spike latency to 12-15ms.

We mitigate this by setting MLModelConfiguration computeUnits to .cpuAndNeuralEngine and monitoring per-frame inference time. If p95 latency exceeds 7ms over a 2-second window, we temporarily fall back to GPU. This adaptive scheduling maintains 120fps during multitasking scenarios like picture-in-picture video playback.

Post-Processing and Smoothing

Raw model outputs are noisy: a pointing gesture might flicker between 'point' and 'null' classes for 3-4 frames during the transition. We apply temporal smoothing using a 5-frame median filter on the logits before softmax. This reduces false positives by 68% while adding only 0.3ms of CPU time.

For gesture start/end detection, we use a simple state machine: a gesture is confirmed after 3 consecutive frames with the same predicted class and confidence above 0.85. Gesture end is triggered by 2 consecutive null predictions or a different gesture class. This hysteresis prevents spurious detections from hand movements between intentional gestures.

Power and Thermal Considerations

Continuous 120fps inference draws 1.8-2.2W on iPhone 14 Pro, including camera and display. After 90 seconds, device temperature rises to 42°C and iOS begins thermal throttling: CPU frequency drops from 3.2GHz to 2.4GHz, and Neural Engine duty cycle is reduced. Inference latency climbs to 6-7ms, occasionally missing the 8.3ms deadline.

Our mitigation strategy is adaptive frame rate: if we detect three consecutive deadline misses (measured via CADisplayLink callback timing), we drop to 60fps inference. The camera still captures at 120fps, but we process every other frame. Gesture recognition accuracy remains above 91% at 60fps, and thermal stability is restored within 15 seconds. When temperature drops below 38°C, we ramp back to 120fps.

Production Metrics

In a health-focused app built by Omar's team, this gesture pipeline enabled hands-free navigation for users with limited mobility. Over six months in production with 12,000 active users, the system logged 4.2 million gesture interactions. Median inference time: 4.4ms. p99: 6.1ms. False positive rate (unintended gesture detected): 0.8%. User-reported accuracy: 94%. Battery impact during 10-minute continuous use: 3-4% drain on iPhone 13 and newer.

The architecture scales to custom gesture sets with minimal retraining. Adding two new gestures required 8,000 labeled examples and 6 hours of fine-tuning on a single M1 MacBook Pro. The quantized model grew by only 40KB.

Key Takeaways

Achieving 120fps gesture recognition on mobile demands co-design of model architecture, preprocessing pipeline, and runtime scheduling. Int8 quantization with per-channel calibration preserves accuracy while fitting within Neural Engine memory constraints. Metal compute shaders eliminate preprocessing bottlenecks. Circular buffer management and zero-copy data flow minimize memory overhead. Adaptive frame rate and compute unit fallback ensure consistent performance under thermal stress. The result: responsive, power-efficient gesture interfaces that feel native to the platform.