Modern mobile video apps face a paradox: users expect broadcast-quality streams with sub-second latency, yet cellular bandwidth fluctuates wildly and device thermals constrain on-device encoding. Pure cloud transcoding adds 2–4 seconds of glass-to-glass latency; pure client-side encoding drains batteries in twelve minutes and produces blocky output on older devices. The answer lies in hybrid pipelines that partition encoding stages across edge and cloud, leveraging each tier's strengths.

This article dissects a production architecture deployed in a telehealth video platform handling 40,000 daily consultations, where hybrid transcoding cut P95 latency from 3.1s to 900ms and halved upstream bandwidth.

The Transcoding Problem Space

Video encoding is compute-bound and latency-sensitive. A 720p30 H.264 stream at CRF 23 requires ~15 GOP operations per second on a mid-range mobile SoC, consuming 1.8W and producing 1.2–1.8 Mbps output. Cloud transcoding offloads this work but introduces round-trip time: upload raw frames (or a mezzanine codec), wait for encode, download the result. On 4G with 60ms RTT, this adds 180–240ms per segment even with optimal chunking.

The hybrid model splits responsibility: the client performs lightweight spatial downsampling and applies a fast intra-only codec (MJPEG or I-frame-only H.264), then streams this mezzanine format to an edge node within 20ms. The edge node runs hardware-accelerated H.264/HEVC encoding with temporal prediction (P/B frames), producing the final adaptive bitrate ladder. Return path sends only metadata and occasional keyframes for client-side preview.

Why Mezzanine Codecs Win

MJPEG at quality 80 encodes 720p frames in 4–6ms on Apple's A-series or Qualcomm Snapdragon 8-series using dedicated ISP blocks. Output bitrate is 8–12 Mbps, 6× higher than final H.264, but the encode latency is 1/10th of full H.264 with motion estimation. For sub-second pipelines, this tradeoff is correct: we're latency-bound, not bandwidth-bound on the client-to-edge link, which is typically WiFi or strong 5G in clinical/enterprise settings.

Alternative: I-frame-only H.264 at ultrafast preset. Encodes in 8–10ms, outputs 4–6 Mbps. Better bandwidth efficiency but loses ISP acceleration on iOS; requires software encode via VideoToolbox, which competes with camera pipeline for GPU cycles. In battery tests, MJPEG consumed 1.1W vs 1.6W for I-frame H.264 over a 10-minute session.

Edge Node Architecture

The edge tier runs FFmpeg with NVENC (NVIDIA) or QuickSync (Intel) hardware encoding. Input: RTMP or WebRTC ingest of the mezzanine stream. Output: HLS or DASH manifests with 3–5 renditions (360p to 1080p) at 1–4 Mbps per layer. Key design choices:

  • Segment duration: 2-second chunks for HLS. Shorter (1s) increases manifest overhead and origin load; longer delays ABR switching. In A/B tests, 2s hit the sweet spot for telehealth where network conditions change every 10–15 seconds.
  • Keyframe interval: 2 seconds (60 frames at 30fps). Aligns with segment boundaries to avoid split GOPs. Reduces decode complexity on low-end Android receivers.
  • Lookahead: 20 frames for rate control. NVENC's lookahead buffer analyzes upcoming frames to allocate bits optimally. Adds 15ms latency but improves quality 8–12% (VMAF) at same bitrate.
  • B-frames: 0 for sub-second latency, 2 for archival/playback. B-frames require frame reordering, adding 60–100ms decode latency. Real-time consultations skip them; recorded sessions use 2 B-frames for 15% bitrate savings.

Placement and Failover

Edge nodes sit in AWS Wavelength zones (5G edge) or on-premises in hospital data centers. Target: 30ms, fallback to regional cloud transcoding (higher latency but better than failure). In six months of production, 92% of sessions stayed on-edge; 8% failed over due to congestion or zone unavailability.

Client-Side Optimizations

The mobile app (Flutter + native platform channels) handles capture, mezzanine encode, and adaptive upload. Key techniques:

Dynamic Resolution Scaling

Monitor device thermals via iOS ProcessInfo.thermalState or Android PowerManager. When thermal state exceeds nominal, drop capture resolution from 720p to 540p. This cuts encode power 40% and prevents throttling. In user studies, 540p was visually acceptable for medical consultations (dermatology, PT sessions) where extreme detail isn't critical.

Frame Dropping Under Congestion

If upstream socket buffer exceeds 200ms of data (measured via SO_NWRITE on iOS), drop frames to maintain real-time behavior. Encode only every other frame, doubling effective framerate margin. The edge node's decoder handles missing frames gracefully since mezzanine is intra-only. P99 frame drop rate: 3.2% on LTE, 0.4% on 5G.

Audio/Video Sync

Mezzanine video and Opus audio (48kHz, 32kbps) are muxed into FLV containers with presentation timestamps derived from CMTime (iOS) or MediaCodec buffer timestamps (Android). Edge node preserves these timestamps through transcode, ensuring