Shared Memory Texture Buffers: GPU↔CPU Zero-Copy

In real-time computer vision pipelines—face tracking, object detection, AR overlays—the CPU often needs pixel data that the GPU just rendered or processed. The naive approach copies texture bytes from VRAM to system RAM via glReadPixels or MTLTexture.getBytes, stalling the pipeline for 15–50ms per frame. Shared memory texture buffers eliminate this copy entirely, letting CPU and GPU read the same physical memory. This article walks through Metal (iOS) and Vulkan (Android) implementations, synchronization primitives, cache coherency tradeoffs, and real-world latency wins in a 60fps face-mesh pipeline.

The Readback Bottleneck

A typical vision pipeline renders or processes a texture on the GPU, then needs CPU access for post-processing, ML inference, or logging. The standard path:

GPU writes texture in VRAM
getBytes or readPixels blocks until GPU finishes
Driver copies VRAM → system RAM (PCIe or SoC interconnect)
CPU wakes, processes data

On an iPhone 14 Pro, a 1920×1080 RGBA8 texture (8MB) takes ~18ms to read back via MTLTexture.getBytes. On a Snapdragon 8 Gen 2, glReadPixels for the same buffer averages 22ms. At 60fps (16.67ms/frame), this alone blows the budget. The copy is synchronous, so the GPU idles while the CPU waits, and vice versa.

Shared Memory Fundamentals

Modern mobile SoCs (Apple A-series, Qualcomm Snapdragon, Samsung Exynos) use unified memory architecture: CPU and GPU share the same physical RAM pool. The OS can map a buffer so both processors access identical bytes—no copy, no transfer. The trick is convincing the graphics API to allocate textures in this shared region and managing coherency.

Metal (iOS/macOS): Use MTLStorageModeShared when creating MTLTexture or MTLBuffer. The texture lives in system RAM, readable by both CPU and GPU. Trade-off: GPU access is slightly slower than MTLStorageModePrivate (which lives in fast VRAM), but eliminates the copy.

Vulkan (Android): Query VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT memory types. If available (common on mobile), allocate VkImage from that heap. The image is GPU-optimal but CPU-mappable. Fallback: use staging buffers with VK_MEMORY_PROPERTY_HOST_CACHED_BIT for read performance.

Metal Implementation

Create a shared texture:

let descriptor = MTLTextureDescriptor.texture2DDescriptor(
    pixelFormat: .rgba8Unorm,
    width: 1920,
    height: 1080,
    mipmapped: false
)
descriptor.storageMode = .shared
descriptor.usage = [.shaderWrite, .shaderRead]
let sharedTexture = device.makeTexture(descriptor: descriptor)!

In a compute or fragment shader, write to the texture. On the CPU side, access bytes directly:

let bytesPerRow = 1920 * 4
let region = MTLRegion(origin: MTLOrigin(x: 0, y: 0, z: 0),
                       size: MTLSize(width: 1920, height: 1080, depth: 1))
var pixels = [UInt8](repeating: 0, count: 1920 * 1080 * 4)
sharedTexture.getBytes(&pixels,
                       bytesPerRow: bytesPerRow,
                       from: region,
                       mipmapLevel: 0)

But getBytes still copies! The real win: map the underlying MTLBuffer if you allocated a buffer-backed texture, or use MTLTexture.buffer (iOS 16+). For maximum control, allocate a MTLBuffer with storageMode: .shared, then create a texture view:

let bufferSize = 1920 * 1080 * 4
let sharedBuffer = device.makeBuffer(length: bufferSize,
                                     options: .storageModeShared)!
let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(...)
let texture = sharedBuffer.makeTexture(descriptor: textureDescriptor,
                                       offset: 0,
                                       bytesPerRow: 1920 * 4)!

Now sharedBuffer.contents() returns a raw pointer to the same memory the GPU writes. Zero-copy.

Vulkan Implementation

Query memory properties:

VkPhysicalDeviceMemoryProperties memProps;
vkGetPhysicalDeviceMemoryProperties(physicalDevice, &memProps);
for (uint32_t i = 0; i < memProps.memoryTypeCount; i++) {
    if ((memProps.memoryTypes[i].propertyFlags &
         (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
          VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT)) ==
        (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
         VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT)) {
        sharedMemoryTypeIndex = i;
        break;
    }
}

Allocate VkImage from this heap, bind memory, then map:

VkDeviceMemory imageMemory;
vkAllocateMemory(device, &allocInfo, nullptr, &imageMemory);
vkBindImageMemory(device, image, imageMemory, 0);
void* mappedPtr;
vkMapMemory(device, imageMemory, 0, VK_WHOLE_SIZE, 0, &mappedPtr);

The GPU renders to image; the CPU reads from mappedPtr. No vkCmdCopyImageToBuffer, no staging.

Synchronization and Coherency

Shared memory doesn't mean instant visibility. You must synchronize:

Metal: Insert a MTLCommandBuffer.waitUntilCompleted() or use a MTLSharedEvent with signal/wait. For fine-grained control, encode a MTLBlitCommandEncoder.synchronize(resource:) (iOS 16+) to flush GPU caches. Without this, the CPU may read stale data or the GPU may see partial writes.

Vulkan: Use pipeline barriers with VK_ACCESS_SHADER_WRITE_BIT → VK_ACCESS_HOST_READ_BIT and VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT → VK_PIPELINE_STAGE_HOST_BIT. If the memory type lacks VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, call vkInvalidateMappedMemoryRanges before CPU reads to ensure cache coherency. On most mobile GPUs, coherent memory is available but slower; non-coherent requires manual flushing but is faster for GPU writes.

Real-World Latency: Face Mesh Pipeline

A production face-tracking app processes camera frames at 60fps. The pipeline:

Camera → CVPixelBuffer (iOS) or AHardwareBuffer (Android)
GPU: YUV→RGB convert, bilateral filter (compute shader)
CPU: Run TFLite face mesh model (468 landmarks)
GPU: Render overlay triangles

Original (readback): 41ms/frame (GPU: 8ms, readback: 18ms, CPU inference: 12ms, render: 3ms). Frame rate: 24fps.

Shared memory (Metal storageMode: .shared): 23ms/frame (GPU: 9ms slower due to shared RAM, readback: 0ms, CPU: 12ms, render: 2ms). Frame rate: 43fps.

Further optimization: overlap GPU and CPU work. While the GPU renders frame N, the CPU processes frame N-1 from the shared buffer. Requires double-buffering (two shared textures, ping-pong). Achieved 16.2ms/frame, sustaining 60fps with 1ms headroom.

Cache Coherency Pitfalls

On ARM architectures (Apple Silicon, Snapdragon), CPU and GPU caches are not automatically coherent. Writing from GPU, reading from CPU without a barrier can yield:

Stale reads: CPU cache holds old texture data; GPU wrote new data to RAM but CPU didn't invalidate its cache lines.
Torn reads: CPU reads mid-write if GPU write isn't atomic at cache-line granularity (64 bytes on ARM).

Solutions:

Metal: MTLStorageModeShared implies automatic cache management on Apple GPUs (A11+). Older devices (A9-A10) may require explicit didModifyRange calls.
Vulkan: If memory is HOST_COHERENT, the driver handles it. Otherwise, vkInvalidateMappedMemoryRanges before CPU reads, vkFlushMappedMemoryRanges after CPU writes before GPU reads.

In practice, coherent memory adds 5-10% GPU overhead on Adreno 730 (measured via systrace). Non-coherent + manual invalidation is faster but error-prone—one missed barrier corrupts frames.

When Not to Use Shared Memory

Shared memory isn't always a win:

Write-once, read-never textures: If the CPU never needs the data (e.g., intermediate G-buffers), keep them in storageMode: .private (Metal) or device-local-only memory (Vulkan) for peak GPU performance.
High GPU write frequency: Shared RAM is slower for GPU writes (70-80% of private VRAM bandwidth on Apple A16). If a texture is written every frame but read by CPU once per second, the readback cost may be less than the cumulative GPU slowdown.
Large textures: A 4K RGBA16F texture (32MB) in shared mode consumes precious system RAM, reducing available memory for app logic. Profile footprint (Xcode Instruments) or dumpsys meminfo (Android).

Benchmark both paths. In a GlucoScan-style PPG processing pipeline (512×512 texture, read every 30ms for FFT), shared memory cut frame time from 28ms to 19ms—but for a 1080p video filter app (read once per 5 seconds for thumbnail), readback was negligible and private storage kept GPU renders at 11ms vs 14ms shared.

Tooling and Verification

Metal: Xcode GPU Frame Capture shows texture storage mode in the resource inspector. Memory Graph Debugger reveals if shared buffers are pinned in RAM. Check for "Shared" label in the texture list.

Vulkan: RenderDoc captures memory allocation properties. Android GPU Inspector (AGI) profiles memory bandwidth—compare device-local-only vs host-visible. Look for "Memory Type Index" in allocation details.

Validate correctness by writing a known pattern (checkerboard) from GPU, reading on CPU, asserting pixel values match. Fuzz test with random writes at varying cadences to catch cache bugs.

Production Lessons

In HearingAid Pro's AirPods spatial audio pipeline, we used shared memory to pass FFT magnitude buffers (2048 floats) from Metal compute shaders to Core Audio callbacks. Latency dropped from 12ms to 3ms. Key learnings:

Use ring buffers (4 slots) to avoid GPU stalls when CPU is slow.
Align buffer sizes to 4KB (page size) to avoid cache-line false sharing.
On iOS, shared memory survives app backgrounding—but Metal device loss invalidates pointers. Re-map after MTLDevice reset.

For cross-platform code (Flutter plugin), abstract shared memory behind a protocol: SharedTextureHandle with platform-specific implementations (Metal on iOS, Vulkan on Android, fallback to readback on web).

Conclusion

Shared memory texture buffers turn a 20-40ms serialized copy into a pointer dereference. The cost: slightly slower GPU writes, careful synchronization, and platform-specific code. For real-time vision, audio DSP, or any CPU-GPU data exchange above 10Hz, the latency win is transformative. Measure both paths, profile memory pressure, and fuzz test coherency—then ship zero-copy.