Hypura A storage-tier-aware LLM inference scheduler for Apple Silicon

The Architecture: Tiers of Compromise

When an LLM's footprint exceeds the physical memory available on Apple Silicon, system crashes can occur. Hypura addresses this by acting as a storage-tier-aware LLM inference scheduler. It intelligently manages the placement of LLM weights across the memory hierarchy: GPU (Metal), unified RAM, and NVMe SSD.

The core mechanism reads tensor weights from GGUF files on NVMe into RAM or GPU memory pools. Computation occurs exclusively in RAM or on the GPU. As an LLM inference scheduler, Hypura functions as a specialized memory manager, distinct from a generic OS page fault handler. It exploits the deterministic access patterns of transformer layers during inference, prefetching upcoming tensors and issuing large sequential reads to maximize NVMe throughput.

This Hypura LLM inference scheduler classifies tensors for optimal placement across these tiers. Attention layers, norms, and embeddings are pinned to the GPU (Metal) due to their frequent access and require the fastest access, constrained by the recommendedMaxWorkingSetSize. Unified RAM serves as the next tier, handling overflow layers from the GPU's working set, accessed via mmap. The bulk of the remaining layers reside on the NVMe SSD, loaded on-demand using direct I/O (F_NOCACHE + pread) with prefetching.

For Mixture-of-Experts (MoE) models like Mixtral, the Hypura LLM inference scheduler exploits sparsity. It intercepts the router to load only the necessary expert strides from NVMe, reducing I/O by up to 75% in our internal evaluations. It also maintains a neuron cache for loaded expert slices, achieving a 99.5% hit rate in our benchmark suite, and tracks co-activations for speculative prefetching.

Dense FFN weights, which can constitute 60% of a model's size, are streamed from NVMe via a dynamically-sized pool buffer. The system automatically configures buffer sizes, prefetch depth, and memory budgets based on the hardware profile.

Hypura performs read-only operations from the SSD during inference. This design prevents SSD wear, addressing a frequent concern when NVMe storage functions as an extension of RAM. Beyond its core functionality as an LLM inference scheduler, Hypura provides an Ollama-compatible HTTP API server, accessible by default at 127.0.0.1:8080. This enables straightforward integration into existing LLM workflows, supporting standard endpoints like /api/generate and /api/chat. Building Hypura requires Rust 1.75+ and CMake.

The Bottleneck: Latency as a Limiting Factor for LLM Inference

While Hypura, the LLM inference scheduler, can run these models, their execution speed is the critical factor, especially for interactive use. Here's what the performance data shows:

For instance, a Qwen 2.5 14B Q4_K_M model (8.4 GB) fits entirely within the GPU and achieves 21 tokens/second. In contrast, a Mixtral 8x7B Q5_K_M model (30.9 GB) requires expert-streaming, utilizing 1.1 GB of GPU memory and 29.8 GB from NVMe, resulting in 2.2 tokens/second. A Llama 3.3 70B Q4_K_M model (39.6 GB), which uses dense FFN-streaming with 7.8 GB on GPU and 31.8 GB on NVMe, yields only 0.3 tokens/second.

The 0.3 tokens per second for a 70B model exposes a fundamental architectural bottleneck. While Hypura's paging and prefetching mechanisms are optimized, they cannot circumvent the physical limitations of data transfer. NVMe sequential reads on a 2021 14-inch MacBook Pro with an M1 Max (1TB SSD) reach approximately 5.1 GB/s, as measured by fio with 1MB block sizes.

However, random reads, which occur even with prefetching for certain access patterns, can decrease to ~500 MB/s under similar fio benchmarks with 4KB block sizes. Streaming 30+ GB of model weights, even with optimal sequential access, means data volume inherently leads to latency dominance.

Even for models approaching the limits of current consumer hardware, the sheer volume of data to stream can lead to unacceptable latency. For instance, if a model required streaming 2TB of FP16 weights per forward pass, it would result in over 300 seconds per token. Such performance renders it unusable for any foreground task. While the system achieves Availability (the model runs), it sacrifices Usability (the model is too slow for practical application).

Hypura's optimized paging mitigates worst-case random access penalties, particularly for MoE models where only a subset of experts are required. The 99.5% neuron cache hit rate for Mixtral exemplifies this optimization. However, for dense models, where a larger portion of FFN weights must be streamed, the bottleneck shifts from random access inefficiency to the raw bandwidth and latency of data transfer from NVMe to GPU/RAM.

The Trade-offs: Capacity vs. Latency in LLM Inference Scheduling

The Hypura LLM inference scheduler's core design prioritizes capacity over latency, allowing us to run models on consumer hardware that would otherwise be impossible when their footprint exceeds physical memory. This is a deliberate architectural choice, but it comes with clear performance implications.

This involves trading interactive, real-time performance for the fundamental ability to execute the model. This isn't a new concept; we've seen similar compromises in virtual memory systems for decades. The key distinction, however, lies in the sheer data scale and the LLM inference application's acute sensitivity to latency. A human user expects responses in seconds, not minutes.

This represents a resource contention scenario. When a model's working set exceeds the fastest memory tiers (GPU VRAM, unified RAM), the system must utilize a slower tier (NVMe). Performance then becomes constrained by the slowest component in this hierarchy. Hypura's optimizations aim to maximize the efficiency of this slower link, but they cannot achieve unified memory performance from an NVMe SSD.

The Pattern: Know Your Workload, Design for Purpose with Hypura's LLM Inference

Approaching the Hypura LLM inference scheduler requires a clear understanding of workload characteristics and latency requirements.

For interactive use, where sub-second or single-digit tokens per second are required in user-facing applications, Hypura's streaming modes for very large models are generally unsuitable. In such scenarios, it is advisable to prioritize smaller models or consider cloud offload. This involves strategies such as aggressive quantization to reduce model size to fit within GPU/RAM, or utilizing smaller, efficient models that can provide adequate performance for specific tasks.

For truly massive models, a distributed inference cluster in the cloud often remains the only method to achieve acceptable interactive latency. This is a fundamental decision about where to perform the computation.

Conversely, Hypura excels in scenarios where latency is not paramount, effectively functioning as a local batch processing engine for LLMs. If an overnight task requires processing a large corpus with a 70B model, and a multi-hour execution time is acceptable, then a rate of 0.3 tokens/second becomes viable in a background context. This enables local execution for tasks previously impossible on consumer hardware, albeit at a reduced speed.

Finally, the data consistently indicates that for models fitting entirely within GPU+RAM, Hypura introduces no overhead. This highlights a crucial point: increased fast memory consistently improves performance. Therefore, investing in Apple hardware with substantial memory capacity remains the most direct route to higher local LLM inference performance. Ultimately, Hypura, as an LLM inference scheduler, serves as a mitigation strategy for memory constraints, not a substitute for sufficient memory.

Hypura represents a sophisticated engineering solution that expands the capabilities of consumer hardware. It successfully addresses Out-Of-Memory crashes for large LLMs on Apple Silicon. However, it does not eliminate the fundamental latency challenges inherent in moving gigabytes of data between storage tiers. For interactive, high-speed inference with very large models, the architectural requirement remains more fast memory or offloading computation. Ultimately, Hypura shines as a specialized tool for workloads where latency isn't the primary concern, rather than a general-purpose accelerator.