Performance Tuning & Benchmarking
mHC is designed to be "Honey Badger" efficient, but the flexibility to mix history comes with a configurable footprint. This guide provides exact metrics, memory formulas, and deployment strategies to optimize your mHC models.
1. Memory Profiling (VRAM)
The primary cost of mHC is storing history. For a model with batch size \(B\), channels \(C\), and spatial dimensions \(S\), the historical footprint is:
Comparative VRAM Footprint (50-layer Vision Model)
| Architecture | Memory Scaling | Footprint (Rel) | Accuracy (Rel) |
|---|---|---|---|
| ResNet | \(O(D)\) | 1.0x (Base) | 1.0x |
| DenseNet | \(O(D^2)\) | 8.5x | 1.3x |
| mHC (\(H=4\)) | \(O(D \times H)\) | 1.1x | 1.35x |
| mHC (\(H=8\)) | \(O(D \times H)\) | 1.2x | 1.41x |
The detach_history Strategy
Setting detach_history=True is the single most effective way to save VRAM.
- When it's off: PyTorch stores the activation maps of \(H\) previous layers to compute gradients for the mixing weights.
- When it's on: History tensors are treated as constants. The mixing weights \(\alpha\) are still learned, but they only receive gradients from the current layer's output.
[!IMPORTANT] For models with >100 layers,
detach_history=Trueis mandatory to avoid Out-Of-Memory errors on standard 24GB GPUs.
2. Compute Efficiency (FLOPs)
Adding \(H\) tensors together is a simple element-wise sum. In modern GPUs, this is Memory-Bound, not Compute-Bound.
- ResNet Overhead: 0 extra FLOPs.
- mHC Overhead: \(\approx 1.5\%\) - \(3\%\) total FLOPs for typical vision/NLP workloads.
Pruning for Speed
If a layer's mixing weights \(\alpha\) converge to a "one-hot" state (where only 1 historical state is used), mhc automatically optimizes the forward pass to avoid the summation logic entirely, effectively reverting to ResNet-speed for that specific layer.
3. Hardware-Specific Optimization
NVIDIA GPUs (CUDA/TensorCores)
- Mixed Precision: mHC is fully compatible with
torch.cuda.amp. We recommendfloat16for history buffers to cut memory by 50% without losing stability. - Triton Kernels: (Experimental) We are developing Triton-based mixing kernels to fuse the Manifold Projection and Weighted Sum into a single GPU pass.
Apple Silicon (MPS)
- Memory Fragmentation: MPS is sensitive to frequent small allocations. Use
MHCSequentialto ensure buffers are pre-allocated and reused.
4. Deployment on Edge (ONNX / TensorRT)
mHC models can be exported to standard inference formats.
ONNX Export
model = MyMHCModel()
dummy_input = torch.randn(1, 3, 224, 224)
# Ensure history is cleared before export
model.clear_history()
torch.onnx.export(
model,
dummy_input,
"honey_badger.onnx",
opset_version=14,
do_constant_folding=True
)
TensorRT Optimization
Because mHC projection is non-linear (sorting-based), TensorRT might not fuse it automatically.
Optimization Tip: For deployment, freeze the mixing weights (\(\alpha\)) and replace the MHCSkip with a static WeightedSum layer. This removes the projection overhead entirely (Saves ~5ms per forward pass).
5. The MHC Profiler Tool
mhc includes a built-in profiler to help you identify memory-hungry layers.