Performance Tuning & Benchmarking

mHC is designed to be "Honey Badger" efficient, but the flexibility to mix history comes with a configurable footprint. This guide provides exact metrics, memory formulas, and deployment strategies to optimize your mHC models.

1. Memory Profiling (VRAM)

The primary cost of mHC is storing history. For a model with batch size \(B\), channels \(C\), and spatial dimensions \(S\), the historical footprint is:

\[\text{Cost} = B \times C \times S \times H \times \text{Precision (bytes)}\]

Comparative VRAM Footprint (50-layer Vision Model)

Architecture	Memory Scaling	Footprint (Rel)	Accuracy (Rel)
ResNet	\(O(D)\)	1.0x (Base)	1.0x
DenseNet	\(O(D^2)\)	8.5x	1.3x
mHC (\(H=4\))	\(O(D \times H)\)	1.1x	1.35x
mHC (\(H=8\))	\(O(D \times H)\)	1.2x	1.41x

The `detach_history` Strategy

Setting detach_history=True is the single most effective way to save VRAM.

When it's off: PyTorch stores the activation maps of \(H\) previous layers to compute gradients for the mixing weights.
When it's on: History tensors are treated as constants. The mixing weights \(\alpha\) are still learned, but they only receive gradients from the current layer's output.

[!IMPORTANT] For models with >100 layers, detach_history=True is mandatory to avoid Out-Of-Memory errors on standard 24GB GPUs.

2. Compute Efficiency (FLOPs)

Adding \(H\) tensors together is a simple element-wise sum. In modern GPUs, this is Memory-Bound, not Compute-Bound.

ResNet Overhead: 0 extra FLOPs.
mHC Overhead: \(\approx 1.5\%\) - \(3\%\) total FLOPs for typical vision/NLP workloads.

Pruning for Speed

If a layer's mixing weights \(\alpha\) converge to a "one-hot" state (where only 1 historical state is used), mhc automatically optimizes the forward pass to avoid the summation logic entirely, effectively reverting to ResNet-speed for that specific layer.

3. Hardware-Specific Optimization

NVIDIA GPUs (CUDA/TensorCores)

Mixed Precision: mHC is fully compatible with torch.cuda.amp. We recommend float16 for history buffers to cut memory by 50% without losing stability.
Triton Kernels: (Experimental) We are developing Triton-based mixing kernels to fuse the Manifold Projection and Weighted Sum into a single GPU pass.

Apple Silicon (MPS)

Memory Fragmentation: MPS is sensitive to frequent small allocations. Use MHCSequential to ensure buffers are pre-allocated and reused.

4. Deployment on Edge (ONNX / TensorRT)

mHC models can be exported to standard inference formats.

ONNX Export

model = MyMHCModel()
dummy_input = torch.randn(1, 3, 224, 224)

# Ensure history is cleared before export
model.clear_history()

torch.onnx.export(
    model,
    dummy_input,
    "honey_badger.onnx",
    opset_version=14,
    do_constant_folding=True
)

TensorRT Optimization

Because mHC projection is non-linear (sorting-based), TensorRT might not fuse it automatically. Optimization Tip: For deployment, freeze the mixing weights (\(\alpha\)) and replace the MHCSkip with a static WeightedSum layer. This removes the projection overhead entirely (Saves ~5ms per forward pass).

5. The MHC Profiler Tool

mhc includes a built-in profiler to help you identify memory-hungry layers.

from mhc.utils.profiling import PerformanceAudit

model = MyModel()
audit = PerformanceAudit(model)

audit.run(input_size=(1, 64, 64))
print(audit.report())
# Shows: VRAM by layer, mixing time vs transformation time.