Troubleshooting & Deep Architecture FAQ
mHC is a "Honey Badger"βit doesn't care about depthβbut complex PyTorch integrations can still cause friction. This guide provides an exhaustive list of known edge cases, architectural "gotchas," and precision fixes.
π Numerical Stability Issues
1. "My Loss is NaN from the first step"
- The Diagnosis: Usually caused by using a high
learning_ratewithmode="hc"(Softmax). Softmax has a "long tail" that can sometimes accumulate gradients too aggressively. - The Cure:
- Switch to
mode="mhc". The Euclidean Projection used inmhcmode is naturally more robust because it can zero-out noisy history. - Check for
input_normalization. mHC amplifies existing signal; if your inputs are not zero-mean, they will quickly saturate the history mixing.
- Switch to
2. "Floating Point Underflow in Simplex Projection"
- The Case: Logging shows that mixing weights \(\alpha\) are extremely small (\(1e-7\)) but not exactly zero.
- The Cure: Adjust the
temperature. If \(\tau\) is too high, the weights will stay small and uniform. - Sweet Spot: Maintain \(\tau\) between \(0.5\) and \(2.0\).
π Performance & Device Issues
3. "RuntimeError: Expected all tensors to be on the same device"
- The Diagnosis: This happens if you instantiate layers manually and use
HistoryBufferwithout correctly moving it to the GPU. - The Cure: Use
MHCSequential. It is "Device Aware" and will automatically move history buffers whenever the model moves (model.to('cuda')).
4. "OOM (Out of Memory) during Backprop"
- The Case: Your model fits in VRAM during the forward pass but crashes during
loss.backward(). - The Diagnosis: You are using
detach_history=False(the default). PyTorch is trying to store the gradient graphs for \(H\) previous layers. - The Cure: Set
detach_history=True. This is the recommended mode for models with >100 layers. It cuts the memory footprint by up to 5x with minimal impact on accuracy.
π Visual Debugging & Monitoring
5. Heatmap "Symmetry"
If you use visualization tools and see that the mixing weights \(\alpha\) look identical for all layers, check your clear_history setting. If history is never cleared between batches, the manifold might "collapse" into a single global state.
6. Early Warning Signs of Divergence
- Entropy Spike: If the entropy of your mixing weights suddenly jumps to maximum (\(log(H)\)), it means the model has lost its specialized paths and is defaulting to a noisy average.
- Weight Saturation: If \(\alpha_{latest}\) hits \(0.99\) constantly, your network is ignoring history and behaving like a standard sequential model. Increase the
temperatureor decrease theepsilonto force more historical exploration.
π Integration "Gotchas"
7. "My custom layer is being ignored by the Dashboard"
- The Cure: You must use the
@mhc_compatibledecorator on your customnn.Module. This adds a secret marker that theMHCLightningCallbackuses to find your layers during model traversal.
8. "Mixing weights are all zeros"
- The Diagnosis: Mathematically impossible if using
simplexoridentityconstraints. - Check: Are you extracting parameters from the model before any data has been pushed? Buffers are initialized with zeros and only populate after the first
forward()call.