Skip to content

Troubleshooting & Deep Architecture FAQ

mHC is a "Honey Badger"β€”it doesn't care about depthβ€”but complex PyTorch integrations can still cause friction. This guide provides an exhaustive list of known edge cases, architectural "gotchas," and precision fixes.


πŸ›‘ Numerical Stability Issues

1. "My Loss is NaN from the first step"

  • The Diagnosis: Usually caused by using a high learning_rate with mode="hc" (Softmax). Softmax has a "long tail" that can sometimes accumulate gradients too aggressively.
  • The Cure:
    1. Switch to mode="mhc". The Euclidean Projection used in mhc mode is naturally more robust because it can zero-out noisy history.
    2. Check for input_normalization. mHC amplifies existing signal; if your inputs are not zero-mean, they will quickly saturate the history mixing.

2. "Floating Point Underflow in Simplex Projection"

  • The Case: Logging shows that mixing weights \(\alpha\) are extremely small (\(1e-7\)) but not exactly zero.
  • The Cure: Adjust the temperature. If \(\tau\) is too high, the weights will stay small and uniform.
  • Sweet Spot: Maintain \(\tau\) between \(0.5\) and \(2.0\).

πŸš€ Performance & Device Issues

3. "RuntimeError: Expected all tensors to be on the same device"

  • The Diagnosis: This happens if you instantiate layers manually and use HistoryBuffer without correctly moving it to the GPU.
  • The Cure: Use MHCSequential. It is "Device Aware" and will automatically move history buffers whenever the model moves (model.to('cuda')).

4. "OOM (Out of Memory) during Backprop"

  • The Case: Your model fits in VRAM during the forward pass but crashes during loss.backward().
  • The Diagnosis: You are using detach_history=False (the default). PyTorch is trying to store the gradient graphs for \(H\) previous layers.
  • The Cure: Set detach_history=True. This is the recommended mode for models with >100 layers. It cuts the memory footprint by up to 5x with minimal impact on accuracy.

πŸ” Visual Debugging & Monitoring

5. Heatmap "Symmetry"

If you use visualization tools and see that the mixing weights \(\alpha\) look identical for all layers, check your clear_history setting. If history is never cleared between batches, the manifold might "collapse" into a single global state.

6. Early Warning Signs of Divergence

  • Entropy Spike: If the entropy of your mixing weights suddenly jumps to maximum (\(log(H)\)), it means the model has lost its specialized paths and is defaulting to a noisy average.
  • Weight Saturation: If \(\alpha_{latest}\) hits \(0.99\) constantly, your network is ignoring history and behaving like a standard sequential model. Increase the temperature or decrease the epsilon to force more historical exploration.

πŸ”Œ Integration "Gotchas"

7. "My custom layer is being ignored by the Dashboard"

  • The Cure: You must use the @mhc_compatible decorator on your custom nn.Module. This adds a secret marker that the MHCLightningCallback uses to find your layers during model traversal.
from mhc import mhc_compatible

@mhc_compatible
class MySuperLayer(nn.Module):
    # ... your logic here

8. "Mixing weights are all zeros"

  • The Diagnosis: Mathematically impossible if using simplex or identity constraints.
  • Check: Are you extracting parameters from the model before any data has been pushed? Buffers are initialized with zeros and only populate after the first forward() call.