Quamba2: a robust and scalable post-training quantization framework for selective state space models (SSMs)
Quamba2 highlights
Supports W4A8 / W4A16 / W4AX / W8A8 for Mamba1 and Mamba2
Achieves 4x memory reduction and 3x generation speedup
Enables 8B model inference on Orin Nano 8G at 13 tokens/sec
Outperforms W4A8KV4 Llama3-8B in both speed and quality
Background
Deploying state space models (SSMs), which excel at processing long sequences but demand...
|