Quamba2: a robust and scalable post-training quantization framework for selective state space models (SSMs)

Quamba2 highlights

  • Supports W4A8 / W4A16 / W4AX / W8A8 for Mamba1 and Mamba2
  • Achieves 4x memory reduction and 3x generation speedup
  • Enables 8B model inference on Orin Nano 8G at 13 tokens/sec
  • Outperforms W4A8KV4 Llama3-8B in both speed and quality

Background

Deploying state space models (SSMs), which excel at processing long sequences but demand intensive memory and computation, poses significant hurdles on resource-constrained hardware. Conventional post-training quantization methods, effective for transformer architectures, tend to fail when applied to the unique structures of these models, particularly in accurately handling the linear recurrence components. This mismatch often results in either unacceptable accuracy degradation or insufficient reductions in latency and memory footprint, making it challenging to achieve both efficiency and performance in real-world applications on mobile or edge devices.

Researchers at the University of Texas at Austin previously developed and patented the groundbreaking quantization method, Quamba, for SSMs. Quamba implemented a static 8-bit per-tensor quantization and introduced two key innovations: the suppression of maximum input activation values to enable finer quantization and the use of the Hadamard transform for quantizing output activations in an outlier-free space. This technology, Quamba2, further advances the state of the art in SSM development.

Technical description

The Quamba2 framework implements post-training quantization for state space models by exploiting inherent properties of these models to reduce precision while preserving computational invariance. It quantizes the linear recurrence inputs using an offline approach that sorts, clusters, and applies per-state-group quantization to parameters, followed by offline weight re­arrangement to maintain compute-invariance. Additionally, it uses the Walsh-Hadamard transformation to shift output activations into an outlier-free space and applies group quantization methods to further refine the precision of inputs and associated parameters. Multiple bit-width configurations (W8A8, W4A8, and W4A16) are provided to balance memory reduction, speed improvements, and accuracy, enabling deployment on resource-constrained hardware while integrating existing libraries such as PyTorch, CUTLASS, and others.

Benefits

  • Memory efficiency: Quamba2 achieves up to a 4× memory reduction compared to full-precision models, enabling the deployment of large-scale language models on resource-limited hardware (e.g., Nvidia Orin Nano 8G), an advantage not offered by conventional PTQ techniques for Transformers.
  • Improved latency and throughput: With a 1.3× speed-up in the prefilling stage and a 3× speed-up in the generation stage, Quamba2 outperforms traditional PTQ methods typically used for Transformers and earlier SSM quantization approaches, ensuring faster inference times.
  • Flexible bit-width configurations: By supporting multiple configurations (W8A8, W4A8, and W4A16), Quamba2 provides tailored tradeoffs for different deployment scenarios, unlike existing ones that often rely on a fixed bit-width, thereby enhancing efficiency for both cloud and edge applications.
  • Enhanced accuracy preservation: Despite aggressive quantization, the framework maintains minimal accuracy degradation (only a 1.6% drop), outperforming many existing quantization strategies that exhibit significant accuracy losses when applied to state space models.
  • Specialized adaptation to state space models: Quamba2’s design leverages channel order preserving, activation persistence, and compute-invariance preservation specifically for SSMs, addressing the shortcomings of PTQ methods originally devised for Transformers, which struggle with the selective scan (linear recurrence) component.
  • Innovative quantization techniques: The integration of techniques such as the Walsh-Hadamard transformation and group quantization leads to improved outlier handling and precision in the quantization process, offering a technical edge over traditional quantization approaches that fail to optimize these aspects.

Commercial applications

  • Edge AI applications
  • Mobile AI deployments
  • Autonomous vehicles processing
  • Cloud-based NLP services
  • Virtual reality computing

Publication link

https://arxiv.org/pdf/2503.22879