fal.ai Blog | Generative AI Model Releases & Tutorials
  • Home
  • Docs
  • Discord

Yigithan Yigit

Chasing 6+ TB/s: an MXFP8 quantizer on Blackwell

Chasing 6+ TB/s: an MXFP8 quantizer on Blackwell

We built an MXFP8 quantizer in CuTeDSL that hits 6+ TB/s on B200. The kernel writes scale factors directly into the packed layout that Blackwell's block-scaled Tensor Cores expect, so downstream GEMMs can consume them without an additional pack step. MXFP8 is a microscaling format (from the
Jan 27, 2026 7 min read
Page 1 of 1
fal.ai Blog | Generative AI Model Releases & Tutorials
  • Home
  • Docs
  • Discord
  • Dashboard