Yigithan Yigit

Chasing 6+ TB/s: an MXFP8 quantizer on Blackwell

Chasing 6+ TB/s: an MXFP8 quantizer on Blackwell

We built an MXFP8 quantizer in CuTeDSL that hits 6+ TB/s on B200. The kernel writes scale factors directly into the packed layout that Blackwell's block-scaled Tensor Cores expect, so downstream GEMMs can consume them without an additional pack step. MXFP8 is a microscaling format (from the