We built an MXFP8 quantizer in CuTeDSL that hits 6+ TB/s on B200. The kernel writes scale factors directly into the packed layout that Blackwell's block-scaled Tensor Cores expect, so downstream GEMMs can consume them without an additional pack step.
MXFP8 is a microscaling format (from the