In many ML workloads, a GEMM is followed by small operations like bias, activation, scaling, or type conversion. These ops are cheap in math, but they often cost extra global memory traffic (store GEMM result, read it back, write again).
Epilogue fusion is a way to avoid this, we can