Abdussamet Turker

Crafting Efficient Kernels with Epilogue Fusion

Crafting Efficient Kernels with Epilogue Fusion

In many ML workloads, a GEMM is followed by small operations like bias, activation, scaling, or type conversion. These ops are cheap in math, but they often cost extra global memory traffic (store GEMM result, read it back, write again). Epilogue fusion is a way to avoid this, we can