Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it

Pull Request #7298: [Gluon][Tutorial] Persistent Attention Overview The pull request rewrites the attention kernel in the Gluon tutorial to be persistent. The persistent kernel improves performance at low context sizes. However, for fp16 data type at large context sizes, there is degraded performance caused by a ptxas instruction scheduling issue in the softmax partition. The fp8 precision version is approximately 100 teraflops faster when the kernel name includes "cutlass". Performance Results Performance benchmarks compare triton-fp16 and triton-fp8 for attention with parameters: Z=4, H=32, D=64 and D=128 causal=True and causal=False Context size NCTX varies from 1024 to 65536 | Config | NCTX | triton-fp16 | triton-fp8 | |----------------------------|-------|-------------|-------------| | Z=4, H=32, D=64, causal=False | 1024 | 359.57 | 370.12 | | | 65536 | 699.87 | 728.56 | | Z=4, H=32, D=64, causal=True | 1024 | 181.88 | 177.98 | | | 65536 | 692.95 | 694.65 | | Z=4, H=32, D=128, causal=False| 1024 | 718.58 | 709.86 | | | 65536 | 1064.59 | 1518.74 | | Z=4, H=32, D=128, causal=True | 1024 | 355.64 | 351.16 | | | 65536 | 970.53 | 1413.28 | Additional Notes Previous best results before converting to persistent kernel are also included for comparison. Performance drop for fp16 at large context sizes is acknowledged and linked to ptxas scheduling issue without known solution. The cutlass naming trick: The "cutlass" kernel name leads to an NVIDIA-specific optimization bit that affects instruction scheduling in ptxas. This optimization is likely unstable and experimental, not simply toggled by kernel name for other kernels. Discussion about layout mismatches related to blocking and slicing layouts in the kernel broadcast phase. Outstanding questions shared about the accuracy and guarantees of this speedup. Multiple collaborators reviewed and approved the changes. The pull request was merged on July 9, 2025. Key Contributors Author & main committer: Mogball Reviewers include: Jokeren, ThomasRaoux, peterbell10, aeng-openai, among others. Summary This PR introduces a persistent attention kernel in the Gluon tutorial to enhance performance, particularly in low context-length scenarios. The cutlass kernel naming enables a substantial speedup for fp8, attributed to a specialized NVIDIA optimization related to instruction scheduling in ptxas. Although fp16 performance at large context sizes suffers due to a known compiler scheduling issue, this PR marks a significant step in optimizing attention kernel operations in Triton’s Gluon tutorials. --- For detailed performance tables, code snippets, and review discussions, please refer to the original GitHub pull request #7298 in the triton-lang/triton repository.