A Community-Built Kernel Just Outperformed AMD’s Own Attention Library on Every Single Test

When a small team outside AMD beats AMD’s own optimized library on every tested shape and rounding mode, the question stops being about one kernel and starts being about who actually controls hardware performance. MoonMath AI has open-sourced a bf16 forward attention kernel for the AMD MI300X GPU, written in HIP and released under the MIT license, that outperforms AMD’s own AITER v3 across all benchmarked configurations — with per-shape RTNE speedups reaching 1.26× and geomean speedups of 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). That last figure is the most telling: RTZ is AITER’s own fastest mode, and the MoonMath AI kernel still wins.

The kernel computes the standard scaled dot-product attention operation QKᵀ/√d in bf16 (brain float 16 — a 16-bit floating-point format that trades precision range for memory efficiency), targeting the gfx942 ISA of the AMD MI300X. It accepts tensors in both BSHD and BHSD layouts, fixes head dimension at 128, and is deterministic across all three bf16 rounding modes: RTNE (round to nearest even), RTNA (round to nearest, ties away from zero), and RTZ (truncate toward zero). The real-world consequence is immediate: a SGLang pull request integrating this kernel accelerated Wan2.1 video diffusion by 1.23× with no quality regression.

The broader implication is structural. Vendor-optimized libraries like AITER v3 are built to cover a wide range of hardware configurations and use cases. A community kernel that targets one chip — the MI300X — with surgical precision can exploit architectural details that a general-purpose library cannot afford to hard-code. MoonMath AI’s release makes that gap visible and reproducible.

How Three Memory Decisions Produced a Faster Kernel Than AMD’s Own

The performance advantage documented in the MoonMath AI release notes traces directly to three memory placement choices, each exploiting a specific feature of the CDNA3 compute unit. K streams from HBM into LDS (Local Data Share — a 32 KiB fast scratchpad shared within a compute unit), double-buffered and shared by all eight waves per block. V stays hot in L1 cache, resident and prefetched, so it is reread across every PV matrix multiply without returning to HBM. Q and the output accumulators live in VGPRs (Vector General-Purpose Registers) — read every iteration, never reloaded. The cascade effect matters: moving V to L1 freed the LDS that the third Q tile (3Q) then fills, raising data reuse per loaded K and V tile without adding synchronization cost.

The pipeline itself is structured as eight waves per block, split into two groups of four. Each K-block iteration involves two phases, bounded by two s_barrier instructions per iteration — a synchronization primitive that coordinates the phase handoff at each iteration boundary without per-counter waits. The matrix core never idles during this eight-wave pipeline execution. The team chose the 16×16×16 MFMA (Matrix Fused Multiply-Add) instruction over the 32×32×8 MFMA: both deliver the same throughput on the CDNA3 compute unit’s four SIMD units, but the 16×16×16 shape carries lower VGPR pressure and better power efficiency. The actual instruction is exposed through a one-instruction assembly wrapper: asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0" : "+v"(c) : "v"(a), "v"(b)); declared as __device__ __forceinline__, bypassing compiler abstraction layers entirely.

A Flash-Decoding-style tail KV split rescues the stranded fractional round across MI300X’s 304 CUs — the 304 compute units that make up the full chip. Without this split, workloads with sequence lengths that do not divide evenly across CUs leave some units underutilized. The combination of these wins cascades: the benchmark table across five input shapes (B, H, S, D) — including (2, 24, 8192, 128), (2, 24, 16384, 128), (4, 16, 16384, 128), (2, 24, 32768, 128), and (1, 16, 131072, 128) — shows the MoonMath AI kernel posting 3.083 ms against AITER v3’s 3.792 ms on the smallest shape (1.23× vs AITER), and 11.670 ms against 14.691 ms on the next (1.26× vs AITER). Against Modular MAX, the geomean advantage runs from 1.44× to 1.49×, with per-shape speedups reaching 1.59×. This is not a marginal improvement on a single configuration — it holds across every shape and every rounding mode tested.

The Portability Trade-Off That Every Adopter Must Price In

The MoonMath AI kernel’s performance is inseparable from its constraints. It is hard-coded to the gfx942 ISA — the instruction set architecture of the AMD MI300X — and will not run on other hardware. The head dimension is fixed at 128, matching the bf16 ULP (Unit in the Last Place, the smallest representable difference between adjacent floating-point values) precision requirements of the target workload. There is no causal mask support, no GQA (Grouped Query Attention, a technique that reduces key-value memory by sharing attention heads across query groups), and no varlen batching (variable-length sequence batching, which allows a single kernel call to process sequences of different lengths). These are not oversights — they are the price of the performance. Every generalization that AITER v3 supports is a constraint that the MoonMath AI kernel deliberately avoids.

The comparison with FlashAttention-3 is instructive here. FlashAttention-3’s matmul and softmax alternation and FA3’s producer and consumer warp split represent a different architectural philosophy — one that uses an asynchronous memory move with a dedicated producer wave to overlap computation and data transfer. The MoonMath AI kernel does not adopt this approach on CDNA3, instead relying on the s_barrier-bounded two-phase iteration to keep the matrix core busy. Whether this is the right trade-off depends entirely on the workload: for non-causal, fixed-head-dimension, full-sequence attention on MI300X, the numbers say yes. For anything requiring causal masking, GQA, or variable-length batching, the kernel is currently not applicable.

The RTZ result deserves specific attention. RTZ is AITER’s own fastest mode — truncation toward zero avoids the tie-breaking logic of RTNE and RTNA, making it the least computationally expensive rounding path. The MoonMath AI kernel’s geomean speedup of 1.08× over AITER v3 in RTZ mode, with the tightest race occurring at the (4, 16, 16384) RTZ shape where the range narrows to 0.95× to 1.07×, shows that the advantage compresses but does not disappear even on AITER’s home turf. That compression is the honest signal: the kernel’s lead is real but not uniform, and workloads that happen to land on the shapes where RTZ narrows the gap will see less benefit.

Kernel / Library	Key Differentiator	Best For
MoonMath AI HIP Kernel	gfx942-specific memory layout (K in LDS, V in L1, Q in VGPRs); geomean 1.18×/1.15×/1.08× over AITER v3	Non-causal, fixed head-dim 128 attention on AMD MI300X
AMD AITER v3	Vendor-maintained, broader shape and feature coverage including GQA and varlen batching	Production deployments requiring causal masks, GQA, or variable-length sequences
Modular MAX	Cross-hardware portability; MoonMath AI kernel leads by 1.44× to 1.49× geomean on MI300X	Multi-hardware deployments where MI300X is not the exclusive target

📊 Key Numbers

Geomean speedup vs AITER v3 (RTNE): 1.18× across all tested shapes
Geomean speedup vs AITER v3 (RTNA): 1.15× across all tested shapes
Geomean speedup vs AITER v3 (RTZ): 1.08× — RTZ is AITER’s own fastest mode
Peak per-shape RTNE speedup vs AITER v3: 1.26× (shape 2, 24, 16384, 128)
SGLang / Wan2.1 video diffusion speedup: 1.23× with no quality regression
Geomean speedup vs Modular MAX: 1.44× to 1.49×; per-shape speedups reach 1.59×
Benchmark timing sample (RTNE, shape 2, 24, 8192, 128): 3.083 ms (MoonMath AI) vs 3.792 ms (AITER v3) — 1.23×
Benchmark timing sample (RTNE, shape 2, 24, 16384, 128): 11.670 ms vs 14.691 ms — 1.26×
LDS allocation: 32 KiB, double-buffered, shared by all eight waves per block
MFMA instruction selected: 16×16×16 bf16 (same throughput as 32×32×8, lower VGPR pressure)
MI300X compute units targeted: 304 CUs across the full chip via Flash-Decoding-style tail KV split

🔍 Context

The benchmarks and architectural analysis were published directly by MoonMath AI as part of the open-source release documentation, making the team itself the primary source for all performance claims — independent third-party replication has not yet been reported. The specific gap this kernel addresses is the absence of a publicly available, gfx942-optimized attention kernel that exploits the CDNA3 memory hierarchy at the assembly level; AITER v3 covers a broader feature surface but does not hard-code the K-in-LDS, V-in-L1, Q-in-VGPR placement that the MoonMath AI kernel treats as fixed. This release arrives as AMD’s MI300X has become a primary alternative to NVIDIA hardware for large-scale inference deployments, creating direct demand for software that closes the performance gap between the two ecosystems — a gap that has historically been wider on the software side than the hardware side. The closest architectural alternative is AITER v3 itself, which supports causal masks, GQA, and varlen batching that the MoonMath AI kernel explicitly omits; teams needing those features cannot substitute this kernel without modification. The SGLang integration and the Wan2.1 video diffusion result establish that the kernel’s gains transfer from synthetic benchmarks to a real inference serving framework, which is the condition that makes the release practically relevant rather than academically interesting.

💡 AIUniverse Analysis

Our reading: The genuine advance here is not the speedup number itself but the mechanism that produces it. MoonMath AI’s team identified that the CDNA3 compute unit’s four SIMD units and 32 KiB LDS create a specific opportunity: if you fix head dimension at 128 and abandon generality, you can keep the matrix core busy through every K-block iteration by using two s_barrier instructions per iteration rather than per-counter waits, and by letting V stay resident in L1 rather than reloading it from HBM. The 3Q tile optimization — processing three Q tiles per loaded K and V tile — is the compounding move that turns a good memory layout into a consistently faster kernel. The fact that this holds even in RTZ mode, where AITER v3 is at its fastest, means the advantage is structural rather than a rounding-mode artifact.

The shadow is the feature list that does not exist. No causal mask means no decoder-only transformer inference in the standard autoregressive configuration. No GQA means no efficient inference for models like Llama 3 or Mistral that use grouped query attention to reduce KV cache memory. No varlen batching means no production serving system that packs variable-length sequences into a single kernel call for throughput efficiency. The 1.23× speedup on Wan2.1 video diffusion in SGLang is real and meaningful — but Wan2.1 is a diffusion model with non-causal attention, which is precisely the narrow workload profile where this kernel applies. A team deploying a decoder LLM on MI300X cannot use this kernel today without significant modification. The benchmark comparison against Modular MAX, which shows 1.44× to 1.49× geomean advantage, is also self-reported and has not been independently verified against Modular MAX’s current production configuration.

For this to matter in 12 months, either MoonMath AI or the community will need to extend the kernel to support causal masking and GQA — at which point the portability trade-off becomes a genuine architectural choice rather than a current limitation. If those extensions preserve even 80% of the current speedup, the case for gfx942-specific kernels over general-purpose vendor libraries becomes difficult to argue against.

⚖️ AIUniverse Verdict

✅ Promising. The geomean 1.18× RTNE speedup over AITER v3 is independently meaningful, but the SGLang/Wan2.1 integration result is what converts a benchmark into a production signal. The feature gaps — causal mask, GQA, varlen batching — are the only thing separating this from a tier-1 recommendation for any MI300X deployment.

Analysis based on reporting by MarkTechPost. Original article here.

A Community-Built Kernel Just Outperformed AMD’s Own Attention Library on Every Single Test

ByAI Universe

A Community-Built Kernel Just Outperformed AMD’s Own Attention Library on Every Single Test

How Three Memory Decisions Produced a Faster Kernel Than AMD’s Own

The Portability Trade-Off That Every Adopter Must Price In

📊 Key Numbers

🔍 Context

💡 AIUniverse Analysis

⚖️ AIUniverse Verdict

By AI Universe

Related Post

Zyphra’s TSP Strategy Slashes AI Model Training Time by 2.6x

Open-Sourced FlashKDA Delivers Major Speedups for Long-Context AI Models

Meta Is Building an AI Compute Stack That No Longer Depends on Any Single Chip

You missed

DeepSeek Cuts AI Generation Time Up To 85% With New Optimization Framework

OpenAI and Broadcom Forge a Path to Bespoke AI Silicon

Why Meta Had to Reinvent the Battery to Make AI Glasses Actually Work

A Community-Built Kernel Just Outperformed AMD’s Own Attention Library on Every Single Test