Dense Matrix Multiplication's Dominance Is Being Challenged — And the Numbers Back It UpAI-generated image for AI Universe News

Dense Matrix Multiplication’s Dominance Is Being Challenged — And the Numbers Back It Up

Feedforward layers sit at the heart of every large language model, consuming over two-thirds of model parameters and more than 80% of total floating-point operations. Researchers from Sakana AI and NVIDIA have now published an ICML 2026 paper introducing TwELL — a tile-wise ELLPACK sparse format with fused CUDA kernels — that squeezes 20.5% faster inference and 21.9% faster training out of those layers without changing the underlying model architecture. The paper is available on arXiv under identifier 2603.23198.

The core insight is deceptively simple: for any given token processed by an LLM, only a tiny fraction of hidden neurons actually fire. The rest output zero after the activation function — a phenomenon called activation sparsity. NVIDIA GPUs are engineered to excel at dense matrix multiplications, so they have historically ignored that sparsity entirely, treating zeros as if they were meaningful values. TwELL is built to stop doing that, and the measured speedups suggest the wasted compute was substantial.

What makes this announcement technically credible is its scope. Prior sparse LLM kernels — including TurboSparse, ProSparse, and Q-Sparse — addressed memory-bound GEMV (General Matrix-Vector) operations suited to single- or few-token inference. TwELL targets compute-bound GEMM (General Matrix-Matrix Multiplication) operations in a batched setting with thousands of tokens, covering the regimes that actually matter for training and high-throughput production inference.

Why Activation Sparsity Has Been a Locked Door Until Now

Activation sparsity is not a new observation. Engineers have known for years that LLM feedforward layers produce large numbers of zeros after activation functions. The problem has always been exploitation: historically, sparse operations on modern GPUs ran slower than their dense counterparts, making the theoretical savings unreachable in practice. Traditional sparse formats also impose a conversion overhead — the cost of reorganizing data into a sparse representation — that frequently negates any compute savings before a single multiply-accumulate is skipped.

The conventional ELLPACK format, for instance, packs non-zero values row-by-row. Constructing it from tiled matrix-multiplication output requires a separate kernel launch, a global memory read, and synchronization across Compute Thread Arrays (CTAs) — three sequential bottlenecks that collectively erode the benefit of skipping zero-valued computations. This is why prior approaches, despite targeting the right problem, could not generalize beyond narrow single-token inference scenarios.

TwELL sidesteps this by using tile-wise packing instead. It is built directly within the epilogue of the matmul kernel itself, and it partitions columns into horizontal tiles that match the matmul kernel’s tile size T_n. The result, as documented in the ICML 2026 paper, is that the sparse format is constructed without the separate kernel launch, global memory round-trip, or CTA synchronization that made traditional ELLPACK impractical. The fused CUDA kernel design is what converts a theoretically attractive idea into a measured 20.5% inference speedup and 21.9% training speedup on NVIDIA GPUs.

The Harder Question: What This Approach Gives Up

The speedups are real and independently corroborated — documentation from multiple sources confirms TwELL achieves over 20% faster LLM training and inference on NVIDIA GPUs without altering model architecture. But the approach makes a deliberate trade: it abandons the highly optimized dense computation pathways that NVIDIA’s Tensor Cores are specifically designed to exploit. Dense GEMM on modern NVIDIA hardware benefits from decades of compiler tuning, cuBLAS optimizations, and hardware-level support. TwELL introduces kernel complexity that sits outside that established stack.

The unstructured nature of the sparsity is also a genuine engineering challenge. Unstructured sparsity — meaning zeros appear at arbitrary positions rather than in regular patterns — is fundamentally harder to exploit efficiently than structured sparsity, where hardware can predict memory access patterns. Managing irregular sparse data in a compute-bound batched setting, at the scale of thousands of tokens, requires careful kernel design to avoid introducing new memory bottlenecks that replace the compute bottleneck being eliminated.

The prior art comparison is instructive here. TurboSparse, ProSparse, and Q-Sparse each addressed a narrower, more tractable version of the same problem — memory-bound single-token inference — and still did not achieve widespread production adoption. TwELL’s extension to batched GEMM is a meaningful technical step, but the gap between a published speedup on a research benchmark and a production-grade kernel that ships in inference frameworks is not trivial to close.

Sparse Kernel ApproachOperation TargetedBest For
TurboSparse / ProSparse / Q-SparseMemory-bound GEMVSingle- or few-token inference
TwELL (Sakana AI & NVIDIA)Compute-bound GEMM (batched)Training and high-throughput inference
Dense GEMM (cuBLAS / Tensor Cores)Dense matrix multiplicationGeneral-purpose LLM compute, all batch sizes

📊 Key Numbers

  • Inference speedup (TwELL on LLMs): 20.5% faster than dense baseline
  • Training speedup (TwELL on LLMs): 21.9% faster than dense baseline
  • Feedforward layer parameter share: Over two-thirds of total LLM model parameters
  • Feedforward layer compute share: More than 80% of total FLOPs in LLMs
  • Architecture change required: None — speedups achieved without modifying model architecture
  • Target operation: Batched GEMM with thousands of tokens (compute-bound regime)

🔍 Context

The research was conducted jointly by teams at Sakana AI and NVIDIA and accepted at ICML 2026, with the full paper published on arXiv under identifier 2603.23198. The specific gap TwELL addresses is the absence of any practical sparse kernel for the compute-bound batched GEMM regime — the setting that governs both LLM training runs and high-throughput serving at scale. Every prior sparse LLM kernel (TurboSparse, ProSparse, Q-Sparse) was designed for memory-bound GEMV, leaving the dominant compute workload untouched. TwELL’s tile-wise packing, fused directly into the matmul kernel epilogue, eliminates the separate kernel launch and global memory round-trip that made traditional ELLPACK formats impractical. The “why now” is architectural: as LLMs grow larger and feedforward layers consume an ever-larger share of total FLOPs, the cost of ignoring activation sparsity compounds — making a working batched sparse kernel more valuable with each generation of model scaling.

💡 AIUniverse Analysis

Our reading: ★ LIGHT — TwELL’s genuine advance is not the speedup number itself but the mechanism that produces it: fusing sparse format construction into the matmul kernel epilogue eliminates the three-step overhead (separate launch, global memory read, CTA synchronization) that made ELLPACK unworkable in batched settings. That is a concrete kernel engineering contribution, not a benchmark artifact. The fact that it applies to training — not just inference — means the efficiency gain compounds across the entire model development lifecycle, not just at deployment time.

★ SHADOW — The cautious read is that unstructured sparsity at batched GEMM scale has defeated multiple prior attempts precisely because irregular memory access patterns can introduce new bottlenecks as fast as old ones are removed. The 20.5% and 21.9% figures come from a research paper, not a production deployment. A CTO evaluating TwELL for a real inference stack would need to know: does the speedup hold across different sparsity levels and model architectures, or is it sensitive to the specific activation patterns in the benchmark models? The complexity of maintaining a custom CUDA kernel outside the cuBLAS ecosystem also creates a long-term maintenance burden that does not appear in the headline numbers.

For TwELL to matter in 12 months, it would need to be integrated into at least one major inference framework — such that engineers can access the speedup without writing or maintaining custom CUDA code themselves.

⚖️ AIUniverse Verdict

✅ Promising. The 20.5% inference and 21.9% training speedups are backed by an ICML 2026 paper and achieved without model architecture changes, but production impact depends on whether TwELL’s fused CUDA kernels can be adopted into mainstream inference frameworks beyond the research setting.

🎯 What This Means For You

Founders & Startups: A 20%+ reduction in inference compute costs — without retraining or re-architecting models — could meaningfully shift the unit economics of LLM-powered products. Watch for TwELL integration in inference runtimes before budgeting GPU spend for 2026 deployments.

Developers: The fused CUDA kernel approach means the speedup is not accessible through a simple config flag today — it requires kernel-level integration. Developers building on NVIDIA hardware should track the arXiv paper (2603.23198) and any downstream framework support announcements.

Enterprise & Mid-Market: Training cost is the larger lever here: a 21.9% training speedup applied to large-scale fine-tuning runs translates directly to reduced GPU-hours per experiment. Enterprises running continuous fine-tuning pipelines should evaluate whether TwELL-compatible tooling becomes available in their existing MLOps stack.

General Users: The efficiency gains target the infrastructure layer, not the user interface — but faster, cheaper inference at the provider level typically flows through to lower latency and broader model availability over time.

⚡ TL;DR

  • What happened: Sakana AI and NVIDIA introduced TwELL, a tile-wise ELLPACK sparse format with fused CUDA kernels that achieves 20.5% faster LLM inference and 21.9% faster training by exploiting activation sparsity in feedforward layers.
  • Why it matters: Feedforward layers account for over 80% of LLM FLOPs, and no prior sparse kernel had successfully targeted the compute-bound batched GEMM regime that governs both training and high-throughput inference.
  • What to do: Track the ICML 2026 paper (arXiv 2603.23198) and monitor whether major inference frameworks adopt TwELL’s CUDA kernels before committing to GPU infrastructure plans.

📖 Key Terms

Activation sparsity
The condition where most hidden neurons in an LLM feedforward layer output zero after the activation function for any given token — the phenomenon TwELL is designed to exploit rather than ignore.
Unstructured sparsity
A sparsity pattern where zero values appear at arbitrary, unpredictable positions in a matrix, making it harder to exploit efficiently than structured sparsity where zeros follow regular patterns.
GEMM
General Matrix-Matrix Multiplication — the compute-bound operation that dominates LLM training and high-throughput inference, and the specific target of TwELL’s batched sparse kernel.
GEMV
General Matrix-Vector Multiplication — a memory-bound operation suited to single- or few-token inference, which prior sparse LLM kernels like TurboSparse, ProSparse, and Q-Sparse addressed instead of GEMM.
Tensor Cores
Specialized processing units on NVIDIA GPUs optimized for dense matrix multiplications — the hardware pathway that TwELL’s unstructured sparse approach deliberately steps outside of in exchange for skipping zero-valued computations.

Analysis based on reporting by MarkTechPost. Original article here. Additional sources consulted: Github Repository — github.com/Benjamin-Dobell/nvidia-update; Independent Source — di.gg/ai/psdt91ak; Github Repository — github.com/nvidia.

By AI Universe

AI Universe