Abstract
Autoregressive decoding of large language models becomes bandwidth-limited at long contexts, as generating each token requires streaming the full key-value (KV) cache. We introduce Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling a small number of indices from the post-softmax distribution and aggregating only those value rows via gather-and-add. SANTA yields an unbiased estimator of the attention output while replacing value-stage multiply-accumulates with addition. We introduce variance-reduced variants based on stratified and systematic sampling (S2ANTA) and implement custom CUDA kernels that achieve 1.5x decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada at 32k-token contexts, matching baseline accuracy. We also propose Bernoulli qKT sampling as a complementary technique that sparsifies the score stage through stochastic ternary queries, reducing key-feature access during decoding. Both methods are orthogonal to KV-cache compression, quantization, and eviction, and point toward sparse, multiplier-free, energy-efficient inference.
Biography
Kerem Çamsari is an Associate Professor of Electrical and Computer Engineering at UC Santa Barbara, leading the Orchestrating Physics for Unconventional Systems (OPUS) Lab. His research focuses on probabilistic computing, a physics-inspired paradigm that treats randomness as a computational resource rather than a nuisance. A central idea he and his colleagues introduced is the probabilistic bit (p-bit), a naturally fluctuating unit that explores solutions for optimization and machine learning. He is co-founder and CEO of Flucta, developing energy-efficient hardware and software for AI. His work has appeared in leading journals spanning physics and engineering, and has helped grow a worldwide community working on probabilistic computing.