Beyond Binary Search: When Interpolation Beats Quaternary, Radix, and SIMD
You have a sorted array. Need to search it. `std::lower_bound`: O(log n), predictable, robust. This is the default solution. But my industry experience breeds skepticism. Every few months, a new paper
Modern C++ // dev Apr 28, 2026 10 min read
Kernel Fusion on CPU: What llama.cpp's RMS_NORM + MUL Fusion Teaches Us About LLM Performance
Llama.cpp's PR #22423 landed a kernel fusion for RMS_NORM + MUL in the ggml CPU backend a few weeks ago. The speedup: 1.60×. Consistently. Across dimension sizes, thread counts, even hardware variatio
Modern C++ // dev Apr 21, 2026 7 min read
Designing a SIMD Algorithm from Scratch
I manually unrolled a byte-counting loop with four independent accumulators — the textbook ILP optimization — and it ran 2.08x *slower* than the plain loop. The plain loop that GCC had quietly autovec
Modern C++ // dev Mar 31, 2026 10 min read