ai digest - April 5, 2026

key developments

simon willison highlights a standout piece on agentic engineering and the “build, then rebuild” pattern. lalit maganti’s writeup on building syntaqlite (high-fidelity sqlite devtools: parser, formatter, verifier) is one of the best long-form accounts of what coding agents actually change about development workflow. the key insight: ai let maganti bypass eight years of procrastination by giving him concrete prototypes to react to rather than abstract designs to think through. but the first ai-built prototype was eventually thrown away entirely because ai “made me procrastinate on key design decisions” since refactoring felt cheap. the pattern of ai-as-scaffolding (get to a working prototype fast, then rebuild with proper architecture) is becoming a recurring theme in serious agentic engineering work, and this is one of the most honest accounts of both its power and its failure modes. willison also built a wasm playground for syntaqlite. writeup | playground

fused moe dispatch kernel in pure triton beats megablocks at inference batch sizes. subhadip mitra built a fused moe dispatch pipeline that handles the full forward pass in 5 kernel launches instead of 24+. on mixtral-8x7b (a100), it achieves 131% of megablocks performance at 32 tokens and 124% at 128 tokens, which are the batch sizes that actually matter for inference serving. the core technique fuses gate+up projection to share input tiles from l2 cache with silu computed in registers, eliminating ~470mb of intermediate buffers per forward pass. tested across mixtral, deepseek-v3 (256 experts), and qwen2-moe; passes all tests on amd mi300x with zero code changes. this matters because pure triton achieving parity or better than hand-tuned cuda at practical batch sizes lowers the barrier for custom moe kernel work significantly, and the cross-vendor portability is genuinely notable. code | writeup

willison building new llm abstraction layer, publishes raw api research. willison is redesigning the abstraction layer for his llm python library to handle features like server-side tool execution that current plugins can’t support. to inform the design, he had claude code read through anthropic, openai, gemini, and mistral python clients and generate raw curl commands for streaming and non-streaming modes across scenarios. the captured outputs and scripts are in a new repo. this is interesting as a methodology note: using ai to systematically map vendor api surface areas before designing an abstraction layer is a practical pattern others building multi-provider tooling should steal.

notable

scan-for-secrets 0.1: willison released a tool that scans directories for api keys including common encodings (json escaping, backslash escaping), built via readme-driven development with claude code. solves a real problem for anyone publishing ai session transcripts. link
per-layer embeddings explainer for gemma 4 e2b/e4b: clear community explainer on how gemma 4’s small “e” models differ from both moe and dense architectures, enabling new inference performance tradeoffs. worth reading if you missed the gemma 4 architecture details. link
turboquant kv cache quantization showing strong results on gemma 4: ~3.1 bits per k channel achieving near-zero accuracy loss with 34% speedup at 131k context on gemma 4 26b; per-layer outlier-aware k quantization beating q8_0 ppl on qwen models at ~5-6 bpv. suggests per-layer allocation matters more than base quantizer choice. link
qwen 3.5 tool calling bug catalog: comprehensive writeup documenting four specific bugs that break tool calling in qwen 3.5 across llama.cpp, ollama, and vllm, with workarounds. useful reference if you’re running qwen 3.5 in agentic setups. link
real-time multimodal ai (audio/video in, voice out) running locally on m3 pro with gemma 4 e2b: demonstrates the practical threshold these small multimodal models have crossed for local, real-time use cases like language learning. repo

papers

“signals: finding the most informative agent traces without llm judges”, salman et al. (katanemo labs / digitalocean). proposes lightweight structured signal computation from live agent interactions to surface the most informative trajectories for review without gpu overhead; achieves 82% informativeness rate vs 54% for random sampling on τ-bench. practical for anyone running agentic systems at scale who can’t afford to review or llm-judge every trace. arxiv 2604.00356