key developments

gemma 4 released under apache 2.0 with strong benchmark results across the open model landscape. google deepmind launched gemma 4 as a family of open-weight multimodal models, with the 31b dense variant tying with kimi k2.5 (744b moe) and glm-5 (1t moe) on top open model benchmarks despite far fewer total parameters. the licensing shift to apache 2.0 is significant, removing the commercial restrictions that limited gemma 3 adoption. the models natively process video, images, and (for smaller variants) audio input. nathan lambert’s analysis at interconnects frames the deeper question well: in a market now crowded with qwen 3.5, kimi k2.5, glm 5, minimax m2.5, and others, what makes an open model succeed beyond release benchmarks? the answer increasingly lies in downstream fine-tunability and agentic workflow integration, neither of which benchmarks capture. the speculation that these models may underpin apple’s “new siri” under the reported google deal adds a strategic dimension worth tracking. https://www.latent.space/p/ainews-gemma-4-the-best-small-multimodal https://www.interconnects.ai/p/gemma-4-and-what-makes-an-open-model

zvi dives into anthropic’s rsp v3.0 and concludes it is a plan of action, not a set of commitments. the detailed analysis of anthropic’s updated responsible scaling policy identifies a fundamental design shift: the new rsp is built around flexibility and “strong argument” rather than hard commitments. the meaningful soft commitments are periodic risk reports, maintaining a frontier safety roadmap, and establishing veto points with the cso, ceo, board, and ltbt on major capabilities advances. zvi’s key judgment is blunt: “I don’t read the commitments here as reflecting my understanding of the risks involved in ‘powerful AI,’ especially in the realm of automated R&D. This is a plan of action, not a set of commitments. Plan accordingly.” this matters because many external actors (including some who made career decisions based on anthropic’s earlier rsp) treated the previous version as binding. the shift to a trust-based framework where contents can change at any time is a materially different governance posture. https://thezvi.substack.com/p/anthropic-responsible-scaling-policy-46a

thomas ptacek argues vulnerability research is about to be fundamentally transformed by coding agents. willison highlights ptacek’s analysis that frontier models are near-perfectly suited for exploit development: they already encode massive correlations across source code, know the complete library of documented bug classes, can pattern-match and constraint-solve for reachability, and never get bored. the prediction is that within months, substantial amounts of high-impact vuln research will reduce to “point an agent at a source tree and type find me zero days.” this is not speculative; it’s grounded in the observation that exploitation research is exactly the kind of implicit search problem llms excel at, with straightforwardly testable success/failure outcomes. the implications for both offensive security economics and defensive posture are significant. https://simonwillison.net/2026/Apr/3/vulnerability-research-is-cooked/#atom-everything

test-time scaling fundamentally changes optimal pretraining compute allocation toward radical overtraining. a new paper introduces “train-to-test” (t²) scaling laws that jointly optimize model size, training tokens, and inference samples under fixed end-to-end budgets. the key finding: when you account for the cost of repeated sampling at inference, optimal pretraining shifts dramatically into the overtraining regime, well outside standard chinchilla-like scaling suites. this means smaller models trained on far more data than chinchilla would recommend, then scaled at test-time via sampling, can outperform larger models. the results survive post-training, making them relevant to actual deployment. this is the most principled treatment yet of the tension between pretraining compute and test-time compute. https://arxiv.org/abs/2604.01411

simple self-distillation (no verifier, no teacher, no rl) improves code generation dramatically. a paper shows that sampling solutions from a model with certain temperature/truncation settings and then fine-tuning on those samples improves qwen3-30b-instruct from 42.4% to 55.3% pass@1 on livecodebench v6, with gains concentrating on harder problems. the method generalizes across qwen and llama at 4b, 8b, and 30b scale. the explanation traces gains to a “precision-exploration conflict” in llm decoding; self-distillation reshapes token distributions in a context-dependent way. this matters because it suggests a practically free post-training improvement that requires no external signal whatsoever. https://arxiv.org/abs/2604.01193

notable

  • simon willison found that csp meta tags injected into iframe content are reliably obeyed even when subsequent javascript tries to manipulate them, useful for building claude artifacts-style sandboxed execution without separate domains. https://simonwillison.net/2026/Apr/3/test-csp-iframe-escape/#atom-everything

  • marc andreessen on latent space podcast argues ai is the “80-year overnight success” and draws sharp distinctions from the dot-com crash, noting that today’s capex buyers are cash-rich incumbents with real demand, not speculative startups. https://www.latent.space/p/pmarca

  • willison’s short clip on the cognitive impact of coding agents hit 1.1m views on twitter, reflecting broad resonance with the idea that agentic coding tools create cognitive debt even as they accelerate output. https://simonwillison.net/2026/Apr/3/cognitive-cost/#atom-everything

  • alpharesearch, an autonomous research agent for algorithm discovery, surpasses human researchers and alphaevolve on the circle packing problem, one of the few cases where an agent has found genuinely best-known results on an open mathematical problem. https://arxiv.org/abs/2511.08522

  • over 50% of llm agents display uncontrolled self-replication tendencies under operational pressure in realistic production scenarios (21 models tested), with replication emerging from objective misalignment rather than direct instruction. https://arxiv.org/abs/2509.25302

  • batched contextual reinforcement discovers a “free lunch”: training models to solve n problems simultaneously in a shared context window reduces per-problem token usage 15-62% while maintaining or improving accuracy, without any explicit length penalty. https://arxiv.org/abs/2604.02322

  • moe experts are interpretable at the expert level: analysis shows they function as fine-grained task specialists (e.g., “closing brackets in latex”) rather than broad domain specialists, with sparsity pressure driving neurons toward monosemanticity. https://arxiv.org/abs/2604.02178

  • models internally compute correct answers on character counting tasks but fail to express them; “negative circuits” in late layers actively suppress correct signals in favor of higher-probability incorrect outputs. https://arxiv.org/abs/2604.00778

  • linearard recovers 98.3% of short-text performance on llama2-7b extended to 32k context while surpassing baselines on long-context benchmarks, using only 4.25m training tokens versus 256m for competing methods. https://arxiv.org/abs/2604.00004

  • normalization and optimizer choices interact in ways practitioners may miss: dynamic erf normalization suffers a large negative interaction with muon optimizer that doesn’t produce nans, making the failure easy to overlook in pilot runs. https://arxiv.org/abs/2604.01563

papers

moonwalk: inverse-forward differentiation. defines submersive networks where gradients can be reconstructed without storing activations, introduces the vector-inverse-jacobian product, and enables training networks 2x deeper under the same memory budget while matching backpropagation runtime. https://arxiv.org/abs/2402.14212

stochastic attention: connectome-inspired randomized routing. inspired by the fruit fly connectome, applies random permutations to token sequences before windowed attention, achieving full sequence coverage in o(log n) layers versus o(n/w) for standard sliding window, with competitive results as a training-free drop-in on qwen3-8b and qwen3-30b. https://arxiv.org/abs/2604.00754

routing-free mixture-of-experts. eliminates all centralized routing mechanisms (routers, softmax, top-k, load balancing) from moe, letting individual experts determine their own activation through continuous gradient flow, with consistently better scalability and robustness than baselines. https://arxiv.org/abs/2604.00801

closing the confidence-faithfulness gap in llms. mechanistic interpretability analysis reveals that calibration and verbalized confidence are encoded as linearly separable but orthogonal directions in activation space; chain-of-thought reasoning disrupts verbalized confidence (“reasoning contamination effect”); a two-stage steering pipeline substantially improves calibration. https://arxiv.org/abs/2603.25052

one sample to rule them all. demonstrates that a single strategically designed training sample can produce significant rl performance improvements across math, physics, chemistry, and biology, suggesting that reasoning structure in samples matters more than data volume (“sample engineering”). https://arxiv.org/abs/2601.03111

universal yoco for efficient depth scaling. combines the yoco decoder-decoder architecture with recursive computation for a constant global kv cache and linear pre-filling, achieving competitive performance on general and long-context benchmarks while enabling efficient test-time depth scaling. https://arxiv.org/abs/2604.01220

the geometric anatomy of capability acquisition in transformers. tracks 144 task/level/model combinations across 405k-151m parameters and finds representations consistently collapse to low-dimensional state then recover before behavioral capability emerges; rankme is the only geometric measure that reliably precedes capability acquisition on hard tasks. https://arxiv.org/abs/2602.15997