key developments

arc-agi-3 benchmark released: frontier models score below 1%, humans solve 100%. francois chollet and team published arc-agi-3, the next iteration of the abstract reasoning benchmark, now focused on interactive, turn-based agentic environments where agents must explore, infer goals, build internal models, and plan without explicit instructions. the benchmark avoids language and external knowledge, focusing purely on fluid adaptive efficiency. humans solve all environments; as of march 2026, frontier ai systems score below 1%. this is a significant gap, and the shift from static puzzles to interactive agentic environments makes the benchmark substantially harder to game through memorization or pattern matching. arc-agi-2 was already difficult for models; the interactive dimension makes this a meaningful new ceiling for measuring general reasoning. arxiv

anthropic wins preliminary injunction against department of war; judge lin issues devastating opinion. zvi mowshowitz covers the resolution of anthropic’s legal battle with the department of war (formerly defense). judge lin granted anthropic a preliminary injunction with a 7-day stay, writing what zvi describes as “one of the most forceful, devastating judge opinions i have ever seen.” the core issue: the government designated anthropic a “supply chain risk” hours after an undersecretary was finalizing a deal with the company, suggesting pretextual action. the opinion apparently hammered the government’s arguments and demonstrated strong judicial understanding of the technical and policy issues. this matters because it sets legal precedent around government ability to arbitrarily exclude ai companies from federal contracts and supply chains, and signals judicial willingness to push back on national security justifications for what appears to be politically motivated action. zvi

stripe launches projects.dev, accelerating the “everything is cli” trend for agent-native infrastructure. stripe released projects.dev, a cli that lets agents instantly provision services (run stripe projects add posthog/analytics and it creates accounts, gets api keys, sets up billing). latent space notes this launched alongside clis from ramp, sendblue, elevenlabs, visa, resend, and google workspace, all in the same week. patrick collison cited andrej karpathy’s menugen as inspiration. the trend is significant: the industry is converging on clis as the primary interface for ai agents to interact with services, sidestepping the complexity of mcps while providing direct programmatic access. stripe’s involvement is notable because they’re essentially becoming an intermediary for service provisioning unrelated to payments. latent space

google turboquant generates significant community traction: 6x kv cache compression, zero loss, with early weight quantization adaptations appearing. multiple localllama threads report on google’s turboquant algorithm, which quantizes kv cache to 3 bits with no training required, achieving 6x memory reduction with reportedly perfect downstream results across gemma and mistral models. community members have already begun adapting it: one thread shows it running qwen 3.5-9b on a macbook air m4 with 20k token context (previously infeasible), and another demonstrates a weight quantization adaptation achieving near-lossless compression at 8 effective bits. the rapid community adoption and adaptation, plus the algorithm working across model families without retraining, suggests this could become standard infrastructure for local inference relatively quickly. localllama kv cache | macbook demo | weight adaptation

cursor releases composer 2 technical report: frontier coding model trained with large-scale rl in production harness. cursor published the technical report for composer 2, their specialized agentic software engineering model. the key architectural decision: training was done in the same cursor harness used by the deployed model, with equivalent tools and structure, using environments matching real problems. the model uses continued pretraining followed by large-scale reinforcement learning for multi-step execution and long-horizon coding. it scores 61.7 on terminal-bench and 73.7 on swe-bench multilingual. the report introduces cursorbench, derived from real problems in large codebases including their own. this matters because it represents one of the clearest examples of domain-specialized rl training for coding agents, and the approach of training in the deployment harness itself is a methodological contribution worth noting. arxiv

cross-model disagreement shown to outperform self-uncertainty for detecting llm errors, no training required. a new paper introduces cross-model perplexity (cmp) and cross-model entropy (cme), which measure how surprised a second model is when reading a first model’s answer. on mmlu, cmp achieves 0.75 auroc versus 0.59 for within-model entropy. critically, this addresses the hardest failure mode: confident errors where a model is wrong but certain. the method requires only a single forward pass from the verifier model, no generation, and no labels. this is practical enough for production deployment in routing pipelines and monitoring, and the insight that cross-model signals beat self-uncertainty is important for anyone building multi-model systems. arxiv

notable

  • apple research: ssms provably cannot solve “truly long-form” generation problems, but tool use fixes it. a theoretical result showing state space models hit fundamental limits on long-form generation, mitigable through interactive tool access. important for anyone betting on ssm architectures. apple ml

  • intern-s1-pro: first trillion-parameter scientific multimodal foundation model released. covers 100+ specialized tasks in chemistry, materials, life sciences, earth sciences; claims top-tier open-source general capabilities while outperforming proprietary models on specialized scientific tasks. arxiv

  • vibe porting case study: jsonata rewritten from node to go in 7 hours and $400 of tokens. simon willison highlights another example of using ai to port codebases between languages, enabled by comprehensive existing test suites. shadow deployment confirmed exact behavioral match. willison

  • google ai overviews causally reduce wikipedia traffic by ~15%, with culture articles hit hardest. difference-in-differences study across 161k matched articles exploiting geographic rollout. first strong causal evidence that generative search features materially reallocate attention from publishers. arxiv

  • reasoning contamination effect: prompting llms to reason and express confidence simultaneously worsens calibration. mechanistic interpretability analysis shows calibration and verbalized confidence are encoded as orthogonal linear features; reasoning disrupts the confidence direction. a two-stage steering pipeline substantially improves alignment. arxiv

  • microsoft vibevoice 9b becomes open-source leader for medical stt at 8.34% wer, nearly matching gemini 2.5 pro. community benchmark of 31 models also found whisper’s text normalizer had bugs inflating wer by 2-3% across all models. reddit

  • pruning acts as implicit feature selection: rare sae features survive better than frequent ones. counterintuitive finding across three model families showing pruning preferentially destroys high-frequency generic features while preserving specialized rare ones, with wanda preserving structure 3.7x better than magnitude pruning. arxiv

  • multi-answer rl trains llms to generate multiple plausible hypotheses with calibrated confidence in a single forward pass. positions this as a compute-efficient alternative to best-of-k sampling for tasks with irreducible uncertainty like medical diagnosis. arxiv

  • localllama user demonstrates llms converge to modality-agnostic geometric representations across 8 languages and code/math. replicated across 4 models from different orgs; english descriptions, python functions, and latex equations for the same concept converge in middle layers. connects to chomsky’s universal grammar hypothesis. reddit

  • xgrammar-2 achieves 6x faster compilation and near-zero overhead for dynamic structured generation in agentic llm serving. introduces tag-triggered structure switching and cross-grammar cache reuse, targeting the increasingly common pattern of tool calling and response protocol switching within requests. arxiv

  • phishnchips: a single model’s phishing bypass rate ranges from <1% to 97% depending solely on system prompt configuration. demonstrates prompt-model interaction is a first-order security variable, and that making prompts more specific can paradoxically degrade capable models by replacing multi-signal reasoning with exploitable single-signal dependence. arxiv

papers

“reasoning safety monitor: real-time detection of unsafe reasoning in llms” formally defines reasoning safety as distinct from content safety, introduces a 9-category taxonomy of unsafe reasoning behaviors, and proposes an external monitor achieving 85% classification accuracy. establishes reasoning-level monitoring as a practical concern for chain-of-thought models. arxiv

“environment maps: structured environmental representations for long-horizon agents” persistent, agent-agnostic graph representation that consolidates screen recordings and execution traces into structured context. agents with environment maps nearly double webarena success rates (28.2% vs 14.2%). arxiv

“grokking as a falsifiable finite-size transition” supplies the first falsifiable finite-size inputs for the grokking phase transition claim, treating group order as extensive variable. binder-like crossings and susceptibility analysis strongly disfavor smooth-crossover interpretation. turns analogy into testable quantitative claim. arxiv

“planned diffusion” trains discrete diffusion language models to determine their own denoising order by autoregressively generating a plan that partitions responses into semantically independent chunks, then denoising in parallel. achieves 1.27-1.81x speedup over autoregressive generation with minimal quality loss on alpacaeval. arxiv

“the limits of inference scaling through resampling” proves that imperfect verifiers with nonzero false positive rates impose hard upper bounds on resampling-based inference scaling regardless of compute budget. shows optimal sampling attempts are often fewer than 10. important constraint on the “just sample more” approach to inference scaling. arxiv