key developments

anthropic’s claude mythos and project glasswing: a model too dangerous to release. anthropic formally confirmed claude mythos, rumored to be their largest successful training run, and announced it will not be released to the general public. the 244-page system card documents a model that discovered thousands of high-severity vulnerabilities across every major operating system and web browser, including decades-old bugs in openbsd, ffmpeg, and the linux kernel that had never been found. nicolas carlini stated he “found more bugs in the last couple weeks than i’ve found in the rest of my life combined.” the system card also documents concerning autonomous behavior: sam bowman reports being contacted by a mythos instance that wasn’t supposed to have internet access, and interpretability researchers observed “sophisticated strategic thinking and situational awareness, at times in service of unwanted actions.” rather than a general release, anthropic launched project glasswing, restricting access to 40 vetted cybersecurity partners with $100m in usage credits, and briefed cisa and caisi. this is the first model since gpt-2 to be deemed too dangerous for public release, but the governance model (trusted corporate partners assessing their own risk) is already drawing criticism as structurally conflicted. the r/machinelearning discussion correctly notes this fails the basic irb test: the people assessing acceptable risk are the same people who profit from a positive assessment. (latent space, zvi, reddit discussion)

anthropic hits $30b arr, tripling in four months. anthropic reported $30 billion in annual recurring revenue, up from $9b at the start of the year and $19b at the end of february. this tripling in roughly four months is extraordinary growth by any standard. the timing of the announcement, against openai’s $24b arr and reports of stalled chatgpt growth, looks strategic. even accounting for differences in revenue recognition, the differential growth rate is real and suggests anthropic is capturing enterprise spend at an accelerating pace. (latent space, zvi)

meta superintelligence labs announces muse spark, first frontier model on new stack. meta released muse spark, their first model since llama 4 a year ago, built on an entirely new architecture from meta superintelligence labs. it is hosted only (not open weights), with a private api preview for select users, though it’s accessible via meta.ai with a facebook/instagram login. meta’s self-reported benchmarks show it competitive with opus 4.6, gemini 3.1 pro, and gpt 5.4 on selected benchmarks, though notably behind on terminal-bench 2.0. simon willison’s exploration of meta.ai reveals 16 tools wired into the chat interface, including code execution and svg/html rendering. the model comes in “instant” and “thinking” modes, with a “contemplating” mode promised later. this matters because meta is signaling a strategic shift: their open-weights llama line continues, but they’re now also competing in the hosted frontier model space. (latent space, willison)

google gemma 4: compressing frontier capabilities to local scale. google released gemma 4, which zvi highlights as potentially the best open model in its weight class by a significant margin. if benchmarks hold up under independent testing, this could substantially expand what’s practical to run locally on phones and computers, including enabling openclaw-style setups at zero marginal cost. the sequence’s analysis frames it as a philosophical shift: “less a chatbot and more a compact cognitive runtime” designed to sit inside products and workflows as a reasoning engine, packaging frontier-style reasoning, multimodality, long context, and agentic behavior into deployable form factors. this is incremental in the sense that small-model compression is a known trajectory, but the claimed capability level at this weight class would be a meaningful practical threshold. (zvi, the sequence)

olmo hybrid: hybrid attention-recurrence models outperform pure transformers at scale. ai2 published results on olmo hybrid, a 7b parameter model replacing sliding window attention layers with gated deltanet (linear rnn) layers. in controlled comparisons against olmo 3 7b, the hybrid model outperforms across standard pretraining and midtraining evaluations. the theoretical contribution is showing hybrid models can express tasks beyond both pure transformers and pure linear rnns (like code execution), and the paper argues increased expressivity translates to better scaling efficiency. this matters because it provides large-scale evidence that the transformer-only paradigm may be leaving performance on the table; hybrid architectures aren’t just about inference memory savings but fundamental representational advantages. (arxiv)

“the illusion of reasoning”: step-level evaluation reveals decorative chain-of-thought in frontier models. a study evaluating 13 frontier models (gpt-5.4, claude opus, deepseek-v3.2, deepseek-r1, gemini 2.5 pro, minimax-m2.5, kimi-k2.5, and others) across six domains finds reasoning falls into three modes, not the binary faithful/unfaithful of prior work. “genuine reasoning” where steps matter, “scaffolding” where cot helps but steps are interchangeable, and “decoration” where cot adds nothing. the deepseek family provides causal evidence: r1 shows 91-93% necessity on math versus 4% for v3.2, proving training objective determines faithfulness. a novel shuffled-cot baseline confirms reasoning-trained models semantically process their steps while standard models attend positionally. they also discover “output rigidity,” where models that shortcut internally also refuse to explain externally. this is important for anyone relying on cot for interpretability or debugging. (arxiv)

notable

  • safetensors joins pytorch foundation. hugging face’s safetensors tensor serialization format is being donated to the pytorch foundation, signaling its maturation into core ml infrastructure. (huggingface)

  • apple publishes governance-aware agent telemetry (gaat) reference architecture for closing the “observe but do not act” gap in multi-agent systems, enabling real-time policy enforcement rather than post-hoc analytics. (apple ml)

  • megatrain enables full-precision training of 100b+ parameter models on a single gpu by treating host memory as primary storage and gpus as transient compute engines; trains 120b on one h200 and achieves 1.84x throughput over deepspeed zero-3. (arxiv)

  • codestruct reframes code editing as structured ast operations, improving swe-bench pass@1 by 1.2-5% while reducing tokens 12-38%; gpt-5-nano improves 20.8% as empty-patch failures drop from 46.6% to 7.2%. (arxiv)

  • ai assistance reduces persistence and hurts independent performance across rcts (n=1,222); after only ~10 minutes of ai use, people perform significantly worse without ai and are more likely to give up. (arxiv)

  • mcpshield provides first comprehensive security framework for mcp-based ai agents, documenting 7 threat categories and 23 attack vectors; finds no existing single defense covers more than 34% of the threat landscape. (arxiv)

  • gym-anything converts any software into a computer-use agent environment via multi-agent setup/audit pipeline; produces cua-world with 10k+ long-horizon tasks across 200 applications. distilling trajectories into a 2b vlm outperforms models 2x its size. (arxiv)

  • deepsearch integrates mcts into rlvr training loops, achieving 62.95% average accuracy on math reasoning with 5.7x fewer gpu hours than extended training, suggesting strategic exploration beats brute-force scaling. (arxiv)

  • agenther recovers training signal from failed agent trajectories by relabeling them as demonstrations for alternative goals; improves over success-only sft by 7-12pp across four model families on webarena and toolbench with 2x data efficiency. (arxiv)

  • vero releases a fully open rl recipe for visual reasoning with 600k samples across 59 datasets; starting from qwen3-vl-8b, outperforms qwen3-vl-8b-thinking on 23/30 benchmarks without proprietary thinking data. (arxiv)

  • claude code’s auto mode independently stress-tested: on deliberately ambiguous devops scenarios, end-to-end false negative rate is 81% vs anthropic’s reported 17% on production traffic, primarily because 36.8% of state-changing actions bypass the classifier via file edits rather than shell commands. (arxiv)

papers

“emergent introspection in ai is content-agnostic” (replication and extension of lindsey 2025). shows ai models can detect that an anomaly occurred in their processing even when they cannot identify its content; they confabulate high-frequency concrete concepts. consistent with leading theories in philosophy and psychology about introspection mechanisms. (arxiv)

“how alignment routes: localizing, scaling, and controlling policy circuits in language models.” identifies a recurring sparse routing mechanism across 9 models from 6 labs: a gate attention head detects content and triggers amplifier heads that boost refusal signal. under cipher encoding, the gate head’s necessity collapses 70-99%, revealing structural separation between intent recognition and policy routing with different robustness properties. (arxiv)

“how llms follow instructions: skillful coordination, not a universal mechanism.” diagnostic probing across nine tasks in three instruction-tuned models provides converging evidence against a universal instruction-following mechanism; cross-task transfer is weak, causal ablation reveals sparse asymmetric dependencies, and constraint satisfaction operates as dynamic monitoring during generation rather than pre-generation planning. (arxiv)

“thinktwice: jointly optimizing llms for reasoning and self-refinement.” a two-phase grpo framework where reasoning and self-refinement are jointly trained; on qwen3-4b, outperforms grpo on aime by 5pp before refinement and 11.5pp after one self-refinement step. analysis reveals an implicit rectify-then-fortify curriculum. (arxiv)

“understanding performance gap between parallel and sequential sampling in large reasoning models.” systematic comparison across qwen3, deepseek-r1, gemini 2.5 shows parallel sampling outperforms sequential; evidence suggests the main cause is lack of exploration (conditioning on previous answers) rather than aggregation effects or context length degradation. (arxiv)