key developments

anthropic withholds claude mythos from public release, launches project glasswing for coordinated vulnerability patching. anthropic announced its latest model, claude mythos, but did not release it publicly. instead, it is being made available only to a restricted set of security partners under “project glasswing.” the reason: mythos’s cybersecurity capabilities are so advanced that anthropic believes the software industry needs time to patch before the model proliferates. the model has already found thousands of high-severity vulnerabilities across every major operating system and web browser. the technical details are striking: mythos autonomously wrote a browser exploit chaining four vulnerabilities with a jit heap spray that escaped both renderer and os sandboxes; it exploited subtle race conditions for local privilege escalation on linux; and it wrote a remote root exploit against freebsd’s nfs server using a 20-gadget rop chain split across multiple packets. claude opus 4.6 had near-0% success rates on these same tasks. this is a genuinely new capability threshold. the decision to restrict release rather than ship is notable; it represents the first time a major lab has held back a general-purpose model specifically because of offensive security capabilities rather than general safety concerns. zvi mowshowitz’s analysis frames this as the dominant story of the day. (simonwillison.net, thezvi.substack.com)

anthropic reportedly passed openai in annualized revenue, reaching $30b arr vs openai’s $24b. according to saastr’s analysis of recent disclosures, anthropic hit $30 billion annualized run-rate revenue, up from $9 billion at year-end 2025, while openai sits at $24 billion ($2b/month). a year ago anthropic was at roughly $1b and openai at $6b. the gap closed through enterprise api contracts, developer adoption, and claude code rather than consumer scale. wsj reportedly published confidential financials showing anthropic spends approximately 4x less on training. these numbers should be treated with some caution; saastr is not a primary financial source and “annualized run-rate” can be misleading during periods of rapid growth. but if directionally correct, this represents a meaningful shift in the competitive landscape and validates the enterprise-first strategy. (saastr.com)

openai’s frontier team runs 1m+ loc codebase with zero human-written code and zero human review. ryan lopopolo of openai’s frontier team detailed their approach to what they call “harness engineering” in a latent space podcast. the team built and shipped an internal beta product over five months with literally no manually written code and no human code review before merge. when the agent failed, rather than prompt engineering, they asked “what capability, context, or structure is missing?” and built infrastructure accordingly. they use over 1 billion tokens per day (roughly $2-3k/day). the system, called symphony, is a multi-agent orchestration framework in elixir where codex agents are prompted with the specificity of full prd specs. this matters because it demonstrates a concrete production workflow where the human role has shifted entirely from writing and reviewing code to designing agent infrastructure. lopopolo calls it “borderline negligent” not to operate at this token volume. (latent.space)

z.ai releases glm-5.1, a 754b parameter mit-licensed open model. chinese lab z.ai released glm-5.1, a 754b parameter (1.51tb on hugging face) model under the mit license. simon willison tested it and found it notably capable, particularly at generating complex svg with css animations and self-correcting rendering issues through conversation. the model is available via openrouter. at 754b parameters with a fully open license, this is one of the largest openly available models. the significance is less about any single benchmark and more about the continued expansion of the open-weight frontier from chinese labs. (simonwillison.net)

coral multi-agent system outperforms alphaevolve on erdős minimum overlap problem. a new autonomous multi-agent infrastructure called coral demonstrated that giving agents more autonomy and less rigid structure can outperform tightly constrained evolutionary approaches. on the erdős minimum overlap problem, coral achieved a 2.5x higher improvement rate and 10x faster evolution than openevolve using the same backbone model (opus 4.6), ultimately reaching a better final score. on anthropic’s kernel benchmark, four agents pushed the best known score from 1363 to 1103 cycles. the core claim is that agents given freedom to explore, reflect, and iterate reach stronger limits than constrained setups like alphaevolve. this is a single system with limited evaluation, but the results on established benchmarks make it worth tracking. (reddit)

notable

  • mistral released voxtral tts, a 4b parameter open-weight text-to-speech model that clones voices from 3 seconds of audio across 9 languages, beating elevenlabs flash v2.5 with a 68.4% human preference win rate. runs on 3gb ram. weights on hugging face under cc by-nc. (mistral.ai)

  • openai, anthropic, and google reportedly forming joint effort to combat model copying in china, per bloomberg. details sparse but signals coordinated industry response to ip concerns. (bloomberg via reddit)

  • gemma 4 31b tops kimi k2.5 and grok 4.20 on duellab’s competitive coding leaderboard (53.9 vs 50.5 vs 46.8), notable mainly because an open 31b model is beating much larger proprietary models on this specific axis. (reddit)

  • langchain released deep agents v0.5 with async (non-blocking) subagents that run in the background on remote servers, enabling parallel delegation for long-running tasks. moves from blocking to stateful async orchestration. (blog.langchain.com)

  • “the format tax” paper shows that asking open-weight llms to respond in json/xml/latex degrades reasoning accuracy substantially, with most degradation caused by format-requesting instructions in the prompt rather than constrained decoding. decoupling reasoning from formatting recovers most lost accuracy. closed-weight models mostly don’t have this problem. (arxiv)

  • knowledge packs paper proposes zero-token rag by pre-computing kv caches, achieving identical results to standard rag with up to 95% token savings on qwen3-8b and llama-3.1-8b. also demonstrates behavioral steering via contrastive deltas on cached values. no training required. (arxiv)

  • self-distillation without any external signal improves code generation: “no reference answers, no teacher model, no reward model, no verifier, no execution environment, and no reinforcement learning of any kind.” (reddit/mlscaling)

  • mapping llm exploit trigger taxonomy across 10,000 trials: only “goal reframing” reliably triggers exploitation (38-40% on claude sonnet 4); nine other hypothesized attack dimensions including incentives, authority appeals, and identity priming produce no detectable effect. gpt-4.1 shows zero exploitation across 1,850 trials. (arxiv)

  • new yorker published an 18,000-word profile of sam altman and openai’s history, alongside openai’s policy proposal and their acquisition of tbpn. zvi’s analysis characterizes all three stories unfavorably. (thezvi.substack.com)

papers

  • “automated conjecture resolution with formal verification” (rethlas/archon framework). automatically resolved an open problem in commutative algebra and formally verified the proof in lean 4 with essentially no human involvement. combines informal reasoning with formal verification via structured task decomposition. (arxiv)

  • “qed-nano: teaching a tiny model to prove hard theorems.” a 4b model post-trained for olympiad-level proofs using sft distillation, rl with rubric-based rewards, and a reasoning cache. surpasses much larger open models (nomos-1, gpt-oss-120b) and approaches gemini 3 pro at a fraction of inference cost. full pipeline released. (arxiv)

  • “truth as a compression artifact in language model training.” controlled experiments show llm truth preference tracks compressibility of errors, not truth per se. when false answers follow a single coherent alternative rule system, truth bias vanishes entirely. adding a second competing false rule restores it. proposes the compression-consistency principle. (arxiv)

  • “how alignment routes: localizing, scaling, and controlling policy circuits in language models.” identifies a sparse gate-amplifier routing mechanism for refusal across 9 models from 6 labs. reveals structural separation between intent recognition and policy routing; cipher encoding collapses routing while the model still represents harmful content at deeper layers. (arxiv)

  • “vero: an open rl recipe for general visual reasoning.” fully open vlm family using 600k rl samples across 59 datasets. vero from qwen3-vl-8b-instruct outperforms qwen3-vl-8b-thinking on 23/30 benchmarks without proprietary thinking data. ablations show different task categories produce distinct reasoning patterns that transfer poorly in isolation. (arxiv)

  • “don’t blink: evidence collapse during multimodal reasoning.” documents that reasoning vlms progressively lose visual grounding as they think, even while accuracy increases. attention to evidence regions drops by over half during reasoning. low-entropy, visually disengaged predictions are hazardous on visual-reference tasks but benign on symbolic ones. (arxiv)

  • “klong: training llm agent for extremely long-horizon tasks.” introduces trajectory-splitting sft and progressive rl for long-horizon tasks. klong (106b) surpasses kimi k2 thinking (1t) by 11.28% on paperbench and generalizes to swe-bench verified and mle-bench. (arxiv)