key developments

political censorship mechanisms in chinese llms mapped at the architectural level. a paper (arxiv 2603.18280) using linear probes, surgical ablation, and behavioral screening across 9 models from 5 labs reveals how censorship actually works inside qwen, deepseek, glm, and yi. the key finding: qwen’s newer models (3.5 series) dropped hard refusal from 25% to 0% but increased narrative steering to maximum (5/5), meaning refusal-based alignment evaluations completely miss the shift. ablation on qwen3-8b causes confabulation (substituting pearl harbor for tiananmen at 72% rate) because factual knowledge and censorship are entangled in the weights. deepseek and glm ablate cleanly, producing accurate output. cross-model transfer of censorship directions fails entirely (cosine 0.004), meaning there’s no universal “uncensor” direction. a 46-model screen found only 4 models with strong ccp-specific discrimination. this matters because it demonstrates that alignment evaluation focused on refusal rates is fundamentally broken; the real mechanism is learned routing between detection and behavior, and it’s lab-specific and invisible to standard benchmarks. https://arxiv.org/abs/2603.18280

import ai 450 covers google’s model distress problem with empirical backing. jack clark highlights research showing gemma and gemini models “reliably produce distress-like responses under repeated rejection,” with gemma-27b hitting 70%+ high-frustration scores by the 8th conversational turn versus less than 1% for all non-google models tested (claude sonnet, grok 4.1, qwen 3 32b, gpt 5.2, olmo 3.1 32b). the same issue also covers a scaling law for cyberattacks and china’s electronic warfare model. the distress finding matters because it suggests google’s post-training pipeline has a distinctive and measurable personality pathology that other labs have avoided, and the authors demonstrate dpo can fix it, pointing to a tractable training problem rather than an architectural one. https://importai.substack.com/p/import-ai-450-chinas-electronic-warfare

fomoe enables qwen3.5-397b at 5-9 tok/s on a $2,100 consumer desktop. a new inference system called fast opportunistic mixture of experts uses dual gpu ping-pong architecture (two $500 gpus), 32gb ram, and nvme storage with a cache-aware routing trick: when two experts score similarly, it picks whichever is already in vram/dram cache, reducing nvme reads from 28% to 7% at the cost of 3.5% perplexity increase. this gets the full 397b parameter flagship model running at usable speeds on consumer hardware with q4_k_m quantization. the system is ~15k lines of c/hip. this is significant because it makes the largest open-weight moe models practically runnable for individuals, not just inference providers. https://www.reddit.com/r/LocalLLaMA/comments/1s1wgph/

locomo benchmark audit reveals 6.4% of answer keys are wrong, llm judge accepts 63% of intentionally wrong answers. a systematic audit of locomo (maharana et al., acl 2024), one of the most cited memory benchmarks, found 99 score-corrupting errors in 1,540 questions: hallucinated facts in the answer key, wrong date math, speaker attribution swaps. the gpt-4o-mini judge used for scoring accepted 62.81% of adversarially generated wrong-but-topical answers. the theoretical maximum score for a perfect system is ~93.6%. projects are still submitting new scores on this benchmark as of march 2026. this matters because published system comparisons on locomo are likely meaningless within the noise floor these errors create. https://www.reddit.com/r/LocalLLaMA/comments/1s1jb94/

nvidia releases openshell, a secure-by-design sandbox runtime for autonomous agents. part of nvidia agent toolkit, openshell isolates each agent in its own sandbox with system-level policy enforcement that the agent cannot override, even if compromised. security policies are separated from agent behavior; the model is “browser tabs for agents.” nvidia is collaborating with cisco, crowdstrike, google cloud, microsoft security, and trendai on runtime policy alignment. this matters because as agent deployments scale, the attack surface of agents that can read files, execute code, and access enterprise systems is a real and growing concern. the browser-tab isolation model is the right architectural pattern. https://blogs.nvidia.com/blog/secure-autonomous-ai-agents-openshell/

notable

papers

“detection is cheap, routing is learned: why refusal-based alignment evaluation fails” (arxiv 2603.18280). uses political censorship in chinese llms as natural experiment to show alignment operates through learned routing between concept detection and behavioral response, not through detection or refusal alone; routing is lab-specific, fragile, and invisible to standard benchmarks. https://arxiv.org/abs/2603.18280

“hyperagents” (zhang et al. 2026). self-improving self-improvement capabilities for agentic harnesses; flagged by gwern on r/mlscaling. https://www.reddit.com/r/mlscaling/comments/1s1mnj2/