key developments

litellm supply chain attack via compromised pypi package; dependency cooldown momentum grows. litellm versions 1.82.7 and 1.82.8 published to pypi contained a credential stealer hidden in base64 within a .pth file, meaning merely installing the package (not even importing it) triggered exfiltration of ssh keys, aws credentials, git configs, docker configs, crypto wallets, shell histories, and more. the attack chain traced back to a recent exploit of trivy (a security scanner used in litellm’s ci), which likely leaked pypi credentials. pypi quarantined the package within hours, but the blast radius for anyone who installed during that window is severe. separately, willison highlighted that dependency cooldown mechanisms are now surprisingly well supported across major package managers: pnpm, yarn, bun, deno, uv, pip, and npm all have some form of minimum release age gating. this is the clearest argument yet that cooldowns should be default policy for production environments. https://simonwillison.net/2026/Mar/24/malicious-litellm/#atom-everything / https://simonwillison.net/2026/Mar/24/package-managers-need-to-cool-down/#atom-everything

claude code ships “auto mode” as a supervised alternative to dangerously-skip-permissions. anthropic introduced a new permissions mode for claude code where a separate classifier model (claude sonnet 4.6) reviews every action before execution, blocking scope escalation, untrusted infrastructure access, and prompt injection attempts from file or web content. willison published the full default filter set (available via claude auto-mode defaults), which reveals a thoughtful allow/block taxonomy: test artifacts and read-only operations are allowed; irreversible local destruction, credential access outside project scope, and infrastructure modifications are blocked. this matters because the previous binary choice was either constant permission prompts or --dangerously-skip-permissions. auto mode is a real attempt at a middle ground, and the classifier-as-guardrail architecture is worth studying as a pattern. https://simonwillison.net/2026/Mar/24/auto-mode-for-claude-code/#atom-everything

streaming moe models from ssd continues to push boundaries; 1 trillion parameter kimi k2.5 now runs on a macbook. the “streaming experts” technique (loading only active expert weights from ssd per token rather than fitting the full model in ram) hit a new milestone: kimi k2.5, a 1 trillion parameter moe model with 32b active parameters, is now running on a 96gb m2 max macbook pro. a separate demonstration showed qwen3.5-397b running on an iphone at 0.6 tokens/second. daniel isaac subsequently got kimi k2.5 to 1.7 tokens/second on a 128gb m4 max. the speed is not yet practical for interactive use, but the trajectory is clear: hobbyists are running autoresearch loops to find further optimizations, and the gap between “technically possible” and “useful” is closing faster than expected. this could meaningfully change assumptions about what hardware is required for frontier-class moe inference. https://simonwillison.net/2026/Mar/24/streaming-experts/#atom-everything

meta superintelligence labs acquires dreamer team. meta’s superintelligence lab (led by nat friedman and alex) execuhired the dreamer team (whose sidekick agent-of-agents demo shipped on latent space just days ago). this follows meta’s $2b manus acquisition in december. the pattern is clear: msl is aggressively consolidating consumer agent talent, and the combination of dreamer’s “os and ecosystem” approach with manus’s agent capabilities creates one of the strongest consumer agent teams outside of anthropic and openai. the speed (days from podcast to acquisition) signals how competitive the hiring market for agent-focused teams has become. https://www.latent.space/p/ainews-dreamer-joins-meta-superintelligence

notable

  • flashattention-4 hits 1,613 tflops/s on b200 (71% utilization), integrated into vllm 0.17.0. hopper and blackwell only; written entirely in nvidia’s cute-dsl (python), compiling in 2.5s vs 55s for c++. the real unlock is kernel iteration speed, not just raw performance. https://www.reddit.com/r/LocalLLaMA/comments/1s1yw23/flashattention4_1613_tflopss_27x_faster_than/

  • delta-kv for llama.cpp borrows video codec ideas (store deltas between consecutive kv cache values instead of absolutes) to achieve near-lossless 4-bit kv cache compression. perplexity on wikitext-2 with llama 3.1 70b: f16 baseline 3.3389, delta-kv 3.3352. no training, no learned components, just exploiting temporal locality. simple and elegant. https://www.reddit.com/r/LocalLLaMA/comments/1s204yi/deltakv_for_llamacpp_nearlossless_4bit_kv_cache/

  • nvidia donates its dynamic resource allocation (dra) driver for gpus to cncf/kubernetes project. moves gpu orchestration from vendor-governed to community-owned. also introduced gpu support for kata containers (confidential computing). incremental but strategically significant for enterprise ai infrastructure standardization. https://blogs.nvidia.com/blog/nvidia-at-kubecon-2026/

  • hypura: storage-tier-aware llm inference scheduler for apple silicon hit 127 points on hacker news. designed to intelligently schedule inference across the unified memory hierarchy on apple hardware. https://github.com/t8/hypura

  • apple ml research: base llms show emergent semantic calibration. finding that base llms can meaningfully assess confidence in answer meaning (not just next-token probability) without explicit calibration training. theoretical contribution explains the mechanism. https://machinelearning.apple.com/research/trained-on-tokens

  • “llm neuroanatomy ii” post on modern llm hacking and hints of a universal language drew 98 points and substantive hn discussion. https://dnhkng.github.io/posts/rys-ii/

papers

  • “flashattention-4” achieves 71% gpu utilization on blackwell via selective softmax rescaling and 5-stage pipeline, written entirely in python (cute-dsl). first attention kernel to match matmul throughput. https://arxiv.org/abs/2603.05451

  • “trained on tokens, calibrated on concepts: the emergence of semantic calibration in llms” (apple). establishes theoretical mechanism for why sampling-based semantic confidence estimates in base llms are well-calibrated despite no explicit training signal. https://machinelearning.apple.com/research/trained-on-tokens

  • “autoplay: scaling synthetic task generation for agents via exploration” (apple). addresses the bottleneck of generating diverse, feasible, verifiable agentic training tasks by having agents explore environments rather than relying on human annotation or blind prompting. https://machinelearning.apple.com/research/scaling-synthetic-task