ai digest - March 29, 2026

key developments

vllm security vulnerability silently ignores trust-remote-code=false (cve-2026-27893). two model files in vllm (nemotron-vl and kimi-k25) hardcode trust_remote_code=True, overriding an explicit False setting with no warning. a malicious hugging face repo targeting either architecture achieves code execution on the inference server. versions 0.10.1 through 0.17.x are affected; 0.18.0 contains the fix. this is the third time this vulnerability class has appeared in vllm, each time in a different code path, which suggests a systemic design problem rather than a one-off bug. anyone running vllm in production should upgrade immediately or audit model loading paths. https://nvd.nist.gov/vuln/detail/CVE-2026-27893

langflow critical rce with no installable patch (cve-2026-33017). cisa added this to the kev catalogue on march 25. a single unauthenticated post to the public flow build endpoint allows arbitrary python execution. the fix is supposedly in 1.9.0 but no such release exists on pypi or github; latest installable is 1.8.3. if you have langflow exposed to the internet, you need compensating controls now: block unauthenticated access, disable public flows, set AUTO_LOGIN=false. the gap between “fix announced” and “fix available” is operationally dangerous. https://raxe.ai/labs/advisories/RAXE-2026-043

kv cache rotation in llama.cpp recovers math performance lost to q8 quantization. a pr for kv rotation found that existing q8 kv quantization significantly degrades performance on aime25, but rotation largely recovers it. this is practically significant for anyone running quantized local models on hard reasoning tasks; the silent accuracy loss from kv quantization was previously underappreciated. worth tracking as this merges. https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357

gpt-5.4-mini shows 22 percentage point regression on vanilla prompting versus gpt-5-mini. across 1,800 evaluations on 12 tasks, vanilla accuracy dropped from 69.5% to 47.2%. the model produces shorter, terser outputs by default. a recursive language model implementation that forces computation through a python repl absorbed most of the regression (72.7% to 69.5%). this matters because it shows newer model versions can silently degrade on simple prompting patterns that benchmarks don’t test, and that structured scaffolding can compensate. the pattern of models getting worse at “just answering” while improving on structured reasoning continues. https://github.com/avilum/minrlm

notable

simon willison covered pretext, a new browser library from former react core developer cheng lou that calculates line-wrapped text height without touching the dom, enabling fast text rendering effects previously impractical in browsers. https://simonwillison.net/2026/Mar/29/pretext/#atom-everything
simon willison built a python vulnerability lookup tool using the osv.dev open cors json api; paste a pyproject.toml or requirements.txt and get reported vulnerabilities. https://simonwillison.net/2026/Mar/29/python-vulnerability-lookup/#atom-everything
the sequence covered turboquant from google research: 3-bit kv cache compression via polar coordinate conversion and johnson-lindenstrauss sign-bit reduction, claiming 6x memory reduction and up to 8x speedup on h100s with zero accuracy loss, training-free. https://thesequence.substack.com/p/the-sequence-radar-832-last-week
an independent researcher implemented the missing hebbian fast-weight write-back for the bdh architecture paper, showing selective writeback (top 10% of rows) preserves episodic memory signal where dense writeback degrades it. mechanism proof only (25m params, synthetic tasks), but the implementation fills a gap the original paper left. https://github.com/fleeb83/bdh-fast-weights
someone trained luganda language models (20m to 110m params) from scratch and got them running fully on-device on android without gpu or internet; a meaningful contribution to low-resource language accessibility. https://huggingface.co/datasets/mwebazarick/BULaMU

papers

“llms do not grade essays like humans” — evaluates gpt and llama family models on essay scoring against human raters. llms assign higher scores to short/underdeveloped essays and penalize longer essays with minor errors. llm scores are internally consistent with their generated feedback but rely on different signals than human raters. useful calibration for anyone considering llm-as-judge for text evaluation. https://arxiv.org/abs/2603.23714

mdm-prime-v2: binary encoding and index shuffling enable compute-optimal scaling of diffusion language models (chao et al. 2026) — claims diffusion language models outperform autoregressive transformers, though with a more data-hungry scaling curve. if the results hold, this is a meaningful datapoint for non-autoregressive architectures as viable alternatives. https://www.reddit.com/r/mlscaling/comments/1s70otw/