key developments
h100 rental prices are rising sharply, defying depreciation expectations. latent space reports that since december 2025, h100 rental prices have reversed their post-deepseek r1 decline and are now climbing significantly, with the chips reportedly worth more today than three years ago at launch. the driver is a combination of the general chip shortage, the reasoning model and agent inflection point, and dramatically better inference software making older hardware more productive than anticipated. this matters because it breaks the standard 4-7 year gpu depreciation model that data center economics are built on, and signals that compute demand from reasoning and agentic workloads is structurally outpacing supply. if sustained, this reprices the entire inference infrastructure stack. (latent space)
anthropic’s next-generation model “capybara” leaked via unsecured cms bucket, confirmed by fortune. leaked marketing pages (preserved before being pulled) describe a model called both “mythos” and “capybara” that posts dramatically higher scores than claude opus 4.6 in coding, academic reasoning, and cybersecurity. anthropic confirmed to fortune that it is developing and testing with early access customers a model representing a “step change” in capabilities. the relationship between the names is unclear; github strings suggest capybara may be a tier above opus rather than a distinct model name, but other evidence treats it as a separate model. the leaked page carries a “03|26” date. notably, anthropic acknowledged it will be very expensive to serve, which connects to the compute demand story above. (r/mlscaling, latent space)
turboquant is getting rapid community implementations and clear explanations. google’s kv cache compression method from zandieh et al. 2025 has generated a burst of activity. the key insight (well explained on r/localllama) is simple: randomly rotate a vector before quantizing it, counter-rotate on dequantization. this works because llm state vectors have highly non-uniform coefficient magnitudes, and random rotation distributes the information more evenly across dimensions, making quantization far less lossy. practical implementations now include: an mlx port with custom metal kernels achieving 4.6x kv cache compression at 98% of fp16 speed on qwen2.5-32b (github), a weight quantization adaptation showing near-lossless 8-bit and usable 4-bit compression (r/machinelearning), and integration into llama.cpp with h2o and streamingllm for 256k+ context on a 16gb 4060ti (github). this is moving from paper to production tooling unusually fast.
litellm supply chain attack compromised versions 1.82.7 and 1.82.8 on pypi. a malicious .pth file was injected through a compromised publish token (stolen via trivy, ironically a vulnerability scanner). the payload scraped ssh keys, aws/gcp credentials, k8s secrets, and environment variables on every python process start, no import needed. over 2,000 downstream packages depend on litellm including dspy and mlflow. the attack was only caught because a fork bomb bug in the malicious code crashed machines. if you use litellm, check your version immediately; anything above 1.82.6 should be treated as full compromise. (r/machinelearning)
notable
- glm-5.1 weights releasing april 6-7 per zhipu’s discord, the next iteration of their open model family is imminent. (r/localllama)
- ibm granite 4.0 3b vision released as a lora adapter on granite 4.0 micro, focused on enterprise document extraction (charts, tables, kvp) with a single deployment supporting both multimodal and text-only workloads. (r/localllama)
- cern burns tiny ai models into fpgas/asics using hls4ml for nanosecond-scale lhc data filtering; 0.02% of collision events kept, the rest gone forever. extreme edge inference that no large model could do. (r/localllama)
- mlx lora pipeline for encoder/embedding models on apple silicon; 56 min vs 6-8 hours on pytorch for bge-m3 fine-tuning, 78% gpu utilization vs <5%. (github)
- pentanet: pentanary {-2,-1,0,1,2} quantization at 124m scale shows 6.4% perplexity improvement over bitnet ternary with zero additional multiplier cost (multiply-by-2 is just a bit shift). incremental but clean result. (r/machinelearning)
- kv cache architecture evolution overview with per-token costs: gpt-2 at 300 kib/token down to deepseek v3 at 68.6 kib/token, plus the observation that there is no architectural slot between ephemeral kv cache and permanent weights for medium-term memory. (r/localllama)
- fal cto on ai gross margins: “as models get better, costs don’t go down, they go up” because customers always want the newest, most expensive models. software advances faster than hardware; the margin squeeze is structural, not temporary. (saastr)
- titans-trainer library released for google’s titans architecture (neurips 2025) with test-time memory updates; biotitan genomic model approaches geneformer performance with 120x less training data. (github)
- tidesurf web agent harness achieves 30x token reduction and 12x ttft reduction for browser-use agents by rendering dom as compressed markdown rather than raw html or screenshots. (github)