Artificial Intelligence AI News

رفتن به کانال در Telegram

We are a community of machine learning enthusiasts/researchers/journalists/writers who share interesting news and articles about the applications of AI. You will never miss any updates on ML/AI/CV/NLP fields because we post them daily. JOIN NOW

نمایش بیشتر

کشور مشخص نشده استفناوری و برنامه‌ها24 093

3 227

مشترکین

-124 ساعت

+187 روز

+8430 روز

547

نمایش های پست

~ 24624 ساعت

~ 30448 ساعت

16.95%

نرخ مشارکت

اطلاعاتی وجود ندارد

پست های در روز

Ads index

beta

آرشیو پست ها

3 227

Sakana AI Releases Fugu-Cyber: An Orchestration Model Reporting 86.9% on CyberGym and 72.1% on CTI-REALM It is not a new frontier model. It is a third endpoint on the Fugu orchestrator, tuned for security reasoning. Here's what's actually interesting. 1. The CyberGym number only means something with context → Fugu-Cyber: 86.9% → GPT-5.5-Cyber: 85.6% → Claude Mythos Preview: 83.1% → Best agent in the original CyberGym paper: ~20% The benchmark asks an agent to write a PoC that crashes the pre-patch build but not the post-patch build. 1,507 instances, 188 OSS-Fuzz projects. Sakana's score is a small step past the reported frontier, not a leap. The leap already happened. 2. The CTI-REALM figure is a different metric than it sounds Microsoft scores CTI-REALM as a 0–1 trajectory reward, not pass/fail. Its own eval put the top three configs at 0.624–0.685. Sakana reports 72.1% and calls it a success rate. Read it as reward 0.721. 3. Detection engineering still breaks on cloud Microsoft's per-platform means across evaluated models: → Linux endpoints: 0.585 → AKS: 0.517 → Azure cloud: 0.282 4. The pricing is a clean 1.2× → $6 input / $36 output / $0.60 cached, per 1M tokens → All three double above 272K context → Exactly 20% over Fugu-Ultra on every line Access is gated — manual approval, defensive-use AUP, Token Plan only, no EU/EEA, no weights. Full analysis: https://www.marktechpost.com/2026/07/25/sakana-ai-releases-fugu-cyber-orchestration-model-cybergym-cti-realm/ Technical details: https://sakana.ai/fugu-cyber-release/

3 227

Meet Open Dreamer: A JAX/Flax Reproduction of the Dreamer 4 World Model Pipeline, With the Full Training Recipe Published No VAE. No KL loss. No adversarial loss. Here's how it works: 1. Two models, one backbone A causal video tokenizer and an action-conditioned dynamics model share the same block-causal transformer. Space layers move information inside a frame. Causal time layers move it between frames. 2. The tokenizer is a Masked Autoencoder, not a VAE Masking makes the latent space more diffusible, so no KL or adversarial term is needed. → ~100× compression, 512 latent tokens at width 16 per frame → 360×640 frames padded to 368×640 for clean 16×16 patches 3. The rollout is folded into blocks Each timestep is (previous action, state, policy). Spatial attention runs inside the block, causal time attention across blocks. World-model tokens cannot read the agent token, so policy information reaches future states only through the next action. → 1.6B params, depth 30, d_model 1920, 30 heads / 3 KV heads 4. Stability, not throughput, was the bottleneck Most failures happened while the loss was still going down. MSE improves smoothly, generation quality degrades. → Muon replaced LaProp, which spiked randomly and increasingly often → ~400 B200 hours per optimizer comparison run 5. The numbers (B200 dynamics training) → 57–58% MFU, against 60% described as very healthy → 292 FLOP/byte roofline crossover, 256 frames per GPU to clear it → ~24 GiB model state, activations were the real memory cost → plain data parallelism beat FSDP, tensor and sequence parallelism Full analysis: https://www.marktechpost.com/2026/07/25/meet-open-dreamer-a-jax-flax-reproduction-of-the-dreamer-4-world-model-pipeline-with-the-full-training-recipe-published/ Research and Demo: https://next-state.github.io/open-dreamer/ Code: https://github.com/next-state/open-dreamer

3 227

Claude Opus 5 is out. The agentic numbers, not the coding ones, are the story: • FrontierBench v0.1 → 43.3% vs Opus 4.8's 18.7% • OSWorld 2.0 → 70.57% vs 55.7% • AutomationBench → 26.0% vs 17.0% (Fable 5: 17.4%) • ARC-AGI-3 → 30.16%, ~4x the previous best on the leaderboard Same $5/$25 pricing. Fable 5 still edges it on SWE-bench Pro (80.0 vs 79.2). Full analysis: https://www.marktechpost.com/2026/07/24/meet-the-new-claude-opus-5-frontier-class-agentic-coding-and-computer-use-at-unchanged-opus-pricing/ Technical details: https://www.anthropic.com/news/claude-opus-5 Paper: https://www-cdn.anthropic.com/c5fbac3f0b1280a933ebd26d3cb8bb9f5bdeaf48/Claude%20Opus%205%20System%20Card.pdf#page=73

3 227

Andrew Ng Just Released OpenWorker: An Open-Source, Local-First Desktop AI Coworker That Returns Finished Deliverables Instead of Chat Here are some key takeaways: 1. The approval layer is typed, not cosmetic Most desktop agents bolt approvals onto the UI. OpenWorker classifies every tool call into one of four risk classes before it runs: → read — no side effects, always allowed → write_local — mutates the workspace, path-scoped → exec — runs commands → external — side effects off the machine Five permission modes then decide what happens: discuss, plan, interactive (default), auto, custom. 2. Unattended ≠ more autonomous Unattended mode doesn't raise the autonomy ceiling. It only reroutes approval prompts to an Inbox and suspends the run until a human answers. Autonomy and attention are separate axes — most agent frameworks conflate them. 3. No inference service, by design You bring a key or run local: → 30 curated models, 13 providers → Native OpenAI / Anthropic / Google, open-weight via Together and Fireworks → Fully local via Ollama, no key → Matrix limited to tool-calling models — anything else stalls the agent loop The stack → Tauri 2 + React shell over a local Python FastAPI server (127.0.0.1:8765) → 35 connectors live, plus any MCP server → Built on aisuite, ~32.4k lines of Python, 78 test modules Full analysis: https://www.marktechpost.com/2026/07/23/andrew-ng-just-released-openworker-an-open-source-local-first-desktop-ai-coworker-that-returns-finished-deliverables-instead-of-chat/ Repo: https://github.com/andrewyng/openworker Project: https://openworker.com/

3 227

Meet Gigatoken: A Rust BPE Tokenizer that Encodes Text at 24.53 GB/s on a 144-core AMD EPYC 9565, against 24.8 MB/s for HuggingFace tokenizers and 36.0 MB/s for tiktoken on the same machine Both baselines are multithreaded Rust implementations. The difference comes from how the work is structured, not the language. 1. Pretokenization without a regex engine Most tokenizers delegate pretokenization to a regex engine. Gigatoken implements it directly: → A 256-byte lookup table classifies the first byte in O(1), replacing alt/backtrack dispatch → SWAR loads 8 bytes as a u64 and checks all 8 for the letter property with branchless arithmetic → Two independent cursors run from a safe split point, so the out-of-order engine overlaps their instruction streams The repo's optimization log records the progression on single-threaded GPT-2 pretokenization: fancy-regex at 47 MiB/s, NEON at 462, LUT + SWAR at 830, dual-cursor at 1,049 MiB/s. 2. Pretoken caching Words seen before are looked up rather than re-encoded through BPE. The author notes this is the hard part: the cache grows quickly and pretoken distributions are long-tailed. 3. Measured results across hardware GPT-2 on the 11.9 GB OpenWebText corpus: → EPYC 9565 (144 cores): 24.53 GB/s → Apple M4 Max (16 cores): 8.79 GB/s → Ryzen 7 9800X3D (16 cores): 6.27 GB/s Methodology note: Gigatoken encodes the full file un-split and finds its own boundaries. HuggingFace tokenizers gets the first 100 MB and tiktoken the first 1 GB, both presplit on <|endoftext|>. Best of 3 interleaved rounds, fresh process per measurement. 4. Relevant workloads Pretraining data preparation, where a corpus is retokenized on each mixture or filter change. And time-to-first-token in serving: vLLM and SGLang hash token chunks into prefix trees, so tokenization runs before the KV-cache lookup. Full analysis: https://www.marktechpost.com/2026/07/23/meet-gigatoken-a-rust-bpe-tokenizer-that-encodes-text-at-24-53-gb-s-up-to-989x-faster-than-huggingface-tokenizers/ GitHub Repo: https://github.com/marcelroed/gigatoken/#benchmarks

3 227

Poolside released Laguna S 2.1 last week, and the interesting part is not the benchmark table. It is what fits in memory. It is a 118B-parameter Mixture-of-Experts coding model that activates ~8B parameters per token. Roughly 6.8% of the network fires on any given step. 1. The weight-class claim → 78.5% SWE-Bench Multilingual — tops Poolside's published table outright → 70.2% Terminal-Bench 2.1 — first among open, disclosed-size models → 59.4% SWE-Bench Pro → 40.4% DeepSWE v1.1, against DeepSeek-V4-Pro-Max at 9.0% with ~6× the active parameters Closed frontier models still lead several of these. Claude Fable 5 hits 80.3% on SWE-Bench Pro. The claim is the weight class, not the top of the board. 2. Thinking mode is doing the heavy lifting Two modes only: off and max, with max as default. No user-configurable effort control yet. → Terminal-Bench 2.1: 60.4% → 70.2% → DeepSWE v1.1: 16.5% → 40.4% → Cost: DeepSWE trajectories go from ~99k to ~249k completion tokens That is a real inference bill, not a free lunch. Worth modelling before you switch it on in production. 3. Sizing it correctly This is where teams get MoE wrong. Every expert stays resident, so you size on 118B, not 8B. → 4-bit (NVFP4/INT4): ~59 GB — fits one NVIDIA DGX Spark (128 GB) → FP8: ~118 GB — one Spark or one H200 → BF16: ~236 GB — two linked Sparks or a multi-GPU node Day-one support for vLLM, SGLang, and Ollama. Hosted on OpenRouter at $0.10 / $0.20 / $0.01 per 1M input / output / cache-read tokens. ..... Full analysis: https://www.marktechpost.com/2026/07/21/poolside-releases-laguna-s-2-1/ Technical details: https://poolside.ai/blog/introducing-laguna-s-2-1 Trajectories: https://trajectories.poolside.ai/ Technical report: https://poolside.ai/assets/laguna/laguna-m1-xs2-technical-report.pdf

3 227

NVIDIA just put a full world model — perception, prediction, and action — inside a 4B model that runs on the robot itself, no cloud round-trip. I spent some time analyzing the Cosmos 3 Edge release. Here is what stood out to me, and why it matters for anyone building physical AI. 𝟭. 𝗢𝗻𝗲 𝗺𝗼𝗱𝗲𝗹 𝘀𝗽𝗮𝗻𝘀 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴, 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻, 𝗮𝗻𝗱 𝗮𝗰𝘁𝗶𝗼𝗻 A world model learns how an environment changes over time — objects, motion, and the effects of actions. Cosmos 3 Edge brings that on-device, so a system can read the current state, simulate a likely future, and connect that future to an action. 𝟮. 𝗧𝘄𝗼 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿 𝘁𝗼𝘄𝗲𝗿𝘀, 𝗼𝗻𝗲 𝘀𝗵𝗮𝗿𝗲𝗱 𝗿𝗲𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 It uses a Mixture-of-Transformers design. → Autoregressive tower (reasoner): vision + text tokens, causal attention → Diffusion tower (generator): vision + audio + action tokens, broad context attention The towers keep separate norm layers and MLPs, but share multimodal attention. So the model reasons about a scene before it generates anything. 𝟯. 𝗔 𝗰𝗼𝗺𝗺𝗼𝗻 𝗮𝗰𝘁𝗶𝗼𝗻 𝘀𝗽𝗮𝗰𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝗲𝗺𝗯𝗼𝗱𝗶𝗺𝗲𝗻𝘁𝘀 Actions are encoded as compact geometric vectors — translation, rotation, manipulation state — so control maps directly to pixel changes. → camera / autonomous vehicle: 9D → single-arm robot: 10D · dual-arm: 20D → egocentric: 57D · humanoid: 29D 𝟰. 𝗣𝗼𝗹𝗶𝗰𝘆 𝗺𝗼𝗱𝗲 𝗿𝘂𝗻𝘀 𝗶𝗻 𝗯𝗼𝘁𝗵 𝗱𝗶𝗿𝗲𝗰𝘁𝗶𝗼𝗻𝘀 Current state in → action + expected visual consequence out. Run it the other way and it infers the action from an observed change. That is what connects world modeling to policy training and evaluation. 𝟱. 𝗢𝗻-𝗱𝗲𝘃𝗶𝗰𝗲 𝗻𝘂𝗺𝗯𝗲𝗿𝘀 𝘁𝗵𝗮𝘁 𝗺𝗮𝘁𝘁𝗲𝗿 → 4B params (2B dense reasoner) → 640×360 robot-control resolution → 32 actions per inference on Jetson Thor → 15 Hz real-time control loop → runs on Jetson (T2000 / T3000 / Thor), RTX PRO, GeForce RTX, DGX → #1 on VANTAGE-Bench for vision analytics among 4B models (vendor-stated — benchmark on your own scenes) 𝟲. 𝗪𝗵𝗮𝘁 𝘀𝗵𝗶𝗽𝘀 𝗮𝗹𝗼𝗻𝗴𝘀𝗶𝗱𝗲 𝗶𝘁 → Cosmos 3 Edge Policy (DROID): a pick-and-place manipulation policy, with post-training scripts → Cosmos 3 Super 4-Step Distillation: cuts diffusion from 35–50 denoising steps to 4, up to 25× faster for text-to-image and image-to-video → post-train for your embodiment and sensors in about a day on an H100 cluster or DGX Station Full analysis: https://www.marktechpost.com/2026/07/21/nvidia-releases-cosmos-3-edge-a-4b-parameter-open-world-model-that-reasons-and-generates-robot-actions-on-device/ Model weight: https://huggingface.co/nvidia/Cosmos3-Edge Technical details: https://huggingface.co/blog/nvidia/cosmos3edge?linkId=100000431533160

3 227

Zyphra Releases ZUNA1.1: An Apache 2.0 EEG Foundation Model With Variable-Length Inputs From 0.5 To 30 Seconds Most EEG foundation models only work on the clean, fixed-length slices they were trained on. Real recordings are messy — and Zyphra spent an entire release closing that gap. They released ZUNA1.1 — a 380M masked diffusion autoencoder for scalp EEG under Apache 2.0, which reconstructs, denoises, and upsamples across arbitrary channel layouts. The architecture is nearly unchanged from ZUNA1. Almost everything that moved, moved in the training. Here's what's actually interesting: → Variable-length inputs from 0.5 to 30 seconds, snapped to a 0.125 s token grid — one model serves a trial snippet and a 30 s stretch, no reconfiguration → Four dropout schemes instead of one: whole channels, time stretches across every channel, stretches on some channels only, and scattered points → Corpus grew from ~2M to ~3.5M channel-hours by scoring quality per channel, per second, instead of discarding whole recordings → 4D RoPE over (x, y, z, t) means position, not array index, tells the model where a channel sits — so it can generate signals at electrode positions never recorded → Reported NMSE equal to or better than ZUNA1, and both beat MNE's spherical-spline interpolation Full analysis: https://www.marktechpost.com/2026/07/17/zyphra-releases-zuna1-1-an-apache-2-0-eeg-foundation-model-with-variable-length-inputs-from-0-5-to-30-seconds/ Model weight: https://huggingface.co/Zyphra/ZUNA1.1 Repo: https://github.com/Zyphra/zuna Technical Details: https://www.zyphra.com/our-work/zuna1.1

3 227

NVIDIA AI Releases Nemotron 3 Embed: An Open Embedding Collection Whose 8B Checkpoint Ranks #1 on RTEB Most RAG stacks treat the embedding model as a commodity — pick one, index, move on. Nemotron 3 Embed is NVIDIA's argument that the retrieval layer is where agent cost actually gets set. They released three open checkpoints — Nemotron-3-Embed-8B-BF16, 1B-BF16, and 1B-NVFP4 — built on Ministral bases, trained with bidirectional attention masking, pooled by averaging token-level representations, all taking 32,768-token inputs under OpenMDW-1.1. Here's what's actually interesting: → The 8B ranks #1 overall on RTEB: 78.46 avg NDCG@10, alongside 75.45 on MMTEB Retrieval and 60.60 on ViDoRe-V3 text → The 1B wasn't trained small. It was pruned from a 3B parent with ModelOpt mcore_minitron NAS, then distilled from the 8B teacher on COS + MSE loss — twice → That pipeline lands the 1.14B checkpoint at 72.38 RTEB, up 10.4 points on the prior-generation llama-nemotron-embed-vl-1b-v2 → NVFP4 costs 0.38 RTEB points (72.00 vs 72.38, ~99.5% retention) and buys up to 2x BF16 throughput on Blackwell Full Analysis: https://www.marktechpost.com/2026/07/17/nvidia-ai-releases-nemotron-3-embed-an-open-embedding-collection-whose-8b-checkpoint-ranks-1-on-rteb/ Model weight: https://huggingface.co/collections/nvidia/nemotron-3-embed Technical details: https://huggingface.co/blog/nvidia/nemotron-3-embed-wins-rteb

3 227

Moonshot AI Releases Kimi K3: A 2.8 Trillion Parameter Open MoE Model With Kimi Delta Attention and 1M Context Here's how it works: 1. Kimi Delta Attention A hybrid linear attention mechanism scaling across sequence length. It breaks conventional prefix caching, so Moonshot upstreamed a KDA implementation to vLLM. → up to 6.3x faster decoding at million-token contexts 2. Attention Residuals The other axis: depth, not length. It selectively retrieves representations across depth instead of accumulating them uniformly. → ~25% higher training efficiency at under 2% added cost 3. Stable LatentMoE At 16-of-896 sparsity, routing becomes a first-order problem. Quantile Balancing derives expert allocation straight from router-score quantiles, dropping heuristic updates and one sensitive hyperparameter. → ~2.5x scaling efficiency vs K2 4. The numbers (max reasoning effort) → 91.2 BrowseComp, 88.3 Terminal Bench 2.1, 77.8 Program Bench → beats Fable 5 and GPT 5.6 Sol on 6 of 35 published rows → trails Fable 5 on FrontierSWE (81.2 vs 86.6) and HLE-Full (43.5 vs 53.3) Full analysis: https://www.marktechpost.com/2026/07/16/moonshot-ai-releases-kimi-k3-a-2-8-trillion-parameter-open-moe-model-with-kimi-delta-attention-and-1m-context/ Technical details: https://www.kimi.com/blog/kimi-k3 Try it: https://platform.kimi.ai/

3 227

Thinking Machines Lab Releases Inkling: A 975B-Parameter Open-Weights Multimodal MoE With a Trained Effort Dial. No RoPE. No vision encoder. No audio encoder. Here's how it works: 1. Encoder-free multimodality Most omni models bolt a separate encoder onto each modality. Inkling doesn't. Audio enters as dMel spectrograms — 100ms chunks classified into discrete mel bins. Images become 40x40 pixel patches through a four-layer hMLP. Both hit a lightweight embedding tower, then get processed jointly with text by the same decoder. → 91.4% VoiceBench, 73.5% MMMU Pro 2. Relative attention, not RoPE Position is encoded directly in the attention logits. Beyond Q/K/V, a fourth projection produces a per-token, per-head relative feature, tweaked with key-query distance. Sliding-window and global layers interleave 5:1 with 8 KV heads. → 66 layers, 1M-token context 3. MoE with a shared expert sink The router scores routed and shared experts together. Top-6 over 256 routed experts, plus 2 shared always active. Sigmoid routing, auxiliary-loss-free load balancing. → 975B total params, 41B active 4. Controllable thinking effort The part worth stealing. During RL, the lab varied the system message and adjusted per-token cost across samples — so the model learned to modulate its own token budget. Now exposed as reasoning_effort in transformers. → Matches Nemotron 3 Ultra on Terminal Bench 2.1 at ~1/3 the tokens 5. The numbers (open weights, effort=0.99) → 78.0% FORTRESS Adversarial — highest of any open-weights model compared → 77.6% SWEBench Verified, 74.1% MCP Atlas → 63.8% Terminal Bench 2.1, trailing GLM 5.2 by 18.9 pts Full analysis: https://www.marktechpost.com/2026/07/15/thinking-machines-lab-releases-inkling-a-975b-parameter-open-weights-multimodal-moe-with-41b-active-parameters-and-controllable-thinking-effort/ Model card: https://thinkingmachines.ai/model-card/inkling/ HF: https://huggingface.co/thinkingmachines/Inkling Technical details: https://thinkingmachines.ai/news/introducing-inkling/

3 227

PrismML Releases Bonsai 27B: 1-bit and Ternary Builds of Qwen3.6-27B Hitting 89.5% of FP16 at 3.9GB. No new pretrain. No higher-precision escape hatches. No multi-GPU rig. Here's how it works. 👇 (1) Codes, not floats Every weight becomes a code, with one shared FP16 scale per group of 128. Ternary is {−1, 0, +1}, binary is {−1, +1}. Sharing the scale across 128 weights keeps its cost at 16/128 = 0.125 bits. → Ternary: log2(3) + 16/128 ≈ 1.71 bits/weight → 5.9GB → Binary: 1 + 16/128 = 1.125 bits/weight → 3.9GB (2) Post-training, not from scratch No BitNet-style low-bit pretrain. It starts from off-the-shelf Qwen3.6-27B, architecture unchanged. The representation runs end to end across embeddings, attention projections, MLP projections, and the LM head. → 9.4× (ternary) and 14.2× (binary) vs the 54GB FP16 baseline (3) Labels are not bit-widths Conventional low-bit builds are mixed-precision by construction. The advertised name describes the most-compressed tensors, not the model. → Q4_K_XL, labeled "4-bit," is really 5.2 bits/weight at 17.6GB → IQ2_XXS, labeled "2-bit," is really 2.8 bits/weight at 9.4GB (4) Fitting a phone is two budgets iOS caps a single app near half of RAM, so a 12GB iPhone exposes ~6GB. The KV cache grows on top. Hybrid attention at ~75% linear means only 16 of 64 layers cache. → 4-bit KV: 4.3GB at 262K context, down from 17.2GB → 11.0 tok/s on iPhone 17 Pro Max (5) The numbers (15 benchmarks, thinking mode) → Ternary: 80.49 avg at 5.9GB — 94.6% of FP16 → 1-bit: 76.11 avg at 3.9GB — 89.5% of FP16 → IQ2_XXS falls to 57.5 on AIME26 while still scoring 88.93 on MMLU-Redux The key takeaway: 27B-class reasoning without the 54GB checkpoint — group-wise ternary and binary codes, an end-to-end low-bit language stack, 4-bit KV, on one phone. Full analysis: https://www.marktechpost.com/2026/07/14/prismml-releases-bonsai-27b-1-bit-and-ternary-builds-of-qwen3-6-27b-that-run-on-laptops-and-phones/ Repo: https://github.com/PrismML-Eng/Bonsai-demo/ Model weight: https://huggingface.co/collections/prism-ml/bonsai-27b Technical details: https://prismml.com/news/bonsai-27b

3 227

Mistral AI Releases Robostral Navigate: An 8B Model Enabling Robots to Navigate Complex Environments Hitting 76.6% on R2R-CE With One RGB Camera. No LiDAR. No depth sensor. No multi-camera rig. Here's how it works. 👇 1. Pointing, not metric commands The model predicts the pixel coordinates of the next target in the camera view, plus the arrival orientation. Working in pixel space keeps it robust to camera intrinsics and world scale. When the target leaves the frame, it falls back to local displacements ("2m forward, 1.5m left, turn 25°"). 2. Grounding-first No open-source VLM base. It starts from Mistral's grounding model (pointing, counting, localization). Navigation emerges once the model knows where things are. → ~400,000 trajectories across 6,000 simulated scenes 3. Prefix-caching for training A tree-based attention mask packs a full episode into one sequence — all time steps in a single forward pass. → 22× fewer training tokens; months of training done in days 4. Online RL on top After supervised training, CISPO adds trial-and-error learning to fight distribution shift from behavior cloning. → +3.2% success rate from RL alone 5. The numbers (R2R-CE, Matterport3D) → 76.6% success on validation unseen → +9.7 pts over best single-camera approach → +4.5 pts over best depth/multi-camera system The key takeaway: state-of-the-art continuous VLN without a sensor stack — grounding-init, pixel-space actions, prefix-cached SFT, and online RL, on one RGB camera. Full analysis: https://www.marktechpost.com/2026/07/14/mistral-ai-releases-robostral-navigate-an-8b-model-enabling-robots-to-navigate-complex-environments-using-a-single-rgb-camera/ Technical details: https://mistral.ai/news/robostral-navigate/

3 227

[Most robots react. This one thinks a step ahead.] Ant Group's Robbyant just published LingBot-VA 2.0 — a video-action foundation model built from scratch for robot control, not fine-tuned from a video generator. The usual approach takes a video generator made for content creation and bolts a robot policy onto it. LingBot-VA 2.0 argues that's the wrong starting point, and pretrains the whole causal stack natively instead. What stands out: → Foresight Reasoning — the robot predicts the next action chunk while executing the current one, then overwrites the imagined frame with the real observation. Prediction and execution stop waiting on each other. → 927 ms → 142 ms per chunk, across four cumulative optimizations. That lifts asynchronous control from 35 Hz to 225 Hz — a 6.5× speedup. → One shared latent space. A semantic visual-action tokenizer puts world states and actions in the same coordinates, so unlabeled web video carries action-relevant signal. → Sparse MoE video stream — 128 experts, top-8 routing. Roughly 2.5B of ~15.3B parameters fire per token. → Few-shot by design — adapts from 10–15 demonstrations, and a human demo video can replace the text instruction entirely. Full breakdown: https://www.marktechpost.com/2026/07/11/ant-groups-robbyant-unveils-lingbot-va-2-0/ Paper: https://github.com/Robbyant/lingbot-va/blob/main/LingBot_VA2_paper.pdf Project Page: https://technology.robbyant.com/lingbot-va-v2

3 227

[Really Cool Research from NVIDIA] They Released Nemotron-Labs-3-Puzzle-75B-A9B: A Compressed Hybrid MoE LLM Delivering 2.03x Server Throughput at Matched User Throughput It goes from 120.7B total / 12.8B active parameters to 75.3B / 9.3B, while keeping the parent's 88-block hybrid Mamba-Transformer MoE layout completely intact. The serving operating point was fixed first. The architecture was searched to hit it. Here's what's actually interesting: → Iterative Puzzle: compress a little, heal with distillation, rescore against the compressed model, repeat. Worth +0.57 avg points over single-shot at the same target → The architecture is not a scaled-down teacher. Active routed MoE capacity ranges 8.7% to 62.3% per layer, mean 30.9% → 8xB200, decode-heavy 8K/64K at 100 tok/s per user: 42,601 vs 20,939 tok/s. 2.03x → 8xB200, prefill-heavy 50K/2K, same floor: only 1.60x. Compression pays less when you're compute-bound → Single H100 at 1M context: weights drop 70GB to 44.5GB, so concurrency goes from 1 request to 8 → MTP acceptance length 3.45 → 4.34 at draft length 7, after fixing the teacher-forced vs autoregressive drafting mismatch Full analysis: https://www.marktechpost.com/2026/07/09/nvidia-releases-nemotron-labs-3-puzzle-75b-a9b-a-compressed-hybrid-moe-llm-delivering-2-03x-server-throughput-at-matched-user-throughput/ Paper: https://arxiv.org/pdf/2607.04371

3 227

Robbyant Releases LingBot-VLA 2.0: An Open-Source 6B Vision-Language-Action (VLA) Model for Cross-Embodiment Robot Manipulation Most VLA models are trained for one robot, then re-trained for the next one. That's a per-embodiment pipeline — and Ant Group's Robbyant team just collapsed it into a single action space. They released LingBot-VLA 2.0 — a 6B vision-language-action model built on a Qwen3-VL-4B-Instruct backbone, pretrained on ~60,000 hours spanning 20 robot configurations, with one 55-dimensional canonical vector encoding every embodiment's state and action. Here's what's actually interesting: → 50,000 hours of robot trajectories + 10,000 hours of egocentric human video, filtered from a 110,000-hour raw pool by jerk, velocity/acceleration Z-scores, and URDF-replay checks → One 55-D vector covers arms, end-effectors, grippers, 12-DoF dexterous hands, waists, heads, and mobile bases — unused dimensions are simply padded → Token-level MoE action expert with DeepSeek-V3-style auxiliary-loss-free routing: a bias corrects expert load, so no load-balancing loss touches the action objective → Dual-query distillation — current and future queries supervised by LingBot-Depth (geometry) and DINO-Video (causal temporal dynamics), so the policy predicts the future frame before acting → GM-100 generalist, AgileX Cobot Magic: 66.2 / 34.4 progress / success vs 59.1 / 32.2 for π0.5 → Long-horizon mobile manipulation, stove cleaning in-domain: 84.3 / 66.7 vs 79.9 / 60.0 for π0.5 — and it holds the lead out-of-distribution → ~130 ms inference on an RTX 4090D at 10 denoising steps. Apache-2.0, weights on Hugging Face and ModelScope Full analysis: https://www.marktechpost.com/2026/07/08/robbyant-releases-lingbot-vla-2/ Paper: https://github.com/Robbyant/lingbot-vla-v2/blob/main/assets/LingBot_VLA_2_0.pdf Model weights: https://huggingface.co/collections/robbyant/lingbot-vla-v2 Technical details: https://technology.robbyant.com/lingbot-vla-v2

3 227

Netflix AI Team Cuts Wide-Partition Read Latency from Seconds to Milliseconds by Splitting Cassandra Partitions Per ID They published how its TimeSeries Abstraction splits wide Cassandra 4.x partitions per TimeSeries ID, asynchronously, with no application changes. Here is the mechanism, stage by stage: Context: partitions grow wide as events accumulate. Wide-partition reads pushed tail latency into seconds, causing timeouts, GC pauses, high CPU, and thread queueing. They ship two approaches. 1. Time Slice Re-Partitioning (table level) A background worker reads partition-size percentiles from nodetool tablehistograms, exposed via a Cassandra virtual table. → Computes an adjustment factor when partitions miss a configured density (typically 2–10 MiB). → Rewrites future Time Slices, e.g. time_bucket interval 60s → 604800s. Helps only when most of the table is mis-partitioned. 2. Dynamic Partitioning per ID (partition level) An async pipeline: Detection → Planning & Splitting → Serving Reads. Detection (read path): → Each read tracks bytes read; over a threshold it emits a detection event to Kafka. → Detect on reads, not writes — most data never needs splitting. → Immutable partitions only, in the first implementation. Planning & Splitting: → Planner reads the full partition once and checkpoints to a wide_row metadata table. → EventBucketPartitionSplitStrategy assigns more event buckets to a time bucket; ultra-wide partitions cap bucket count to bound read amplification. → Pre- and post-split checksums must match before status = COMPLETED. → The original partition is retained as a fallback. Serving Reads: → Completed split keys load into in-memory Bloom filters (lookup: single-digit microseconds). → On a hit, cached wide_row metadata routes the query; PartitionReader reads the smaller split partitions and results are merged. Correctness before rollout: → Data Bridge Spark jobs validate that split data matches the original, offline. → Phased rollout with a shadow-mode Comparison phase (bytes served, old vs new read path). Measured results: → Average read latency: seconds → low double-digit milliseconds → Tail latency: several seconds → ~200 ms or better → Read timeouts down; lower CPU; minimal thread queueing → 500MB+ partitions remained queryable (one paginated read: time_taken 41.072410142s, available instead of timing out) Full analysis: https://www.marktechpost.com/2026/07/08/netflix-ai-team-cuts-wide-partition-read-latency-from-seconds-to-milliseconds-by-splitting-cassandra-partitions-per-id/ Technical details: https://netflixtechblog.com/dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads-0eded064f456

3 227

Ant Group’s Robbyant Open-Sources LingBot-Vision: A 1B Boundary-Centric Vision Foundation Model for Dense Spatial Perception Most vision foundation models treat boundaries as an output — something a task head predicts after pretraining. Robbyant (Ant Group) just inverted that, and the numbers are hard to ignore. They released LingBot-Vision — a 1.1B ViT-g/16 trained purely with self-supervision, where the teacher's own boundary predictions decide which tokens the student must reconstruct. No labels, no external edge detectors, no pretrained backbone anywhere in the loop. Here's what's actually interesting: → Boundary-forcing masking: teacher-discovered boundary tokens B are forced into the student's masked set, M⁺ = M ∪ B — the least redundant tokens become the hardest prediction targets → Boundary fields reparameterized as categorical distributions over K=32 bins per channel — regression collapses in the EMA teacher-student loop, classification inherits DINO-style centering and sharpening → The uniform distribution over bins doubles as the a-contrario null hypothesis, so an NFA test validates every segment for free — unsupported structure never becomes a teaching signal → NYUv2 linear-probe RMSE: 0.296 vs. 0.309 for the 7B DINOv3 — with ~7× fewer parameters and a 161M-image corpus, an order of magnitude smaller than LVD-1689M → The distilled 0.3B student matches the 7B DINOv3 on NYUv2 (0.310 vs. 0.309) — roughly 23× fewer parameters → Swapping only the encoder into LingBot-Depth 2.0 set leading results on 14 depth-completion benchmarks, and the advantage widens as training data scales from 3M to 150M Full analysis: https://www.marktechpost.com/2026/07/07/ant-groups-robbyant-open-sources-lingbot-vision-a-1b-boundary-centric-vision-foundation-model-for-dense-spatial-perception/ Paper: https://github.com/robbyant/lingbot-vision/blob/main/paper.pdf Model Weights: https://huggingface.co/collections/robbyant/lingbot-vision Technical Details: https://technology.robbyant.com/lingbot-vision

3 227

NVIDIA Releases Audex (Nemotron-Labs-Audex-30B-A3B): A Unified Audio-Text LLM That Preserves the Text Intelligence of Its Backbone Most unified audio models pay a text tax. Add audio output, and reasoning benchmarks drop — even when the only new output is speech. NVIDIA just released one that doesn't. Audex (Nemotron-Labs-Audex-30B-A3B) is a 30B MoE with 3B active parameters, built on the text-only Nemotron-Cascade-2-30B-A3B backbone. Audio inputs are projected into the text embedding space. Text tokens and quantized audio tokens are then generated the same way, inside one MoE decoder — no thinker–talker split, no stacked cascade. Here's what's actually interesting: → One model, audio in and out: understanding, ASR, translation, TTS, text-to-audio, speech-to-speech → Text holds vs its own backbone: IMO AnswerBench 81.1 vs 79.3, MMLU-Redux 86.4 vs 86.3 → Beats text-only Qwen3.5-35B-A3B on several tasks: LiveCodeBench v6 85.3 vs 74.6, IFBench 77.8 vs 70.2 → The usual tax, for contrast: Qwen3-Omni-30B-A3B-Thinking drops to 60.4 on HMMT vs 71.4 for its text backbone → 6.82 WER on OpenASR, ahead of Step-Audio-R1.1-33B (7.91) and Qwen3-Omni-Thinking (8.00) → Two codecs: X-Codec2 for speech (50 tok/s, FSQ, 65,536 codebook), X-Codec for general audio (200 tok/s, 4 flattened RVQ layers) — the only strong open model generating general audio beyond speech Full analysis: https://www.marktechpost.com/2026/07/07/nvidia-releases-audex-nemotron-labs-audex-30b-a3b-a-unified-audio-text-llm-that-preserves-the-text-intelligence-of-its-backbone/ Paper: https://arxiv.org/pdf/2607.05196 Model weights: https://huggingface.co/nvidia/Nemotron-Labs-Audex-30B-A3B

3 227

Liquid AI Open-Sources Antidoom: A Final Token Preference Optimization (FTPO) Method that Reduces Doom Loops in Reasoning Models Most fixes for repetition loops in reasoning models reweight the entire output distribution. That treats a single-token failure as a whole-distribution problem — and Liquid AI just drew a clear line between the two. They open-sourced Antidoom, a method that removes doom loops by retraining exactly one position. A doom loop is when a model emits a span, then repeats it until the context window is exhausted. Antidoom finds the token that starts the loop, trains the model to prefer coherent alternatives at that spot, and leaves the rest of the vocabulary largely intact. It runs on Final Token Preference Optimization (FTPO), a DPO-style algorithm built to move a handful of tokens with minimal disturbance elsewhere. Here's what's actually interesting: → The loop almost always starts on one overtrained token — usually interruptives like "Wait," "So," or "Alternatively" → FTPO trains only the trailing token mid-generation, spreads probability across multiple chosen tokens, and uses a KL-like loss in logit space → LFM2.5-2.6B: doom-loop rate fell from 10.2% to 1.4% → Qwen3.5-4B: 22.9% → 1% under greedy sampling, with eval gains coming entirely from the drop in looping → Whole pipeline runs in a few hours: ~1h to generate the set on 8× MI325, 1–2h to train on a single GPU Full analysis: https://www.marktechpost.com/2026/07/07/liquid-ai-antidoom-doom-loops-ftpo/ GitHub Repo: https://github.com/Liquid4All/antidoom Technical Details: https://www.liquid.ai/blog/antidoom