Researchers from Renmin University and Huawei Noah's Ark Lab develop GUI-G1, a framework that improves visual grounding in GUI agents through targeted reinforcement learning techniques, achieving state-of-the-art accuracy on ScreenSpot benchmarks while requiring only 17K training samples and generating fewer tokens compared to existing approaches.
Researchers from the National University of Singapore introduce Thinkless, a framework that enables LLMs to adaptively choose between short and long-form reasoning based on task complexity, reducing unnecessary chain-of-thought reasoning by 50-90% across mathematical benchmarks while maintaining accuracy through a novel decoupled reinforcement learning algorithm.
Princeton, Peking University, and ByteDance researchers introduce MMaDA, a unified multimodal diffusion framework that handles text reasoning, image generation, and multimodal understanding through a shared architecture and novel post-training methodology, achieving competitive performance against specialized models while enabling cross-task generalization without additional fine-tuning.
DeepSeek-AI researchers present insights from developing DeepSeek-V3, documenting specific hardware constraints and architectural solutions that enable efficient large language model training through innovations in mixed-precision computation, network topology optimization, and memory management while achieving competitive performance with significantly reduced hardware requirements.
Researchers from Cambridge and UCL introduce Visual Planning, a framework enabling large vision models to reason directly through image sequences without textual mediation, demonstrating superior performance on navigation tasks compared to language-based approaches while reducing the modality gap in multimodal reasoning.
Microsoft Research Asia and collaborating institutions introduce R&D-Agent(Q), a data-centric multi-agent framework that automates quantitative investment strategy development through coordinated factor-model optimization, achieving 2x higher annualized returns than classical factor libraries while using 70% fewer factors and operating at minimal cost.
ByteDance researchers introduce BAGEL, an open-source multimodal AI model that combines understanding and generation capabilities through carefully structured interleaved data and a Mixture-of-Transformer-Experts architecture, achieving competitive performance with proprietary systems while demonstrating emergent abilities in complex visual manipulation and world navigation.
USC researchers demonstrate that textual steering vectors extracted from language model backbones can improve visual understanding in multimodal LLMs, enabling better performance on spatial reasoning and counting tasks while requiring no additional training data or model modifications.
A comprehensive taxonomy establishes clear distinctions between AI Agents (autonomous task-specific entities) and Agentic AI (orchestrated multi-agent systems), mapping their architectural differences, capabilities, and limitations while providing structured frameworks for system design and evaluation across domains like robotics, healthcare, and research automation.