arXiv AI Top 10

Aligning Progress and Feasibility: A Neuro-Symbolic Dual

Memory Framework for Long-Horizon LLM Agents Bin Wen1, Ruoxuan Zhang2, Yang Chen1, Hongxia Xie2, Lan-Zhe Guo1∗ 1Nanjing University 2Jilin University

Large language models (LLMs) have demonstrated strong potential in long- horizon decision-making tasks, such as embodied manipulation and web interaction. However, agents frequently struggle with endless trial-and- error loops or deviate from the main objective in complex...

35 pages arxiv ↗

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for

Health Agents Chao Li∗ Cailiang Liu∗ Ang Gao Kexin Deng Shu Zhang Langping Xu Xiaotong Shi Xionghao Ding Jian Pei † Xun Jiang † Shanda Group

Longitudinal health agents must reason across multi-source trajectories that combine con- tinuous device streams, sparse clinical exams, and episodic life events—yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution...

29 pages arxiv ↗

GrandCode: Achieving Grandmaster Level in Competitive

Programming via Agentic Reinforcement Learning Xiaoya Li, Xiaofei Sun, Guoyin Wang∗, Songqiao Su, Chris Shum and Jiwei Li DeepReinforce Team

Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive program- ming: the most recent best result, Google’s Gemini 3 Deep Think, attained 8th place even not being...

31 pages arxiv ↗

Holos: A Web-Scale LLM-Based Multi-Agent

System for the Agentic Web Xiaohang Nie1,3,6,†, Zihan Guo1,4,†, Zicai Cui2, Jiachi Yang1, Zeyi Chen1,2, Leheyi De1, Yu Zhang6, Junwei Liao1,2, Bo Huan

As large language models (LLM)-driven agents transition from isolated task solvers to persistent digital entities, the emergence of the Agentic Web, an ecosystem where heterogeneous agents autonomously interact and co-evolve, marks a pivotal shift toward Artificial General...

38 pages arxiv ↗

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification

and Validation for Trustworthy Autonomous Systems Jiyong Kwon1 Ujin Jeon2 Sooji Lee3 Guang Lin1,4 1School of Mechanical Engineering, Purdue University

Deep learning models excel at detecting anomaly patterns in normal data. However, they do not provide a direct solution for anomaly classifica- tion and scalability across diverse control systems, frequently failing to distinguish genuine faults from nuisance faults caused by...

20 pages arxiv ↗

IMPROVING ROLE CONSISTENCY IN MULTI-AGENT

Guoling Zhou School of Information Science and Technology Northeast Normal University Wenpei Han School of Information Science and Technology Northeas

In large language model (LLM)-driven multi-agent systems, disobey role specification (failure to adhere to the defined responsibilities and constraints of an assigned role, potentially leading to an agent behaving like another) is a major failure mode Cemri et al. [2025]. To...

13 pages arxiv ↗

AutoVerifier: An Agentic Automated Verification Framework Using

Large Language Models Yuntao Du Minh Dinh Kaiyuan Zhang Ninghui Li Purdue University Winner of 2025-2026 Radiance Technologies Innovation Bowl

16 pages arxiv ↗

Xpertbench: Expert Level Tasks with Rubrics-Based

ByteDance Seed

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain...

25 pages arxiv ↗

”I must delete the evidence”:

AI Agents Explicitly Cover up Fraud and Violent Crime Thomas Rivasseau 1 Benjamin Fung 1

As ongoing research explores the ability of AI agents to be insider threats and act against com- pany interests, we showcase the abilities of such agents to act against human well being in service of corporate authority. Building on Agentic Mis- alignment and AI scheming...

24 pages arxiv ↗

Let’s Have a Conversation: Designing and Evaluating

LLM Agents for Interactive Optimization Joshua Drossman, Alexandre Jacquillat Operations Research Center and Sloan School of Management, Massachusetts

49 pages arxiv ↗

Compositional Neuro-Symbolic Reasoning

Anugyan Das 1 Omkar Ghugarkar 1 Vishvesh Bhat 1 Asad Aali 2 1CoreThink AI 2Stanford University

We study structured abstraction-based reasoning for the Abstraction and Reasoning Corpus (ARC) and compare its generalization to test-time approaches. Purely neural architectures lack reliable combinatorial generalization, while strictly symbolic systems struggle with per-...

24 pages arxiv ↗