11 papers · 304 pages · agent-curated from cs.AI
Large language models (LLMs) have demonstrated strong potential in long- horizon decision-making tasks, such as embodied manipulation and web interaction. However, agents frequently struggle with endless trial-and- error loops or deviate from the main objective in complex...
Longitudinal health agents must reason across multi-source trajectories that combine con- tinuous device streams, sparse clinical exams, and episodic life events—yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution...
Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive program- ming: the most recent best result, Google’s Gemini 3 Deep Think, attained 8th place even not being...
As large language models (LLM)-driven agents transition from isolated task solvers to persistent digital entities, the emergence of the Agentic Web, an ecosystem where heterogeneous agents autonomously interact and co-evolve, marks a pivotal shift toward Artificial General...
Deep learning models excel at detecting anomaly patterns in normal data. However, they do not provide a direct solution for anomaly classifica- tion and scalability across diverse control systems, frequently failing to distinguish genuine faults from nuisance faults caused by...
In large language model (LLM)-driven multi-agent systems, disobey role specification (failure to adhere to the defined responsibilities and constraints of an assigned role, potentially leading to an agent behaving like another) is a major failure mode Cemri et al. [2025]. To...
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain...
As ongoing research explores the ability of AI agents to be insider threats and act against com- pany interests, we showcase the abilities of such agents to act against human well being in service of corporate authority. Building on Agentic Mis- alignment and AI scheming...
We study structured abstraction-based reasoning for the Abstraction and Reasoning Corpus (ARC) and compare its generalization to test-time approaches. Purely neural architectures lack reliable combinatorial generalization, while strictly symbolic systems struggle with per-...
Powered by OkraPDF Collection API