← Today/2026-03-08

📅 Sunday, March 8, 2026

Top ML papers scored in PaperBrief digests on this day.

📅

Top Papers — Mar 8

The highest-scoring arXiv ML papers from Mar 8, ranked by LLM relevance.

Share
1📄 Notable
Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

Nghi D. Q. Bui · 2026-03-05

5.0

The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execut…

2📄 Notable
STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

ELita Lobo +7 · 2026-03-05

5.0

Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take ac…

3📄 Notable
KARL: Knowledge Agents via Reinforcement Learning

Jonathan D. Chang +25 · 2026-03-05

5.0

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work …

4📄 Notable
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Sunishchal Dev +4 · 2026-03-05

4.0

We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, m…

5📄 Notable
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Sicheng Fan +6 · 2026-03-05

4.0

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectori…

6📄 Notable
TimeWarp: Evaluating Web Agents by Revisiting the Past

Md Farhan Ishmam +1 · 2026-03-05

4.0

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web …

7📄 Notable
iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Preetam Prabhu Srikar Dammu +3 · 2026-03-04

4.0

With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many wid…

8📄 Notable
From Word to World: Can Large Language Models be Implicit Text-based World Models?

Yixia Li +9 · 2025-12-21

4.0

Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a pote…

9📄 Notable
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Yurun Chen +10 · 2025-10-01

4.0

As multimodal LLM-driven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in…

10📄 Notable
Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Benjamin Feuer +2 · 2026-03-05

3.0

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loo…

Want papers like these in your inbox?

PaperBrief sends you a personalised daily digest of the arXiv papers that actually matter for your research track.

Get your personalised digest →