ML Research Digest — March 8, 2026

1

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

✨5.0Agents / PlanningELita Lobo, Xu Chen, Jingjing Meng et al.arxiv ↗

Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take actions that optimize long-term objectives…

2

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

✨5.0Agents / MemoryNghi D. Q. Buiarxiv ↗

The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-b…

3

KARL: Knowledge Agents via Reinforcement Learning

✨5.0Agents / Tool UseJonathan D. Chang, Andrew Drozdov, Shubham Toshniwal et al.arxiv ↗

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we…

4

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

✨5.0Agents / PlanningNghi D. Q. Buiarxiv ↗

The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-b…

5

KARL: Knowledge Agents via Reinforcement Learning

✨5.0RAG & GroundingJonathan D. Chang, Andrew Drozdov, Shubham Toshniwal et al.arxiv ↗

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we…

6

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

✨5.0Agents / MemoryELita Lobo, Xu Chen, Jingjing Meng et al.arxiv ↗

Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take actions that optimize long-term objectives…

7

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

✨4.0RAG & GroundingPreetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta et al.arxiv ↗

With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable…

8

TimeWarp: Evaluating Web Agents by Revisiting the Past

✨4.0Agent Evaluation & ReliabilityMd Farhan Ishmam, Kenneth Marinoarxiv ↗

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that va…

9

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

✨4.0Agents / PlanningSicheng Fan, Rui Wan, Yifei Leng et al.arxiv ↗

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Trip…

10

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

✨4.0Agent Evaluation & ReliabilitySunishchal Dev, Andrew Sloan, Joshua Kavner et al.arxiv ↗

We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently ass…

Sunday, March 8, 2026

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

KARL: Knowledge Agents via Reinforcement Learning

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

KARL: Knowledge Agents via Reinforcement Learning

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

TimeWarp: Evaluating Web Agents by Revisiting the Past

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Get this in your inbox daily