A Reproducibility Study of LLM-Based Query ReformulationAmin Bigdeli, Radin Hamidi Rad, Hai Son Le et al.
cs.IRcs.CLApr 30, 2026
Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are…
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive LearningBowen Sun, Chaozhuo Li, Yaodong Yang et al.
cs.CRcs.CLcs.LGApr 30, 2026
Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collecti…
Qualitative Evaluation of Language Model Rescoring in Automatic Speech RecognitionThibault Bañeras-Roux, Mickaël Rouvier, Jane Wottawa et al.
cs.CLApr 30, 2026
Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric su…
PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement LearningSudong Wang, Weiquan Huang, Xiaomin Yu et al.
cs.CVcs.AIcs.CLApr 30, 2026
The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). H…
A Pattern Language for Resilient Visual AgentsHabtom Kahsay Gidey, Alexander Lenz, Alois Knoll
cs.AIcs.SEApr 30, 2026
Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n…
Heterogeneous Scientific Foundation Model CollaborationZihao Li, Jiaru Zou, Feihao Fang et al.
cs.AIcs.CLcs.LGApr 30, 2026
Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world p…
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World WorkflowsChenxin Li, Zhengyang Tang, Huangxin Lin et al.
cs.SEcs.AIApr 30, 2026
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and gra…
Synthetic Computers at Scale for Long-Horizon Productivity SimulationTao Ge, Baolin Peng, Hao Cheng et al.
cs.AIcs.CLcs.LGApr 30, 2026
Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content…
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code GenerationGuang Yang, Xing Hu, Xiang Chen et al.
cs.SEcs.AIApr 30, 2026
Multimodal large language models (MLLMs) are increasingly used to translate visual artifacts into code, from UI mockups into HTML to scientific plots into Python scripts. A circuit diagram can be view…
Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the FutureSihong Wu, Owen Jiang, Yilun Zhao et al.
cs.CLcs.AIApr 30, 2026
Peer review is a multi-stage process involving reviews, rebuttals, meta-reviews, final decisions, and subsequent manuscript revisions. Recent advances in large language models (LLMs) have motivated me…
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware ExtractionAlex Petrov, Alexander Gusak, Denis Mukha et al.
cs.AIcs.CLApr 30, 2026
Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic reca…
Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary resultsLauren Cadwallader, Iain Hrynaszkiewicz, parth sarin et al.
cs.DLcs.CLApr 30, 2026
Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the 'downstream' effects or impacts of open sc…
Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICRSidi Chang, Peiying Zhu, Yuxiao Chen et al.
cs.AIcs.CLApr 30, 2026
As LLMs become credible readers of earnings calls, investor-relations Q\&A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model …
DPN-LE: Dual Personality Neuron Localization and Editing for Large Language ModelsLifan Zheng, Xue Yang, Jiawei Chen et al.
cs.CLApr 30, 2026
With the widespread adoption of large language models (LLMs), understanding their personality representation mechanisms has become critical. As a novel paradigm in Personality Editing, most existing m…
Sentiment Analysis of AI Adoption in Indonesian Higher Education Using Machine Learning and Transformer-Based ModelsHappy Syahrul Ramadhan, Ahmad Sahidin Akbar, Karin Yehezkiel Sinaga et al.
cs.CLApr 30, 2026
This study analyzes Indonesian student opinions on the adoption of artificial intelligence in higher education using two approaches: TF-IDF-based machine learning and Transformer-based deep learning. …
Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AIDorottya Demszky, Edith Bouton, Alison Twiner et al.
cs.AIcs.CLcs.CYApr 30, 2026
Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping this methodological space along three dimensions-…
Beyond the Mean: Within-Model Reliable Change Detection for LLM EvaluationJon-Paul Cacioli
cs.CLcs.AIApr 30, 2026
We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pa…
Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment PerceptionNeemias B da Silva, Rodrigo Minetto, Daniel Silver et al.
cs.CLcs.SIApr 30, 2026
Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral d…
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal InteractionJunbo Cui, Bokai Xu, Chongyi Wang et al.
cs.CLApr 30, 2026
Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-lev…
On the Proper Treatment of Units in Surprisal TheorySamuel Kiegeland, Vésteinn Snæbjarnarson, Tim Vieira et al.
cs.CLApr 30, 2026
Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stim…
To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code EditingWei Cheng, Yongchang Cao, Chen Shen et al.
cs.SEcs.CLApr 30, 2026
Large Language Models (LLMs) are increasingly used for code editing, yet the prevalent full-code generation paradigm suffers from severe efficiency bottlenecks, posing challenges for interactive codin…
Universal statistical laws governing culinary designGanesh Bagler, Gopal Krishna Tewari, Aditya Raj Yadav et al.
physics.soc-phcs.CLApr 30, 2026
Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much like languages do through words and syntax. Yet, b…
LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment GapsKeito Inoshita, Xiaokang Zhou, Akira Kawai et al.
cs.CLApr 30, 2026
Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distr…
Test Before You Deploy: Governing Updates in the LLM Supply ChainMohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li
cs.SEcs.AIApr 30, 2026
Large Language Models (LLMs) are increasingly used as core dependencies in software systems. However, the hosted LLM services evolve continuously through provider-side updates without explicit version…
ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language ModelsJiasheng Zheng, Xin Zheng, Boxi Cao et al.
cs.SEcs.CLApr 30, 2026
Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, exi…
Leading Across the Spectrum of Human-AI Relationships: A Conceptual Framework for Increasingly Heterogeneous TeamsAlejandro R. Jadad
cs.AIcs.CLcs.CYcs.HCApr 30, 2026
What shapes a consequential decision when human and artificial intelligence work on it together? The answer is becoming harder to see. A decision may look human-led after AI has set the frame, or appe…
Exploration Hacking: Can LLMs Learn to Resist RL Training?Eyon Jang, Damon Falck, Joschka Braun et al.
cs.LGcs.CLApr 30, 2026
Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration …
HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition MetricsThibault Bañeras Roux, Jane Wottawa, Mickael Rouvier et al.
cs.CLApr 30, 2026
Conventionally, Automatic Speech Recognition (ASR) systems are evaluated on their ability to correctly recognize each word contained in a speech signal. In this context, the word error rate (WER) metr…
Instruction-Guided Poetry Generation in Arabic and Its DialectsAbdelrahman Sadallah, Kareem Elozeiri, Mervat Abassy et al.
cs.CLcs.AIApr 30, 2026
Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research …
Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding AgentsMehmet Iscan
cs.CLcs.AIcs.LGApr 30, 2026
Large language model (LLM)-based coding agents increasingly rely on external memory to reuse prior debugging experience, repair traces, and repository-local operational knowledge. However, retrieved m…
Track Code Generation — Get notified when new papers are scored
Sign up free and get daily digests tailored to your research interests.
Sign up free