💻

Code Generation

AI-powered code synthesis, program repair, and automated software engineering.

30 papers in the last 30 daysRSS feed

A Reproducibility Study of LLM-Based Query Reformulation

Amin Bigdeli, Radin Hamidi Rad, Hai Son Le et al.

cs.IRcs.CLApr 30, 2026

Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are…

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang et al.

cs.CRcs.CLcs.LGApr 30, 2026

Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collecti…

Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

Thibault Bañeras-Roux, Mickaël Rouvier, Jane Wottawa et al.

cs.CLApr 30, 2026

Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric su…

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Sudong Wang, Weiquan Huang, Xiaomin Yu et al.

cs.CVcs.AIcs.CLApr 30, 2026

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). H…

A Pattern Language for Resilient Visual Agents

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

cs.AIcs.SEApr 30, 2026

Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n…

Heterogeneous Scientific Foundation Model Collaboration

Zihao Li, Jiaru Zou, Feihao Fang et al.

cs.AIcs.CLcs.LGApr 30, 2026

Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world p…

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li, Zhengyang Tang, Huangxin Lin et al.

cs.SEcs.AIApr 30, 2026

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and gra…

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Tao Ge, Baolin Peng, Hao Cheng et al.

cs.AIcs.CLcs.LGApr 30, 2026

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content…

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

Guang Yang, Xing Hu, Xiang Chen et al.

cs.SEcs.AIApr 30, 2026

Multimodal large language models (MLLMs) are increasingly used to translate visual artifacts into code, from UI mockups into HTML to scientific plots into Python scripts. A circuit diagram can be view…

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

Sihong Wu, Owen Jiang, Yilun Zhao et al.

cs.CLcs.AIApr 30, 2026

Peer review is a multi-stage process involving reviews, rebuttals, meta-reviews, final decisions, and subsequent manuscript revisions. Recent advances in large language models (LLMs) have motivated me…

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

Alex Petrov, Alexander Gusak, Denis Mukha et al.

cs.AIcs.CLApr 30, 2026

Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic reca…

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

Lauren Cadwallader, Iain Hrynaszkiewicz, parth sarin et al.

cs.DLcs.CLApr 30, 2026

Numerous metascience studies and other initiatives have begun to monitor the prevalence of open science practices when it is more important to understand the 'downstream' effects or impacts of open sc…

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Sidi Chang, Peiying Zhu, Yuxiao Chen et al.

cs.AIcs.CLApr 30, 2026

As LLMs become credible readers of earnings calls, investor-relations Q\&A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model …

DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models

Lifan Zheng, Xue Yang, Jiawei Chen et al.

cs.CLApr 30, 2026

With the widespread adoption of large language models (LLMs), understanding their personality representation mechanisms has become critical. As a novel paradigm in Personality Editing, most existing m…

Sentiment Analysis of AI Adoption in Indonesian Higher Education Using Machine Learning and Transformer-Based Models

Happy Syahrul Ramadhan, Ahmad Sahidin Akbar, Karin Yehezkiel Sinaga et al.

cs.CLApr 30, 2026

This study analyzes Indonesian student opinions on the adoption of artificial intelligence in higher education using two approaches: TF-IDF-based machine learning and Transformer-based deep learning. …

Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

Dorottya Demszky, Edith Bouton, Alison Twiner et al.

cs.AIcs.CLcs.CYApr 30, 2026

Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping this methodological space along three dimensions-…

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

Jon-Paul Cacioli

cs.CLcs.AIApr 30, 2026

We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pa…

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Neemias B da Silva, Rodrigo Minetto, Daniel Silver et al.

cs.CLcs.SIApr 30, 2026

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral d…

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang et al.

cs.CLApr 30, 2026

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-lev…

On the Proper Treatment of Units in Surprisal Theory

Samuel Kiegeland, Vésteinn Snæbjarnarson, Tim Vieira et al.

cs.CLApr 30, 2026

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stim…

To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing

Wei Cheng, Yongchang Cao, Chen Shen et al.

cs.SEcs.CLApr 30, 2026

Large Language Models (LLMs) are increasingly used for code editing, yet the prevalent full-code generation paradigm suffers from severe efficiency bottlenecks, posing challenges for interactive codin…

Universal statistical laws governing culinary design

Ganesh Bagler, Gopal Krishna Tewari, Aditya Raj Yadav et al.

physics.soc-phcs.CLApr 30, 2026

Cooking is a cultural expression of human creativity that transcends geography and time through the orchestration of ingredients and techniques, much like languages do through words and syntax. Yet, b…

LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

Keito Inoshita, Xiaokang Zhou, Akira Kawai et al.

cs.CLApr 30, 2026

Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distr…

Test Before You Deploy: Governing Updates in the LLM Supply Chain

Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li

cs.SEcs.AIApr 30, 2026

Large Language Models (LLMs) are increasingly used as core dependencies in software systems. However, the hosted LLM services evolve continuously through provider-side updates without explicit version…

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

Jiasheng Zheng, Xin Zheng, Boxi Cao et al.

cs.SEcs.CLApr 30, 2026

Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, exi…

Leading Across the Spectrum of Human-AI Relationships: A Conceptual Framework for Increasingly Heterogeneous Teams

Alejandro R. Jadad

cs.AIcs.CLcs.CYcs.HCApr 30, 2026

What shapes a consequential decision when human and artificial intelligence work on it together? The answer is becoming harder to see. A decision may look human-led after AI has set the frame, or appe…

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Eyon Jang, Damon Falck, Joschka Braun et al.

cs.LGcs.CLApr 30, 2026

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration …

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Thibault Bañeras Roux, Jane Wottawa, Mickael Rouvier et al.

cs.CLApr 30, 2026

Conventionally, Automatic Speech Recognition (ASR) systems are evaluated on their ability to correctly recognize each word contained in a speech signal. In this context, the word error rate (WER) metr…

Instruction-Guided Poetry Generation in Arabic and Its Dialects

Abdelrahman Sadallah, Kareem Elozeiri, Mervat Abassy et al.

cs.CLcs.AIApr 30, 2026

Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research …

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

Mehmet Iscan

cs.CLcs.AIcs.LGApr 30, 2026

Large language model (LLM)-based coding agents increasingly rely on external memory to reuse prior debugging experience, repair traces, and repository-local operational knowledge. However, retrieved m…

Code Generation

Track Code Generation — Get notified when new papers are scored