PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction

☆☆☆☆☆Mar 20, 2026arxiv →

Runsong ZhaoShilei LiuJiwei TangLangming LiuHaibin ChenWeidong Zhang+5 more

Abstract

While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.

Explain this paper

Ask this paper

Loading chat…

Abstract

Explain this paper

Ask this paper

Rate this paper