Block-Level Recursion: Adaptive Test-Time Routing in Large Language Models
MCML Authors
Quentin Bouniot
Dr.
* Former Member
Abstract
Quentin Bouniot
Dr.
* Former Member
Abstract
Test-time routing improves frozen large language models (LLMs) by taking non-linear paths through their layers, without modifying weights or generating extra tokens. Existing approaches define route spaces that grow exponentially with depth, making them costly to search and hard to learn from. We therefore introduce Block-Level Recursion (BLR), a restricted route family that repeats a single contiguous block of transformer layers once. This reduces the number of routes from exponential to quadratic in the number of layers, making exhaustive per-instance evaluation tractable and the oracle upper bound directly measurable. Despite this restriction, BLR retains most of the routing potential. Across six model families and ten reasoning benchmarks, the optimal block varies across models, tasks, and individual inputs, with per-instance oracle gains of +59.1% on average and up to +75.8% on individual tasks. BLR also supports two practical policies: a single train-selected block (sBLR) that requires no router or per-input overhead, and a learned global router (aBLR) trained from dense per-instance rewards over all routes. sBLR already recovers a substantial fraction of the available gains, while aBLR improves further by selecting routes per input. With a frozen Qwen2.5-0.5B backbone, aBLR achieves higher accuracy than the unrouted Qwen2.5-7B model at lower FLOPs.
inproceedings SKS+26
AdaptFM @ICML 2026
Workshop on Resource-Adaptive Foundation Model Inference at the 43rd International Conference on Machine Learning. Seoul, South Korea, Jul 06-11, 2026. To be published. Preprint available.Authors
K. Sakalyan • S. Kim • L. Schwinn • Q. Bouniot • Z. AkataLinks
URLResearch Area
BibTeXKey: SKS+26