Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Kuang-Da Wang1, Teng-Ruei Chen1, Yu Heng Hung1, Guo-Xun Ko1, Shuoyang Ding2, Yueh-Hua Wu2, Yu-Chiang Frank Wang2,3, Chao-Han Huck Yang2, Wen-Chih Peng1, Ping-Chun Hsieh1
1National Yang Ming Chiao Tung University    2NVIDIA    3National Taiwan University
Concept teaser: curse of horizon vs curse of dimensionality

Token-level preference control induces a long effective horizon (curse of horizon), while full-response rewriting searches an enormous action space (curse of dimensionality). TMPC resolves this trade-off using short-horizon predictive planning with hindsight-discovered subgoals.

Abstract

Preference alignment of large language models (LLMs) is often achieved through finetuning, which can be costly and slow to iterate. We target test-time alignment—improving outputs at inference without updating model weights—by viewing text generation as a sequential decision-making problem. We identify two complementary bottlenecks: token-level guided decoding struggles with a long decision horizon, while response-level rewriting suffers from a high-dimensional action space. Inspired by Model Predictive Control (MPC), we propose Textual Model Predictive Control (TMPC), a predictive planning framework that repeatedly: (i) samples short rollouts, (ii) evaluates them with a reward model, (iii) extracts reusable hindsight subgoals from high-reward segments, and (iv) performs subgoal-conditioned re-generation to improve the next segment. TMPC avoids hard, pre-defined text boundaries by enabling adaptive segmentation during generation. Across discourse-level machine translation, long-form response generation, and program synthesis, TMPC improves both preference reward and downstream task performance.

One sentence: TMPC makes alignment a short-horizon planning problem, while using hindsight subgoals to keep progress consistent across iterations.

From MPC to Textual MPC

Model Predictive Control (MPC) solves long-horizon decision making by repeatedly optimizing over a short moving horizon. At step t, MPC chooses the next action by approximately maximizing the cumulative reward over a horizon H (with HT), executes the first action, then re-plans from the new state. TMPC transfers this principle to language generation by treating text as a trajectory and using a preference reward model to score short rollouts.

Model Predictive Control objective with moving horizon H

Optimize locally over a moving horizon H instead of globally optimizing over the full length T

TMPC Framework

TMPC adapts Model Predictive Control (MPC) to language generation by repeatedly planning over a short horizon. At iteration t, it (1) samples K candidate rollouts for the next segment from the current state (prompt + partial output), (2) evaluates each rollout with a preference reward model R, (3) retrospectively extracts high-reward intermediate segments as subgoals and stores them in a bounded buffer B, and (4) regenerates the next segment by conditioning on buffered subgoals. This short-horizon re-planning reduces long-horizon brittleness while avoiding unstable whole-response rewrites.

TMPC framework overview: rollout, reward, subgoals, re-generation

Framework overview: sample short-horizon rollouts, score with a reward model, distill high-quality intermediate segments into a subgoal buffer B, then regenerate the next segment conditioned on B.

Two Core Components

TMPC relies on two mechanisms expressed by the exact update rules below. We show them as equations to preserve the method’s precise design and to make the page faithful to the paper.

Hindsight subgoal identification

Hindsight subgoal identification: buffer update equation

Buffer update. After scoring rollouts, TMPC aggregates high-reward segments into the subgoal buffer B. When B reaches capacity, lower-quality subgoals are replaced, keeping only the most useful waypoints.

Subgoal-conditioned re-generation

Subgoal-conditioned re-generation: conditioning equation

Re-generation. TMPC filters rollouts by a reward threshold and conditions the next segment on buffered subgoals in B, guiding generation toward previously validated high-reward directions while keeping the planning horizon short.

Algorithm

The full procedure alternates between rollout sampling, reward evaluation, hindsight subgoal buffering, and subgoal-conditioned re-generation until termination.

Algorithm 1: TMPC

Experiments & Results

Discourse-level Machine Translation (WMT’24 Literary Translation)

TMPC improves translation quality under document-level constraints by re-planning with short rollouts and reusing hindsight subgoals that correspond to meaningful context units.

WMT'24 translation results table

Discourse-level translation: improvements under long-context constraints (e.g., document-level coherence and style).

Long-form Response Generation (HH-RLHF)

We evaluate long-form instruction-following with a learned preference reward model. TMPC consistently increases the reward-model score, indicating improved alignment at inference time without updating model weights.

HH-RLHF: reward model score results

Higher reward-model scores indicate stronger alignment to the learned preference signal for long-form responses.

Deeper Analysis: Evaluation Signal, Hard-to-Segment Codegen, and Iterative Dynamics

Beyond aggregate benchmarks, we examine three complementary views of test-time alignment. (Left) On HH-RLHF, GPT-4 pairwise judgments provide a strong, task-agnostic signal for preference alignment, revealing how often TMPC produces outputs preferred by an external evaluator. (Middle) For code generation—where high-quality intermediate boundaries are often ambiguous—TMPC’s hindsight subgoals provide adaptive anchors that improve success without relying on fixed segmentation. (Right) In iterative translation, TMPC maintains steady gains across iterations compared with conventional iterative refinement, reflecting more stable progress when planning is kept short-horizon.

HH-RLHF: GPT-4 evaluation win rate

HH-RLHF (GPT-4). GPT-4 pairwise win rates: an external evaluator’s preference between TMPC and baselines, complementing reward-model-based evaluation.

Code generation: pass rate across methods

Code generation. Code generation pass rates: when intermediate “boundaries” (what to fix next) are unclear, hindsight subgoals provide adaptive anchors for planning.

Translation: performance across iterations

Iteration dynamics. Iteration-by-iteration trajectory: TMPC maintains steady improvements across iterations compared with conventional iterative refinement.

BibTeX

@inproceedings{
wang2026testtime,
title={Test-Time Alignment for Large Language Models via Textual Model Predictive Control},
author={Kuang-Da Wang and Teng-Ruei Chen and Yu Heng Hung and Guo-Xun Ko and Shuoyang Ding and Yueh-Hua Wu and Yu-Chiang Frank Wang and Chao-Han Huck Yang and Wen-Chih Peng and Ping-Chun Hsieh},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=DsS3xRPSs5}
}