A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

¹National Yang Ming Chiao Tung University, Hsinchu, Taiwan
²Massachusetts Institute of Technology
^*Equal Contribution

Abstract

Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach addresses this by training a single policy network conditioned on preference-weighted rewards. In this paper, we explore a novel algorithmic perspective: leveraging reward-free reinforcement learning (RFRL) for MORL.

While RFRL has historically been studied independently of MORL, it learns optimal policies for any possible reward function, making it a natural fit for MORL's challenge of handling unknown user preferences. We propose using the RFRL's training objective as an auxiliary task to enhance MORL, enabling more effective knowledge sharing beyond the multi-objective reward function given at training time. To this end, we adapt a state-of-the-art RFRL algorithm to the MORL setting and introduce a preference-guided exploration strategy that focuses learning on relevant parts of the environment.

Through extensive experiments and ablation studies, we demonstrate that our approach significantly outperforms the state-of-the-art MORL methods across diverse MO-Gymnasium tasks, achieving superior performance and data efficiency. This work provides the first systematic adaptation of RFRL to MORL, demonstrating its potential as a scalable and empirically effective solution to multi-objective policy learning.

Method Overview

We reframe MORL as a special case of RFRL by adopting a Forward-Backward (FB) representation. The posterior reward function is provided as R(λ) = λ^Tr(s,a). Our method, MORL-FB, introduces two core designs to enhance training and exploration:

Preference-Guided Exploration (PG-Explore): Instead of sampling from a standard Gaussian distribution, we sample preferences from a unit simplex during training to calculate a projection variable z_λ. This variable effectively guides both exploration and network updates, requiring no additional samples to compute during testing.
Auxiliary Q Loss: Since the reward function is known during training, we construct the Q loss directly from observed rewards (λ^Tr) rather than relying on pseudo rewards.

Experiments & Results

We evaluated MORL-FB across diverse Multi-Objective MuJoCo domains (HalfCheetah2d, Walker2d, Hopper3d, Ant3d, Humanoid2d, and Humanoid5d) and measured its performance against state-of-the-art baselines (PD-MORL, Q-Pensieve, CAPQL, PG-MORL, etc.) using Utility (UT), Hypervolume (HV), and Episodic Dominance (ED).

Key Highlights:

Higher Sample Efficiency: MORL-FB consistently achieves superior performance and data efficiency across all benchmarked environments.
Generalizability: Our method can generalize effectively over the entire preference set, even when trained on a heavily reduced subset of preference vectors (e.g., [1,0,0], [0,1,0], [0,0,1], [1/3, 1/3, 1/3]).
Zero-Shot Transfer: We demonstrated effective zero-shot cross-objective transfer (e.g., transferring from 2D Hopper directly to 3D and 4D objectives), showcasing the robustness of the learned FB representations.

BibTeX

@inproceedings{ chen2026morlfb, title={A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning}, author={Ying-Tu Chen and Wei Hung and Bing-Shu Wu and Zhang-Wei Hong and Ping-Chun Hsieh}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=IwiwmY3Mzz}, }