Cross-domain reinforcement learning (CDRL) aims to improve the data efficiency of reinforcement learning by leveraging data samples collected from a source domain to facilitate learning in a similar target domain. Despite its potential, cross-domain transfer in RL faces two fundamental and intertwined challenges. First, the source and target domains may have distinct state spaces or action spaces, which makes direct transfer infeasible and requires more sophisticated inter-domain mappings. Second, the transferability of a source-domain model in RL is not easily identifiable a priori, making CDRL prone to negative transfer effects.
In this paper, we jointly tackle these two challenges through the lens of cross-domain Bellman consistency and hybrid critics. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure the transferability of a source-domain model. We then propose $Q$Avatar, which combines Q-functions from both the source and target domains using an adaptive, hyperparameter-free weighting function.
Through this design, we characterize the convergence behavior of $Q$Avatar and show that it achieves reliable transfer by effectively leveraging a source-domain Q-function for knowledge transfer to the target domain. Experiments demonstrate that $Q$Avatar achieves favorable transferability across various reinforcement learning benchmark tasks, including locomotion and robot arm manipulation.
Cross-domain reinforcement learning (CDRL) aims to improve the sample efficiency of reinforcement learning by transferring useful knowledge from a source domain to a related target domain. The source domain is often easier or cheaper to collect data from, while the target domain is the actual environment where efficient learning is desired.
(Data-rich/Low-cost)
(Data-scarce/High-cost)
$Q$Avatar is designed to enable reliable knowledge transfer from a source domain to a target domain. It combines source-domain and target-domain Q-functions through an adaptive, hyperparameter-free weighting mechanism, allowing the agent to exploit useful source knowledge while reducing the risk of negative transfer.
For inter-domain mapping, most prior works enforce transition consistency between the source and target domains. However, this form of dynamic consistency can fail to identify the optimal mapping. Please refer to Appendix D.1 for a detailed MiniGrid example illustrating this idea.
To address this issue, $Q$Avatar learns the inter-domain mappings via cross-domain Bellman consistency. By requiring the mapped source-domain Q-function to satisfy a Bellman-like equation on target-domain transitions and rewards, the learned correspondence becomes task-aware and can distinguish mappings that dynamics consistency cannot.
We adopt an NPG-style policy update with hybrid critics:
Specifically, the target critic $Q^{(t)}_{\text{tar}}$ is trained by minimizing the TD loss, whereas the source critic $Q_{src}$ is pre-trained in the source domain and remains fixed throughout target-domain learning:
Proposition (Average Sub-Optimality Gap): Under $Q$Avatar, assuming an exploratory initial distribution, the average sub-optimality over \(T\) iterations is bounded as follows:
Source and target domains are evaluated across locomotion, robot manipulation, and navigation tasks.
| Environment | HalfCheetah | Ant | Door Opening | Table Wiping | Navigation |
|---|---|---|---|---|---|
|
Source Domain |
|
|
|
|
|
|
Target Domain |
|
|
|
|
|
(a) HalfCheetah
(b) Ant
(c) Door Opening
(d) Table Wiping
(e) Navigation
@inproceedings{
chen2026cross,
title={Cross-domain policy optimization via bellman consistency and hybrid critics},
author={Ming-Hong, Chen and Kuan-Chen, Pan and You-De, Huang and Xi, Liu and Ping-Chun, Hsieh},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=kTXRFtWHnM}
}