Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

Ming-Hong Chen1*   Kuan-Chen Pan1*   You-De Huang1*   Xi Liu2   Ping-Chun Hsieh1
1 National Yang Ming Chiao Tung University, Hsinchu, Taiwan
2 Applied Machine Learning, Meta AI, Menlo Park, CA, USA
* These authors contributed equally to this work.
Code Paper

Abstract

Cross-domain reinforcement learning (CDRL) aims to improve the data efficiency of reinforcement learning by leveraging data samples collected from a source domain to facilitate learning in a similar target domain. Despite its potential, cross-domain transfer in RL faces two fundamental and intertwined challenges. First, the source and target domains may have distinct state spaces or action spaces, which makes direct transfer infeasible and requires more sophisticated inter-domain mappings. Second, the transferability of a source-domain model in RL is not easily identifiable a priori, making CDRL prone to negative transfer effects.

In this paper, we jointly tackle these two challenges through the lens of cross-domain Bellman consistency and hybrid critics. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure the transferability of a source-domain model. We then propose $Q$Avatar, which combines Q-functions from both the source and target domains using an adaptive, hyperparameter-free weighting function.

Through this design, we characterize the convergence behavior of $Q$Avatar and show that it achieves reliable transfer by effectively leveraging a source-domain Q-function for knowledge transfer to the target domain. Experiments demonstrate that $Q$Avatar achieves favorable transferability across various reinforcement learning benchmark tasks, including locomotion and robot arm manipulation.

Cross-Domain Reinforcement Learning

Cross-domain reinforcement learning (CDRL) aims to improve the sample efficiency of reinforcement learning by transferring useful knowledge from a source domain to a related target domain. The source domain is often easier or cheaper to collect data from, while the target domain is the actual environment where efficient learning is desired.

Source Domain

(Data-rich/Low-cost)

Source Domain
Different in state/action space,
reward function,
and transition probability.
Learning a state/action
mapping function $\phi$, $\psi$ to
utilize source-domain knowledge.

Target Domain

(Data-scarce/High-cost)

Target Domain

$Q$Avatar Framework

$Q$Avatar is designed to enable reliable knowledge transfer from a source domain to a target domain. It combines source-domain and target-domain Q-functions through an adaptive, hyperparameter-free weighting mechanism, allowing the agent to exploit useful source knowledge while reducing the risk of negative transfer.

Inter-Domain Mapping

For inter-domain mapping, most prior works enforce transition consistency between the source and target domains. However, this form of dynamic consistency can fail to identify the optimal mapping. Please refer to Appendix D.1 for a detailed MiniGrid example illustrating this idea.

To address this issue, $Q$Avatar learns the inter-domain mappings via cross-domain Bellman consistency. By requiring the mapped source-domain Q-function to satisfy a Bellman-like equation on target-domain transitions and rewards, the learned correspondence becomes task-aware and can distinguish mappings that dynamics consistency cannot.

Dynamics Consistency
Matches transition dynamics.
Can fail to find optimal mappings.
Objective
\[ \mathcal{L}_{\mathrm{Dyn}}(\phi,\psi;\ T_{src}, D_{tar}) :=\mathbb{E}_{(s,a,s')\sim \mathcal{D}_{\mathrm{tar}}} \left[\ \delta_{\mathrm{dyn}} \ \right], \]
\[ \delta_{\mathrm{dyn}} := \left\| \phi(s') - T_{\mathrm{src}}\bigl(\phi(s),\psi(s,a)\bigr) \right\|_1. \]
Bellman Consistency (Ours)
Matches Bellman optimality.
Discovers reward-aligned mappings.
Objective
\[ \mathcal{L}_{\mathrm{bell}}(\phi,\psi;\ Q_{src}, \pi_{tar}, D_{tar}) := \mathbb{E}_{(s,a,r_{\mathrm{tar}},s')\sim \mathcal{D}_{\mathrm{tar}}} \left[\ \delta_{\mathrm{bell}} \ \right] \]
\[ \delta_{\mathrm{bell}} := \left| r_{\mathrm{tar}} + \gamma \mathbb{E}_{a'\sim \pi_{\mathrm{tar}}} \left[ Q_{\mathrm{src}}\bigl(\phi(s'),\psi(a')\bigr) \right] - Q_{\mathrm{src}}\bigl(\phi(s),\psi(a)\bigr) \right|^2. \]

Hybrid Critics

We adopt an NPG-style policy update with hybrid critics:

$ \pi^{(t+1)}(a \mid s) \propto \pi^{(t)}(a \mid s) \cdot \exp \left( \eta \left( (1-\alpha(t))Q_{\mathrm{tar}}^{(t)}(s,a) + \alpha(t)Q_{\mathrm{src}} \left( \phi^{(t)}(s), \psi^{(t)}(a) \right) \right) \right). $

Specifically, the target critic $Q^{(t)}_{\text{tar}}$ is trained by minimizing the TD loss, whereas the source critic $Q_{src}$ is pre-trained in the source domain and remains fixed throughout target-domain learning:

Adaptive Weight Mechanism

Proposition (Average Sub-Optimality Gap): Under $Q$Avatar, assuming an exploratory initial distribution, the average sub-optimality over \(T\) iterations is bounded as follows:

$$ \begin{aligned} &\frac{1}{T}\sum_{t=1}^T \mathbb{E}_{s \sim \mu_{\mathrm{tar}}} \Big[V^{\pi^{*}}(s) - V^{\pi^{(t)}}(s)\Big] \\[2mm] &\leq \underbrace{ \frac{\left[\log |\mathcal{A}_{\mathrm{tar}}| + 1\right]} {\sqrt{T}(1-\gamma)} }_{(a)} + \underbrace{ \frac{C_0}{T}\sum_{t=1}^T \mathbb{E}_{(s,a) \sim d^{\pi^{(t)}}} \left[ \left| (1-\alpha(t))Q^{(t)}_{\mathrm{tar}}(s,a) +\alpha(t)Q_{\mathrm{src}}(\phi^{(t)}(s),\psi^{(t)}(a)) - Q^{\pi^{(t)}}(s,a) \right| \right] }_{(b)} \\[2mm] &\leq \underbrace{ \frac{\left[\log |\mathcal{A}_{\mathrm{tar}}| + 1\right]} {\sqrt{T}(1-\gamma)} }_{(a)} + \underbrace{ \frac{C_1}{T}\sum_{t=1}^T \Big( \alpha(t) \lVert \epsilon_{\mathrm{cd}} (Q_{\mathrm{src}}, \phi^{(t)}, \psi^{(t)}) \rVert_{d^{\pi^{(t)}}} + (1-\alpha(t)) \lVert \epsilon_{\mathrm{td}}^{(t)} \rVert_{d^{\pi^{(t)}}} \Big) }_{(c)} \end{aligned} ,where \begin{aligned} &\|\epsilon_{\mathrm{td}}^{(t)}\|_{d^{\pi^{(t)}}} := \mathbb{E}_{(s,a)\sim d^{\pi^{(t)}}} \Big[ \Big| Q^{(t)}_{\mathrm{tar}}(s,a) - r_{\mathrm{tar}}(s,a) - \gamma \mathbb{E}_{\substack{ s'\sim P_{\mathrm{tar}}(\cdot\rvert s,a)\\ a'\sim \pi^{(t)}(\cdot\rvert s') }} [ Q^{(t)}_{\mathrm{tar}}(s',a') ] \Big| \Big], \\[4pt] &\|\epsilon_{\mathrm{cd}}(Q_{\mathrm{src}},\phi,\psi)\|_{d^{\pi^{(t)}}} := \mathbb{E}_{(s,a)\sim d^{\pi^{(t)}}} \Big[ \Big| Q_{\mathrm{src}}(\phi^{(t)}(s),\psi^{(t)}(a)) - r_{\mathrm{tar}}(s,a) - \gamma \mathbb{E}_{\substack{ s'\sim P_{\mathrm{tar}}(\cdot\rvert s,a)\\ a'\sim \pi^{(t)}(\cdot\rvert s') }} [ Q_{\mathrm{src}}(\phi^{(t)}(s'),\psi^{(t)}(a')) ] \Big| \Big]. \end{aligned} $$
Based on the average sub-optimality gap proposition, at each iteration $t$, term (c) can be minimized by choosing $\alpha(t)$ as an indicator function; that is, setting it to 1 when $\lVert\epsilon_{\text{cd}}(Q_{\text{src}}, \phi^{(t)}, \psi^{(t)})\rVert_{d^{\pi^{(t)}}} < \lVert\epsilon_{\text{td}}^{(t)}\rVert_{d^{\pi^{(t)}}}$ , and to 0 otherwise. In practice, estimating these two error terms can be noisy, so using an indicator function may lead to large fluctuations in $\alpha(t)$ and unstable training. To address this issue, we propose a smoother variant: \[\alpha(t) = \lVert\epsilon_{\text{td}}^{(t)}\rVert_{d^{\pi^{(t)}}}/(\lVert\epsilon_{\text{cd}}(Q_{\text{src}}, \phi^{(t)}, \psi^{(t)})\rVert_{d^{\pi^{(t)}}} + \lVert\epsilon_{\text{td}}^{(t)}\rVert_{d^{\pi^{(t)}}}).\] Notably, this design is hyperparameter-free and incurs minimal deployment overhead.

Experimental Results

Evaluation Environments

Source and target domains are evaluated across locomotion, robot manipulation, and navigation tasks.

Environment HalfCheetah Ant Door Opening Table Wiping Navigation
Source
Domain
HalfCheetah source domain HalfCheetah target domain HalfCheetah target domain HalfCheetah target domain HalfCheetah target domain
Target
Domain
HalfCheetah source domain HalfCheetah target domain HalfCheetah target domain HalfCheetah target domain HalfCheetah target domain

Evaluation Results

Learning Curve

HalfCheetah learning curve

(a) HalfCheetah

Ant learning curve

(b) Ant

Door Opening learning curve

(c) Door Opening

Wipe learning curve

(d) Table Wiping

Dog learning curve

(e) Navigation


Learning curve legend

Time to Threshold

Time to threshold

Aggregated IQM

Aggregated IQM

Ablation Study: Does $\alpha(t)$ Reflect Source Model Transferability?

Strong Positive/Negative Transfer

We consider a task where the source domain is standard 'Ant-v3' and the target changes the goal to move backward, with all else unchanged. Here, $Q_{\text{src}}$ and $Q_{\text{tar}}$ are adversarial due to opposite goals. We evaluate $Q$Avatar in two scenarios: (a) Learning state/action mapping: strong transferability exists, as Ant is symmetric along the front-back axis, allowing a perfect mapping. (b) Fixing mapping as identity: a strong negative transfer case, since $Q_{\text{src}}$ provides adversarial reward signals. The results are shown in below, $Q$Avatar captures both positive transfer (high $\alpha(t)$) and negative transfer (low $\alpha(t)$), demonstrating that $\alpha(t)$ reflects transferability.

Source Model of Varying Quality

We evaluate a scenario with a source model of varying quality in the Cheetah environment. Specifically, we use a low-quality source model with a total return of 1000, compared with approximately 7000 for the expert. The learning process and $\alpha(t)$ of $Q$Avatar are shown below. When the source model is of low quality, $\alpha(t)$ decreases to a small value by the end of training, thereby mitigating the effect of negative transfer.
Learning process on HalfCheetah Alpha values on HalfCheetah
Legend

Citation

@inproceedings{
chen2026cross,
title={Cross-domain policy optimization via bellman consistency and hybrid critics},
author={Ming-Hong, Chen and Kuan-Chen, Pan and You-De, Huang and Xi, Liu and Ping-Chun, Hsieh},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=kTXRFtWHnM}
}