FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

Sungha Kim *, 1
Gawon Lee *, 1
1 Lab for Autonomous Robotics Research, Seoul National University 2 Georgia Institute of Technology
* Equal contribution

TL;DR

FLAG trains expressive flow policies for MaxEnt-RL through supervised target matching, rather than direct gradient-based policy optimization through the flow. It replaces fragile global importance sampling with latent-conditioned local importance sampling, enabling scalable high-dimensional control and state-of-the-art performance.

From Global to Local Importance Sampling

Prior methods train diffusion- and flow-based policies by matching a target action distribution, such as the unnormalized MaxEnt target exp(Q(s,a)/α)\exp(Q(s, a) / \alpha). Since this target cannot be sampled from directly, they use importance sampling (IS): sample actions from a proposal policy, reweight them toward the target, and perform weighted score or flow matching.

The limitation is that IS can only reweight actions that were already sampled. When the proposal-target discrepancy is large, especially in high-dimensional control, the proposal rarely samples target-relevant actions. This yields sparse supervision and lead to failure mode.

Our key insight is to change where IS is performed. Prior methods use Global IS, which reweights samples over the full action space. FLAG uses Local IS: it conditions both the proposal and target on the same flow latent variable zz, so reweighting happens inside a shared local region.

Local IS turns global target matching into local target matching
Global IS · prior work
Region
Full action space
Samples
Useful ones are rarely drawn
Supervision
Weak
Local IS · FLAG (ours)
Region
Local (latent-conditioned)
Samples
Even a few stay informative
Supervision
Dense

We first illustrate this effect in a controlled multi-goal task then show the same pattern in high-dimensional continuous control tasks in the Result section.

Multi-goal Environment (Didactic Experiment)

Comparison of global and local importance sampling in a multi-goal environment
The multi-goal task isolates the small-sample regime: global-IS baselines fail when N8N \le 8, while FLAG recovers the optimal multi-modal behavior with only N=2N = 2 samples.

Method

Expand each Q card to follow the argument. Q1–Q3 can be expanded; open a card to read its step-by-step reasoning.

Use the tabs: Challenge → Solution → Bottom Line. Each tab is one stage of the answer — start with the problem, then the construction, then the takeaway.

Challenge Solution Bottom Line
Q1 How do we construct local IS, and is it consistent with the original MDP?
Challenge

For local IS to be principled — not just a trick — two conditions must hold:

  • Localization. The proposal and target distributions must share a region indexed by a latent zz.
  • Consistency. Optimizing inside that local region must be the same optimization problem as optimizing the RL objective in the original MDP.

Without Consistency, local IS optimizes the wrong objective and any gains are illusory.

Q2 How do we incorporate this into MaxEnt-RL and update the policy?
Challenge

Two obstacles block plugging the z-MDP into MaxEnt-RL:

  • Intractable Entropy. The composite policy’s log-probability logπ(as)\log\pi(a \mid s) requires marginalizing over the latent zz — there is no closed form.
  • RL via Supervised Learning. We want to cast MaxEnt-RL as an EM algorithm whose policy update reduces to a supervised learning problem — avoiding backprop through the flow ODE (BPTT).
Q3 Does FLAG provably improve the policy, and how does it relate to SAC?
Challenge

FLAG updates the policy by supervised distillation, not by differentiating the objective through the flow. Two guarantees are therefore not obvious — and this section establishes both:

  • Monotonic improvement. Does FLAG’s update actually raise the objective, i.e. is Jk+1Jk\mathcal{J}_{k+1} \geq \mathcal{J}_k guaranteed?
  • Relation to SAC. We optimize a MaxEnt-RL objective — so how does FLAG’s update relate to Soft Actor–Critic (SAC), and does it inherit SAC’s soft policy improvement?

Experimental Results

We present the results around the three questions used in Section 5 of the paper.

Expand each banner to reveal results. Q1/Q2/Q3 cards and nested sub-banners (e.g., Q1.1) hide figures and tables until opened.

Click images to view detailed figure/table with caption. Cards show clean caption-free previews; the enlarged view shows the original caption from the paper.

Term used in experiments

best-of-P: when sampling actions from the policy (e.g., during rollout and policy update), P candidate actions are drawn and the one with the highest Q-value is selected. FLAG excludes this heuristics, while other baselines heavily rely on this.

Q1 Does FLAG scale to high-dimensional action spaces under limited sample budgets?
A1
FLAG is Scalable to High-dimensional Action Spaces and Robust to Sample Budget
Q1.1Is FLAG scalable to high-dimensional action spaces?

FLAG stays in the high-return, low-runtime region as action dimensionality increases, while global-proposal baselines degrade or require larger best-of-P budgets.

Q1.2Is FLAG robust to the sample budget?

Local proposal matching keeps importance samples informative even when the update uses only a small number of samples.

Table 1 ablation study on the number of training samples in DMC Dog tasks
Learning curveLearning Curves
Q2 How does FLAG compare to action-gradient and BPTT-based actor-critic methods?
A2
FLAG Outperforms BPTT-based Actor-Critic and Action Gradient Methods
Q3 How do our key design choices---covariance scheduling and the guidance buffer---connect to the theoretical results?
A3
FLAG Key Design Choices Align with Theoretical Results
A3.1Covariance SchedulingControls σ²ₖ · Theorem 4.5 (Q3) ↑

Annealing the local covariance suppresses the covariance-dependent reward drift in Theorem 4.5 while preserving enough early exploration.

σinit (σfinal)-1(-1)-1(-2)-1(-3)-2(-2)-2(-3)-2(-4)-2(-5)-2(-6)
Return (1k) ↑0.6140.6750.7320.5880.6800.5970.5470.269
DMC Dog-run, 5 seeds. -2(-3) is the default used in Section 5.1 and 5.2.
A3.2Guidance BufferControls εₖ · Theorem 4.5 (Q3) ↑

A moderate guidance buffer reduces CFM projection error by reusing recent improved action labels without letting targets become stale.

Buffer Size010.24k51.2k102.4k204.8k
Return (1k) ↑0.6010.6800.6700.6180.589
DMC Dog-run, 5 seeds. 10.24k is the default used in Section 5.1 and 5.2.

BibTeX citation

Displaying the BibTeX entry for your paper in a code block makes it easy to copy and paste.

@misc{kim2025flag,
title = {{FLAG}: Flow Policy MaxEnt-RL by Latent Augmented Guidance},
author = {Kim, Sungha and Lee, Gawon and Lee, Jusuk and Park, Jonghae and Kim, H. Jin and Cho, Daesol},
year = {2026},
eprint = {2605.00000},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2506.00000},
}