FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

Sungha Kim ^{*, 1}

Gawon Lee ^{*, 1}

Jusuk Lee ¹

Jonghae Park ¹

H. Jin Kim ¹

Daesol Cho ²

¹ Lab for Autonomous Robotics Research, Seoul National University ² Georgia Institute of Technology

^* Equal contribution

Paper Code arXiv

TL;DR

FLAG trains expressive flow policies for MaxEnt-RL through supervised target matching, rather than direct gradient-based policy optimization through the flow. It replaces fragile global importance sampling with latent-conditioned local importance sampling, enabling scalable high-dimensional control and state-of-the-art performance.

From Global to Local Importance Sampling

Prior methods train diffusion- and flow-based policies by matching a target action distribution, such as the unnormalized MaxEnt target $\exp(Q(s, a) / \alpha)$ . Since this target cannot be sampled from directly, they use importance sampling (IS): sample actions from a proposal policy, reweight them toward the target, and perform weighted score or flow matching.

The limitation is that IS can only reweight actions that were already sampled. When the proposal-target discrepancy is large, especially in high-dimensional control, the proposal rarely samples target-relevant actions. This yields sparse supervision and lead to failure mode.

Our key insight is to change where IS is performed. Prior methods use Global IS, which reweights samples over the full action space. FLAG uses Local IS: it conditions both the proposal and target on the same flow latent variable $z$ , so reweighting happens inside a shared local region.

Local IS turns global target matching into local target matching

Global IS · prior work

Region

Full action space

Samples

Useful ones are rarely drawn

Supervision

Weak

Local IS · FLAG (ours)

Region

Local (latent-conditioned)

Samples

Even a few stay informative

Supervision

Dense

We first illustrate this effect in a controlled multi-goal task then show the same pattern in high-dimensional continuous control tasks in the Result section.

Multi-goal Environment (Didactic Experiment)

Comparison of global and local importance sampling in a multi-goal environment — The multi-goal task isolates the small-sample regime: global-IS baselines fail when $N \le 8$ , while FLAG recovers the optimal multi-modal behavior with only $N = 2$ samples.

Method

Expand each Q card to follow the argument. Q1–Q3 can be expanded; open a card to read its step-by-step reasoning.

Use the tabs: Challenge → Solution → Bottom Line. Each tab is one stage of the answer — start with the problem, then the construction, then the takeaway.

Challenge → Solution → Bottom Line

Q1 How do we construct local IS, and is it consistent with the original MDP?

Challenge

For local IS to be principled — not just a trick — two conditions must hold:

Localization. The proposal and target distributions must share a region indexed by a latent $z$ .
Consistency. Optimizing inside that local region must be the same optimization problem as optimizing the RL objective in the original MDP.

Without Consistency, local IS optimizes the wrong objective and any gains are illusory.

Q2 How do we incorporate this into MaxEnt-RL and update the policy?

Challenge

Two obstacles block plugging the z-MDP into MaxEnt-RL:

Intractable Entropy. The composite policy’s log-probability $\log\pi(a \mid s)$ requires marginalizing over the latent $z$ — there is no closed form.
RL via Supervised Learning. We want to cast MaxEnt-RL as an EM algorithm whose policy update reduces to a supervised learning problem — avoiding backprop through the flow ODE (BPTT).

Solution

Proposition 4.3 — Cross-entropy surrogate

\begin{aligned} D_{\mathrm{TV}}(\pi, \tilde\pi) &= \mathcal{O}\big(\sqrt{\mathrm{tr}(\Sigma)}\big), \quad D_{\mathrm{KL}}(\pi \,\|\, \tilde\pi) = \mathcal{O}\big(\mathrm{tr}(\Sigma)^2\big) \\[4pt] \mathcal{H}(\pi, \tilde\pi) &= \mathcal{H}(\pi) + \mathcal{O}\big(\mathrm{tr}(\Sigma)^2\big) \approx \mathcal{H}(\pi) \end{aligned}

Plot showing the global policy converging to the base flow as the local variance shrinks

Replace the entropy $\mathcal{H}(\pi)$ with the cross-entropy $\mathcal{H}(\pi, \tilde\pi)$ against the base flow $\tilde\pi$ .
Estimate $\log \tilde \pi(a\mid s)$ via the Hutchinson trace estimator — Use common random numbers (CRN) to reduce variance in the estimator (Appendix D.3).

Q3 Does FLAG provably improve the policy, and how does it relate to SAC?

Challenge

FLAG updates the policy by supervised distillation, not by differentiating the objective through the flow. Two guarantees are therefore not obvious — and this section establishes both:

Monotonic improvement. Does FLAG’s update actually raise the objective, i.e. is $\mathcal{J}_{k+1} \geq \mathcal{J}_k$ guaranteed?
Relation to SAC. We optimize a MaxEnt-RL objective — so how does FLAG’s update relate to Soft Actor–Critic (SAC), and does it inherit SAC’s soft policy improvement?

Solution

Theorem 4.5 — Monotonic Improvement Guarantee

\mathcal{J}_{k+1} - \mathcal{J}_k \;\geq\; \underbrace{\lambda \beta \mathcal{G}_k}_{\substack{\text{frozen-reward}\\\text{MPO improvement}}} \;-\; \underbrace{\alpha\, C_\Sigma\, \sigma_k^2\, \beta \mathcal{G}_k}_{\substack{\text{cross-entropy}\\\text{reward drift}}} \;-\; \underbrace{\lambda\, \epsilon_k^{\mathrm{proj}}}_{\substack{\text{CFM projection}\\\text{error}}} \;+\; \mathcal{O}(\beta^2)

One FLAG step splits into a positive MPO improvement term, a cross-entropy drift that scales with the local variance $\sigma_k^2$ , and the CFM projection error $\epsilon_k^{\mathrm{proj}}$ .
Whenever $\lambda > \alpha\, C_\Sigma\, \sigma_k^2$ and the projection error is small, the improvement term dominates, so $\mathcal{J}_{k+1} \geq \mathcal{J}_k$ — monotonic improvement.

Proof Roadmap

Experimental Results

We present the results around the three questions used in Section 5 of the paper.

Expand each banner to reveal results. Q1/Q2/Q3 cards and nested sub-banners (e.g., Q1.1) hide figures and tables until opened.

Click images to view detailed figure/table with caption. Cards show clean caption-free previews; the enlarged view shows the original caption from the paper.

Term used in experiments

best-of-P: when sampling actions from the policy (e.g., during rollout and policy update), P candidate actions are drawn and the one with the highest Q-value is selected. FLAG excludes this heuristics, while other baselines heavily rely on this.

Q1 Does FLAG scale to high-dimensional action spaces under limited sample budgets?

FLAG is Scalable to High-dimensional Action Spaces and Robust to Sample Budget

Q1.1Is FLAG scalable to high-dimensional action spaces?

FLAG stays in the high-return, low-runtime region as action dimensionality increases, while global-proposal baselines degrade or require larger best-of-P budgets.

Q1.2Is FLAG robust to the sample budget?

Local proposal matching keeps importance samples informative even when the update uses only a small number of samples.

Learning curveLearning Curves

Q2 How does FLAG compare to action-gradient and BPTT-based actor-critic methods?

FLAG Outperforms BPTT-based Actor-Critic and Action Gradient Methods

Q3 How do our key design choices---covariance scheduling and the guidance buffer---connect to the theoretical results?

FLAG Key Design Choices Align with Theoretical Results

A3.1Covariance SchedulingControls σ²ₖ · Theorem 4.5 (Q3) ↑

Annealing the local covariance suppresses the covariance-dependent reward drift in Theorem 4.5 while preserving enough early exploration.

σinit (σfinal)	-1(-1)	-1(-2)	-1(-3)	-2(-2)	-2(-3)	-2(-4)	-2(-5)	-2(-6)
Return (1k) ↑	0.614	0.675	0.732	0.588	0.680	0.597	0.547	0.269

DMC Dog-run, 5 seeds. -2(-3) is the default used in Section 5.1 and 5.2.

A3.2Guidance BufferControls εₖ · Theorem 4.5 (Q3) ↑

A moderate guidance buffer reduces CFM projection error by reusing recent improved action labels without letting targets become stale.

Buffer Size	0	10.24k	51.2k	102.4k	204.8k
Return (1k) ↑	0.601	0.680	0.670	0.618	0.589

DMC Dog-run, 5 seeds. 10.24k is the default used in Section 5.1 and 5.2.

BibTeX citation

Displaying the BibTeX entry for your paper in a code block makes it easy to copy and paste.

@misc{kim2025flag,
  title         = {{FLAG}: Flow Policy MaxEnt-RL by Latent Augmented Guidance},
  author        = {Kim, Sungha and Lee, Gawon and Lee, Jusuk and Park, Jonghae and Kim, H. Jin and Cho, Daesol},
  year          = {2026},
  eprint        = {2605.00000},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2506.00000},
}